Urgent firmware downgrade request (2.0.6_0051) Home Assistant Unavaliability

Thanks for everyone’s patience and assistance for this issue, we decide to publish the 65 firmware officially soon as there is no issue reported since we released the beta version days ago. Also hereby I want to explain this issue in detail because we think every user suffers from this issue has the right to know it.

Issue was reported after we published the 2.0.x firmware to support Apple Homekit through the newly introduced soft-auth method. This feature brings into 3 major changes: a) Apple’s Bonjour protocol which is built on mdns protocol. b) Internal resource increases since the device need to run one more protocol stack in parallel. c) 2.0 SDK changes, to be specific, the cloud monitor (phone home) thread. The 3 changes bring all the issues we didn’t expect and we hadn’t found in our lab (and also Apple’s lab).
For change a), we removed our old mdns code and stopped announcing “miio.udp” which is exactly HA uses to auto-discover Yeelight devices. In our interworking spec, we clearly state that one SSDP like discover protocol should be used and no other discovery mechanism is promised, therefore, we didn’t make mistake on this one and we think HA should improve on their side.

For change b), this is the major cause of the hang issue. As we are using embedded system instead of “high-level” OS like Linux, all the internal resources need to be fine-tuned. The hang is caused by UDP socket resources and one internal semphore resource problem. If the system once runs out of the semphore, one task just intentionally hang the system and makes the bulb freeze. This is successfully reproduced in our network environment when there are lots of mdns traffic while 4 local TCP clients are connected. After this finding, we published some beta firmware and we did see some improvement. Regarding the UDP socket resource issue, it’s the final one, even now we still can’t reproduce this problem locally, but thanks to our nice and kind user @chrisnwh05 who let us running our test firmware with debugging information on his device, again and again, we were finally able to conclude this problem is caused by UDP resource leakage. One task is opening a UDP socket, but it somehow doesn’t call “recvfrom” when socket is readable, which causes TCP/IP stack running out of UDP buffer. Because UDP is also used internally for command response from main control task to local control task, that’s why you can still command the bulb through HA, but get no response from the bulb. We finally fix this issue by tunning the UDP resource a bit to make it more tolerant under some network environment.

For change c), this is introduced to solve the offline issue frequently reported by mesh router users. For some mesh router, if we don’t bounce the WiFi interface intentionally when the bulb switches from mesh node A to B, the bulb can never be online again. We didn’t expect lots of users would block their internet traffic for the bulb and the lesson is that we need to think twice before introducing the behavior change next time. This logic is removed from the new firmware.

That’s basically the entire story and issue analysis and we apologize for our mistakes. We also want to say thanks to all our users who are always supportive.

5 个赞