TCP Stack issue (might help with the fw 51 issues)

Capturing my network traffic I have noticed that, occasionally, a bulb willi send a TCP ack with the seq and ack numbers reversed, hence the message is never really acked. I also note that I see the Seq nbr 242 & 243 / ack 111 a lot in these messages. Below is a couple of select frames from the capture. If there is someplace I can upload the capture too I’d be happy to do so. I believe the full capture (39K filesize is all) may be helpful to you.

Below: 10.1.0.3 is a host system, 10.1.0.204 is a Yeelight RGB bulb. My program is sending a discovery message and then making connections to each discovered bulb and then fetching some properties. This yields quite a few (perhaps all the bulbs) sending a mangled ack packet. There are 14 bulbs total here so be aware that you are only see a snippet. Note the Seq and Ack numbers below.

490 70.823262 10.1.0.3 10.1.0.204 TCP 60 56772 → 55443 [ACK] Seq=242 Ack=111 Win=29200 Len=0

553 71.809680 10.1.0.204 10.1.0.3 TCP 54 [TCP ACKed unseen segment] 55443 → 56772 [FIN, ACK] Seq=111 Ack=243 Win=13758 Len=0

554 71.809929 10.1.0.3 10.1.0.204 TCP 60 [TCP Previous segment not captured] 56772 → 55443 [ACK] Seq=243 Ack=112 Win=29200 Len=0

1 个赞

I’ve learned a bit more. To trigger the problem I must send commands to multiple bulbs simultaneously (or at least quickly). If I, for example, send a message to all 14 of my bulbs to change color temp, I see a bunch of improper acks shortly thereafter. Some of these occur on conversations with your servers rather than mine which tells us the effect may not show up in the current conversation as my conversation clearly trigged errors in subsequent operations. Hopefully you can recreate this in-house and, as always, I’d be happy to share captures or test however you like.

1 个赞

Scratch all of this. I’m running up a limit in a dd-wrt router where I’m missing packets. Thus my evidence is garbage. I’ll continue to poke around but the rest of this topic’s messages can be ignored.

thanks for helping debug this issue, we really appreciate your support.

I understand you are not seeing this issue in your lab at all. Allow me to share what I’ve done and learned so far. FWIW, I’m a retired Unix/Networking consultant, I have skills and, as demonstrated earlier, the ability to say “I was wrong” when I am (though proudly I don’t need to do that too often). Of course, I have to treat the bulbs as a black box since I don’t (and don’t expect to) have source code.

My setup:
I have 14 rgb bulbs here. I’m toying with my own software and how much I can keep such devices from talking to the internet. They are typically not allowed to talk to the internet but are allowed to talk to DNS. The reason that I allow them to talk to DNS is that they tend to become ‘sluggish’ in answering initial requests, that is, a bulb that has sat idle a long time may take a couple a messages before it responds. Allowing DNS fixed that. This has been true since before fw .51, I just never made much fuss about it as I am certainly working in my own little world here. But, at this point, it may be a hint to help. My servers are all Linux machines though I doubt that has any bearing, the more info you get the better.

Things I’ve tried:
I have tried allowing them freedom to talk to anything and I get a hung bulbs overnight. I’ve tried to capture that with tcpdump, unfortunately the low end hardware I bought to load DD-wrt on is really low end and doesn’t have enough memory to store an overnight tcpdump, pushing it over the net via ssh (while capturing the wireless side, the ssh itself not being in the capture) yields enough packet loss to be confusing.

mDNS?
I do not run mDNS. For some reason I’ve been wondering if that’s factor. With my experiments so far I have yet to investigate that.

Isolating the bulbs
I have isolated all 14 bulbs from the internet including DNS. They cannot talk to anything but my devices for DHCP. The result is, they are sitting idle overnight with no messages being sent to them. I have still had a couple of bulbs hang requiring a power cycle this morning. However…

Tonights Experiment:
Just now, I power cycled all of the bulbs and will leave everything blocked. We’ll see if I have hung bulbs in the morning. I do this because I have a suspicion that the bulbs that hung were not power cycled since I isolated them.

A suggestion to you. Take a dozen or so bulbs and put them in a similar situation, let them connect to a wireless, get a DHCP address and nothing else, no mDNS, no DNS, nada. I would be surprised if you don’t see the problem unless, of course, the experiment I am performing tonights experiment show otherwise. I’ll post the result of that tomorrow morning.

Meanwhile, if there is anything you’d like me to try please let me know. I’d even go so far as to setup some remote access to that isolated network if that would be helpful.

I know you guys are super frustrated and nervous about this. It’s just a software bug, we just need to find it and world will be all happy again. Whatever I can do to help.

1 个赞

One other very important fact. Bulbs that ‘hang’ remain pingable. Pinging the bulb will not tell me if it is functional, only by issuing a command to the bulb will I find out it hung.

Thanks a lot again for helping us with your expertise, we are really lucky having users like you.:grinning: We will follow your advice and method to reproduce this problem.
We have released beta firmware 56 now which fixed one potential issue, could you help verify in your environment? If you are willing to, please share your xiaomi ID with us and we will add you to the whitelist so you could do a firmware upgrade.

BTW, Oct 1st to Oct 7th is Chinese National Holiday, we can only do the experiment on Oct 8th.

1 个赞

To confirm, fully rebooted lights with just dhcp and all other packets dropped and 3 or my 14 bulbs hung last night. I’ll upgrade to the beta and try it all again if you have a chance to put me on the list. Otherwise, enjoy the holiday.

Mid 1880846078

Already added to whitelist, thanks for helping.

I’m not seeing a firmware update. I am using the Yeelight App (Android) version 3.2.15. Being a holiday there, please do not feel pressured to spend a great deal of time getting to beta to me. If I see it pop up I’ll carry on testing but no worries.

Even with my low end router, I did finally manage to get decent captures and have learned that the bulbs still actually accept and obey commands, they simply don’t send the response message.

Did you see this behavior on the latest firmware?

No. This is version .51 still.

A message or so back in this subject chain I stated that I’m not being offered the beta for some reason (or perhaps I am unaware of how to obtain it) but I also did not wish to pressure anyone to solve that, especially if all I am doing is turning up evidence of what you may have already fixed.

OK. I have added you to whitelist, you could use Yeelight App to check firmware upgrade. Anyway, I will be back from vacation very soon and will try to reproduce this issue with debugging enabled.

Still no update. Please do not concern yourself with this until you are back. It was my intent to be helpful, not to become another project. Enjoy some holiday.

I can also reproduce the hang issue on 51 firmware, however, I can’t see this problem on 56 version. Could you try it on your side?

To updated the firmware, you may need to reset your bulb and connect to China server and then you will see the beta firmware.

After the update, I can already see a great improvement in responsiveness. I’ll still report tomorrow if any of the bulbs hang but feel confident they will all stay running. The timing on my in house lighting show just got a bunch better :slight_smile:

Thanks for your update!

Last update! Everything appears to be working wonderfully. Thank you.

Dang, I was overconfident. I found two bulbs hung when I got home just now. I’ve rebooted them and verified they are all on .56, we’ll see if it happens again.

As a reminder, I’m coding everything myself, not using the python lib or homekit so, if it appears that I’m the only one having issue, please say so. Though I think I’ve got very simple and solid code here I would never be so arrogant as to say “my code couldn’t be a fault”.