This is an interesting story, that was baffling me for two days, affecting even my work-hours presence too, hence, I was eager to solve it.
My provider is hungarian T-Home (a DT subsidiary, just like T-Mobile and all other "magenta" companies are). Since January, we have IPTV set up, which meant a new cable modem too. The old "dumb" cable modem had to be replaced, and I wanted to migrate my home networking infrastructure without any disturbances. I did, and it worked. Up to yesterday, when something happened...
The day when serviceman appeared to install the units was interesting, so let's start with that. He just brought two boxes, one with modem and one with IPTV Set Top Box (STB). Let's forget the fact he initially brought the wrong STB -- the one without HDD while we ordered the one with HDD, he installed them quickly and fast. The interesting part was when he spotted my infra sitting next to modem while he was switching the old modem with new one (which turned out to not be modem only at all). He put on a blunt smile, and told me, "You have to stop using your own router, it interferes with STB. The new modem has router capabilities too.". I asked "how" does it interfere? He just kept repeating "You have to uninstall your router" and smiling. I believe he had to tell me that due to some company policy (the contract has some stupid limit of machines allowed to connect, but nowadays when even micro ovens has WIFI, those policies may wipe out my... um). Okay, "I will remove everything in a moment you finish" I lied.
So, what he installed looked very promising. Both of the gears wears "Cisco" sticker. The modem (and router, and AP as later turns out) is "Cisco EPC3925 EuroDOCSIS 3.0 2-PORT Voice Gateway" model EPC3925. It features 4 LAN ports, 2 phone ports (I am not using those, SIP phone rulez), and N WIFI AP. The STB is Cisco ISB6030MT.
Both of the "high quality" Cisco gears, not some cheap shit. Yeah. I believed that for few days until I tried to google for them. It's cheap shit with nice stickers on it. Cisco did acquire few companies, and blatantly rebranded them (why are they ruining their own trademark?). I did not care for TV as long as it works and does what we want (it does, even if it runs ancient WinCE!!!), but this was a reason more I did not want to rely on this modem as router. I wanted to use it as least as possible. So, I decided to change network segment for my home stuff. This is what I ended up with:
In short, the modem was set up on 192.168.0.1 IP and I did not want to fiddle with it too much, so I switched my home network to 192.168.1.x. Modem, STB and WRT-H are directly wired (is better to reduce multicast group latency), and WRT-H (H as home) is routing to 192.168.1.x segment, but also does DHCP, DNS for home (and to "fix" the damn Apache Software Foundation SVN server to work with git-svn, but that's another story) and QoS. Wired connections from it goes to Apple TimeCapsule (TC) and Gigaset SIP Phone's base station. And it serves as WiFi AP for home machines like Macs and phones and such. Both WRTs are actually good old Linksys WRT-54GL running the best custom firmware I had chance to find, the Tomato firmware. And the WDS is here just to "hop" over the internet to my office, and to be able to use the printer (it's actually an MFP) from home.
Not wanting to fiddle with modem, all I did (that changes the "factory" preset config as T-Home is shipping them) is shutting down the WiFi on it. Yes, T-Home is shipping them with WiFi on, and my neighbor is full of WiFi noise with meaningless SSIDs (they are randomly generated), and many of my neighbors are simply unaware they have WiFi! Why oh why is T-Home shipping them like this? Why not turning on WiFi on the spot if customer asks for it in the first place?
And everything was working like a charm. Until yesterday.
The network since change to IPTV and new modem was fairly stable and fast. I did notice some small "drops" (like a browser trying too long to get a page), but they were intermittent and were rare, so I did not fiddle with those.
Yesterday it started to falling apart. My wife was unable to browse anything, my browser, git and svn was timeouting (not connection refused but like TCP packets went to devnull somehow)... It was a nightmare. And the most interesting thing, is that UDP was working without a problem! Initially I thought it's network outage (or brownout) that keeps recurring on provider side, but was suspicious that Skype for example worked without interruption (same for TV reception, that uses UDP mutlicast). So I phoned my provider, asking about outage and describing the problem, but after a long session (they did some remote measurements and other checks), they convinced me it's problem on my side, they had good signal quality readouts, and no packet loss reported (I did confirm the signal quality, since modem does print those out on it's ugly UI). To convince myself even more, I hooked up a Mac directly over the wire STB was using to try the network (to rule out WiFi, any in-the-middle router, etc). It was working like a charm. So, really, it must be my equipment.
Tracing the problem clearly showed that TCP packets are somehow disappearing in my network, and WRT-H was becoming the target of suspicion. But it was reporting no problem, and to make things worse, the "outage" was simply sporadic: in a moment the network was working just fine (the TCP at least, since UDP services had no outage at all), and in next moment, it stopped and packets were lost. Routing table did look okay there, but still, I wanted to check
Mac (actually all BSD kernels I believe) have a nice monitoring tool route -n monitor, and it clearly showed that packets are lost:
got message of size 124 on Tue May 3 11:31:17 2011RTM_LOSING: Kernel Suspects Partitioning: len 124, pid: 0, seq 0, errno 0, ifscope 0, flags:<UP,GATEWAY,HOST,DONE,WASCLONED,IFSCOPE>locks: inits: sockaddrs: <DST,GATEWAY> 220.127.116.11 192.168.1.1
got message of size 124 on Tue May 3 11:31:32 2011RTM_LOSING: Kernel Suspects Partitioning: len 124, pid: 0, seq 0, errno 0, ifscope 0, flags:<UP,GATEWAY,HOST,DONE,WASCLONED,IFSCOPE>locks: inits: sockaddrs: <DST,GATEWAY> 18.104.22.168 192.168.1.1
The gateway was WRT-H's IP address, meaning the TCP packet did left Mac, but was lost. There was a LOT of these messages when the problem was present, but in next moment, they stopped and network worked. I was freaking out. I disassembled my network to it's bits to rule out WDS, one router, another router, shut down TimeCapsule but nothing reliable. Btw, try to google for these kernel messages above, NOTHING but nothing really you can discover about them.
So, I googled for hungarian hacker community, knowing I am not alone having this piece of crap of equipment. And what a luck, I did found answer here. Many thanks to Hungarian Unix Portal and people participating in this forum! The guys starting this thread had exactly same symptoms as I had, but using different HW and OSes, he used Ubuntu (I started suspecting at Apple's OSX and who knows what, actually, I was clueless).
In short, it turned out that crappy wannabe-Cisco modem has a Conn-track connection limit set to 1024! But there is no Admin UI you can find it out or at least read the value! When the connection count is over that threshold, it starts dropping the connections! This is applied to TCP (stateful) connections, hence UDP is unaffected by this. It turns out -- luckily the guy in forum experimented out with his modem -- that modem's "SPI Firewall" is doing this, limits connection count to 1024 when turned on. And guess what the modem default is! I did not apply other fixes he proposed (again, I am not using modem's AP), but shutting down modem's firewall did make it work! Again, many thanks HUP user "ufoka"!
Later, I figured what happened. At home we have two laptops, and two smart phones going out (to the internet, making connections on modem), the printer for example is just "local" connection. But the phones, while did having WiFi set up for home networking, were mostly left on 3G to conserve battery. But when I bumped their firmware to latest Froyo, I started using mine with WiFi constantly on (since battery consumption showed very good and durable). Over the weekend my wife's phone was updated too, and her WiFi got turned on too. And it seems we were already near the 1k connections, and this just made us closer.
Simply, the blunt modem, when the threshold were hit, started silently dropping TCP connections, since it detected as "flood" or whatnot, and this is why hit it. Enabling the phones just made things worse. And this explains the "sporadic nature" of the problem too: the phones does sync here and then, when my pressed Enter in browser she actually created a connection "burst", same for me, etc. Blah.
Long story short: It's solved!