Tag Archives: Failure

MikroTik local link up / down on high traffic

One of our RouterBOARD routers – MikroTik RB750 (mipsbe with Atheros 7240 switch) running the version v6.35.2 (latest version) have been constantly crashing on high network load. One particularly interesting note is that it crashed on traffic that goes through the internal switch but not on other traffic.

The log file shows the following pattern:
may/13 21:23:17 interface,info ether1-gateway link up (speed 100M, full duplex)
may/13 21:23:17 interface,info ether2-master-local link up (speed 100M, full duplex)
may/13 21:23:17 interface,info ether4-slave-local link up (speed 100M, full duplex)
may/13 21:24:21 system,info sntp change time May/13/2016 21:23:38 => May/13/2016 21:24:21
may/13 21:26:22 interface,info ether2-master-local link down
may/13 21:26:22 interface,info ether4-slave-local link down
may/13 21:26:24 interface,info ether2-master-local link up (speed 100M, full duplex)
may/13 21:26:24 interface,info ether4-slave-local link up (speed 100M, full duplex)
may/13 21:29:16 interface,info ether2-master-local link up
may/13 21:29:16 interface,info ether4-slave-local link down
may/13 21:29:17 interface,info ether2-master-local link down
may/13 21:29:18 interface,info ether2-master-local link up (speed 100M, full duplex)
may/13 21:29:18 interface,info ether4-slave-local link up (speed 100M, full duplex)
may/13 21:36:01 interface,info ether4-slave-local link down

A little search on the internet shows that we are not alone. Port flapping is widespread in MikroTik world. There are many reports with the similar problem dating back to 2011, but there are no solution:

My guess is that switch chip is broken / dead /malfunctioning / buggy. Or the switch “part” of MikroTik router is sensitive to voltage / current changes.

But anyway, we solved this by disabling the switch and changed each port to different subnet (Bridging also may work). Now all the traffic is sent through the CPU, and even when MikroTik advertises, that switch have wire-speed, we noticed that traffic-through-CPU have even better performance.

Real men do not test backups, remember?

I always said, real men don’t make backups for their important data 🙂
I do not want to lose data. I am in IT industry for some time, and I know, that it is not “IF hard drive will fail”** … but “when will it fail”. Here is the story we all can learn from:

About 20 years ago, I worked for a company which I shall not name, which used CVS as its source repository. All of the developers’ home directories were NFS mounted from a central Network Appliance shared storage (Network Appliance was the manufacturer of the NAS device), so everyone worked in and built on that one central storage pool. The CVS repository also lived in that same pool. Surprisingly, this actually worked pretty well, performance-wise.

One of the big advantages touted for this approach was that it meant that there was a single storage system to back up. Backing up the NA device automatically got all of the devs’ machines and a bunch more. Cool… as long as it gets done.

One day, the NA disk crashed. I don’t know if it was a RAID or what, but whatever the case, it was gone. CVS repo gone. Every single one of 50+ developers’ home directories, including their current checkouts of the codebase, gone. Probably 500 person-years of work, gone.

Backups to the rescue! Oops. It turns out that the sysadmin had never tested the backups. His backup script hadn’t had permission to recurse into all of the developers’ home directories, or into the CVS repo, and had simply skipped everything it couldn’t read. 500 person-years of work, really gone.

Almost.

Luckily, we had a major client running an installation of our hardware and software that was an order of magnitude bigger and more complex than any other client. To support this big client, we constantly kept one or two developers on site at their facility on the other side of the country. So those developers could work and debug problems, they had one of our workstations on-site, and of course *that* workstation used local disk. The code on that machine was about a week old, and it was only the tip of the tree, since CVS doesn’t keep a local copy of the history, only a single checked-out working tree.

But although we lost the entire history, including all previous tagged releases (there were snapshots of the releases of course… but they were all on the NA box), at least we had an only slightly outdated version of the current source code. The code was imported into a new CVS repo, and we got back to work.

In case you’re wondering about the hapless sysadmin, no he wasn’t fired. That week. He was given a couple of weeks to get the system back up and running, with good backups. He was called on the carpet and swore on his mother’s grave to the CEO that the backups were working. The next day, my boss deleted a file from his home directory and then asked the sysadmin to recover it from backup. The sysadmin was escorted from the building two minutes after he reported that he was unable to recover the file.

from Slashdot by swillden.

** I am talking not only about HDD, but about media in general. And this also applies to humans, because we humans are making mistakes too, and we lose data every day.