Real men do not test backups, remember?

I always said, real men don’t make backups for their important data :)
I do not want to lose data. I am in IT industry for some time, and I know, that it is not “IF hard drive will fail”** … but “when will it fail”. Here is the story we all can learn from:

About 20 years ago, I worked for a company which I shall not name, which used CVS as its source repository. All of the developers’ home directories were NFS mounted from a central Network Appliance shared storage (Network Appliance was the manufacturer of the NAS device), so everyone worked in and built on that one central storage pool. The CVS repository also lived in that same pool. Surprisingly, this actually worked pretty well, performance-wise.

One of the big advantages touted for this approach was that it meant that there was a single storage system to back up. Backing up the NA device automatically got all of the devs’ machines and a bunch more. Cool… as long as it gets done.

One day, the NA disk crashed. I don’t know if it was a RAID or what, but whatever the case, it was gone. CVS repo gone. Every single one of 50+ developers’ home directories, including their current checkouts of the codebase, gone. Probably 500 person-years of work, gone.

Backups to the rescue! Oops. It turns out that the sysadmin had never tested the backups. His backup script hadn’t had permission to recurse into all of the developers’ home directories, or into the CVS repo, and had simply skipped everything it couldn’t read. 500 person-years of work, really gone.

Almost.

Luckily, we had a major client running an installation of our hardware and software that was an order of magnitude bigger and more complex than any other client. To support this big client, we constantly kept one or two developers on site at their facility on the other side of the country. So those developers could work and debug problems, they had one of our workstations on-site, and of course *that* workstation used local disk. The code on that machine was about a week old, and it was only the tip of the tree, since CVS doesn’t keep a local copy of the history, only a single checked-out working tree.

But although we lost the entire history, including all previous tagged releases (there were snapshots of the releases of course… but they were all on the NA box), at least we had an only slightly outdated version of the current source code. The code was imported into a new CVS repo, and we got back to work.

In case you’re wondering about the hapless sysadmin, no he wasn’t fired. That week. He was given a couple of weeks to get the system back up and running, with good backups. He was called on the carpet and swore on his mother’s grave to the CEO that the backups were working. The next day, my boss deleted a file from his home directory and then asked the sysadmin to recover it from backup. The sysadmin was escorted from the building two minutes after he reported that he was unable to recover the file.

from Slashdot by swillden.

** I am talking not only about HDD, but about media in general. And this also applies to humans, because we humans are making mistakes too, and we lose data every day.