Protecting your valuable data

These days it's cheap and easy to run out to Staples or CostCo and buy USB Storage devices. In fact many shops will sell you a Seagate or Western Digital 1TB USB 3 external drive for $99 or less. These can be really handy.

Perhaps because these drives have been so cheap and easy to buy for so long they are now crashing and taking your valuable data with them. Human nature has us look into the safety of our data only after some disaster occurs - so we only find out when it's too late that there's nothing anybody can reasonably do to recover data from a failed hard disk drive.

Next comes the question of data recovery services: Can experts who specialize in this kind of work restore your lost files? It is possible but it will cost you a pretty penny to find out. Most people will back out when they are told that they must pay for the work but there's no guarantee that the data will be recovered.

In this note I'll explain the basic storage issues that we face today. After that I'll explain what we are able to do today and, finally, I'll tell you about the ultimate solution, Self-healing RAID arrays, that are coming soon.

Random loss of data

Hard disk drives work by writing digital data into something physical. The most common drives today are using magnetic fields recorded onto a spinning disk. The digital ones and zeros are recorded in an analog form which is not 100% reliable. Over a period of time it's possible that some kind of failure will occur causing one of the bits to randomly change from one state to another. This is called Bit Rot (also written as bitrot) in the industry. Drives contain software and hardware designed to help detect and work around the problems - but this detection and correction doesn't always work.

As a result your photo collections, your music collections, your emails and your accounting information - all your personal and business data - is slowly deteriorating.

One day you will try to:

  • view an image that you snapped a few years earlier. Bitrot will cause damage that can take many forms depending on the type of image. A single bit error in a RAW file might only appear as a single strange dot that wasn't there before. The same error in a JPG image, on the other hand, might result in big splotches of strange colors in some parts of the image.
  • listen to an old mp3 recording. Bitrot will sound like clicks and chirps that were not there before,
  • read an archived email. Bitrot will result in strange characters appearing in the text.
  • view previous years of accounting data. Bitrot will result in random errors in the data and possibly even prevent that data from loading into your accounting system.

Bitrot is a long-term day-to-day data integrity issue. Remember that your nightly backup procedure is backing-up the data that you have on your main storage device. Files that have experienced bitrot will therefore be happily copied to your backup system - usually overwriting previous copies which may have been correct. In other words: Bitrot has a way of working through your business process and wrecking havoc along the way.

Modern computers do not yet have a conveniently available solution for the problem of bitrot. As mentioned above there are features built into modern storage devices to detect and correct reliability problems. They do a great job - but they're not perfect. In the next few years we will see the general availability of Next Generation File Systems with the ability to detect and correct bitrot. These technologies do exist now and are available to those who are willing to make a little extra effort. We'll talk more about that below.

Total drive failure

Normally, though, bitrot happens very, very slowly. It might take a long time to do enough damage to your accounting data, for example, to prevent the accounting software from starting in the morning.

Total failure, on the other hand, is quickly visible. You will know when your hard disk starts to fail in a big way. At first you will hear strange clicking sounds as the drive tries to repair itself. Loading your data will become a very time consuming exercise. Soon after that your computer may refuse to boot or you may not be able to access any of your files.

RAID protects against total loss

It is possible to avoid total data loss due to drive failure. This is done by writing the data to a Redundant Array of Inexpensive Drives or RAID. RAID takes your data and spreads it out over multiple drives. The technology comes in two forms: Mirroring and Striping.

Mirroring copies your data, bit for bit, onto one or more mirror drives. So, for example, if you have a 1TB drive and you want to protect yourself from total data loss due to the failure of that drive, you buy a second drive and mirror the two. Every time you save a file your operating system will then write the file to both drives. If one of them fails your data is still safely available on the other drive. You can go ahead and replace the failed drive. If you have a feature called Hot Swappable Drives you can remove the failed drive and install the replacement without having to turn-off the computer.

Striping makes more effective use of disk space than Mirroring. If you buy, for example, three 1TB disk drives you can create something called a RAID5 array with 2TB of storage available. Striping will result in the data being spread out over the three drives. A simple mathematical algorithm is used to ensure that any one drive can fail at any moment without causing any data loss. The RAID5 array will still be able to function even though the array is degraded and no longer provides any further protection for you. Once you have replaced the failed drive the RAID5 array will use some simple math to reconstruct itself and, once the reconstruction is completed, your data will be protected once again.

One more good thing about Striping is that it can be setup to provide protection from two simultaneous drive failures. This is good because most people buy all their drives at the same time and often from the same manufacturing production run. This, combined with power problems and vibration issues, can actually result in two drives (or more) failing at the same time. Using something called RAID6 it is possible to setup an array of 4 inexpensive disks such that the array will continue to function even if two drives fail at the same time.

In practice you must use RAID for any data that you really need to keep safe. Your family photos and home videos, for example, must be stored on RAID. You will not be protected from bitrot but you will be protected from total loss of your data due to a single drive failure (double drive failure if you use RAID6.)

Using Long-term Backup Rotations

Bitrot is not the only reason why you might lose data over a period of time. There are a great many problems that cause daily damage to your data. You may, for example, install some software one day that has a bug in it. This bug may introduce errors into your accounting data - and you might not see those errors right away. It is quite possible that you will discover the problem only a few days, weeks or months later. For this reason many companies maintain three rotations of nightly backups: The daily rotation, the weekly rotation and the monthly rotation:

  • Most of the time your staff will accidentally delete files and remember to ask for copies of those files a few days or weeks later. For this reason you might rotate nightly backups over a period of 7, 14, 21 or more days.
  • Backups made on Sunday nights are rotated out of the daily set and into the weekly set. This weekly rotation might also continue for 4, 8, 12 or more weeks. If you discover that you need a file from more than a few days ago (ie: you can't find it in your nightly backup sets,) you will probably find it in one of your weekly backup sets.
  • Backups made on the first Sunday of every month are rotated into both the weekly and monthly sets. This monthly rotation might continue for 6, 12, 18 or more months. These backups allow you to find good copies of your files many months after they have been damaged.

If, one day, you click on an old photo and find that it is no longer viewable:

  • Your immediate response would be to pull a copy of that file from your most recent nightly backup. Unfortunately the bitrot event that damaged the file probably didn't happen the night before.
  • You would then start rummaging through your nightly backups looking for a good copy of the file - but you might not find it.
  • You might go through all your weekly backup sets also and still not find a good copy of the file. This is because it's not unusual to discover damaged files long after they've been damaged.
  • Finally you would go through all the sets in your monthly rotation. If your monthly rotation is long enough you will find a good copy of the file which you can then copy over the damaged file on your main network.

Long backup rotations, for most people, are the only available solution for the majority of data loss problems. Each backup represents a snapshot of your data at the time of the backup. If you are able to go back far enough in time, to a time before a file was damaged or lost, you can easily recover that file or use the data to help correct damage caused by software bugs. Very long backup rotations are also the only real solution for bitrot as it mostly affects files that are rarely used. It turns out, though, that there is a better solution to the problem of bitrot.

Self-healing RAID arrays

For some of the issues that we face in business today there is no easy solution to the problem of data storage.

For example: If you have buggy software that is destroying your accounting data there may be no way for you to detect the problem until the damage appears in your reports and is recognized by a human.

Bitrot, however, can be detected using Next Generation File Systems. These new file systems add checksum information as the data is being written. Then, when reading the data, the new file systems are able to detect any errors in the data and report that information to the user.

Next Generation File Systems are also able to recognize the different components of a RAID array. If they detect an error while reading from an array the Next Generation File Systems will automatically try to reconstruct the correct data. For a mirrored array the file system will simply read from another drive in the mirror to find correct data (if it exists.) For a Striped array a Next Generation File System will try to figure out which stripe is not valid and reconstruct it from the other stripes.

Note that you will not be able to detect and correct bitrot if you are not reading all your data from time to time. Setting-up an automated nightly or weekly backup system is a simple and easy way to do this.

Once you've got that process up and running you will have self-healing RAID storage; ie: No more bitrot!

Conclusion

Next Generation File Systems are not readily available just yet. You can certainly get what you need today but it's not conveniently organized for you when you buy a new computer. Contact your local network services company for help getting started.

If you are a photographer, a videographer, an audiophile or the owner of a business: You really have no choice about using these technologies:

  • Use RAID to protect yourself from total drive failure
  • Use a Next Generation file system to protect yourself from bitrot
  • Use Daily, Weekly and Monthly Backup Rotations to protect yourself from other short-term and long-term data loss issues.