People usually have to learn the hard way that backups are important. For people not familiar with computers it takes time to appreciate the nature of digital data. Digital data is easy and cheap to create and copy, but equally easy to destroy and lose! It is very easy to accidentally delete files.
The medium that carries your data will become defective over time. This holds true whatever storage medium is used. Harddisks, CDs, DVDs and tape have usable lifetimes that are measured in years or at most decades. And while clay or stone tablets may still be readable after thousands of years, they are not really suited for digital data storage. The consequence is that digital data will become partially or completely unreadable if the media is left untended.
On this page I will focus on solutions for people looking to backup their PC, not for data centres. Those will hopefully not need my advice anyway.
Requirements for keeping your data intact
Copies as redundancy
The best way to guard against data loss is to make multiple copies of data, and store them in different places and on different media. This is called redundancy. The challenge here is to make sure that those different copies remain synchronized if the data is changed.
Data formats for longevity
A corollary of saving data for a long time is that you should still be able to read it afterwards. Therefore one should avoid proprietary and/or undocumented formats. Vendors and computer platforms come and go. And if they go, they might take your data with them! What are you going to do with data that was written by a program from a company that has gone out of business and which does not work on your current computer?
Stick with data formats for which the spefications are publicly available, and preferably not encumbered by patents, like unicode text (UTF-8 seems most popular), JPEG and PNG for pictures, FLAC and ogg vorbis for sound, and theora for video.
With these formats one can be pretty sure that in 20 years, there will still be applications that can read them.
Categories of data
For myself, I divide data in two categories;
- convenient to have a backup
This is e.g. a backup of the operating system and installed software. If your harddisk dies, it saves you from having to install everything from scratch. The best strategy here is to use the tools that come natively with the operating system. Ideally, if you have to restore a system you should be able to boot the computer from a live-CD and restore your data with the tools available in that situation.
Anything that you can find on the internet almost be definition falls into this category; you can always download it again.
An interesting corollary is that uploading something to an internet site is a form of backup. :-)
- critical to have a backup
This is data that you have created and which will be lost if you don’t save it; i.e. it cannot be recreated. Examples are photos that you’ve taken, things that you’ve written, etc. Realize what data falls into which category, and plan your backups accordingly.
If your collection of data is smaller than, say a DVD (4.7 GB), then just burn the whole lot to a DVD. The frequency with which you should do that depends on how much data you’re willing to lose; if you cannot stand losing more than a weeks work, back up every week. :-)
If your data runs in the tens or hundreds of gigabytes, go for a pair of external harddisks. Hide one at home, and store the other at another safe place.
If you have a great amount of data that changes little (e.g. photo collections that are added to every year but change little afterwards), backing up everything every time takes up a lot of time and space. So use a program that compares your harddisk to the backup disk, and only sends new or changed data to the backup disk. This is called incremental backup.
With the ever increasing capacity of harddisks, traditional backup media for private citizens like floppies, CD-R or DVD-R are losing ground. 500 GB (465 GiB) harddisks are common as I write this. A DVD can hold 4.7 GB (4.38 GiB) , so you’d need around 100 of them to completely back up the contents of such a big disk. Clearly that is not a viable alternative.
Magnetic tapes can hold much more, but high-capacity drives like LTO can easily cost more than your computer and require SCSI of SAS buses. This is a fine solution for data centres and people willing to spend lots of money. My focus here is on a practical backup for a private citizen.
In my opinion, the best backup for a big harddisk these days, is another big harddisk, unless you are willing to spend big time for a high-end tape drive. So that is what I will focus on.
My solution (on FreeBSD)
Essentially there are two kinds of backups, disaster recovery and archival storage. In this case I’m more concerned with the former than with the latter, since I keep all my data live on disk where I can reach it.
My first line of defence is to have two identical disks in the machine. The first one (ad4) is normally used, while the second one (ad6) is a copy which is updated nightly by rsync(1) is a cron(8) job. I tried using gmirror(8), but that felt slow, and rebuilding the array took days. Using a nightly cron job has the additional advantage that I until the next nightly update to restore accidentally deleted files! That has already proved useful.
Having two disks doesn’t protect against calamities like fire or theft. So one should consider making backups and storing them away from the PC.
There are a few features that a backup program should have;
1. Tools for restoring system partitions should be available in the base system or a live-CD; If you need to restore, you should be able to do so without first having to install a base system and a bunch of ports. That’s a lot of work. 2. It should back up all the features, data and attributes of the system partitions; If the backup program doesn’t restore ownerships, permissions, flags and ACLs you have to restore them by hand afterwards. Again a lot of extra work.
In the /rescue directory of both the FreeBSD base system and the install CD, you’ll find statically linked binaries of restore(8) (for restoring backups made by dump(8)), tar(1) and pax(1). But only dump(8)/restore(8) is guaranteed to fulfill the second requirement on UFS2 filesystems. So that is what I use for system partitions.
The bulk of my data is in my /home partition. Things like holiday photos, scans and FLAC files take up a lot of space but change infrequently. Therefore I felt that a weekly level 0 backup was inappropriate for this partition.
As mentioned I use a couple of disks in USB enclosures as external backup media. Their size is sufficient to store complete backups. They are easy to use and store in a vault or off-site. Next question is what to back up. It is customary to divide the disk into several partitions. Here is my break-up;
df -h Filesystem Size Used Avail Capacity Mounted on /dev/ad4s1a 484M 93M 353M 21% / /dev/ad4s1g.eli 373G 141G 202G 41% /home /dev/ad4s1e 48G 36K 45G 0% /tmp /dev/ad4s1f 19G 5.8G 12G 32% /usr /dev/ad4s1d 1.9G 226M 1.6G 12% /var
The /tmp filesystem is only used for temporary storage, so it doesn’t need to be backed up. So the filesystems that need to be backed up in my case are /, /home, /usr and /var. On can summarize those as all UFS filesystems except /tmp.
If your disk has died, you need to re-partition and newfs your new disk to match the old one. To that end, one should always store a copy of /etc/fstab and the output of the bsdlabel command;
bsdlabel /dev/ad4s1 # /dev/ad4s1: 8 partitions: # size offset fstype [fsize bsize bps/cpg] a: 1024000 16 4.2BSD 2048 16384 64008 b: 16777216 1024016 swap c: 976768002 0 unused 0 0 # "raw" part, don't edit d: 4194304 17801232 4.2BSD 2048 16384 28528 e: 104857600 21995536 4.2BSD 2048 16384 28528 f: 41943040 126853136 4.2BSD 2048 16384 28528 g: 807971826 168796176 4.2BSD 2048 16384 0
Another thing that is handy to save, is the output of the ‘dumpfs -m’ command for all of your partitions. This gives you the newfs command to recreate the partitions in question
dumpfs -m / # newfs command for / (/dev/ad4s1a) newfs -O 2 -a 8 -b 16384 -d 16384 -e 2048 -f 2048 -g 16384 -h 64 -m 8 -o time -s 256000 /dev/ad4s1a dumpfs -m /home # newfs command for /home (/dev/ad4s1g.eli) newfs -O 2 -U -a 8 -b 16384 -d 16384 -e 2048 -f 2048 -g 16384 -h 64 -m 8 -o time -s 201992956 /dev/ad4s1g.eli dumpfs -m /usr # newfs command for /usr (/dev/ad4s1f) newfs -O 2 -U -a 8 -b 16384 -d 16384 -e 2048 -f 2048 -g 16384 -h 64 -m 8 -o time -s 10485760 /dev/ad4s1f dumpfs -m /var # newfs command for /var (/dev/ad4s1d) newfs -O 2 -U -a 8 -b 16384 -d 16384 -e 2048 -f 2048 -g 16384 -h 64 -m 8 -o time -s 1048576 /dev/ad4s1d
My method of choice is to use dump(8) to make backups to a USB disk for the system partitions, and use rsync(1) for /home. The usb disk is divided into a single slice, with two partitions. The smaller first one is for the OS dumps, the second encrypted to is to store a copy of /home.
First, the system partitions are backed up. I’ve written a shell script to perform the dumps of the relevant partitions. It’s called dodumps.
As called below, it makes level 0 backups and stores them in /tmp;
rm -rf /usr/obj/* src/scripts/dodumps 0 /tmp mount /dev/da0s1a /mnt/root rm -f /mnt/root/*.dump* cp -vp /tmp/*.dump /mnt/root umount /mnt/root
Then the encrypted partition is mounted and the /home partition is synchronized;
geli attach /dev/da0s1d mount /dev/da0s1d.eli /mnt/root rsync -axq --delete /home/ /mnt/root/home sync umount /mnt/root geli detach /dev/da0s1d.eli