GateKeeper Back-Up Services - Technical Overview

Data Repository

All data is stored as files within the server’s file system. The server is portable and runs either in Linux, Solaris, or Win32. Our storage backends employ either the ReiserFS, XFS or ZFS (Solaris) filesystems for the storage backend. These are all modern journaling file systems supporting multi-terabyte partitions, 64-bit file sizes, and millions of files per directory, offering both horizontal and vertical scalability. Our use of ZFS adds an additional layer of redundancy to mitigate data corruption. RAID-6 is used for our repository partitions.

Each account is assigned a subdirectory and contains subdirectories for root backup folders. Each root backup folder contains the following subdirectories:

data: Current versions of files
meta: Historical versions of files
deldata: Deleted versions of files
delmeta: Deleted, historical files
index: mirrors the directory structure so the directory list can be generated quickly

The current version of the file always stores the complete file (encrypted and compressed). Historical versions store data blocks that differ from the next (more recent) version. Thus, to restore the 5th version of the file you apply the deltas from the previous 4 versions and then apply the 5th delta. This is done as the file is downloaded and is very efficient. Also, this method makeu uploading new versions efficient, as all previous versions need not be changed.

When the client detects that a file has been deleted it notifies the server during the next backup. The server annotates the filename with the deleted date/time and moves it to the deleted data area. The client program will enumerate and destroy old deleted data once a week. An end user can use the file manager to destroy data. When data is destroyed it is moved to a parallel repository designed to hold the “destroyed” data. Destroyed data is held for an additional 30 days, in case the destruction of data was unintentional.

All actions in the repository are transactional so that the system is always in a consistent state. A transaction log is kept on the disk such that if the server ever loses power the transaction will be rolled back upon server startup and the system will be restored to a consistent state. Transactions are also automatically rolled back if a network connection times out or some other error occurs.

The repository uses “meta data” objects to track disk usage. There is one meta data object per directory, and it tracks how much data is contained within that directory and also how much is contained within that directory and all of its subdirectories. These meta data objects are updated in real time in a transactional manner. This allows the server to provide disk usage information to the client program (or billing process) in a very efficient manner. Because all repository data is stored as files within the native file system, an account’s data can be managed easily using the native operating system’s utilities. Additionally, existing technologies and utilities to mirror file systems (such as rsync) can be used to provide additional protection against data loss.