September 18, 2013, Storage Developer Conference, Santa Clara, CA—Sam Fineberg from HP and Simona Rabinovici-Cohen from IBM talked about combining cloud, tape, and container format technologies for long-term big data retention. With the move in big data to use historical information as part of the analytics, long-term data retention becomes a necessity.
Big data is now generating 2.5 quintillion bytes of new data per day, and the Internet of things will only increase this volume. By 2020, 19.8 billion devices will be connected, of which 4.2 billion will be smart phones and tablets and 3.4 billion PCs. Data analytics will be looking at all these data, and has found the historical information has significant value.
The analytics are generating new techniques and algorithms to gain new insights. These tools need to ensure that all necessary information is captured to extract the full value. As a result, big data has similarities to long-term preservation.
Some of the drivers for data preservation include regulatory compliance and legal issues, emerging Web services and applications, and other fixed content repositories. Some domains that have long-term preservation requirements include scientific and cultural data, which we would like to keep forever. Others include healthcare where x-rays are stored for periods of 75 years and minors' records are needed until they get between 20 and 43 years of age. Media and entertainment would like to keep film masters, outtakes and related artifacts for 100 years or more. Long-term retention of 20 years or more is now required by most companies due to legal risk, compliance regulations, and business and security risk.
The goals of digital preservation are to assure that digital assets remain accessible, undamaged, and usable for as long as desired. These time frames exceed the lifetime of most storage systems and existing technologies, and all these data must be stored at an affordable cost.
Data preservation requires more than just saving the raw data. The SNIA is developing a self-contained information retention format (SIRF) that is a logical container for set of digital preservation objects in a catalog. The catalog contains metadata related to the entire contents of the container as well as to the individual objects.
The storage container may comprise a logical physical storage area considered his unit. For example, the file system, a tape, a block device, a stream device, or even a data bucket in a cloud storage system. The SIRF properties must be self describing—so they can be interpreted by different systems, self-contained—all data needed for the interpretation is in the container, and extensible—to meet future needs.
The components of a SIRF system include a magic object which identifies the SIRF container and its version, preservation objects that are immutable, and the catalog that is updatable and contains the metadata to make the container and preservation objects portable into the future without external functions. Detailed information about the various levels and properties within a SIRF are being developed in the LTR TWG.
Complementary functions include a cloud data management interface (CDMI) (ISO/IEC 17826:2012) which defines an interoperable format for moving data and metadata between club providers. This has several implementations including OpenStack and data services including security ratio, encryption, replication, retention, etc. The data objects are designed to be accessed by standard browsers and Internet tools.
Another function is the linear tape file system (LTFS) which creates two partitions on a linear tape, an index partition and data partition. In LTO5, the index partition consumes 37.5 GB of the 1.5 TB available. The file system module implements a set of standard file system interfaces using FUSE. The file system includes a non-tape structure to tractate contents. This format is becoming the standard for linear tape.
The combination of SIRF, CDMI, and LTFS enables future clients to understand containers created by today's client. This is important because properties of future clients are unknown to us today, while still identifying the preservation objects in the container and the packaging format for each object, etc.
A CDMI container can be qualified as a SIRF container when the magic object is mapped to the container metadata and includes specification ID and version, SIRF level, and catalog object ID. The SIRF catalog is an object in the CDMI container that is formatted in JSON. A SIRF preservation object that is a simple object is mapped to a CDMI data object which can be a tar/zip. Composite objects can be mapped to a set of data objects either one to one basis plus a manifest data object that includes the IDs and fixities of the element data objects.
SIRF for LTFS can fit into the index partition which is large enough to hold the index, the SIRF catalog and other information such as thumbnails of images. The LTFS tape can also be a SIRF container when the magic object is mapped to extended attributes in the LTFS index or directory.
Other organizations are working on developing complementary frameworks and tools in the areas of preservation. ForgetIT is an FP 7 EU project that combines new concept for easing the adoption of preservation personal and organizational contexts. The concept is to couple information management and preservation management to bring back information into active use in a meaningful way. It also incorporates managed forgetting which is inspired by human forgetting. See www.forgetit-project.EU/ and www.snia.org/ltr