Big Data Retention

September 18, 2013, Storage Developer Conference, Santa Clara, CA—Sam Fineberg from HP and Simona Rabinovici-Cohen from IBM talked about combining cloud, tape, and container format technologies for long-term big data retention. With the move in big data to use historical information as part of the analytics, long-term data retention becomes a necessity.

Big data is now generating 2.5 quintillion bytes of new data per day, and the Internet of things will only increase this volume. By 2020, 19.8 billion devices will be connected, of which 4.2 billion will be smart phones and tablets and 3.4 billion PCs. Data analytics will be looking at all these data, and has found the historical information has significant value.

The analytics are generating new techniques and algorithms to gain new insights. These tools need to ensure that all necessary information is captured to extract the full value. As a result, big data has similarities to long-term preservation.

Some of the drivers for data preservation include regulatory compliance and legal issues, emerging Web services and applications, and other fixed content repositories. Some domains that have long-term preservation requirements include scientific and cultural data, which we would like to keep forever. Others include healthcare where x-rays are stored for periods of 75 years and minors' records are needed until they get between 20 and 43 years of age. Media and entertainment would like to keep film masters, outtakes and related artifacts for 100 years or more. Long-term retention of 20 years or more is now required by most companies due to legal risk, compliance regulations, and business and security risk.

The goals of digital preservation are to assure that digital assets remain accessible, undamaged, and usable for as long as desired. These time frames exceed the lifetime of most storage systems and existing technologies, and all these data must be stored at an affordable cost.

Data preservation requires more than just saving the raw data. The SNIA is developing a self-contained information retention format (SIRF) that is a logical container for set of digital preservation objects in a catalog. The catalog contains metadata related to the entire contents of the container as well as to the individual objects.

The storage container may comprise a logical physical storage area considered his unit. For example, the file system, a tape, a block device, a stream device, or even a data bucket in a cloud storage system. The SIRF properties must be self describing—so they can be interpreted by different systems, self-contained—all data needed for the interpretation is in the container, and extensible—to meet future needs.

The components of a SIRF system include a magic object which identifies the SIRF container and its version, preservation objects that are immutable, and the catalog that is updatable and contains the metadata to make the container and preservation objects portable into the future without external functions. Detailed information about the various levels and properties within a SIRF are being developed in the LTR TWG.

Complementary functions include a cloud data management interface (CDMI) (ISO/IEC 17826:2012) which defines an interoperable format for moving data and metadata between club providers. This has several implementations including OpenStack and data services including security ratio, encryption, replication, retention, etc. The data objects are designed to be accessed by standard browsers and Internet tools.

Another function is the linear tape file system (LTFS) which creates two partitions on a linear tape, an index partition and data partition. In LTO5, the index partition consumes 37.5 GB of the 1.5 TB available. The file system module implements a set of standard file system interfaces using FUSE. The file system includes a non-tape structure to tractate contents. This format is becoming the standard for linear tape.

The combination of SIRF, CDMI, and LTFS enables future clients to understand containers created by today's client. This is important because properties of future clients are unknown to us today, while still identifying the preservation objects in the container and the packaging format for each object, etc.

A CDMI container can be qualified as a SIRF container when the magic object is mapped to the container metadata and includes specification ID and version, SIRF level, and catalog object ID. The SIRF catalog is an object in the CDMI container that is formatted in JSON. A SIRF preservation object that is a simple object is mapped to a CDMI data object which can be a tar/zip. Composite objects can be mapped to a set of data objects either one to one basis plus a manifest data object that includes the IDs and fixities of the element data objects.

SIRF for LTFS can fit into the index partition which is large enough to hold the index, the SIRF catalog and other information such as thumbnails of images. The LTFS tape can also be a SIRF container when the magic object is mapped to extended attributes in the LTFS index or directory.

Other organizations are working on developing complementary frameworks and tools in the areas of preservation. ForgetIT is an FP 7 EU project that combines new concept for easing the adoption of preservation personal and organizational contexts. The concept is to couple information management and preservation management to bring back information into active use in a meaningful way. It also incorporates managed forgetting which is inspired by human forgetting. See www.forgetit-project.EU/ and www.snia.org/ltr

Big Data Retention

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112