< Back to Main Site

EDUCAUSE review onlineEDUCAUSE review online

Using Cloud Infrastructure as Part of a Digital Preservation Strategy with DuraCloud

0 Comments

Key Takeaways

  • Preservation of digital content in digital repositories and archives requires appropriate backup and replication, with geographic separation of the replicas as a best practice.
  • Cloud services can aid in replication and preservation of digital content, with concerns about risk mitigated by having trusted organizations provide oversight of data stored with cloud providers.
  • The DuraCloud open-source software platform being developed by DuraSpace aims to provide a fully integrated platform where services and data can be managed across multiple cloud providers, to prevent lock-in or reliance on any single provider.
  • DuraCloud will be offered as a hosted service providing data storage, data replication, and services to support data preservation, data transformation, and data access.

To ensure multiple copies of digital content for digital repositories and archives, both systems must offer backup and replication services. Backup alone does not serve as an appropriate solution for trusted digital archives, however; replication of content is best practice, and it is especially important to separate the replicas geographically. To mitigate the risks of technology failure, it is even better to store replicated content in systems that use different underlying technologies than the original archive system. To avoid information loss due to obsolescence of content and metadata formats, archiving systems should provide mechanisms to monitor and transform content as needed. Lastly, appropriate security mechanisms are essential to prevent tampering with and unauthorized access to content.

With the emergence of cloud infrastructure as a service (IaaS), the prospect of using cloud technologies to support data replication has become an option. While well-documented concerns exist (trust, control of data location, guarantees against data loss),1 taking advantage of the scalability and low cost of utility cloud providers looks increasingly interesting. Research commissioned by DuraSpace found that 50 percent of technology decision makers surveyed in the DSpace and Fedora communities indicated that they expected to use cloud services within the next year. Participants in the study cited replication and preservation services as their top interests in the cloud. They also indicated that the prospect of trusted organizations providing oversight of data stored with cloud providers mitigated their concerns about risk.

DuraCloud, a software platform being developed by the DuraSpace not-for-profit organization, will provide easy entry into the cloud infrastructure by offering data storage, data replication, and services to support data preservation, data transformation, and data access. DuraSpace is planning to host DuraCloud as a service following the completion of a pilot phase with three partners, now under way and funded by the Library of Congress.2 Since the core components of DuraCloud will be released as open-source software, institutions or consortia will be able to install the DuraCloud core to create and manage their own cloud networks.

DuraCloud aims to provide trusted cloud mediation with different levels of service aimed at making digital content (1) durable, meaning accessible for long periods of time, and (2) usable, meaning that it can be retrieved and accessed or dynamically transformed to fit within a variety of application and system contexts. DuraCloud provides a simple, open application programming interface (API) with back-end connectors to multiple cloud storage providers. The strategy of mediating between digital repositories and archive systems and multiple cloud providers hedges risks and overcomes obstacles to storing data at any one provider, such as having a single point of failure for data storage and data lock-in.2 Currently, there are DuraCloud connectors to three commercial cloud services (Amazon Web Services, EMC Atmos Online Services, and the Rackspace Cloud). Key features of the software enable users to:

  • Transparently push content to multiple third-party storage providers. This allows organizations to take advantage of cost-effective Internet-based storage, using the DuraCloud software to send content to one or more underlying cloud storage providers.
  • Use value-added services. The DuraCloud platform adds value to what the underlying storage providers offer, with a particular focus on services that enable longevity of content and facilitate flexible use and reuse. These services are provided as a menu from which users can choose services to implement. Services planned include:
    • Preservation support: Replication, file format transformation, and bit integrity checking.
    • Access and reuse: Image viewing and editing, video streaming and editing, and faceted browse and search.
  • Leverage open-source technologies. DuraCloud is being built as open-source software, keeping with the open-source principles promoted by both Fedora Commons and DSpace. Core DuraCloud software components will be released as open source in the Summer of 2010.
  • Choose hosted or run-your-own. The DuraSpace organization plans to run the DuraCloud platform software as a service. Since it is built on open-source technologies, others can pick up the service and run local instances to create their own hybrid cloud or cloud consortium network.

The DuraCloud project has already demonstrated cloud replication capabilities using data from our initial pilot partners, having ingested 10 terabytes of data from each partner and currently testing cloud replication with up to three replicas across different cloud providers. We have implemented integrity checking (checksum calculation and validation) at every transfer point in the ingestion process. During the pilots we also have successfully navigated both policy-imposed and practical data-transfer limits that are less than the actual storage limit for a file. We see cloud providers establishing limits on maximum single file size (in Amazon S3, the established limit is 5 gigabytes per object stored). Files that exceed the transfer limit can be stored after they have been broken into parts ("chunked") from which they can be reassembled, a capability that can be used across our multiple underlying storage providers. In addition, we are currently working on the "stitching" capability that puts the chunks back together for access. To enable graceful evolution of DuraCloud, we created a service plug-in architecture and demonstrated the deployment of an initial set of services, including format transformation and viewing of very large images, with a data mining service and video streaming service up next.

We have begun work to support repository replication and synchronization with existing local Fedora and DSpace repositories as well as other file-based content management systems. The synchronization tool will copy the underlying file directory from the repository or content management system to DuraCloud, keeping the cloud store synchronized with the primary local store if desired.

The DuraSpace team will conclude the pilot program by the end of 2010. The DuraSpace organization believes IaaS will become as ubiquitous as electric utility infrastructure is today. Therefore, it is imperative we begin the process of learning how to use this infrastructure and connect it to our existing systems to adapt and take advantage of what the cloud has to offer for digital repositories and archive systems.

Endnotes
  1. Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia, "Above the Clouds: A Berkeley View of Cloud Computing, Electrical Engineering and Computer Sciences," University of California at Berkeley, Technical Report No. UCB/EECS-2009-28, February 10, 2009.
  2. With funding from the Library of Congress National Digital Information Infrastructure and Preservation Program (NDIIPP), initial DuraCloud pilots were formed with three partners: Biodiversity Heritage Library, New York Public Library, and WGBH Media Archive.

Michele Kimpton

Michele Kimpton is Chief Executive Officer of DuraSpace and one of the founders of the organization. DuraSpace was formed in July 2009, and was the coming together of both the DSpace Foundation and Fedora-Commons organizations. DuraSpace is a not-for-profit organization that provides guidance and support for open source software projects DSpace, Fedora and more recently DuraCloud. Michele sets the strategic direction for DuraSpace with the executive team and members of the Board. Mrs. Kimpton was recently awarded Digital Preservation Pioneer by the NDIPP program at Library of Congress, and you can find more detail at http://www.digitalpreservation.gov/partners/pioneers/detail_kimpton.html.

Prior to joining DuraSpace, Michele Kimpton was the Founder of the DSpace Foundation, a not for profit organization set up to provide leadership and support to the community of users of the DSpace open source software platform. The mission of the Foundation was to promote open access and preservation of the world's scholarly works. The DSpace open source software platform is freely available to anyone or any institution, wishing to preserve, manage and provide internet access to their digital collections. Currently there are over one thousand installations worldwide using DSpace software.

Prior to joining DSpace, Michele Kimpton was the Director at Internet Archive for five years. In her role she works closely with National Libraries, Archives and Universities to provide technical expertise and services in web archiving. She has developed partnerships with several of these institutions to collaborate on web archiving activities, including being one of the founding members of the International Internet Preservation Consortium.

 

Sandy Payette

As Executive Director of Fedora Commons, Sandy bridges research and innovation with practical applications and open source software deployment. Her original research at Cornell University Information Science led to the founding of the Fedora Project which, in 2007, she successfully directed into the Fedora Commons non-profit organization. Sandy continually collaborates with scholars, scientists, and practitioners nationally and internationally to further the mission of Fedora Commons and to continue her research in scholarly communication, digital libraries, digital preservation, and information modeling. Sandra also spent ten years in industry leading information technology projects at Corning Incorporated, a Fortune 500 company. Her leadership led to early adoption of computing and information technologies by executives and senior management, helping to forge new processes and techniques for strategic business analysis.

 

Tags from the EDUCAUSE Library

Most Popular

Stay Up-to-Date

RSS Email Twitter

Share Your Work and Ideas

Issues coming up will focus on designing the future of higher ed, digital engagement, and new business models. Share your work and ideas with EDUCAUSE Review Online.

E-mail us >

Purchase

Close
Close


Annual Conference
September 29–October 2
Register Now!

Events for all Levels and Interests

Whether you're looking for a conference to attend face-to-face to connect with peers, or for an online event for team professional development, see what's upcoming.

Close

Digital Badges
Member recognition effort
Earn yours >

Career Center


Leadership and Management Programs

EDUCAUSE Institute
Project Management

 

 

Jump Start Your Career Growth

Explore EDUCAUSE professional development opportunities that match your career aspirations and desired level of time investment through our interactive online guide.

 

Close
EDUCAUSE organizes its efforts around three IT Focus Areas

 

 

Join These Programs If Your Focus Is

Close

Get on the Higher Ed IT Map

Employees of EDUCAUSE member institutions and organizations are invited to create individual profiles.
 

 

Close

2014 Strategic Priorities

  • Building the Profession
  • IT as a Game Changer
  • Foundations


Learn More >

Uncommon Thinking for the Common Good™

EDUCAUSE is the foremost community of higher education IT leaders and professionals.