Starting the Conversation: University-wide Research Data Management Policy
This article represents a call for action to address the high-level benefits of adopting a university-wide policy regarding research data management. An institution should identify all the stakeholders, bring them together to discuss their interests, create policy, and actively determine how it will manage the university's data assets.
OCLC Research Library Partnership Data Curation Policy Working Group
Daniel Tsang, University of California, Irvine (Chair)
Anna Clements, University of St. Andrews
Joy Davidson, Digital Curation Centre
Mike Furlough, Pennsylvania State University
Amy Nurnberger, Columbia University
Sally Rumsey, University of Oxford
Anna Shadbolt, University of Melbourne
Claire Stewart, Northwestern University
Beth Forrest Warner, Ohio State University
Perry Willett, California Digital Library
This article represents a call for action. It addresses the high-level benefits of adopting a university-wide policy regarding research data management. Much information is available about research funders requiring data management plans, and universities have moved quickly to address those requirements (see, for example, the University of California Library's DMP Tool and UC3 consulting service.) Now is a good time to reflect on broader issues. This essay identifies the various university stakeholders and suggests a conversation among them to get buy-in for a proactive, rather than reactive, high-level policy for responsible data planning and management.
For those institutions that have not yet had this broader discussion, this is an opportunity for one or more of the stakeholders to play an entrepreneurial role in furthering the mission of the larger enterprise. No one stakeholder need own all the functions, responsibilities, or systems, but by initiating the university-wide conversation, the proactive stakeholder can make sure to be at the table and contribute their expertise to the discussion. The intent of this call for action is to initiate the conversation and to secure buy-in from all key stakeholders in a policy that is supported and sustainable.
Data, the units of information observed, collected, or created during the course of research, is not limited to scientific data, but includes social science statistical and ethnographic data, humanities texts, or any other data used or produced in the course of academic research, whether it takes the form of text, numbers, image, audio, video, models, analytic code, or some yet-to-be-identified data type. Responsible data policy and planning isn't limited to managing data while the research project is active and storing the data afterwards; it's also the institutional rationale for managing research data and the ensuing implications for the university.
Now that universities have a few years of experience preparing data management plans required by grant funding agencies, desirable outcomes are beginning to become apparent. Making data sets available can support validation of results and the reproducibility of research. Data can be repurposed in ways not foreseen by the originating researchers, inspiring collaborations and new areas of research. Planning for data management early on will make curation activities much easier throughout the data lifecycle. Efficiencies can be achieved when data curation activities are not treated as one-off occurrences.
If responsible data policy is implemented university-wide, many other desirable outcomes are possible:
- Clear expectations will ease the way for data managers.
- Uniform requirements will facilitate data understandability and sharing among researchers.
- Consistent data management standards and training and tracking programs can foster harmony within the university.
- A standardized approach to data management will ease compliance and improve management of and access to the university's intellectual assets.
- Positive impacts and efficiencies can benefit all research conducted at the university, not just that funded by agencies requiring a data management plan.
Assuming that these benefits look appealing, stakeholders should enter into discussions to make sure their university realizes them.
The major stakeholders who should be at the table are the university, office of research, research compliance office, information technology department, researchers, academic units, and library. We consider each in turn.
Research data can be viewed as university assets, stemming from the university's mission to support quality research. Applying best practices to safeguard such assets protects the university's intellectual, financial, human, and material investment in research. The aspiration to commercialize research and patents must be balanced with the desire (in addition to the requirement) to share data. The university will want to ensure that it is a responsible steward for the research outputs of the institution — and will want to find economical and sustainable ways to do so.
Responsible data management, and the resulting access to research data, can contribute to an improved public understanding of the university's contributions to the public good. Public support can help ensure future research funding. The university also may wish to make a public commitment to open access. A university-wide policy should address how best practices in managing research data and making it publicly accessible, when feasible, contribute to high-quality research, academic integrity, and responsible stewardship.
The Office of Research
The office of research (sometimes known as the division of sponsored programs and by various other names) has broad responsibility for administration of sponsored research and related policies and services. The senior research officer is a key contact with funding agencies and is involved in university and consortial advocacy around legislative and regulatory matters affecting research funding and the conduct of research. Depending on the organizational structure of a specific college or university, the office of research may have responsibility for technology transfer, patent and other intellectual property administration, research integrity, the institutional review board, oversight of major research centers, and grants management and administration.
In its capacity as contracts and grants administrator, the office of research typically assists investigators with funder requirements, including for data management and sharing. This usually is where proposals, awards, progress reports, and project completion are tracked. When a data management plan is required at the proposal stage, the office of research can ensure that those who will implement the plan are involved as early as possible. Its staff should be the first point of contact for researchers and should be able to provide knowledgeable guidance about services for data management, both within the institution and externally, as appropriate. Staff will be key partners in conversations about local services, infrastructure, and practices needed to manage data during the active phases of research, and will ensure its validity as it is transformed, deposited, and distributed. They will be concerned with the funding and policy and governance of data management programs, both to maintain good relationships with funders and to contribute to responsible data management. They also will be instrumental in assisting researchers with identifying data management costs for their grant proposals. The research office is in the best position to embed research data management into grant management workflows, providing an opportunity to track how project reporting aligns with grant requirements for the management of research data.
The Research Compliance Office
It is important to recognize the particular point of view that the research compliance office represents. An office of compliance ensures that institutional policies are in compliance with sponsor policies and regulations, and carefully reviews proposed institutional policies with a view towards the practical and procedural issues of compliance, weighing both benefits and risks. The office's responsibility for ensuring compliance with institutional policy through training, communication, and enforcement requires their involvement in policy discussions. Some points of consideration may include uniformity of data management expectations, requirements, and standards; the measures of validation or support that proposed data management systems will require; the responsibilities of the institution to data housed elsewhere; and the impacts of changing data retention requirements.
New compliance requirements for access to data are continually emerging, as evidenced by the White House Office of Management and Budget Memorandum, "Open Data Policy — Managing Information as an Asset" (May 9, 2013) and the earlier memorandum from the Office of Science and Technology Policy, "Increasing Access to the Results of Federally Funded Scientific Research" (February 22, 2013). The compliance office will be responsible for ensuring conformance with resulting future requirements, as well.
The Information Technology Department
As the use of technology extends the reach of research, there is a corresponding increase in the impact on university services and research technology environments or cyberinfrastructure. Today's cyberinfrastructure must support advanced data acquisition, storage, management, security, integration, mining, and visualization, as well as other information-processing services. Many universities' infrastructure is decentralized to research units, departments, and individual laboratories, with varying degrees of coordination by the central information technology department.
Large-scale data storage and data preservation represent the most people-intensive parts of the infrastructure; replicating these functions in multiple locations needs careful consideration. While some laboratories have reasonably reliable systems, many researchers keep irreplaceable data on personal storage devices without documentation, version control, backup, or redundancy. Even where data are handled effectively, the data will not likely be made available to others for inspection or to enable new innovation. All infrastructure must now include systems for documenting, depositing, managing, archiving, and preserving data; facilitating efficient search and retrieval; and providing access.
Rather than depending on individual researchers or labs, these efforts should be based on the premise that long-term stewardship of digital data — the intellectual assets of the university — is a critical responsibility of the university as a whole. Existing technical infrastructure can be coordinated to support data management, but any gaps must be addressed. A coordinated cyberinfrastructure environment can offer advantages such as economies of scale, integration, and a focused approach to coordinating technology and expertise, computing power, and the planning, acquisition, and management of storage space. Critical to the centralized coordination of technical infrastructure is the cost model used. How can costs be managed to support rather than hinder compliance and good practice?
Information technology departments are increasingly aware of their role in strengthening university services to adequately support the various stages of research activity and, in particular, how the resulting research data sets are to be managed throughout their existence. As high-performance computing becomes more affordable, services will need to be commoditized to make them more efficient and scalable. Training also will be needed. In order to situate data management in the larger research information environment, technology leaders may need to integrate the data management system with related systems, such as current research information systems or virtual research environments, to make data management part of the researchers' workflow.
As the producers of the research data that must be managed and preserved, researchers are central stakeholders. They may be especially invested when their career advancement depends on their research outputs. Faculty members and other researchers confront a mix of requirements for data management and open access that are mandated by funding agencies, national and state law, and their own universities. They may negotiate publishing agreements that determine ownership of data — and that in some cases mandate, or preclude, open access. Some researchers already may have experience depositing data in institutional or discipline-based data repositories.
The relationship between researchers and their data is an intimate one. Trust is critical for central university services to meet the needs of researchers and productively engage them. Researchers are likely to resist new administrative burdens and may be incentivized by evidence that sharing will increase the visibility of their research. Researcher representatives should be included in policy discussions, and all researchers must be clearly informed of resulting decisions and procedures.
The Academic Units
While the office of research is the locus for policies, oversight, and other activities regarding research grants, the researchers themselves are generally in academic units overseen by the university provost. At the operational level, research projects are managed by the principal investigator's home department.
Some academic units have support staff to help with proposal writing, administration, budgets, tracking, and compliance. Some also may have their own technology infrastructure. Academic support staff are an important part of the university's research milieu and should be included as stakeholders. They have close relationships with the researchers in their departments and thus can serve as good conduits for communication. They may feel uncertain about how to respond to the new data management requirements and might welcome guidance, including the provision of a more robust and sustainable infrastructure than they can manage independently.
The library is well situated to be a key player in data management, curation, and preservation, given its extensive experience with selection, metadata, collections, institutional repositories, preservation, curation, and access. In fact, the library may be the most appropriate place on campus for safe, sustained, and trusted stewardship of research data. Best practice in research data management dictates that research data be actively curated, not just stored or backed up.
Many components of the library have contributions to make:
- Many libraries have subject area liaisons who offer researchers expertise in managing their research projects.
- Research services often provide functional liaisons for research support, and data management activities can build on those existing services.
- The university archives or a digital resources unit can help address appraisal, deposit, retention, reappraisal, and continued availability of research data over the long term.
- Technical processing staff can offer advice about metadata. The library's experience with name authorities will come into play in the area of researcher name disambiguation, making the research easier to discover and giving acknowledgment to the researchers.
- Many research libraries already run an institutional repository for research outcomes, and this infrastructure may be extended to encompass data sets.
The library offers other areas of expertise:
- Copyright issues related to ownership of both source materials and research outcomes are familiar to library staff, as are privacy issues and ensuring implementation of any access restrictions.
- When it makes sense to put the data in an external repository, the library can provide guidance to help researchers meet deposit requirements.
- In many universities, the library has led the way in the creation of data-management plans.
- The library has a track record with long-term preservation and provision of access.
Elements of the Conversation
To achieve maximum benefit (and minimal burden), the conversation among stakeholders — and the resulting policy and procedures — should address the following points:
Who owns the data? Many universities assert ownership of research data generated on their campuses, as do some funding agencies. There is, however, widespread misunderstanding among researchers on this issue. Policies on data ownership must be clearly communicated and understood.
What requirements are imposed by others? Funding agencies may mandate public distribution of the resulting data set and require that data management plans be incorporated into the grant proposal. Publishers may require that the data supporting an article be deposited in a particular repository. Collaborative agreements with other institutions may impose stipulations. These requirements should be clarified early in the process.
Which data should be retained? No university can, or should, retain all research data generated by its researchers. Curating research data requires significant investment of staff time and financial resources, so universities should aim to ensure that they are investing only in data that is worth keeping. For example, data from a failed experiment may not merit curation, nor may that derived from secondary analysis of large data sets publicly available and archived elsewhere.
Who decides which data to keep? Is it the researcher or someone else? Should other domain experts be consulted? Should peers comment on the data management plan? Which data sets are likely to be reused in future research? In which cases must the underlying data be retained to enable the validation of the research findings by others? What data would be prohibitively expensive to recreate? When a data management plan is required, it is sometimes reasonable that it state that the data do not merit preservation; perhaps the data could be easily recreated, or an algorithm might have more significance than the data set itself.
For how long should data be maintained? Data may have long-term scientific or institutional value (e.g., as evidence in cases of scientific misconduct), but all preserved data should be subject to review. How will retention periods be tracked? Can notifications for reappraisal be automated? When an agreed-upon retention period is due to expire, how will it be decided whether it should be extended?
What metrics could assist with reappraising data for long-term retention? Who should be involved in the reassessment? What happens when the primary researcher leaves the institution? How will reappraisal of data be managed within the repository system?
When a data set is deemed no longer worth keeping, who should be notified? Should deaccessioned data sets be offered to others or destroyed? What records should be kept to document the disposition?
How should digital data be preserved? For each data set, it will need to be determined if there are any unique digital preservation needs. Are the needs different from the approaches identified by the Open Archival Information System reference model and the Trustworthy Repositories Audit and Certification process? What are the ramifications of cloud storage?
Should the data management plan be kept with the data? Should it be made public to provide provenance and additional context? What other information should be provided, such as project and personnel records or instrument calibration documentation? Are the file formats of the data supported by the repository? What descriptors should be applied? What standards (e.g., for identifiers, citation, metadata) will be required?
Are there ethical considerations? Data should be kept in a way that is compliant with institutional review board requirements, grant conditions, or specific research protocols mandated by laws and regulations. How will the institution handle intellectual property rights and privacy issues (e.g., personally identifiable information or protected health information)? How will sensitive data be identified and contained? Are there access restrictions that must be enforced? How can ethical issues be identified during the proposal stage so that consent forms can be developed?
What sort of risk management is needed for research data? How will the impact on sharing data be mitigated? Should the same security protocols that pertain to an institution's business data apply to research data? Will security measures be applied in a different manner during the course of research than afterward?
How are data accessed? Depending on who will most likely use the data, and how, it will be necessary to determine how access will be provided. Is it necessary only to make the metadata discoverable, with links to the data files, or is deeper support for manipulating the data needed? Which indices and catalogs should reference the data's availability? What service-level assurances (e.g., up-time, support) should be made? How will the repository monitor access to ensure that restrictions are enforced? What implications do tracking and monitoring of data access have? What are the possibilities for quantifying access, and how might this information play into questions of impact, promotion, and tenure? Indeed, what is the measure of "access" — the number of clicks, downloads, or citations?
How open should the data be? An institution may decide to provide access to its research data unless constrained by law or grant conditions, or it may decide to share only on a case-by-case basis. Data may also be embargoed with the goal to share at some stated date in the future. In situations where data can never be released or shared, what explanation or justification should be provided for not sharing data?
How should the costs be borne? Data management will incur substantial new costs, and approaches to funding are likely to be controversial. Where will the necessary funds come from? Will funders permit investigators to include data management costs in their grant proposals? If funding is project-based and therefore time-limited, how will the costs of long-term preservation be supported? How will universities fund data curation research that is not grant-supported?
Funders and researchers might sometimes assume that data curation costs will be covered by the indirect costs that the home university includes in grant budgets. On the other hand, it might become permissible to include data management as a direct cost in proposals. The latter may, however, apply only to in-project costs, not longer-term curation and preservation. Discussion among universities, publishers, and funding bodies is necessary to identify how the longer-term costs can most realistically be shared.
It is important that the university be clear about which services it will cover, and which are considered "over and above." Whichever approach to funding is adopted, it should strongly encourage use of the university cyberinfrastructure, rather than relying on individual unit or lab resources.
Another possibility is co-investment by multiple partners. As demands for research data management and sharing increase, shared services are becoming more and more attractive. 3TU in the Netherlands is an excellent example of three technical universities joining up to develop and deliver data management infrastructure and support.
Ultimately, universities need a better means of assessing curation costs and projecting them into the future to ensure that they can develop scalable and sustainable services. Identifying the costs is not enough, however. Universities must be able to make a case about the potential return on their investment. The university must make the case for retaining its research data assets — and identify funding to do so.
What alternatives to local data management exist? Not all data need be stored at the researcher's own institution; in some cases a more appropriate home exists. Should the data set be deposited in a national, international, or discipline-based data center? Many funders require data to be deposited in large national or international repositories that hold other like data (e.g., the National Climatic Data Center or the RCSB Protein Data Bank). In some cases, researchers in a particular field use a specific data repository and develop a disciplinary culture around data sharing (e.g., Inter-university Consortium for Political and Social Research for social science data or Open Context for archaeology data). New services such as figshare and Zenodo allow users to upload "orphan" data sets into a repository for discovery, citing, and reuse.
Some research is collaborative and involves investigators at different institutions. A decision must be made as to which institution will take responsibility for the data, both during and after the project. This decision should be made explicit during data management planning.
If the data set will be stored elsewhere, the ingest requirements and retention policy of the off-site repository should be reviewed. Most likely the university will want to include in its local institutional repository a metadata record describing the data along with a link to the data set where it resides, thus enabling the university to keep a complete record of its research assets.
One home may be appropriate for preservation and another for access. These two main components of data curation can be accommodated independently, but they are interrelated and should be linked. Preservation enables access, and active use of the data is often the best justification for continued preservation.
With the recent Office of Science and Technology Policy mandate, other players may emerge in the data management milieu. The Association of Research Libraries, the Association of American Universities, and the Association of Public and Land-grant Universities have issued a proposal, "SHared Access Research Ecosystem (SHARE)," that imagines a workflow architecture implemented across a network of university-operated repositories fulfilling the mandate's requirements. Representatives of 30 organizations that archive scientific data released a call for action urging the creation of sustainable funding streams for domain repositories that are closely tied to scholarly communities. Regardless of how this settles out, universities still will want to have a record of their own research output — and it could be that these data repositories will be important nodes in the evolving U.S. research data network.
It is important to recognize the current uncertainty as to how data management support and services will be distributed among university, disciplinary, funder, and national and international stakeholders. In this complex environment, an institution must actively determine how it will manage and distribute data services internally. Various university players are important stakeholders in determining the appropriate governance structure to ensure efficient coordination; adequate security and regulatory compliance; and scalable, sustainable, and useful data management services to researchers.
Effective data management is just one aspect of achieving the ultimate goal of ensuring on-going access to the outputs of academic research. This goal only can be achieved if the right questions have been asked and answered along the way.