![]() |
|
![]() |
![]() |
|
EDUCAUSE Review
|
![]() |
Curating Scientific Web Services and Workflow![]() © 2008 Carole Goble and David De Roure. The text of this article is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License (http://creativecommons.org/licenses/by-nc-nd/3.0/). EDUCAUSE Review, vol. 43, no. 5 (September/October 2008) Curating Scientific Web Services and WorkflowsA bewildering array of digital resources is available to the modern researcher, ranging from libraries of articles and data collections to analytical tools and visualization applications, many publicly available. Take bioinformatics, for example. Nucleic Acids Research describes more than 1,000 databases1 drawn from a “Bioinformatics Nation”2 of different subdisciplines, research teams, and institutes. The same is true for chemistry, astronomy, earth sciences, and just about any information-rich scientific area. These digital resources are combined and their data aggregated and analyzed in the day-to-day work of skilled scientific investigators. But even though we are familiar with the need to curate our data for dissemination and for the long term, we must not neglect the curation and cataloguing of the processes that we use to search, integrate, and analyze that data. Where researchers may once have used applications or software libraries, increasingly we see functionality provided instead through web-based resources. In nanotechnology, for example, nanoHUB (http://www.nanohub.org) provides more than 1,000 resources for research, education, and collaboration—including simulation tools accessed from the web browser. Hence the web has become a distributed computing platform supporting research on an everyday basis. In this article we will highlight two specific kinds of processes—web services and workflows—that are enjoying increasing adoption. They have already become prevalent in the in silico research of life scientists, providing a revealing glimpse into the future. Web services offer a well-defined programming interface that software applications, written in various programming languages and running on various platforms, can use to process data over the Internet. An increasing number of resources are available via web services interfaces, turning these resources into services that can be combined into complex networked applications. In the life sciences, open source and commercial integration systems, data warehouses, and integration frameworks use web services behind the scenes. Workflows are an alternative to using precooked applications, with embedded data pipelines and analysis scripts. Scientific workflow management systems—such as Taverna, Triana, Kepler, and Pipeline Pilot—provide a mechanism to automatically orchestrate the execution of services, coordinating processes (control flow) and managing the flow of data between them (data flow).3 The workflows are explicit and precise descriptions of a scientific process—the instruction scripts that define the flow of data and the order of execution of the service steps. In turn, these workflows can become services within other workflows and applications. Workflows are becoming rather fashionable. As more resources become available, exposed as web services, they provide an attractive means for rapid assembly of customized integrations. They link together and cross-reference data in different repositories, both public and private, which could be widely distributed. For example, workflows can assist in automatically text-mining the literature. From a developer’s standpoint, they are an agile means of application delivery of a process. From a scientific programmer’s standpoint, they are a means to automatically, repetitively, and systematically run a process while accurately tracking the provenance of results. From a scientist’s standpoint, they are a reliable and transparent means for encoding a scientific method that supports reproducible science and the sharing and replicating of best-of-practice and know-how through reuse. Curating ProcessesProcesses are the methods that form a core component of scientific discovery. Given a predicted rise in the number of openly available web services and workflows, it would seem necessary, and certainly prudent, to curate processes as effectively as we curate the data they consume and the publications they generate. The systematic curation of processes would enable programmers and scientists to survey available, well-characterized, and established methods, to avoid unnecessary reinvention, and to be better informed of best-practice techniques and how they are used correctly and appropriately. The lack of adequate and standard metadata describing individual services often prevents their discovery unless users already know that these services exist, know what they do, and know how to use them. We should be able to
Both web services and workflows need accurate and flexible metadata that is understandable both by people and by software applications. However, the comprehensive cataloguing needed to serve the broader research community, beyond project-specific efforts, is lacking. Web services and workflows are scattered across the web. They are most likely to be located by word-of-mouth or by Google searches, which find them through textual references. Groups or individuals gather them on websites or portals. Broader initiatives such as seekda (http://www.seekda.com) gather together a very wide range of web services but are insufficiently curated. Yet curation is crucial:
Although some curators are domain experts who understand web services and workflows, we see two other key approaches. One is community curation: the trend is to follow in the footsteps of popular Web 2.0 social computing sites and encourage community curation through user feedback, blogging, e-tracking, recommendations, and folksonomy-based tagging. Community curation requires built-in incentive models, such as credit and attribution, for people to contribute. Second, operational and usage metadata is ripe for automation, generated from monitoring services, application diagnostics, customer reports, and social network analysis. Workflow analytics is the term used for processing workflow collections to identify, for example, service co-use patterns and service popularity. Two ApproachesWe are tackling these challenges in two efforts to systematically catalogue processes for the benefit of specific scientific communities:
ConclusionWe have an increasing understanding of the practices of data curation, but we should not neglect the curation and cataloguing of the processes that we use to work with the data. A well-curated resource would potentially enable reuse by including knowledge of and about processes, and would hence avoid wasteful reinvention, increase reliability by pooling operational histories and reputations, and improve validation by promoting best-practice, verified procedures and popular processes. However, an absence of curated processes leads to ignorance of availability and creates obstacles to adoption. Active curation of these resources with accurate and flexible descriptions to check their availability, reliability, and general quality of service is required. Community curation and automation provide a powerful approach to addressing these challenges. Notes
|
![]() |
| Unless otherwise noted, EDUCAUSE holds the copyright on all materials published by the association, whether in print or electronic form. In certain cases the work remains the intellectual property of the individual author(s) (see Special Circumstances). Content from conference speeches, presentations, blogs, wikis and feeds reflect the opinions of the author, and not necessarily those of EDUCAUSE or its members. | |||