DataFlow project: storing and sharing large datasets on the cloud PDF | Print | E-mail

The DataFlow project is building tools for research groups to share data internally, and to store and search high-quality data in long-term repositories.  The final product will be freely available, fully documented and open-source.

One thing we need is “user stories” to check that what we’re building has the functionality that researchers (and sysadmin types) want.  Please tell us how you might like to use a tool like this!  We also need some users to test-drive the system, which will be ready for use this summer.  Visit us at www.dataflow.ox.ac.uk, and be sure to check into our “suggestion box” at the top of the page.

Further details

The DataFlow Project is building a two-stage cloud-deployable data management infrastructure for researchers, that can be used across national Higher Education Institutions: (a) DataStage, to manage their research data locally, and (b) DataBank, to preserve and publish valuable research.

For local data management, rather than storing datasets on external hard drives in the lab, DataFlow lets researchers save their work to a DataStage file system that appears as a mapped drive on their computer, a lightweight system requiring them to install no special software on their computers.  DataStage will allow specification of specific read/write permissions for Principal Investigators and individuals within a research group, to ensure appropriate levels of data confidentiality.  The system will be lightweight, and will adopt best-practice standards to make sure data is secure and easy to retrieve.

They will then, through a convenient web interface, be able to deposit selected datasets from their local DataStage file management system to their institutional or subject-specific data repository, this being a cloud-based instance of the generic DataBank data repository.

DataStage will allow users to:

Package datasets with descriptive metadata, including a DOI issued by DataCite;

  • Determine who can access data: options include fully public datasets, embargoed material, and “dark” or private data; and
  • Search for and retrieve available datasets, enabling data to be reused and/or made available for peer review.

We are using a standards-based approach ensure compatibility with as many systems as possible.  Researchers will be able to use Linux, Windows or Mac operating systems, while DataBank will be deployable on the Eduserv cloud, on a commercial data storage cloud, or on a local institutional server. The infrastructure will be open and scalable, to meet the needs of individual researchers and their institutions, for both public and confidential data.