About
DataFlow is a software and hardware solution that helps scientists easily
transmit scientific data from generators such as scientific instruments to a designated, centralized data storage resource
and optionally, capture metadata related to the data
via a user-friendly web application or an application programming interface (API).
DataFlow is especially impactful in bringing data movement capabilities to air-gapped computers,
such as the instrument control computers that are off the enterprise network.
Key benefits
- Data bridge to air-gapped instruments
- Data from (instrument) computers moved to desired storage solution:
- Source (instrument computer):
- Storage drive can be freed up for larger / more experiments since generated data can be deleted once transmitted to destination
- Failure of storage drives would not result in data loss
- Destination (storage solution):
- Data more easily accessible to desired computational resource (e.g. personal computer, cloud, cluster, etc.)
- Secure location for data limiting / preventing access from other users
- Data more resilient against storage drive failure
- (Typically) scalable / large storage solutions
- Convenient user interfaces for data movement and metadata capture:
- User-friendly and intuitive web interface
- REST API and python wrapper for automated transfer of data (and metadata),
thereby enabling automated data processing / management pipelines
- Metadata captured digitally:
- Standardized capture of metadata fields across all researchers through use of custom metadata forms (web only)
- Digitized capture of metadata in close proximity to raw data
- Customizable data transfer adapters to suit applications - Globus only for now but more coming soon.
- Streamlined data pipeline for subsequent data utilization - Once data is at the desired location, researchers can share, organize, analyze, and process data more efficiently.
For example, moving data to a HPC environment enables data to be processed via large-scale batch computing jobs, sharing via Globus, etc.
Overview
DataFlow is available in a variety of flavors to suit different needs. Here is a quick overview of the main variables:
Deployments
The DataFlow software stack can be deployed on a variety of different physical and virtual locations.
- Facility-local servers - The DataFlow software stack can be deployed on dedicated server at the desired laboratory
or facility to provide aforementioned data capabilities to air-gapped computers / instruments.
Please see the dedicated page on facility local servers for more information.
- Centralized server - The DataFlow software is also deployed on CADES cloud at https://dataflow.ornl.gov for users who want to try out DataFlow.
Data adapters
Adapters enable users of DataFlow to authenticate and move data to specific locations, as allowed by the specific adapter.
Each adapter has its own benefits and limitations.
Researchers are recommended to choose the adapter that works best for their needs.
At the moment, DataFlow only supports the Globus adapter but we plan on expanding the list of adapters in subsequent iterations.
User interfaces
Users (or programs) can interact with DataFlow in two main ways:
- Web interface - Users can use an user-friendly web application to move data, capture metadata about experiments,
manage DataFlow settings, create, search and manage Datasets, etc.
- Programming interfaces - Users can develop scripts that can enable automated transfer of data and capture of metadata
without human intervention using DataFlow's:
- REST API - available by clicking on the user's name on the top right, selecting "API access".
This provides a low-level interface to DataFlow allowing users to use their programming language of choice to
interact with DataFlow
- ordflow - a light-weight python wrapper to REST API.