Gareth Murphy
ESS DMSC
2017-09-14
The European Spallation Source (ESS) generates vast amounts of neutron data from its state-of-the-art instruments. Managing and curating this data effectively is essential to ensure scientific reproducibility, compliance with FAIR (Findable, Accessible, Interoperable, Reusable) principles, and the facilitation of advanced data analytics and machine learning applications.
This document provides a comprehensive guide to managing and curating neutron data at ESS, covering key processes, tools, and strategies to optimize data workflows.
- Instrumentation: Neutron instruments at ESS produce raw data from experiments, capturing interactions between neutrons and sample materials.
- Metadata Capture: Real-time recording of experimental parameters (e.g., sample ID, beam settings, temperature).
- Data Formats: Use standardized formats like NeXus (HDF5-based) for compatibility and interoperability.
- Calibration and Correction: Apply instrument-specific calibration and corrections to raw data.
- Reduction Pipelines: Convert raw data into scientifically meaningful units (e.g., scattering intensity, reflectivity).
- Software Tools: Use ESS-supported platforms (e.g., Mantid) for data reduction.
- Short-term Storage: Store raw and processed data on high-speed storage systems for immediate access.
- Long-term Archival: Migrate data to high-capacity, redundant storage systems for preservation.
- Backup Policies: Regularly back up data to ensure recovery in case of system failures.
- Analysis Software: Provide researchers with robust tools for data visualization and interpretation (e.g., Python libraries, Jupyter Notebooks).
- Collaboration: Facilitate collaborative analysis by enabling shared access to data and software environments.
- DOI Assignment: Assign Digital Object Identifiers (DOIs) to datasets to ensure traceability.
- Publication Repositories: Deposit datasets in recognized repositories like ESS Dataportal or Zenodo.
- Supplementary Materials: Link datasets to related publications, experimental logs, and protocols.
- Metadata Standards: Ensure datasets are annotated with rich, standardized metadata.
- Searchability: Develop advanced search functionalities to locate datasets using keywords, sample IDs, or experiment parameters.
- Persistent Identifiers: Use DOIs or similar identifiers to ensure datasets are uniquely findable.
- Access Policies: Define clear access levels (e.g., public, restricted, or embargoed).
- Authentication Systems: Implement secure authentication for user-specific access.
- Data Portals: Provide intuitive web interfaces for data discovery and download.
- Standard Formats: Adopt NeXus and other widely recognized formats for data sharing.
- APIs: Develop RESTful APIs for programmatic access to datasets.
- Ontologies: Use domain-specific ontologies to describe data and metadata consistently.
- Licensing: Apply clear, permissive licenses (e.g., Creative Commons) to datasets.
- Documentation: Provide detailed documentation on data formats, processing pipelines, and analysis methodologies.
- Quality Assurance: Implement robust validation processes to ensure data accuracy and reliability.
- Data Stewards: Ensure compliance with FAIR principles and manage metadata quality.
- Instrument Scientists: Oversee data acquisition and provide guidance on processing workflows.
- IT Teams: Maintain infrastructure for data storage, backup, and security.
- Raw Data: Retain for a defined period (e.g., 10 years), subject to ESS policies.
- Processed Data: Retain indefinitely for reproducibility and reuse.
- User Data: Allow users to specify retention preferences for non-critical datasets.
- Encryption: Encrypt data during transfer and storage to prevent unauthorized access.
- Anonymization: Remove or mask sensitive information from datasets.
- Monitoring: Implement systems to detect and respond to potential data breaches.
- ESS Dataportal: Centralized platform for dataset storage, discovery, and access.
- ICAT: Middleware for cataloging and retrieving experimental metadata.
- Mantid: For neutron data reduction and visualization.
- Python Ecosystem: Libraries like NumPy, SciPy, and matplotlib for custom analyses.
- Jupyter Notebooks: For interactive data analysis and sharing workflows.
- Workflow Automation: Use tools like Apache Airflow for automated data processing pipelines.
- Integration: Connect ESS systems with external repositories and analysis platforms.
- Challenge: Instruments generate terabytes of data daily.
- Solution: Use scalable storage solutions and tiered storage architectures.
- Challenge: Inconsistent or incomplete metadata can hinder data reuse.
- Solution: Develop mandatory metadata templates and validation tools.
- Challenge: Researchers may lack expertise in data management practices.
- Solution: Offer workshops and online resources on data curation and FAIR principles.
- Leverage AI to identify patterns in neutron data and automate routine analyses.
- Develop machine learning models to predict experimental outcomes based on historical data.
- Build platforms for seamless collaboration between ESS researchers and external partners.
- Foster open science by enabling public access to selected datasets.
- Implement energy-efficient data centers and workflows.
- Regularly review and update data policies to align with emerging standards.
Effective management and curation of neutron data at ESS are critical for advancing scientific discovery. By adopting FAIR principles, leveraging state-of-the-art tools, and fostering a culture of collaboration and innovation, ESS can ensure that its data remains a valuable resource for the global research community.
References:
- Wilkinson, M. D., et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship.
- European Spallation Source (ESS) Website: https://europeanspallationsource.se
- NeXus Data Format: https://www.nexusformat.org
- Make data available and citeable
- Need metadata
- eg. technical metadata (experimental setup, detector geometry and area, pulse width and shape, source type, derived quantities, if raw or analyzed data, energy resolution, calibration information, sample type )
- descriptive metadata (date, investigator details, ORCID etc)
- administrative metadata (date, how created, Digital object identifier or other persistent ID, File type )
--
---?image=assets/catanie.png&size=auto 90%
What metadata would you like to have visible?
What services should the data catalogue provide?