ThermoML-FAIR

ThermoML-FAIR is a modern Python toolkit for downloading, validating, and structuring ThermoML data from NIST’s ThermoML Archive. Designed for seamless integration with data science and machine learning workflows in materials science, ThermoML-FAIR enables reproducible, automated extraction of thermophysical property data into long-format pandas DataFrames, with detailed phase and method information for every measurement.

ThermoML-FAIR is built to support FAIR data practices—making ThermoML data Findable, Accessible, Interoperable, and Reusable. This ensures that data workflows are robust, transparent, and ready for open science and sustainable materials discovery.

This project is a ground-up reimplementation inspired by the original choderalab/thermopyl, rewritten for robust schema validation, high-throughput data processing, and downstream compatibility with tools like Matminer and Citrine. The toolkit is built with sustainability and open science in mind, making it easy to access, analyze, and share high-quality thermophysical property data for materials discovery and informatics.

Features

FAIR data principles: All workflows are designed to make data Findable, Accessible, Interoperable, and Reusable
Automated mirroring of the NIST ThermoML Archive (RSS and archive-based)
Schema validation: All XML files are validated against the official ThermoML XSD
Efficient, parallelized parsing and DataFrame construction: Cross-platform support with ProcessPoolExecutor for high-throughput workflows
Rich CLI experience: Intuitive command-line interface with progress bars, robust error handling, and flexible options for parallelism (--max-workers)
Long-format DataFrame output: Each measurement is a row, with phase and method columns included for every property
Comprehensive compounds DataFrame: Always includes a symbol column (chemical formula or fallback name) for all files
Flexible output: Export to CSV, HDF5, or Parquet for scalable analytics and ML workflows
Resilient download logic: DOI resolution, override support, and robust error handling
Modular, extensible architecture: Built with dataclasses, pathlib, and modern Python best practices
Ready for ML pipelines: Designed for easy integration with scikit-learn, matminer, and other data science tools
Sustainability focus: Streamlines reproducible data extraction for green chemistry, energy materials, and more
Cache management: Tools for clearing and managing parsed data caches
Cross-platform compatibility: Works on Windows, macOS, and Linux
SPDX License: GPL-2.0

Installation

ThermoML-FAIR requires Python 3.8 or newer.

Basic install:

pip install thermoml-fair

With Parquet support (recommended for large datasets):

pip install 'thermoml-fair[parquet]'

From PyPI (Recommended)

pip install thermoml-fair

From Source

git clone https://github.com/your-username/thermoml_fair.git
cd thermoml_fair
pip install .

CLI Usage

The thermoml_fair package provides a command-line interface (CLI) for interacting with the ThermoML data archive. All major operations support parallel processing via the --max-workers (or -mw) option for efficient, scalable workflows.

You can explore the available commands and their options by running:

thermoml_fair --help

And for specific commands:

thermoml_fair <command> --help

Common Workflow: Step-by-Step

Here's a typical workflow for using the thermoml_fair CLI. By default, files are stored in ~/.thermoml/.

Update the local ThermoML archive: This command downloads or updates the ThermoML XML files from the NIST repository into the default archive directory (~/.thermoml/archive/).
```
thermoml_fair update-archive
```
To force a re-download of all files:
```
thermoml_fair update-archive --force-download
```
Parse all downloaded XML files (with parallelism): This command processes all .xml files in the archive directory and saves the parsed data as .parsed.pkl cache files alongside them.
```
thermoml_fair parse-all --max-workers 4
```
You can specify a different directory:
```
thermoml_fair parse-all --dir path/to/your/archive --max-workers 2
```
Build the consolidated DataFrames (with parallelism): This command consolidates the parsed data into three separate CSV files: one for measurements (data.csv), one for compounds (compounds.csv), and one for unique property names (properties.csv).
```
thermoml_fair build-dataframe --max-workers 4
```
You can specify custom output paths and formats (e.g., parquet, h5):
```
thermoml_fair build-dataframe \
  --output-data-file my_data.parquet \
  --output-compounds-file my_compounds.csv \
  --output-properties-file my_properties.csv \
  --max-workers 4
```

Explore the Data: After building your data files, you can quickly explore their contents.

List Unique Properties:

thermoml_fair properties --properties-file my_properties.csv

List Unique Chemicals:

# List unique common names (default)
thermoml_fair chemicals --compounds-file my_compounds.csv

# List unique molecular formulas
thermoml_fair chemicals --compounds-file my_compounds.csv --field sFormulaMolec

(Optional) Clear cached files: If you need to clear the .parsed.pkl files from the archive directory:
```
thermoml_fair clear-cache --yes
```
You can also specify a different directory:
```
thermoml_fair clear-cache --dir path/to/your/custom_directory --yes
```

Note: For every command, you can use the --help flag to see all available options and their descriptions. For example, thermoml_fair update-archive --help.

Output Details

DataFrame: Always in long format, with each row representing a single measurement. Includes phase, method, property, value, material_id, and more.
Compounds DataFrame: Always includes a symbol column (chemical formula or fallback name) for all files, supporting robust downstream analytics.
Rich CLI output: Progress bars and status messages for all major operations.

Property Coverage

ThermoML-FAIR captures a wide range of thermophysical properties from the NIST ThermoML Archive.
The dataset includes 97 unique properties, spanning transport, thermodynamic, and phase equilibrium measurements.
The chart below highlights the Top 10 most frequently reported properties:

Figure 1. Distribution of the top 10 most frequently reported thermophysical properties in the ThermoML dataset. Dataset covers 97 unique properties in total.

This breadth of high-quality, peer-reviewed data is rare in materials informatics.
By making 97 distinct properties machine learning–ready, ThermoML-FAIR enables reproducible benchmarking and cross-property modeling workflows.
Such coverage supports not only thermal conductivity or viscosity prediction but also broader efforts in sustainable materials discovery and process design.

⚠️ Note: While ThermoML-FAIR implements schema validation and robust error handling, some rare properties or edge-case entries may not parse perfectly.
I encourage users to review outputs for their specific use case and welcome contributions to further improve coverage.

Contributing

Contributions are welcome! Please open an issue or submit a pull request. For Parquet support, install with pip install 'thermoml-fair[parquet]' before running related tests.

Development

To set up a development environment:

git clone https://github.com/YOURNAME/thermoml-fair.git
cd thermoml-fair
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -e .[dev,parquet]  # Editable install with dev and Parquet dependencies

# Run tests
pytest

Linting and type checking are recommended (e.g., Black, Flake8, MyPy) before committing. ThermoML-FAIR is designed for robust, reproducible data extraction and analysis—ideal for accelerating sustainable materials discovery and informatics workflows.

Publishing to PyPI

Set up your environment (first time only):

Create a virtual environment:
```
python -m venv venv
```

Activate it. On Windows (PowerShell), you may need to adjust your execution policy:

# Allow script execution for the current process, then activate
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope Process
.\venv\Scripts\Activate.ps1

On macOS/Linux:
```
source venv/bin/activate
```
Install build tools:
```
pip install build twine
```

Build the package: This command creates the dist/ directory with the source archive and wheel.
```
python -m build
```
Upload to TestPyPI (recommended): First, make sure you have a TestPyPI account and have configured your ~/.pypirc file.
```
twine upload --repository testpypi dist/*
```
Upload to PyPI: Once you've verified the package on TestPyPI, upload it to the official Python Package Index.
```
twine upload dist/*
```

Project Motivation & Impact

Modern materials science and machine learning rely on robust, high-quality datasets.
ThermoML-FAIR unlocks over 2.6 million peer-reviewed experimental measurements spanning 97 distinct thermophysical properties from leading journals such as Journal of Chemical & Engineering Data, Journal of Chemical Thermodynamics, and Fluid Phase Equilibria.

By converting the NIST ThermoML archives into FAIR, ML-ready formats, this toolkit enables reproducible benchmarking, cross-property modeling, and sustainable materials discovery.
ThermoML-FAIR is a step toward a future where validated data and automation accelerate innovation in green chemistry, energy materials, and advanced manufacturing.

About the Author

Angela C. Davis
Materials Scientist and Data Innovator passionate about sustainable materials discovery and open science.

Background: Coatings, thermoplastics, polymer AM, composites, green chemistry, corrosion, advanced manufacturing
Focus: AI, data science, materials informatics, process modeling, and continuous learning
Mission: To build FAIR, reproducible tools that accelerate sustainable innovation and enable the community to unlock peer-reviewed data for next-generation discovery

Through ThermoML-FAIR and related projects, Angela bridges experimental expertise with modern data science and ML, creating open, scalable workflows that advance sustainability and innovation.

Contact: angela.cf.davis@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.vscode		.vscode
thermoml_fair		thermoml_fair
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
PUBLISHING_GUIDE.md		PUBLISHING_GUIDE.md
README.md		README.md
USER_GUIDE.md		USER_GUIDE.md
data_plots.ipynb		data_plots.ipynb
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
setup.py		setup.py
thermoml_analysis.ipynb		thermoml_analysis.ipynb
thermoml_data_analysis.ipynb		thermoml_data_analysis.ipynb
thermoml_top10_properties.png		thermoml_top10_properties.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ThermoML-FAIR

Features

Installation

From PyPI (Recommended)

From Source

CLI Usage

Common Workflow: Step-by-Step

Output Details

Property Coverage

Contributing

Development

Publishing to PyPI

Project Motivation & Impact

About the Author

About

Uh oh!

Releases

Packages

Languages

License

acfdavis/thermoml-fair

Folders and files

Latest commit

History

Repository files navigation

ThermoML-FAIR

Features

Installation

From PyPI (Recommended)

From Source

CLI Usage

Common Workflow: Step-by-Step

Output Details

Property Coverage

Contributing

Development

Publishing to PyPI

Project Motivation & Impact

About the Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages