Skip to content

datalab-industries/csd-optimade

Repository files navigation

CSD OPTIMADE API

This repo contains prototyping work for creating an OPTIMADE API for searching and accessing structures from the Cambridge Structural Database (CSD).

The structures are accessed via the CSD Python API and cast to the OPTIMADE format; the optimade-maker and optimade-python-tools are then used to launch a local OPTIMADE API.

Installation

After cloning this repository and using some appropriate method of creating a virtual environment (current recommendation is uv), this package can be installed with

git clone git@github.com:datalab-industries/csd-optimade
cd csd-optimade
uv sync --extra-index-url https://pip.ccdc.cam.ac.uk

or

git clone git@github.com:datalab-industries/csd-optimade
cd csd-optimade
pip install . --extra-index-url https://pip.ccdc.cam.ac.uk

Note that the extra index URL is required to install the csd-python-api package.

Important

Any attempts to use CSD data will additionally require a CSD license and appropriate configuration.

Usage

Ingesting CSD data

The CSD can be ingested into the OPTIMADE format using the csd-ingest entrypoint:

csd-ingest

This will use multiple processes (controlled by --num-processes) to ingest the local copy of the CSD database in chunks of size --chunk-size until the target --num-structures has been reached (defaults to the entire CSD). Each batch will be written to an OPTIMADE JSONLines file, and combined into a single JSONLines file (~ 5.5 GB for the entire CSD, or 2 GB compressed) on completion, with name <--run-name>-optimade.jsonl.

Depending on parallelisation, this process should take a few minutes to ingest the entire CSD on consumer hardware (around 10 minutes with 8 processes on an AMD Ryzen 7 PRO 7840U mobile processor, requiring around 3 GB of RAM per process with the default chunk size of 100k).

Creating an OPTIMADE API

The csd-serve entrypoint provides a thin wrapper around the optimade-maker tool, and bundles the simple configuration required to launch a local OPTIMADE API with a simple in-memory database (if --mongo-uri is provided, a real MongoDB backend will be used). Just provide the path to your combined OPTIMADE JSONLines file:

csd-serve <path-to-optimade-jsonl>

You should now be able to try out some queries locally, either in the browser or with a tool like curl:

curl http://localhost:5000/structures?filter=elements HAS "C"

Containerized version

For ease of deployment, as containerised version of the ingestion pipeline is available.

Important

You should verify that your license agreement allows for any kind of deployment outside of your private network; it likely does not.

To build the container from scratch, you need both a time-limited CSD installer download link (CSD_INSTALLER_URL), and your activation key (CSD_ACTIVATION_KEY).

Note

As of January 2025, you can request your time-limited CSD installer link at https://www.ccdc.cam.ac.uk/support-and-resources/download-the-csd/. Once you receive the email, the CSD_INSTALLER_URL should be the one listed as "CSD Portfolio Linux Online Installer (recommended, small download)".

These should be stored in a .env file that is available both at build time and runtime. Note, managing these secrets requires a recent Docker version that includes Buildx.

Once configured, you can build the container with

docker build --secret id=env,src=.env --target csd-optimade-server -t csd-optimade-server .

This will install the CSD inside the container, run the ingestion pipeline and prepare an encrypted version of the CSD in the OPTIMADE JSONLines format. The file can be decrypted with your CSD_ACTIVATION_KEY.

To launch the container (which will decrypt the file and start the OPTIMADE API locally):

docker run --env-file .env -p 5000:5000 csd-optimade-server

If using a persistent database, future runs of the API can be controlled with the CSD_OPTIMADE_INSERT environment variable. If true, the configured database will be

For development and deployment, you may prefer to use the bake definitions in docker-bake.hcl to build and tag the relevant build stages:

docker buildx bake csd-optimade-server
docker run --env-file .env -p 5000:5000 ghcr.io/datalab-industries/csd-optimade-server

Runtime configuration options

As noted above, the CSD_ACTIVATION_KEY used to build the container must be provided at runtime.

The API container can also be configured with all the OPTIMAKE_ prefixed environment variables.

The most important ones are listed here:

  • OPTIMAKE_MONGO_URI: to use a persistent MongoDB backend, you can provide a MONGO_URI via:

    OPTIMAKE_DATABSE_BACKEND=mongodb
    OPTIMAKE_MONGO_URI=mongodb://mongodb_server:27017/optimade
  • OPTIMAKE_BASE_URL: to set the base URL of the API (used to generate pagination links), you can provide a BASE_URL via:

    OPTIMAKE_BASE_URL=https://my-csd-deployment.com

Finally, if using a persistent database, future runs of the API can be controlled with the CSD_OPTIMADE_INSERT environment variable. If true (default), the configured database will be wiped and rebuilt from the JSONL file directly, and a separate process will run the API. If false, only the API will be started, with no database rebuild.

Contributing and Getting Help

All development of this package (bug reports, suggestions, feedback and pull requests) occurs in the csd-optimade GitHub repository. Contribution guidelines and tips for getting help can be found in the contributing notes.

Funding

This project was developed by datalab industries ltd., on behalf of the UK's Physical Sciences Data Infrastructure (PSDI), supported by the Cambridge Crystallographic Data Centre (CCDC).

About

Prototype OPTIMADE API for the Cambridge Structural Database (CSD)

Resources

License

Stars

Watchers

Forks

Packages

No packages published