This repo contains prototyping work for creating an OPTIMADE API for searching and accessing structures from the Cambridge Structural Database (CSD).
The structures are accessed via the CSD Python
API and cast to the
OPTIMADE format; the
optimade-maker
and
optimade-python-tools
are then used to launch a local OPTIMADE API.
After cloning this repository and using some appropriate method of creating a virtual environment (current recommendation is uv
), this package can be installed with
git clone git@github.com:datalab-industries/csd-optimade
cd csd-optimade
uv sync --extra-index-url https://pip.ccdc.cam.ac.uk
or
git clone git@github.com:datalab-industries/csd-optimade
cd csd-optimade
pip install . --extra-index-url https://pip.ccdc.cam.ac.uk
Note that the extra index URL is required to install the csd-python-api
package.
Important
Any attempts to use CSD data will additionally require a CSD license and appropriate configuration.
The CSD can be ingested into the OPTIMADE format using the csd-ingest
entrypoint:
csd-ingest
This will use multiple processes (controlled by --num-processes
) to ingest the
local copy of the CSD database in chunks of size --chunk-size
until the target
--num-structures
has been reached (defaults to the entire CSD).
Each batch will be written to an OPTIMADE JSONLines file,
and combined into a single JSONLines file (~ 5.5 GB for the entire CSD, or 2 GB compressed) on completion, with name
<--run-name>-optimade.jsonl
.
Depending on parallelisation, this process should take a few minutes to ingest the entire CSD on consumer hardware (around 10 minutes with 8 processes on an AMD Ryzen 7 PRO 7840U mobile processor, requiring around 3 GB of RAM per process with the default chunk size of 100k).
The csd-serve
entrypoint provides a thin wrapper around the
optimade-maker
tool,
and bundles the simple configuration required to launch a local OPTIMADE API
with a simple in-memory database (if --mongo-uri
is provided, a real MongoDB
backend will be used).
Just provide the path to your combined OPTIMADE JSONLines file:
csd-serve <path-to-optimade-jsonl>
You should now be able to try out some queries locally, either in the browser or
with a tool like curl
:
curl http://localhost:5000/structures?filter=elements HAS "C"
For ease of deployment, as containerised version of the ingestion pipeline is available.
Important
You should verify that your license agreement allows for any kind of deployment outside of your private network; it likely does not.
To build the container from scratch, you need both a time-limited CSD installer
download link (CSD_INSTALLER_URL
), and your activation key
(CSD_ACTIVATION_KEY
).
Note
As of January 2025, you can request your time-limited CSD installer link at https://www.ccdc.cam.ac.uk/support-and-resources/download-the-csd/. Once you receive the email, the CSD_INSTALLER_URL
should be the one listed as "CSD Portfolio Linux Online Installer (recommended, small download)".
These should be stored in a .env
file that is available both at build time and runtime.
Note, managing these secrets requires a recent Docker version that includes
Buildx.
Once configured, you can build the container with
docker build --secret id=env,src=.env --target csd-optimade-server -t csd-optimade-server .
This will install the CSD inside the container, run the ingestion pipeline and
prepare an encrypted version of the CSD in the OPTIMADE JSONLines format.
The file can be decrypted with your CSD_ACTIVATION_KEY
.
To launch the container (which will decrypt the file and start the OPTIMADE API locally):
docker run --env-file .env -p 5000:5000 csd-optimade-server
If using a persistent database, future runs of the API can be controlled with
the CSD_OPTIMADE_INSERT
environment variable. If true
, the configured database will be
For development and deployment, you may prefer to use the bake definitions in
docker-bake.hcl
to build and tag the relevant build stages:
docker buildx bake csd-optimade-server
docker run --env-file .env -p 5000:5000 ghcr.io/datalab-industries/csd-optimade-server
As noted above, the CSD_ACTIVATION_KEY
used to build the container must be provided at runtime.
The API container can also be configured with all the OPTIMAKE_
prefixed environment variables.
The most important ones are listed here:
-
OPTIMAKE_MONGO_URI
: to use a persistent MongoDB backend, you can provide aMONGO_URI
via:OPTIMAKE_DATABSE_BACKEND=mongodb OPTIMAKE_MONGO_URI=mongodb://mongodb_server:27017/optimade
-
OPTIMAKE_BASE_URL
: to set the base URL of the API (used to generate pagination links), you can provide aBASE_URL
via:OPTIMAKE_BASE_URL=https://my-csd-deployment.com
Finally, if using a persistent database, future runs of the API can be controlled with the CSD_OPTIMADE_INSERT
environment variable.
If true
(default), the configured database will be wiped and rebuilt from the JSONL file directly, and a separate process will run the API.
If false
, only the API will be started, with no database rebuild.
All development of this package (bug reports, suggestions, feedback and pull requests) occurs in the csd-optimade GitHub repository. Contribution guidelines and tips for getting help can be found in the contributing notes.
This project was developed by datalab industries ltd., on behalf of the UK's Physical Sciences Data Infrastructure (PSDI), supported by the Cambridge Crystallographic Data Centre (CCDC).