Skip to content

Assorted fixes for disorder and reduced formulae #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Feb 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
efdc87c
Handle deuterated reduced formulae
ml-evs Feb 4, 2025
8d164c2
Handle missing or multi-component formulae
ml-evs Feb 4, 2025
ea94afb
Add `_csd_remark` as a searchable field
ml-evs Feb 4, 2025
60b675f
Remove nsites check and add simple tests for reduced formulae
ml-evs Feb 4, 2025
f820a75
Bump optimade-python-tools version to better handle disorder
ml-evs Feb 4, 2025
18e90dd
Add bigger subset test case
ml-evs Feb 4, 2025
b0338bc
Fix deuterated reduced formula
ml-evs Feb 4, 2025
f6892b5
Handle another formula edge case
ml-evs Feb 4, 2025
34373d4
Bump optimade-python-tools version to better handle disorder
ml-evs Feb 4, 2025
edba7c4
Fix type of z-prime field
ml-evs Feb 4, 2025
5a8c92f
Attempt to use packing automatically
ml-evs Feb 4, 2025
ec9974e
Add example license info
ml-evs Feb 7, 2025
ff2cef7
Add ability to exit the API after inserting, to allow asynchronous re…
ml-evs Feb 7, 2025
f3f17b3
Run async insertion pipeline in Dockerfile
ml-evs Feb 7, 2025
7c292cb
Add ability to turn off insertion pipeline via `CSD_OPTIMADE_INSERT` …
ml-evs Feb 7, 2025
5492bf7
Update README with more deployment instructions
ml-evs Feb 7, 2025
4a4ec7b
Refactoring to allow for `implicit_atoms` and completed elements list…
ml-evs Feb 7, 2025
a234e6b
Add fat test for all entries with reuslts saved to disk
ml-evs Feb 8, 2025
dabcec5
Do not decrypt unless inserting
ml-evs Feb 9, 2025
2f74b6a
Fix typo in Dockerfile
ml-evs Feb 9, 2025
b9cc898
Use latest optimade-python-tools pre-release
ml-evs Feb 9, 2025
cd8c636
Skip bad identifiers in big test
ml-evs Feb 9, 2025
4617fc0
Move metadata into fixed modules rather than serve
ml-evs Feb 11, 2025
dd60d06
Export info endpoints dynamically into JSONL
ml-evs Feb 11, 2025
713da8d
Bump optimade-maker version
ml-evs Feb 11, 2025
221733c
Add more debug output to big test
ml-evs Feb 11, 2025
2265908
Properly check `CSD_OPTIMADE_INGEST` variable in docker entrypoint
ml-evs Feb 12, 2025
2d5a3b7
Tweak blocking behaviour in Dockerfile
ml-evs Feb 12, 2025
7f3ceb1
Tweak formula tests
ml-evs Feb 12, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -183,9 +183,14 @@ if [ -z "$CSD_ACTIVATION_KEY" ]; then
exit 1
fi

gpg --batch --passphrase ${CSD_ACTIVATION_KEY} --decrypt /opt/csd-optimade/csd-optimade.jsonl.gz.gpg | gunzip > /opt/csd-optimade/csd-optimade.jsonl
if [ "$CSD_OPTIMADE_INSERT" = "1" ] || [ "$CSD_OPTIMADE_INSERT" = "true" ]; then
# Run the API twice: once to wipe and reinsert the data then exit, the second to run the API
(gpg --batch --passphrase ${CSD_ACTIVATION_KEY} --decrypt /opt/csd-optimade/csd-optimade.jsonl.gz.gpg | gunzip > /opt/csd-optimade/csd-optimade.jsonl &&
exec uv run --no-sync csd-serve --port 5001 --exit-after-insert --drop-first /opt/csd-optimade/csd-optimade.jsonl) &
fi

exec uv run --no-sync csd-serve --no-insert /opt/csd-optimade/csd-optimade.jsonl

exec uv run --no-sync csd-serve --drop-first /opt/csd-optimade/csd-optimade.jsonl
EOF

RUN chmod +x /entrypoint.sh
Expand Down
42 changes: 38 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ Buildx.
Once configured, you can build the container with

```shell
docker build --secret id=env,src=.env -t csd-optimade .
docker build --secret id=env,src=.env --target csd-optimade-server -t csd-optimade-server .
```

This will install the CSD inside the container, run the ingestion pipeline and
Expand All @@ -124,11 +124,45 @@ To launch the container (which will decrypt the file and start the OPTIMADE
API locally):

```shell
docker run --env-file .env -p 5000:5000 csd-optimade
docker run --env-file .env -p 5000:5000 csd-optimade-server
```

For development, you may prefer to use the bake definitions in
`docker-bake.hcl` to build and tag the relevant build stages.
If using a persistent database, future runs of the API can be controlled with
the `CSD_OPTIMADE_INSERT` environment variable. If `true`, the configured database will be


For development and deployment, you may prefer to use the bake definitions in
`docker-bake.hcl` to build and tag the relevant build stages:

```shell
docker buildx bake csd-optimade-server
docker run --env-file .env -p 5000:5000 ghcr.io/datalab-industries/csd-optimade-server
```

### Runtime configuration options

As noted above, the `CSD_ACTIVATION_KEY` used to build the container must be provided at runtime.

The API container can also be configured with all the `OPTIMAKE_` prefixed environment variables.

The most important ones are listed here:

- `OPTIMAKE_MONGO_URI`: to use a persistent MongoDB backend, you can provide a `MONGO_URI` via:

```shell
OPTIMAKE_DATABSE_BACKEND=mongodb
OPTIMAKE_MONGO_URI=mongodb://mongodb_server:27017/optimade
```

- `OPTIMAKE_BASE_URL`: to set the base URL of the API (used to generate pagination links), you can provide a `BASE_URL` via:

```shell
OPTIMAKE_BASE_URL=https://my-csd-deployment.com
```

Finally, if using a persistent database, future runs of the API can be controlled with the `CSD_OPTIMADE_INSERT` environment variable.
If `true` (default), the configured database will be wiped and rebuilt from the JSONL file directly, and a separate process will run the API.
If `false`, only the API will be started, with no database rebuild.

## Contributing and Getting Help

Expand Down
4 changes: 1 addition & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ classifiers = [
]
requires-python = ">= 3.11, < 3.12"
dependencies = [
"optimade @ git+https://github.com/Materials-Consortia/optimade-python-tools.git@ml-evs/jsonl-relationships-links",
"optimade @ git+https://github.com/Materials-Consortia/optimade-python-tools.git",
"optimade-maker @ git+https://github.com/materialscloud-org/optimade-maker.git@ml-evs/qol-server",
"tqdm ~= 4.66",
"pymongo >= 4, < 5",
Expand Down Expand Up @@ -80,6 +80,4 @@ testpaths = "tests"
addopts = "-rs"
filterwarnings = [
"error",
"ignore:.*total_num_atoms.*:RuntimeWarning",
"ignore:.*unable to reduce formula.*:UserWarning"
]
41 changes: 40 additions & 1 deletion src/csd_optimade/fields.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
from optimade import __api_version__
from optimade.models.baseinfo import BaseInfoAttributes, BaseInfoResource


def generate_csd_provider_fields():
return {
"structures": [
Expand Down Expand Up @@ -92,8 +96,43 @@ def generate_csd_provider_fields():
},
{
"name": "_csd_z_prime",
"type": "integer",
"type": "float",
"description": "The number of formula units in the asymmetric unit.",
},
{
"name": "_csd_remarks",
"type": "string",
"description": "Free-text remarks about the structure.",
},
]
}


def generate_csd_provider_info():
return {
"prefix": "csd",
"name": "Cambridge Structural Database",
"description": "A database of crystal structures curated by the Cambridge Crystallographic Data Centre.",
"homepage=": "https://www.ccdc.cam.ac.uk",
}


def generate_license_link():
return "https://www.ccdc.cam.ac.uk/licence-agreement"


def generate_csd_info_endpoint() -> dict[str, BaseInfoResource]:
return {
"data": BaseInfoResource(
attributes=BaseInfoAttributes(
api_version=__api_version__,
available_api_versions=[],
formats=["json"],
available_endpoints=["info", "structures", "references"],
entry_types_by_format={"json": ["info", "structures", "references"]},
is_index=False,
license={"href": generate_license_link()},
available_licenses=None,
)
)
}
40 changes: 37 additions & 3 deletions src/csd_optimade/ingest.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,17 @@
from __future__ import annotations

from optimade import __api_version__

from csd_optimade.fields import (
generate_csd_info_endpoint,
generate_csd_provider_fields,
generate_csd_provider_info,
)

BAD_IDENTIFIERS = {
"QIJZOB", # hangs infinitely during mapping
"VOHZIB", # no 3D structure
"YIGKOP",
}

import glob
Expand Down Expand Up @@ -189,13 +198,38 @@ def cli():
with open(tmp_jsonl_path) as tmp_jsonl:
ids_by_type: dict[str, set] = {}
with open(output_file, "w") as final_jsonl:
# Write headers
# Write headers and info endpoints
final_jsonl.write(
json.dumps({"x-optimade": {"meta": {"api_version": "1.1.0"}}}) + "\n"
json.dumps({"x-optimade": {"meta": {"api_version": __api_version__}}})
+ "\n"
)

info = generate_csd_info_endpoint()
provider = generate_csd_provider_info()
final_jsonl.write(
json.dumps(
{
"data": info["data"].model_dump(
exclude_unset=True, exclude_none=False
)
}
)
+ "\n"
)
final_jsonl.write(
_construct_entry_type_info(
"structures", properties=[], provider_prefix=""
"structures",
properties=generate_csd_provider_fields()["structures"],
provider_prefix=provider["prefix"],
).model_dump_json()
+ "\n"
)

final_jsonl.write(
_construct_entry_type_info(
"references",
properties=[],
provider_prefix=provider["prefix"],
).model_dump_json()
+ "\n"
)
Expand Down
Loading
Loading