The STRATA-FIT Data Validation App is a FastAPI-based tool designed to validate CSV data files containing clinical information about patients with rheumatoid arthritis (RA). The validation is performed against a customizable YAML schema, with support for pydantic constraints like ge
, le
, and more, allowing users to enforce specific data quality and integrity rules.
The application supports file uploads through an API endpoint where users can submit their CSV files to be validated. Errors and discrepancies are reported back in a user-friendly format, making it easier for clinicians and researchers to identify and correct data issues.
Ensure you have a docker daemon installed. The easiest option is to download and install Docker Desktop.
To run the application in a Docker container with custom configurations (config/
):
-
Pull the Docker Image:
docker pull ghcr.io/mdw-nl/strata-fit-data-val:latest
-
Run the Docker Container:
2.1. Using regular configuration:
docker run --rm -p 8000:8000 ghcr.io/mdw-nl/strata-fit-data-val:latest
2.2. Mount your custom configuration directory to the container:
docker run --rm -p 8000:8000 -v $(pwd)/config:/app/config ghcr.io/mdw-nl/strata-fit-data-val:latest
This command maps the local
config/
directory to/app/config
within the container (with-v $(pwd)/config:/app/config
), ensuring your custom settings are used. -
Access the Application: Visit http://localhost:8000/docs to interact with the API.
To use your own data validation schema:
-
Edit the
schema.yaml
File: Modify theconfig/schema.yaml
file with your custom data validation rules. -
Mount the Configuration Directory: Ensure your modified
config/
directory is correctly mounted when running the Docker container:docker run --rm -p 8000:8000 -v $(pwd)/config:/app/config ghcr.io/mdw-nl/strata-fit-data-val:latest
-
Check Your Schema: You can verify the current schema by accessing the
/schema
endpoint at http://localhost:8000/docs or with the following command:curl http://localhost:8000/schema
All runtime parameters live in config/settings.yaml
, so you don’t need to touch code to adjust:
app:
data:
chunksize: 10 # number of rows to process per pandas chunk
model_name: PatientData
errors:
max_to_collect: 1000 # stop streaming after this many validation errors
app.data.chunksize
controls how many rows are read & validated at once (lower it to reduce memory use).app.data.model_name
chooses which Pydantic model from config/schema.yaml to use.app.errors.max_to_collect
caps the total number of error objects emitted to the client.
Update these values, then restart your container (or local server) and the /validate endpoint will immediately pick up the new limits—no code changes or redeploy required.
For local development:
-
Install Dependencies: Ensure Python 3.10+ is installed, then install required dependencies:
pip install -r requirements.txt
-
Run the Application: Start the FastAPI server with Uvicorn
uvicorn api.main:app --reload
-
Interact with the API Access the API through a web browser or use tools like curl or Postman to upload CSV files for validation:
curl -F 'file=@path_to_your_file.csv' http://localhost:8000/validate
/validate
: Upload and validate your CSV file./settings
: Access the current application settings./schema
: Access the current data schema.
For more detailed development guidelines, please refer to the DEV.md file.