aind-analysis-wrapper

⚠️ IMPORTANT: This is a Template Repository ⚠️

This repository serves as an example template for building your own analysis workflows. You should duplicate this repository and customize it for your specific analysis needs. Do not modify this template directly - instead, create your own copy and build your analysis on top of the provided framework.

The analysis wrapper is a standardized framework for running large-scale data analysis workflows on cloud infrastructure. It processes job input models from the job dispatcher, executes your custom analysis code, and automatically handles metadata tracking and result storage.

What it does

The analysis wrapper:

Receives job input models containing data file locations and analysis parameters
Executes your custom analysis code on the specified datasets
Tracks metadata including inputs, parameters, code versions, and execution details
Stores results to cloud storage and writes metadata records to a document database
Prevents duplicate processing by checking if analysis has already been completed

Environment Setup

The steps below are needed to configure the analysis wrapper. Go to the environment tab in the capsule. Steps 1-3 are found under Environment Variables. Step 4 is found under Secrets.

Set the DOCDB_COLLECTION to the project collection
Set the CODEOCEAN_EMAIL
Replace the ANALYSIS_BUCKET with the path to the analysis bucket on S3.
Set the corresponding secrets. Users will have to make a codeocean token - read permissions are sufficient for this. See creating token docs for more information.
Add any required packages needed for analysis

Analysis Wrapper - User Defined Analysis Parameters

To help facilitate tracking of analysis parameters for reproducibility, a user should define their own pydantic model in the analysis wrapper. Follow steps below. An example can be found in /code/example_analysis_model.py:

Start by renaming the example analysis model to match user's custom model.
Then add any fields that would be useful in aiding reproducibility of analysis. The listed fields are just examples, not a requirement. Recommended to add a field to tag the version of analysis run. See Analysis Metadata Tracking.
Additionally, for any small numerical outputs and such - define these in the output model if needed. Larger files such as arrays and tables should be written to the results folder (See next section).
Once this is done, be sure to replace all references to ExampleAnalysisSpecification and ExampleAnalysisOutputs in run_capsule.py. If no output model is needed, remove lines referencing ExampleAnalysisOutputs in run_analysis function.

Running Analysis and Storing Output

User defined analysis can be specified in the run_analysis function in run_capsule.py. An example of the input model passed in can be found in /data/job_dict. An example of analysis parameters that correspond to the example pydantic model can be found in /data/analysis_parameters.json. Modify the analysis parameters json for testing if needed, and make sure the fields match those in the model defined. Some other notes below:

Users can also add an app panel for input arguments that are part of the analysis model.
If there was no file extension specified when dispatching, change to analysis_dispatch_inputs.s3_location in for loop in run_analysis function. Then users will need to read from the S3 bucket directly.
Results should be written to /results/ folder in the capsule. The results folder will then be copied to the S3 Analysis Bucket path set in the environment variables. This path will then be stored as part of the metadata record that will get written to the document database and can be queried later on.
The metadata record is a combination of input data, analysis parameters, git commits, etc. All of these are used to query if analysis has already been run on the combination of input data, parameters, etc. IMPORTANT: BE SURE TO COMMIT ALL CHANGES IN THIS CAPSULE. IF CHANGES ARE NOT COMMITED AND ANALYSIS NEEDS TO BE RUN, IT COULD BE SKIPPED IF THE METADATA RECORD ALREADY EXISTS FOR THE GIVEN COMBINATION OF INPUT DATA, ANALYSIS PARAMETERS, CODE, ETC

Analysis Metadata Tracking

This framework automatically tracks completed analyses using a metadata record that includes:

The input data
The analysis parameters
The committed code version
The output S3 bucket containing analysis results

If all of these are the same, the system will skip re-running the analysis to avoid duplicate processing.

If you modify your analysis parameters or code, the system will consider it a different record and run it normally after the changes in the capsule have been committed.

To intentionally rerun the same analysis on the same data:

Ensure that at least one part of the metadata record changes — this could be a version label, or a code comment.
Not required but adding a versioning field in the user defined pydantic model is strongly recommended for reproducibility and distinguishing between analysis runs.

Testing Analysis Wrapper

To test, a reproducible run can be executed. When ready to run analysis and post results, be sure to set the dry run flag in the app panel to 0 so the results are posted. By default, dry run is enabled..

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.codeocean		.codeocean
.github/workflows		.github/workflows
code		code
environment		environment
example_input_data		example_input_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

aind-analysis-wrapper

What it does

Environment Setup

Analysis Wrapper - User Defined Analysis Parameters

Running Analysis and Storing Output

Analysis Metadata Tracking

Testing Analysis Wrapper

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

AllenNeuralDynamics/aind-analysis-wrapper

Folders and files

Latest commit

History

Repository files navigation

aind-analysis-wrapper

What it does

Environment Setup

Analysis Wrapper - User Defined Analysis Parameters

Running Analysis and Storing Output

Analysis Metadata Tracking

Testing Analysis Wrapper

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages