Skip to content

Incremental BIDS import #1211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

maximemulder
Copy link
Contributor

@maximemulder maximemulder commented Dec 3, 2024

BIDS dataset import script rewrite

List of new features

  • Fully incremental BIDS import. That is, a BIDS dataset can be imported in several parts, and only the files that have not been imported yet will be added to LORIS.
  • Better logging and error reporting. Prints and errors are now logged in the LORIS-MRI logging system. Moreover, additional information is given during the script run, and previously silent errors (such as unidentified scan types) are now signaled to the user.
  • Better support for sessionful and sessionless BIDS datasets. Before, some code paths were only implemented for one of those case, now the system is generic over the presence or absence of BIDS session directories, ensuring better support for both these cases.
  • The --create_candidate or --create_session options are now transactional. Either all candidates and sessions are created without error, or none is created if an error occurs (due to incorrect input data).
  • The --create_candidate option now looks for sites and projects using both names and aliases, in that order.

List of breaking changes

  • The script has been renamed from bids_import.py to import_bids_dataset.py. This should better follow the LORIS-MRI scripts naming conventions.
  • An error is now returned if the dataset_description.json file is absent. This file is a required BIDS file, and was already needed unless the --nocopy option was used. It is also trivial to write anyway.
  • An error is now returned for MRI files if the --type=derivatives or --no_copy options are used. It seems to me that the current code was not tested and probably not working, so it is better to have a hard error rather than an untested and potentially wrong implementation IMO.
  • The --idsvalidation option has been removed. This option allowed to validate the BIDS directory name according to a pscid_candid_blablabla format. This is obviously esoteric and overly-specific as BIDS dataset can contain several participants, and their identifiers should be retrieved from the subject directory names or the participants.tsv file instead.
  • The --create_candidate option no longer tries to find the site by matching site aliases against participants IDs. This approach can give false positives if a similar (and not even necessarily equal) alias as a project. The site participants.tsv column is now mandatory to create candidates.

List of major architectural changes

Note that these changes only apply to the generic BIDS import code (subjects and sessions), and the MRI import code. It does not apply to EEG import code.

  • The code has been fully rewritten.
  • The code is now fully typed 😌.
  • The code now uses the new ORM database abstraction instead of the old database abstraction or inline queries.
  • The code now uses the LORIS-MRI log abstraction instead of print.
  • The BIDS dataset import code has been moved and isolated in the new lib.import_bids_dataset module.
  • Created the new lib.imaging_lib module to replace the old lib.imaging module. Unlike lib.imaging, the new module is fully typed, uses the new ORM database abstraction, and is divided into submodules related to different kinds of imaging data (BIDS, DICOM, file parameters...).
  • Candidates and sessions are obtained at the subject and session levels. This was previously done at the data type level like before and introduced significant duplication between MRI and EEG code and nonsensical dependencies.

Future works

  • Migrate the EEG code.
  • Add an overwrite option to replace the existing files by the new ones if present.
  • Remove the default visit label configuration option and make it an argument (along with default site, default cohort...) ?

Other comments

This PR is very large so it will likely be divided in smaller PRs to be reviewed and merged.

@maximemulder maximemulder force-pushed the 2024-12-02_incremental_import_bids branch 10 times, most recently from 8ccc017 to 751bf02 Compare December 3, 2024 22:18
@maximemulder maximemulder force-pushed the 2024-12-02_incremental_import_bids branch 3 times, most recently from d48df6c to 2ce3fb9 Compare December 22, 2024 04:41
@maximemulder maximemulder force-pushed the 2024-12-02_incremental_import_bids branch 2 times, most recently from 8033f6e to b3aeb95 Compare January 2, 2025 14:04
@maximemulder maximemulder force-pushed the 2024-12-02_incremental_import_bids branch 2 times, most recently from 00d3c65 to d2caaab Compare January 8, 2025 17:14
@maximemulder maximemulder force-pushed the 2024-12-02_incremental_import_bids branch 3 times, most recently from 30a3c2c to f27561a Compare February 18, 2025 09:24
@maximemulder maximemulder force-pushed the 2024-12-02_incremental_import_bids branch from 6a4b0bf to c7f6407 Compare March 20, 2025 10:18
@maximemulder maximemulder force-pushed the 2024-12-02_incremental_import_bids branch 4 times, most recently from 83afa11 to 56a4c42 Compare April 13, 2025 09:36
@maximemulder maximemulder force-pushed the 2024-12-02_incremental_import_bids branch from 56a4c42 to 84dcebc Compare April 18, 2025 09:57
@maximemulder maximemulder force-pushed the 2024-12-02_incremental_import_bids branch 2 times, most recently from 1d3ee78 to 3494167 Compare May 2, 2025 09:35
@maximemulder maximemulder force-pushed the 2024-12-02_incremental_import_bids branch 3 times, most recently from a5a5a1a to 92bc1fd Compare May 4, 2025 09:54
@laemtl
Copy link
Contributor

laemtl commented May 23, 2025

Hello, a few comments:

  • instead of removing --idsvalidation, I think it is better to migrate the logic to use participants.tsv instead.
  • ideally, we need a way for the script to work without dataset_description.json, or to use the one in a specific location, depending on how the incremental ingestion works.

bids session dataclass

bids participants dataclass

factorize combination iteration

fix layout ignore

skip files already inserted

wip

fix mri path join

rebase commit
@MaximeBICMTL MaximeBICMTL force-pushed the 2024-12-02_incremental_import_bids branch from 8f2f244 to b533b6b Compare June 16, 2025 07:09
@MaximeBICMTL MaximeBICMTL force-pushed the 2024-12-02_incremental_import_bids branch from b533b6b to 646dd27 Compare June 16, 2025 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants