Skip to content

New DICOM study import script #1117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Mar 12, 2025
Merged

Conversation

maximemulder
Copy link
Contributor

@maximemulder maximemulder commented Apr 25, 2024

Description

This PR is a rewrite in Python (+ cleaning up / refactor) of the DICOM archiving scripts in Python, with a few added features.

In more details, the changes are the following:

  • dicom-archive/dicomTar.plpython/scripts/import_dicom_study.py
  • dicom-archive/dicomSummary.plpython/scripts/summarize_dicom_study.py
  • dicom-archive/DICOM/DCMSUM.pm (and a bunch of other code) → python/lib/import_dicom_study/*

This PR aims to be feature-compatible with the current Perl scripts, such that it can eventually replace it seemlessly. It removes some arguments that were unused (-centerName, -today and -addMriUpload), sorts the DICOM archive in a year subdirectory automatically (this was previously done later in the pipeline), and automatically use the tarchive directory as a target.

This PR also adds the --session argument to determine the DICOM archive session based on the Python configuration file (and fail if no session is found).

Testing

Tested all options successfully with Raisinbread and MPN data. The data looks identical with both the old DICOM archiver and this new script.

Potential future work

  • Replace the old DICOM archiver by the new script within the existing MRI pipeline.
  • Clean up the tarchive table.
  • Add integration test for CI.

Notes for existing projects

  • need to install pydicom library pip install pydicom

@maximemulder maximemulder marked this pull request as draft April 25, 2024 16:15
@maximemulder
Copy link
Contributor Author

Okay, I think the PR is finished, I am happy with the code and tested it on a few DICOMs on my side.

The tests are failing because I am using a few match statements, which were added in Python 3.10. So I'd like to bump the Python minimum supported version to 3.10, if not, I can remove them (but obviously prefer the newer and shinier version).

We should also test it on DICOMs from other projects. Then, when this PR is merged, I can make another PR to adopt it in the different superscripts, although a problem will be that this script requires the Python profile file, while the scripts that call dicomTar.pl use the Perl one (so both will probably need to be passed as arguments).

I will note a few things on the code:

  • I did not use our custom argument parser because there is now argparse in the standard Python library, which handles all the cases we want. We probably want to transition to that long term.
  • I also wrote a few database helpers for my code, but I think in the long term we should transition to a database and query abstraction library, like sqlalchemy which I've heard about.

@maximemulder maximemulder marked this pull request as ready for review May 21, 2024 21:15
@maximemulder
Copy link
Contributor Author

maximemulder commented Jul 4, 2024

Blocked on Python 3.10, as well as #1141 and #1142 to see if this PR should be updated with new libraries.

NOTE for myself: Bring a fix from the new-mri-upload branch before merging (I don't remember which one but it should be easy) and do not initiate the database connection if the profile is not supplied.

Copy link
Collaborator

@cmadjar cmadjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maximemulder I read through the code and have a few comments/refactoring suggestions. Let me know your thoughts.

Note: I did not test yet, just went through the code.

@maximemulder
Copy link
Contributor Author

Thanks for the review @cmadjar ! I wrote some comments on a few points. Points with no comments mean that I agree or will look into it.

@cmadjar
Copy link
Collaborator

cmadjar commented Aug 1, 2024

@maximemulder I think if you rebase this one, it will pass the flake8 now that your PR with the python install update has been merged. Thanks!

Also, regarding the -mri_upload_update, this option is used by projects that do not use the imaging uploader to upload their scan and run directly dicomTar.pl. I think we should still support that functionality as some projects might still be using the old way of importing DICOMs. Otherwise, the script to create the MINC or BIDS dataset would break because there is no entry in the mri_upload table. Actually, before you do this change, could you bring it up during the next imaging call? It should be next week on August 8th. Thank you :)

@maximemulder
Copy link
Contributor Author

Ok I replied to all your (very good !) comments @cmadjar. The only ones I have intentionally left out are these ones :

  1. Move SQL queries to database_lib (waiting on a decision on SQLAlchemy at the next imaging meeting).
  2. Add the --db-upload parameter (done for a while on my SQLAlchemy branch)
  3. Use log.py (full implementation depends on --db-upload)

If you like the rest of the PR, I can add these very quickly after the next imaging meeting.

@laemtl
Copy link
Contributor

laemtl commented Sep 13, 2024

Waiting for the first test implementation to polish it.

@laemtl laemtl added Area: CI PR or issue related to continuous integration, including automated tests and static checks Needs Work Before Merging and removed Area: CI PR or issue related to continuous integration, including automated tests and static checks labels Sep 13, 2024
@maximemulder maximemulder marked this pull request as draft September 13, 2024 14:39
@maximemulder maximemulder force-pushed the rewrite-dicom-archive branch 11 times, most recently from 3503263 to 3626bcb Compare September 26, 2024 19:46
@maximemulder maximemulder added Difficuly: Complex PR or issue that require a great effort to implementat, review, or test Area: DICOM PR or issue related to the handling of DICOM files labels Oct 4, 2024
@maximemulder maximemulder force-pushed the rewrite-dicom-archive branch from 8fa6000 to 3c2d49d Compare March 2, 2025 16:27
@maximemulder maximemulder force-pushed the rewrite-dicom-archive branch 2 times, most recently from 07ba539 to fbcc9ca Compare March 7, 2025 14:59
Comment on lines +127 to +131
if args.insert and args.update:
log_error_exit(
env,
"Arguments '--insert' and '--update' cannot be used both at the same time.",
lib.exitcode.INVALID_ARG,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make the distinction between --insert and --update here? I think a simple --database option is enough.

If --overwrite is used, then, it would update, if not, it would insert after having check that there is not already an entry in the DB for the StudyUID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this to provide more options in case there are inconsistencies between the file system and the database (a file but no entry or an entry but no file). --insert and --update are only about the database, while --overwrite is about the file system. I don't have a strong opinion on this but it seemed like a good idea when I did it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Interesting. I have to think about this one. I would have a tendency to have a simple database flag and then figure out if data should be updated or inserted in the database when want to update the database instead of having to do that outside of the script in order to properly call import_dicom_study.py... I will sleep on it and get back to you.

Copy link
Contributor Author

@maximemulder maximemulder Mar 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hhhhm, not sure I understand your last message correctly. In any case, I am ok to have a single --update option that both overwrites the file system and updates the entries in the database. However, the script should definitely not overwrite / update anything if the user does not specify it IMO, that is too error prone and could lead to a loss of data IMO.

@cmadjar
Copy link
Collaborator

cmadjar commented Mar 7, 2025

@maximemulder very nice work overall!!

Two summary points:

  • issue with a Philips odd B1 images (discussion on slack with details about the issue found and suggestion of improvements)
  • issue with enhanced DICOMs (I want to take a closer look at this one on Monday to see if it would be possible to read enhanced DICOMs parameters). That last one is not a blocker to merge the PR though and can be implemented later on.

* print

* print

* print

* print

* add support for enhanced DICOMs

* add support for enhanced DICOMs
@maximemulder
Copy link
Contributor Author

maximemulder commented Mar 10, 2025

Thanks for your feedback @cmadjar ! For your information, I tested the PR with some Rainsinbread data and some MPN data, which is acquired using a Siemens scanner, so testing this PR on DICOMs from other scanners is very valuable.

When I wrote the DICOM sorting logic of this PR, I tried to closely follow the code of DCMSUM.pl, which itself seems to follow the structure of the generated "DICOM summary" file rather closely. I took another look at the code I wrote (back in June damn) and the (demonic) Perl code in DCMSUM.pl.

The DCMSUM.pl file basically looks at all the DICOM files and considers that each DICOM series is uniquely identified by a tuple of sequence name, echo number and series number. It also considers that a DICOM series has a single echo time, repetition time, inversion time... This is obviously false, notably for multi-echo series (which I have on MPN). When that is the case, the DCMSUM.pl code only saves a single DICOM series in the database, with a single record of the parameters used (a single echo time out of all the echo times of the series, a single repetition time out of all the repetition times of the series...).

As far as I understand it, the standard way to uniquely identify a DICOM series within a DICOM study is to use the SeriesUID DICOM field. This field is nullable in our database schema, but I checked and there is no NULL entry in neither Rainsinbread nor C-BIG. I think fixing our database schema to handle multi-parameter series is out of scope for this PR. IMO, what we should do for now in this PR is:

  • Stop assuming DICOM series can be identified using the tuple of sequence name, echo number and series number.
  • Create a DICOM series entry in the database for each tuple of all the parameters of the series. This results in duplicate entries for a single DICOM series, but it is better than ignoring the some of the parameter values of the series IMO (also note that there can already be duplicate entries since the current identification method is flawed).
  • Instead of inserting all the DICOM series in the database, and then querying the database to get the DICOM series of each DICOM file as it is currently done (copied from DCMSUM.pl), simply use a dict[DicomStudyDicomSeries, list[DicomStudyDicomFile]] when identifying the DICOM files, and insert each series along with its files, resulting in more straightforward and less error-prone code.

Pointers for DCMSUM.pl:

Copy link
Collaborator

@cmadjar cmadjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maximemulder I test the python DICOM ingestion script on:

  • 6 Philips datasets
  • 4 Siemens XA 30 (enhanced DICOMs)
  • 2 GE datasets

All differences noted between the perl and python ingestion were improvements (ability to read enhanced DICOMs, ability to distinguish multi-echo or multi-tr DICOMs etc...).

The only missing piece at the moment is that if the DeviceSerialNumber header has been removed from the DICOMs due to anonymization, I get the following error:

"import_dicom_study.py -p database_config_27.py -s /data/tmp/CHSTL0158_705052_V02_SessionA_1.3.12.2.1107.5.2.43.166158.30000025012107581494200000808_siemens_prisma --insert
Extracting DICOM information... (may take a long time)
Traceback (most recent call last):
  File "/opt/loris/bin/mri/python/scripts/import_dicom_study.py", line 314, in <module>
    main()
  File "/opt/loris/bin/mri/python/scripts/import_dicom_study.py", line 147, in main
    dicom_summary = get_dicom_study_summary(args.source, args.verbose)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/loris/bin/mri/python/lib/import_dicom_study/summary_get.py", line 39, in get_dicom_study_summary
    study_info = get_dicom_study_info(dicom)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/loris/bin/mri/python/lib/import_dicom_study/summary_get.py", line 83, in get_dicom_study_info
    read_value(dicom, 'DeviceSerialNumber'),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/loris/bin/mri/python/lib/import_dicom_study/summary_get.py", line 154, in read_value
    raise Exception(f"Expected DICOM tag '{tag}' but found none.")
Exception: Expected DICOM tag 'DeviceSerialNumber' but found none."

Ideally, when reading the DICOM, should either read the content of the DICOM tag if available or set it to NULL to avoid this error

@maximemulder
Copy link
Contributor Author

@cmadjar Hhhhm, none of the scanner fields in the tarchive table are nullable, so I guess I will store None in the code, and insert an empty string in the database for now (we can still change the schema and insert NULL later). Should I do that for all the scanner fields (manufacturer, model, serial number, software version) or only some of them ?

@cmadjar
Copy link
Collaborator

cmadjar commented Mar 11, 2025

@cmadjar Hhhhm, none of the scanner fields in the tarchive table are nullable, so I guess I will store None in the code, and insert an empty string in the database for now (we can still change the schema and insert NULL later). Should I do that for all the scanner fields (manufacturer, model, serial number, software version) or only some of them ?

Good catch! Looks like the current dicomTar.pl is indeed inserting with an empty string so probably best to use the same strategy here. And I think it is a good idea to apply that on all the Scanner fields, in case other anonymization software completely removes those.

Thanks!

Copy link
Collaborator

@cmadjar cmadjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!!!

@cmadjar cmadjar merged commit 485f3a8 into aces:main Mar 12, 2025
9 checks passed
cmadjar added a commit that referenced this pull request Mar 21, 2025
* remove trailing whitespaces (#1149)

Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca>

* SQLAlchemy proof-of-concept (#1152)

* add sqlalchemy

* fix lints

* add comment

---------

Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca>

* fix (#1168)

Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca>

* integration test (#1138)

* test

* main Loris here

* ignore Loris python

* clone Loris

* rm loris

* override test

* override test

* override RB data

* fix folder

* test php files

* test

* try

* test

* try install loris mri

* os

* test

* rm modules

* Update flake8_python_linter.yml

* Small linting improvements (#1167)

* better linting

* remove tab in comment

---------

Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca>

* Improve Python subject determination and validation error handling (#1146)

* improve subjects determination and validation

* use tuple for database query arguments

* query on single line

---------

Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca>

* Add literal regex leading r (#1176)

Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca>

* fix (#1178)

Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca>

* LORIS-MRI tooling configuration (#1170)

* add pyproject.toml

* keep flake8 (for now)

* double type checking

* Use Pyright as main type checker

* remove comment

* test

* test

* change configuration

* add python environment variables

* try factorization

* test

* test

* test

* test

* remove temporary things

* test ruff error

* ruff output github

* further test errors

* test pyright output github

* test

* test

* test

* change comment

* remove test errors

* test

---------

Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca>

* update docker install (#1171)

Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca>

* add RB override tables for the MRI side (#1173)

* Remove php 8.1 and 8.2 from the LORIS tests to keep only test on PHP 8.3 (#1175)

* remove php 8.1 and 8.2 testing

* update to 8.3 in Dockerfile.test.php8

* ADD minc tools install in gitactions (#1179)

* todo install loris mri

* test

* run pytest in gitaction (#1172)

* test

* ttt

* test

* test

* test

* test

* test

* test

* test

* test

* test

* test

* test

* test

* test

* test

* test

* test

* test

* Delete test_example.py

* rename

* done

* Delete test/Dockerfile.test.py

* test (#1187)

* unit tests poc (#1188)

* add more pytest global rules (#1194)

* Stricter linter configuration (#1193)

* stricter ruff config

* do not ignore N818

* lint for sorted imports

* add python scripts directory (#1196)

* Bunch of SQLAlchemy definitions (#1192)

* bunch of sqlalchemy definitions

* add project orm definition

* fix candidate to site query function

* optimize docker image (#1195)

* fix pytest config (#1197)

* fix python scripts (#1199)

* remove mysql-connector (#1200)

* Improve subject configuration structure (#1150)

* refacor subject config

* remove suspicious phantom code

* change naming

* fix rebase lints

* cleaner database url

* Cecile CandID path join bug fix

Co-authored-by: Cécile Madjar <cecile.madjar@mcin.ca>

---------

Co-authored-by: Cécile Madjar <cecile.madjar@mcin.ca>

* Integration test to check that the ORM definitions match the LORIS SQL database (#1198)

* orm sync integration test

* add docker health check

* use integration config file

* test error

* Revert "test error"

This reverts commit d1a70e6.

* Clean up Docker database scripts (#1202)

* clean docker database

* remove docker warning

* remove unused attribute (#1201)

* try fix perl ci

* Add file, file_parameter and scanner SQLAlchemy models (#1207)

* add tables

* rename file model

* rename parameter_file

* fix lints (#1210)

* Unify LORIS-MRI logging (#1191)

* new env

* change print order in database log

* create notification type if needed

* fix typing pyright update

* Integration test for `run_dicom_archive_loader.py` (#1203)

add integration test

put keys in github repo

debugging

add database check

add comments and debugging

remove s3 keys

change converter in test

check file tree

fix doc

fix docs 2

rename variable

go back to using s3 keys

fix path

try fix

* rename database sub-namepaces (#1214)

* fix warnings (#1215)

* Factorize get subject session (#1190)

* factorize subject session

* re-add site log

* Migrate "get all sites from the database" to the SQLAlchemy abstraction (#1216)

* Fix new PEP585 Ruff lint (#1218)

* Clean-up database scan type (#1139)

Complementary PR of LORIS #9304

* Migrate DICOM archive files to new database abstraction (#1217)

* wip

* use self.dicom_archive.id instead of self.tarchive_id

* fix todo

* Fix SQLAlchemy file model (#1221)

* Improve Python requirements order (#1222)

* add python documentation (#1223)

* Update SQLAlchemy bindings to match LORIS#9556 (#1229)

* trigger_tests

* try update

* fix old queries for integration tests

* Migrate `ImagingUpload` and `MriUploadDB` to the new database abstraction (#1224)

* migrate mri upload to sqlalchemy

* fix concurrency bug

* fix clean up

* display mri upload ids

use dicom archive id instead of tarchiveid because it is prettier

ruff check

* migrate scanner to new abstraction (#1232)

* Fix install script (#1234)

* fix install script and move files related to install into the install directory

* fix install script and move files related to install into the install directory

* fix install script and move files related to install into the install directory

* fix paths in imaging_install_test.sh

* fix paths in imaging_install_test.sh

* fix SQL error in install script

* fix tests

* fix tests

* fix github action

* add cpanfile

* copy cpanfile in MRI dockerfile

* add new dependency in order to be able to source the Digest::BLAKE2 in the cpanfile

* fix Digest::BLAKE2 installation though cpanfile

* move config templates to a templates directory and requirements files into a requirements directory

* remove copy

* fix test

* fix test

* fix test

* fix test

* Maxime's feedback

* specify python version (#1237)

* fix large file hash memory (#1236)

* add lib.util module (#1238)

* New DICOM study import script (#1117)

* rewrite dicom archive

* update

* remove --year and --target options

* get session from config file

* remove unused upload argument

* fix little oopsie for --session

* handle large files hash

* fix unhandled modalities

* use new lib.util module

* Add support to be able to read headers from enhanced DICOMs  (#7)

* print

* print

* print

* print

* add support for enhanced DICOMs

* add support for enhanced DICOMs

* rework associate files with series

* sort dicom series and files before inserting in the database

* handle null scanner

---------

Co-authored-by: Cécile Madjar <cecile.madjar@mcin.ca>

* refactoring of CandID on the non-ORM python side

* tests

* fix insertion in mri_protocol_violation_scans

* alphabetical order is hard sometimes

* FIx `get_candidate_id` documentation

---------

Co-authored-by: Maxime Mulder <maxime-mulder@outlook.com>
Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca>
Co-authored-by: Shen <kongtiaowangshen@gmail.com>
@maximemulder maximemulder deleted the rewrite-dicom-archive branch March 29, 2025 09:47
@cmadjar cmadjar added this to the 27.0 milestone Apr 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Add to release notes Area: DICOM PR or issue related to the handling of DICOM files Caveat for Existing Projects Difficuly: Complex PR or issue that require a great effort to implementat, review, or test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants