New DICOM study import script #1117

maximemulder · 2024-04-25T16:03:56Z

Description

This PR is a rewrite in Python (+ cleaning up / refactor) of the DICOM archiving scripts in Python, with a few added features.

In more details, the changes are the following:

dicom-archive/dicomTar.pl → python/scripts/import_dicom_study.py
dicom-archive/dicomSummary.pl → python/scripts/summarize_dicom_study.py
dicom-archive/DICOM/DCMSUM.pm (and a bunch of other code) → python/lib/import_dicom_study/*

This PR aims to be feature-compatible with the current Perl scripts, such that it can eventually replace it seemlessly. It removes some arguments that were unused (-centerName, -today and -addMriUpload), sorts the DICOM archive in a year subdirectory automatically (this was previously done later in the pipeline), and automatically use the tarchive directory as a target.

This PR also adds the --session argument to determine the DICOM archive session based on the Python configuration file (and fail if no session is found).

Testing

Tested all options successfully with Raisinbread and MPN data. The data looks identical with both the old DICOM archiver and this new script.

Potential future work

Replace the old DICOM archiver by the new script within the existing MRI pipeline.
Clean up the tarchive table.
Add integration test for CI.

Notes for existing projects

need to install pydicom library pip install pydicom

maximemulder · 2024-05-21T21:15:02Z

Okay, I think the PR is finished, I am happy with the code and tested it on a few DICOMs on my side.

The tests are failing because I am using a few match statements, which were added in Python 3.10. So I'd like to bump the Python minimum supported version to 3.10, if not, I can remove them (but obviously prefer the newer and shinier version).

We should also test it on DICOMs from other projects. Then, when this PR is merged, I can make another PR to adopt it in the different superscripts, although a problem will be that this script requires the Python profile file, while the scripts that call dicomTar.pl use the Perl one (so both will probably need to be passed as arguments).

I will note a few things on the code:

I did not use our custom argument parser because there is now argparse in the standard Python library, which handles all the cases we want. We probably want to transition to that long term.
I also wrote a few database helpers for my code, but I think in the long term we should transition to a database and query abstraction library, like sqlalchemy which I've heard about.

maximemulder · 2024-07-04T17:31:41Z

Blocked on Python 3.10, as well as #1141 and #1142 to see if this PR should be updated with new libraries.

NOTE for myself: Bring a fix from the new-mri-upload branch before merging (I don't remember which one but it should be easy) and do not initiate the database connection if the profile is not supplied.

cmadjar

@maximemulder I read through the code and have a few comments/refactoring suggestions. Let me know your thoughts.

Note: I did not test yet, just went through the code.

python/dicom_archive.py

python/lib/dicom/dicom_database.py

python/dicom_archive.py

python/lib/dicom/dicom_log.py

python/lib/dicom/summary_make.py

python/dicom_archive.py

maximemulder · 2024-07-30T17:03:34Z

Thanks for the review @cmadjar ! I wrote some comments on a few points. Points with no comments mean that I agree or will look into it.

cmadjar · 2024-08-01T14:51:34Z

@maximemulder I think if you rebase this one, it will pass the flake8 now that your PR with the python install update has been merged. Thanks!

Also, regarding the -mri_upload_update, this option is used by projects that do not use the imaging uploader to upload their scan and run directly dicomTar.pl. I think we should still support that functionality as some projects might still be using the old way of importing DICOMs. Otherwise, the script to create the MINC or BIDS dataset would break because there is no entry in the mri_upload table. Actually, before you do this change, could you bring it up during the next imaging call? It should be next week on August 8th. Thank you :)

maximemulder · 2024-08-06T15:33:26Z

Ok I replied to all your (very good !) comments @cmadjar. The only ones I have intentionally left out are these ones :

Move SQL queries to database_lib (waiting on a decision on SQLAlchemy at the next imaging meeting).
Add the --db-upload parameter (done for a while on my SQLAlchemy branch)
Use log.py (full implementation depends on --db-upload)

If you like the rest of the PR, I can add these very quickly after the next imaging meeting.

laemtl · 2024-09-13T14:02:37Z

Waiting for the first test implementation to polish it.

cmadjar · 2025-03-07T16:02:25Z

python/scripts/import_dicom_study.py

+    if args.insert and args.update:
+        log_error_exit(
+            env,
+            "Arguments '--insert' and '--update' cannot be used both at the same time.",
+            lib.exitcode.INVALID_ARG,


Why make the distinction between --insert and --update here? I think a simple --database option is enough.

If --overwrite is used, then, it would update, if not, it would insert after having check that there is not already an entry in the DB for the StudyUID.

I did this to provide more options in case there are inconsistencies between the file system and the database (a file but no entry or an entry but no file). --insert and --update are only about the database, while --overwrite is about the file system. I don't have a strong opinion on this but it seemed like a good idea when I did it.

ok. Interesting. I have to think about this one. I would have a tendency to have a simple database flag and then figure out if data should be updated or inserted in the database when want to update the database instead of having to do that outside of the script in order to properly call import_dicom_study.py... I will sleep on it and get back to you.

Hhhhm, not sure I understand your last message correctly. In any case, I am ok to have a single --update option that both overwrites the file system and updates the entries in the database. However, the script should definitely not overwrite / update anything if the user does not specify it IMO, that is too error prone and could lead to a loss of data IMO.

cmadjar · 2025-03-07T23:46:59Z

@maximemulder very nice work overall!!

Two summary points:

issue with a Philips odd B1 images (discussion on slack with details about the issue found and suggestion of improvements)
issue with enhanced DICOMs (I want to take a closer look at this one on Monday to see if it would be possible to read enhanced DICOMs parameters). That last one is not a blocker to merge the PR though and can be implemented later on.

* print * print * print * print * add support for enhanced DICOMs * add support for enhanced DICOMs

maximemulder · 2025-03-10T09:21:28Z

Thanks for your feedback @cmadjar ! For your information, I tested the PR with some Rainsinbread data and some MPN data, which is acquired using a Siemens scanner, so testing this PR on DICOMs from other scanners is very valuable.

When I wrote the DICOM sorting logic of this PR, I tried to closely follow the code of DCMSUM.pl, which itself seems to follow the structure of the generated "DICOM summary" file rather closely. I took another look at the code I wrote (back in June damn) and the (demonic) Perl code in DCMSUM.pl.

The DCMSUM.pl file basically looks at all the DICOM files and considers that each DICOM series is uniquely identified by a tuple of sequence name, echo number and series number. It also considers that a DICOM series has a single echo time, repetition time, inversion time... This is obviously false, notably for multi-echo series (which I have on MPN). When that is the case, the DCMSUM.pl code only saves a single DICOM series in the database, with a single record of the parameters used (a single echo time out of all the echo times of the series, a single repetition time out of all the repetition times of the series...).

As far as I understand it, the standard way to uniquely identify a DICOM series within a DICOM study is to use the SeriesUID DICOM field. This field is nullable in our database schema, but I checked and there is no NULL entry in neither Rainsinbread nor C-BIG. I think fixing our database schema to handle multi-parameter series is out of scope for this PR. IMO, what we should do for now in this PR is:

Stop assuming DICOM series can be identified using the tuple of sequence name, echo number and series number.
Create a DICOM series entry in the database for each tuple of all the parameters of the series. This results in duplicate entries for a single DICOM series, but it is better than ignoring the some of the parameter values of the series IMO (also note that there can already be duplicate entries since the current identification method is flawed).
Instead of inserting all the DICOM series in the database, and then querying the database to get the DICOM series of each DICOM file as it is currently done (copied from DCMSUM.pl), simply use a dict[DicomStudyDicomSeries, list[DicomStudyDicomFile]] when identifying the DICOM files, and insert each series along with its files, resulting in more straightforward and less error-prone code.

Pointers for DCMSUM.pl:

Where the code gets data from the DICOM files: https://github.com/aces/Loris-MRI/blob/main/dicom-archive/DICOM/DCMSUM.pm#L808-L815
Where the code goes through the DICOM files data to list the DICOM series: https://github.com/aces/Loris-MRI/blob/main/dicom-archive/DICOM/DCMSUM.pm#L657-L674

cmadjar

@maximemulder I test the python DICOM ingestion script on:

6 Philips datasets
4 Siemens XA 30 (enhanced DICOMs)
2 GE datasets

All differences noted between the perl and python ingestion were improvements (ability to read enhanced DICOMs, ability to distinguish multi-echo or multi-tr DICOMs etc...).

The only missing piece at the moment is that if the DeviceSerialNumber header has been removed from the DICOMs due to anonymization, I get the following error:

"import_dicom_study.py -p database_config_27.py -s /data/tmp/CHSTL0158_705052_V02_SessionA_1.3.12.2.1107.5.2.43.166158.30000025012107581494200000808_siemens_prisma --insert
Extracting DICOM information... (may take a long time)
Traceback (most recent call last):
  File "/opt/loris/bin/mri/python/scripts/import_dicom_study.py", line 314, in <module>
    main()
  File "/opt/loris/bin/mri/python/scripts/import_dicom_study.py", line 147, in main
    dicom_summary = get_dicom_study_summary(args.source, args.verbose)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/loris/bin/mri/python/lib/import_dicom_study/summary_get.py", line 39, in get_dicom_study_summary
    study_info = get_dicom_study_info(dicom)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/loris/bin/mri/python/lib/import_dicom_study/summary_get.py", line 83, in get_dicom_study_info
    read_value(dicom, 'DeviceSerialNumber'),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/loris/bin/mri/python/lib/import_dicom_study/summary_get.py", line 154, in read_value
    raise Exception(f"Expected DICOM tag '{tag}' but found none.")
Exception: Expected DICOM tag 'DeviceSerialNumber' but found none."

Ideally, when reading the DICOM, should either read the content of the DICOM tag if available or set it to NULL to avoid this error

maximemulder · 2025-03-11T15:40:00Z

@cmadjar Hhhhm, none of the scanner fields in the tarchive table are nullable, so I guess I will store None in the code, and insert an empty string in the database for now (we can still change the schema and insert NULL later). Should I do that for all the scanner fields (manufacturer, model, serial number, software version) or only some of them ?

cmadjar · 2025-03-11T15:44:17Z

@cmadjar Hhhhm, none of the scanner fields in the tarchive table are nullable, so I guess I will store None in the code, and insert an empty string in the database for now (we can still change the schema and insert NULL later). Should I do that for all the scanner fields (manufacturer, model, serial number, software version) or only some of them ?

Good catch! Looks like the current dicomTar.pl is indeed inserting with an empty string so probably best to use the same strategy here. And I think it is a good idea to apply that on all the Scanner fields, in case other anonymization software completely removes those.

Thanks!

cmadjar

Great job!!!

* remove trailing whitespaces (#1149) Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca> * SQLAlchemy proof-of-concept (#1152) * add sqlalchemy * fix lints * add comment --------- Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca> * fix (#1168) Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca> * integration test (#1138) * test * main Loris here * ignore Loris python * clone Loris * rm loris * override test * override test * override RB data * fix folder * test php files * test * try * test * try install loris mri * os * test * rm modules * Update flake8_python_linter.yml * Small linting improvements (#1167) * better linting * remove tab in comment --------- Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca> * Improve Python subject determination and validation error handling (#1146) * improve subjects determination and validation * use tuple for database query arguments * query on single line --------- Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca> * Add literal regex leading r (#1176) Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca> * fix (#1178) Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca> * LORIS-MRI tooling configuration (#1170) * add pyproject.toml * keep flake8 (for now) * double type checking * Use Pyright as main type checker * remove comment * test * test * change configuration * add python environment variables * try factorization * test * test * test * test * remove temporary things * test ruff error * ruff output github * further test errors * test pyright output github * test * test * test * change comment * remove test errors * test --------- Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca> * update docker install (#1171) Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca> * add RB override tables for the MRI side (#1173) * Remove php 8.1 and 8.2 from the LORIS tests to keep only test on PHP 8.3 (#1175) * remove php 8.1 and 8.2 testing * update to 8.3 in Dockerfile.test.php8 * ADD minc tools install in gitactions (#1179) * todo install loris mri * test * run pytest in gitaction (#1172) * test * ttt * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * Delete test_example.py * rename * done * Delete test/Dockerfile.test.py * test (#1187) * unit tests poc (#1188) * add more pytest global rules (#1194) * Stricter linter configuration (#1193) * stricter ruff config * do not ignore N818 * lint for sorted imports * add python scripts directory (#1196) * Bunch of SQLAlchemy definitions (#1192) * bunch of sqlalchemy definitions * add project orm definition * fix candidate to site query function * optimize docker image (#1195) * fix pytest config (#1197) * fix python scripts (#1199) * remove mysql-connector (#1200) * Improve subject configuration structure (#1150) * refacor subject config * remove suspicious phantom code * change naming * fix rebase lints * cleaner database url * Cecile CandID path join bug fix Co-authored-by: Cécile Madjar <cecile.madjar@mcin.ca> --------- Co-authored-by: Cécile Madjar <cecile.madjar@mcin.ca> * Integration test to check that the ORM definitions match the LORIS SQL database (#1198) * orm sync integration test * add docker health check * use integration config file * test error * Revert "test error" This reverts commit d1a70e6. * Clean up Docker database scripts (#1202) * clean docker database * remove docker warning * remove unused attribute (#1201) * try fix perl ci * Add file, file_parameter and scanner SQLAlchemy models (#1207) * add tables * rename file model * rename parameter_file * fix lints (#1210) * Unify LORIS-MRI logging (#1191) * new env * change print order in database log * create notification type if needed * fix typing pyright update * Integration test for `run_dicom_archive_loader.py` (#1203) add integration test put keys in github repo debugging add database check add comments and debugging remove s3 keys change converter in test check file tree fix doc fix docs 2 rename variable go back to using s3 keys fix path try fix * rename database sub-namepaces (#1214) * fix warnings (#1215) * Factorize get subject session (#1190) * factorize subject session * re-add site log * Migrate "get all sites from the database" to the SQLAlchemy abstraction (#1216) * Fix new PEP585 Ruff lint (#1218) * Clean-up database scan type (#1139) Complementary PR of LORIS #9304 * Migrate DICOM archive files to new database abstraction (#1217) * wip * use self.dicom_archive.id instead of self.tarchive_id * fix todo * Fix SQLAlchemy file model (#1221) * Improve Python requirements order (#1222) * add python documentation (#1223) * Update SQLAlchemy bindings to match LORIS#9556 (#1229) * trigger_tests * try update * fix old queries for integration tests * Migrate `ImagingUpload` and `MriUploadDB` to the new database abstraction (#1224) * migrate mri upload to sqlalchemy * fix concurrency bug * fix clean up * display mri upload ids use dicom archive id instead of tarchiveid because it is prettier ruff check * migrate scanner to new abstraction (#1232) * Fix install script (#1234) * fix install script and move files related to install into the install directory * fix install script and move files related to install into the install directory * fix install script and move files related to install into the install directory * fix paths in imaging_install_test.sh * fix paths in imaging_install_test.sh * fix SQL error in install script * fix tests * fix tests * fix github action * add cpanfile * copy cpanfile in MRI dockerfile * add new dependency in order to be able to source the Digest::BLAKE2 in the cpanfile * fix Digest::BLAKE2 installation though cpanfile * move config templates to a templates directory and requirements files into a requirements directory * remove copy * fix test * fix test * fix test * fix test * Maxime's feedback * specify python version (#1237) * fix large file hash memory (#1236) * add lib.util module (#1238) * New DICOM study import script (#1117) * rewrite dicom archive * update * remove --year and --target options * get session from config file * remove unused upload argument * fix little oopsie for --session * handle large files hash * fix unhandled modalities * use new lib.util module * Add support to be able to read headers from enhanced DICOMs (#7) * print * print * print * print * add support for enhanced DICOMs * add support for enhanced DICOMs * rework associate files with series * sort dicom series and files before inserting in the database * handle null scanner --------- Co-authored-by: Cécile Madjar <cecile.madjar@mcin.ca> * refactoring of CandID on the non-ORM python side * tests * fix insertion in mri_protocol_violation_scans * alphabetical order is hard sometimes * FIx `get_candidate_id` documentation --------- Co-authored-by: Maxime Mulder <maxime-mulder@outlook.com> Co-authored-by: Maxime Mulder <maxime.mulder@mcgill.ca> Co-authored-by: Shen <kongtiaowangshen@gmail.com>

maximemulder marked this pull request as draft April 25, 2024 16:15

maximemulder marked this pull request as ready for review May 21, 2024 21:15

This was referenced Jul 4, 2024

Python library proposal: Typed argument parser #1141

Closed

Python library proposal: SQLAlchemy #1142

Closed

cmadjar requested changes Jul 30, 2024

View reviewed changes

maximemulder mentioned this pull request Jul 31, 2024

Update Python installation #1145

Merged

laemtl added Area: CI PR or issue related to continuous integration, including automated tests and static checks Needs Work Before Merging and removed Area: CI PR or issue related to continuous integration, including automated tests and static checks labels Sep 13, 2024

maximemulder marked this pull request as draft September 13, 2024 14:39

maximemulder force-pushed the rewrite-dicom-archive branch 11 times, most recently from 3503263 to 3626bcb Compare September 26, 2024 19:46

maximemulder mentioned this pull request Sep 27, 2024

Factorize get subject session #1190

Merged

maximemulder added Difficuly: Complex PR or issue that require a great effort to implementat, review, or test Area: DICOM PR or issue related to the handling of DICOM files labels Oct 4, 2024

maximemulder force-pushed the rewrite-dicom-archive branch from 8fa6000 to 3c2d49d Compare March 2, 2025 16:27

maximemulder added 8 commits March 7, 2025 14:43

rewrite dicom archive

14e71a7

update

5356275

remove --year and --target options

7c870d3

get session from config file

47f5cc9

remove unused upload argument

2031e80

fix little oopsie for --session

b34e515

handle large files hash

d3bf820

fix unhandled modalities

4a8cd3f

maximemulder force-pushed the rewrite-dicom-archive branch 2 times, most recently from 07ba539 to fbcc9ca Compare March 7, 2025 14:59

use new lib.util module

4fbe555

maximemulder force-pushed the rewrite-dicom-archive branch from fbcc9ca to 4fbe555 Compare March 7, 2025 15:17

cmadjar added Caveat for Existing Projects Add to release notes labels Mar 7, 2025

cmadjar reviewed Mar 7, 2025

View reviewed changes

Add support to be able to read headers from enhanced DICOMs (#7)

62fc93d

* print * print * print * print * add support for enhanced DICOMs * add support for enhanced DICOMs

maximemulder added 2 commits March 10, 2025 11:01

rework associate files with series

ecfcd6d

sort dicom series and files before inserting in the database

222c048

cmadjar requested changes Mar 11, 2025

View reviewed changes

handle null scanner

31581a3

cmadjar approved these changes Mar 12, 2025

View reviewed changes

cmadjar merged commit 485f3a8 into aces:main Mar 12, 2025
9 checks passed

maximemulder deleted the rewrite-dicom-archive branch March 29, 2025 09:47

cmadjar added this to the 27.0 milestone Apr 22, 2025

New DICOM study import script #1117

New DICOM study import script #1117

Uh oh!

Conversation

maximemulder commented Apr 25, 2024 • edited by cmadjar Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Potential future work

Notes for existing projects

Uh oh!

maximemulder commented May 21, 2024

Uh oh!

maximemulder commented Jul 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmadjar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maximemulder commented Jul 30, 2024

Uh oh!

cmadjar commented Aug 1, 2024

Uh oh!

maximemulder commented Aug 6, 2024

Uh oh!

laemtl commented Sep 13, 2024

Uh oh!

cmadjar Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

maximemulder Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

cmadjar Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

maximemulder Mar 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmadjar commented Mar 7, 2025

Uh oh!

maximemulder commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmadjar left a comment

Choose a reason for hiding this comment

Uh oh!

maximemulder commented Mar 11, 2025

Uh oh!

cmadjar commented Mar 11, 2025

Uh oh!

cmadjar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

maximemulder commented Apr 25, 2024 •

edited by cmadjar

Loading

maximemulder commented Jul 4, 2024 •

edited

Loading

maximemulder Mar 8, 2025 •

edited

Loading

maximemulder commented Mar 10, 2025 •

edited

Loading