Validate subject IDs early in the imaging pipeline #1144

maximemulder · 2024-07-30T21:36:03Z

This PR validates the subject IDs at the very beginning of the imaging pipeline, preventing the insertion of any MRI upload / DICOM archive that would not be associated to a session because of a typo or configuration mistake.

This PR uses the Python configuration function get_subject_ids, even in the Perl code (it calls the Python code from the Perl code). It also includes a factorization and de-duplication of some existing Python code regarding the obtentation and validation of subject IDs.

Caveats for projects

need to have python dependencies installed if not already installed + datatabase_config.py configured.

cmadjar

a few comments.

The main change would be that we need to still insert the DICOMs in the imaging uploader and mark the insertion as failed to keep a trace on the frontend. I know that for CBIGR, you don't want to insert but most studies would still want to see the failure.

A tradeoff would be to create a config under the imaging pipeline section with default to always insert in imaging uploader that you can turn off for CBIGR.

cmadjar · 2024-08-05T15:11:54Z

batch_uploads_imageuploader.pl

-my $upload_id = undef; 
-my ($debug, $verbose) = (0,1);
+my $upload_id = undef;
+my ($debug, $verbose) = (0,0);


why has this changed?

Oh, maybe I should have proposed this change in a separate PR indeed.
The current line makes the imaging pipeline always verbose, no matter if the --verbose argument is provided or not. When calling the Python script from Perl, it becomes very noticeable since all database queries are logged.

cmadjar · 2024-08-05T15:13:34Z

batch_uploads_imageuploader.pl

+        if (system($command) != 0) {
+            # The error is already printed by the Python script, just exit
+            exit $NeuroDB::ExitCodes::CANDIDATE_MISMATCH;


This would fail silently. I think the data should still be inserted in the imaging uploader but with a failure status instead.

cmadjar · 2024-08-05T15:14:52Z

python/lib/database_lib/candidate_db.py

+        query = 'SELECT PSCID ' \
+            'FROM candidate ' \
+            'WHERE CandID = %s'


you should revert back to the old SQL. There is a reason why there are those joins. It allows checking the the PSCID and DCCID refers to the same candidate.

Whoops, I forgot that the PSCID is not unique, my mistake !

EDIT: Actually, even with duplicate PSCIDs, I don't see how the new code could fail. It seems to me that this join is kinda useless :

SELECT c2.PSCID FROM candidate c1 LEFT JOIN candidate c2 ON (c1.CandID=c2.CandID AND c2.PSCID = %s) WHERE c1.CandID = %s;

seems equivalent to me to (omitting NULL results and arguments order):

SELECT PSCID FROM candidate WHERE CandID = %s AND PSCID = %s;

cmadjar · 2024-08-05T15:15:52Z

python/lib/database_lib/visit_windows.py

-        results = self.db.pselect(query=query, args=(visit_label,))
-
-        return results if results else None
+        results = self.db.pselect(query=query, args=[visit_label])


I think that would break since pselect args expects a tuple.

The code worked fine (on my machine™) with the lists and I saw a few places that use lists (in lib/candidate.py) so I went with it. But it seems that even the MySQL docs uses tuples so you're probably right that I should go with it yes. (although I must say I do not get why one would design this API with tuples over lists here)

cmadjar · 2024-08-05T15:17:08Z

python/lib/database_lib/visit_windows.py

-
-        return results if results else None
+        results = self.db.pselect(query=query, args=[visit_label])
+        return bool(results)


Have you checked that changing the return value type does not break any other code?

Yep ! The function only has one reference in the codebase (the place I changed).

cmadjar · 2024-08-05T15:22:30Z

python/lib/dcm2bids_imaging_pipeline_lib/base_pipeline.py

+            except DetermineSubjectException as exception:
                self.log_error_and_exit(
-                    self.subject_id_dict['error_message'],
+                    exception.message,


nice the exception refactoring! Have you checked that determine_subject_ids is not used anywhere else? To make sure nothing gets broken by this refactoring. Thanks!

Yep ! I did change the error management in all the places this function is called. There is just one place which had no error management at all, so I did not add any:
https://github.com/aces/Loris-MRI/blob/main/python/lib/dcm2bids_imaging_pipeline_lib/nifti_insertion_pipeline.py#L84

maximemulder · 2024-08-05T18:50:30Z

The main change would be that we need to still insert the DICOMs in the imaging uploader and mark the insertion as failed to keep a trace on the frontend. I know that for CBIGR, you don't want to insert but most studies would still want to see the failure.

A tradeoff would be to create a config under the imaging pipeline section with default to always insert in imaging uploader that you can turn off for CBIGR.

This sounds reasonable to me, and I wonder if it is not what C-BIG should be doing the same. @ridz1208 do you have an opinion ?

I understand we want to avoid NULL session IDs as much as possible, but would it really be a problem to have NULL-sessioned MRI uploads here ? IMO this seems like the only way to have failed insertion logs on the front-end, and that these entries should not be causing any harm since there is no data linked to it.

For manual insertions just reporting the error is fine, but manual insertions would rather use the front-end. For automated scripts having the errors directly accessible by the admins in LORIS seems like a sensible solution to me.

cmadjar · 2024-08-05T21:27:25Z

python/lib/imaging.py

        return subject_id_dict

-    def validate_subject_ids(self, subject_id_dict):
+    def determine_subject_ids_from_name(self, subject_name: str, scanner_id: Optional[int] = None) -> dict[str, Any]:


duplicated function name. See line 515

Oh, there is no duplicate name, on line 515 there is determine_subject_ids, and on 538 there is determine_subject_ids_from_name, the former calls the latter. The diff is a bit messy here.

ridz1208 · 2024-08-06T16:01:17Z

A tradeoff would be to create a config under the imaging pipeline section with default to always insert in imaging uploader that you can turn off for CBIGR.
@maximemulder @cmadjar This point was brought up by @nicolasbrossard the first time we discussed this, he was asking how would we handle a failed upload if its not inserted anywhere and a few suggestions came up I believe but now that its more concrete we can maybe rediscuss on the imaging meeting?

A few suggestions that came up (but we had not setteled on any) were:

keep a local log where the scans are being uploaded from to store unsuccessful scans
use crontab's email system to email specifically designed coordinators (by project) when a scan fails to upload
use the new DICOM API to have proper feedback from an upload failure

new:
4. have a config option for accepting or not NULL session IDs

Briefly my thoughts are that even for CBIG if NULL sessionID is not accepted we need a way to notify the scan "owners" that it failed to upload so solutions 1,2 and/or 3 will need to be implemented at least for CBIG even if 4 is implemented. As I explained previously my issue with NULL sessionIDs is technical not procedural. The problem is that a NULL sessionID means that we were not able to match the "patient_name" to a session for a candidate because of whatever reason... soooo on the front end who sees it?

Option 1: noone sees it, its filtered out - thats pretty useless we may as well not insert it
Option 2: everyone sees it - problematic in an institutional instance given that patient name may contain if we dont want it to, after all it failed to match with a candidate/session otherwise it would have not had SessionID=NULL
Option 3: a subset of users with a "special" permission can see only those scans... - sure but then we still need to know the source of the scan to know who to send it to to fix the name and reupload... we are back to the file system of origin

Unless I'm missing something here thats my view of the issue. We can certainly discuss at the meeting.

Edit: And just to make sure the ONLY usecase where a sessionID would be nulll is if the file name is incorrect, we are not talking of a failed upload for any other reason, just if the naming convention is not respected which the front end would have rejected either way

maximemulder · 2024-08-08T17:14:34Z

Closed in favor of using a new script that calls the DICOM upload API to insert scans.

maximemuldermcgill added 4 commits July 30, 2024 17:34

factorize subject ids validation

cd0e205

factorize and add early validation

bf6057d

fix lints

335ad69

fix whoopsie

20477bf

maximemulder marked this pull request as ready for review August 5, 2024 14:53

cmadjar requested changes Aug 5, 2024

View reviewed changes

use dataclass for subject

361d2cf

maximemuldermcgill added 3 commits August 5, 2024 15:00

change query args type

ae88d80

change function documentation

298164a

fix oopsies

956917c

cmadjar reviewed Aug 5, 2024

View reviewed changes

cmadjar added Caveat for Existing Projects Add to release notes labels Aug 8, 2024

cmadjar added this to the 27.0 milestone Aug 8, 2024

maximemulder closed this Aug 8, 2024

maximemulder mentioned this pull request Aug 8, 2024

Improve Python subject determination and validation error handling #1146

Merged

cmadjar modified the milestones: 27.0, N/A Aug 30, 2024

Validate subject IDs early in the imaging pipeline #1144

Validate subject IDs early in the imaging pipeline #1144

Uh oh!

Conversation

maximemulder commented Jul 30, 2024 • edited by cmadjar Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Caveats for projects

Uh oh!

cmadjar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maximemulder Aug 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maximemulder Aug 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maximemulder commented Aug 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ridz1208 commented Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maximemulder commented Aug 8, 2024

Uh oh!

Uh oh!

maximemulder commented Jul 30, 2024 •

edited by cmadjar

Loading

maximemulder Aug 5, 2024 •

edited

Loading

maximemulder Aug 5, 2024 •

edited

Loading

maximemulder commented Aug 5, 2024 •

edited

Loading

ridz1208 commented Aug 6, 2024 •

edited

Loading