Skip to content

Validate subject IDs early in the imaging pipeline #1144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

maximemulder
Copy link
Contributor

@maximemulder maximemulder commented Jul 30, 2024

This PR validates the subject IDs at the very beginning of the imaging pipeline, preventing the insertion of any MRI upload / DICOM archive that would not be associated to a session because of a typo or configuration mistake.

This PR uses the Python configuration function get_subject_ids, even in the Perl code (it calls the Python code from the Perl code). It also includes a factorization and de-duplication of some existing Python code regarding the obtentation and validation of subject IDs.

Caveats for projects

  • need to have python dependencies installed if not already installed + datatabase_config.py configured.

@maximemulder maximemulder marked this pull request as ready for review August 5, 2024 14:53
Copy link
Collaborator

@cmadjar cmadjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few comments.

The main change would be that we need to still insert the DICOMs in the imaging uploader and mark the insertion as failed to keep a trace on the frontend. I know that for CBIGR, you don't want to insert but most studies would still want to see the failure.

A tradeoff would be to create a config under the imaging pipeline section with default to always insert in imaging uploader that you can turn off for CBIGR.

my $upload_id = undef;
my ($debug, $verbose) = (0,1);
my $upload_id = undef;
my ($debug, $verbose) = (0,0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why has this changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, maybe I should have proposed this change in a separate PR indeed.
The current line makes the imaging pipeline always verbose, no matter if the --verbose argument is provided or not. When calling the Python script from Perl, it becomes very noticeable since all database queries are logged.

Comment on lines +253 to +255
if (system($command) != 0) {
# The error is already printed by the Python script, just exit
exit $NeuroDB::ExitCodes::CANDIDATE_MISMATCH;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would fail silently. I think the data should still be inserted in the imaging uploader but with a failure status instead.

Comment on lines +44 to +46
query = 'SELECT PSCID ' \
'FROM candidate ' \
'WHERE CandID = %s'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should revert back to the old SQL. There is a reason why there are those joins. It allows checking the the PSCID and DCCID refers to the same candidate.

Copy link
Contributor Author

@maximemulder maximemulder Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I forgot that the PSCID is not unique, my mistake !

EDIT: Actually, even with duplicate PSCIDs, I don't see how the new code could fail. It seems to me that this join is kinda useless :

SELECT c2.PSCID 
    FROM candidate c1
    LEFT JOIN candidate c2 ON (c1.CandID=c2.CandID AND c2.PSCID = %s)
    WHERE c1.CandID = %s;

seems equivalent to me to (omitting NULL results and arguments order):

SELECT PSCID
    FROM candidate
    WHERE CandID = %s AND PSCID = %s;

results = self.db.pselect(query=query, args=(visit_label,))

return results if results else None
results = self.db.pselect(query=query, args=[visit_label])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would break since pselect args expects a tuple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code worked fine (on my machine™) with the lists and I saw a few places that use lists (in lib/candidate.py) so I went with it. But it seems that even the MySQL docs uses tuples so you're probably right that I should go with it yes. (although I must say I do not get why one would design this API with tuples over lists here)


return results if results else None
results = self.db.pselect(query=query, args=[visit_label])
return bool(results)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked that changing the return value type does not break any other code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep ! The function only has one reference in the codebase (the place I changed).

Comment on lines +111 to +113
except DetermineSubjectException as exception:
self.log_error_and_exit(
self.subject_id_dict['error_message'],
exception.message,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice the exception refactoring! Have you checked that determine_subject_ids is not used anywhere else? To make sure nothing gets broken by this refactoring. Thanks!

Copy link
Contributor Author

@maximemulder maximemulder Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep ! I did change the error management in all the places this function is called. There is just one place which had no error management at all, so I did not add any:
https://github.com/aces/Loris-MRI/blob/main/python/lib/dcm2bids_imaging_pipeline_lib/nifti_insertion_pipeline.py#L84

@maximemulder
Copy link
Contributor Author

maximemulder commented Aug 5, 2024

The main change would be that we need to still insert the DICOMs in the imaging uploader and mark the insertion as failed to keep a trace on the frontend. I know that for CBIGR, you don't want to insert but most studies would still want to see the failure.

A tradeoff would be to create a config under the imaging pipeline section with default to always insert in imaging uploader that you can turn off for CBIGR.

This sounds reasonable to me, and I wonder if it is not what C-BIG should be doing the same. @ridz1208 do you have an opinion ?

I understand we want to avoid NULL session IDs as much as possible, but would it really be a problem to have NULL-sessioned MRI uploads here ? IMO this seems like the only way to have failed insertion logs on the front-end, and that these entries should not be causing any harm since there is no data linked to it.

For manual insertions just reporting the error is fine, but manual insertions would rather use the front-end. For automated scripts having the errors directly accessible by the admins in LORIS seems like a sensible solution to me.

return subject_id_dict

def validate_subject_ids(self, subject_id_dict):
def determine_subject_ids_from_name(self, subject_name: str, scanner_id: Optional[int] = None) -> dict[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated function name. See line 515

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, there is no duplicate name, on line 515 there is determine_subject_ids, and on 538 there is determine_subject_ids_from_name, the former calls the latter. The diff is a bit messy here.

@ridz1208
Copy link

ridz1208 commented Aug 6, 2024

A tradeoff would be to create a config under the imaging pipeline section with default to always insert in imaging uploader that you can turn off for CBIGR.
@maximemulder @cmadjar This point was brought up by @nicolasbrossard the first time we discussed this, he was asking how would we handle a failed upload if its not inserted anywhere and a few suggestions came up I believe but now that its more concrete we can maybe rediscuss on the imaging meeting?

A few suggestions that came up (but we had not setteled on any) were:

  1. keep a local log where the scans are being uploaded from to store unsuccessful scans
  2. use crontab's email system to email specifically designed coordinators (by project) when a scan fails to upload
  3. use the new DICOM API to have proper feedback from an upload failure

new:
4. have a config option for accepting or not NULL session IDs

Briefly my thoughts are that even for CBIG if NULL sessionID is not accepted we need a way to notify the scan "owners" that it failed to upload so solutions 1,2 and/or 3 will need to be implemented at least for CBIG even if 4 is implemented. As I explained previously my issue with NULL sessionIDs is technical not procedural. The problem is that a NULL sessionID means that we were not able to match the "patient_name" to a session for a candidate because of whatever reason... soooo on the front end who sees it?

  • Option 1: noone sees it, its filtered out - thats pretty useless we may as well not insert it
  • Option 2: everyone sees it - problematic in an institutional instance given that patient name may contain if we dont want it to, after all it failed to match with a candidate/session otherwise it would have not had SessionID=NULL
  • Option 3: a subset of users with a "special" permission can see only those scans... - sure but then we still need to know the source of the scan to know who to send it to to fix the name and reupload... we are back to the file system of origin

Unless I'm missing something here thats my view of the issue. We can certainly discuss at the meeting.

Edit: And just to make sure the ONLY usecase where a sessionID would be nulll is if the file name is incorrect, we are not talking of a failed upload for any other reason, just if the naming convention is not respected which the front end would have rejected either way

@maximemulder
Copy link
Contributor Author

Closed in favor of using a new script that calls the DICOM upload API to insert scans.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants