-
Notifications
You must be signed in to change notification settings - Fork 344
[ENG-8175] Fixed potential race condition #11179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/pbs-25-10
Are you sure you want to change the base?
[ENG-8175] Fixed potential race condition #11179
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question.
'missing_files': err.missing_files, | ||
}, | ||
) | ||
self.retry(exc=err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of the retry? Are you expecting a situation where the files will be found on the retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it looks like this celery task is run faster than attaching files to a node, this is why they cannot be found. Why I think so because Mark mentioned that despite a user got email with "some files were altered" message, these files were actually included in the node registration, so maybe there was some retry and this celery task was re-run or Matt ran it manually. So if we don't find some files, it may be related to race condition and we make one more attempt in 5 minutes. If it doesn't work -> raise error
In order to avoid this race condition I think we can make this task synchronous, but in this case user'll need to wait a bit more time on registration page to not abort a request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I described this issue to Matt and it's a known bug:
Issue appears to be a known bug wherein those file_urls don't get updated after registration is created, leaving those links pointing to the same file on the node. Unsure about the exact cause, but usually just calling migrate_file_metadata works. Would you like me to run that against it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And migrate_file_metadata
is a problem place that checks if files are found
Purpose
User received an email of archive failure regardless of it was successful. It may happen because
archive_success
is run asynchronouslyosf.io/website/archiver/listeners.py
Lines 33 to 49 in 2328dd6
and may be finished before the main thread finishes archiving process. So at first
archive_success
is processed and no files found, thus email is sent, then the main thread finishes files processing and this archiving is successful actually.Also it's possible that the celery queue had too many tasks to process and when the main thread finishes archiving, user sees his registration and when celery processes
archive_success
tasks that fails, user receives this email.Changes
Added a one-time retry.
Ticket
https://openscience.atlassian.net/browse/ENG-8175?atlOrigin=eyJpIjoiMjg4MWM1YWI1ZTE3NDMyZmEyODk2Y2QxZjlhNjFlOGQiLCJwIjoiaiJ9