Skip to content

[ENG-8175] Fixed potential race condition #11179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: feature/pbs-25-10
Choose a base branch
from

Conversation

ihorsokhanexoft
Copy link
Contributor

@ihorsokhanexoft ihorsokhanexoft commented Jun 11, 2025

Purpose

User received an email of archive failure regardless of it was successful. It may happen because archive_success is run asynchronously

@project_signals.archive_callback.connect
def archive_callback(dst):
"""Blinker listener for updates to the archive task. When the tree of ArchiveJob
instances is complete, proceed to send success or failure mails
:param dst: registration Node
"""
root = dst.root
root_job = root.archive_job
if not root_job.archive_tree_finished():
return
if root_job.sent:
return
if root_job.success:
# Prevent circular import with app.py
from website.archiver import tasks
tasks.archive_success.delay(dst_pk=root._id, job_pk=root_job._id)

and may be finished before the main thread finishes archiving process. So at first archive_success is processed and no files found, thus email is sent, then the main thread finishes files processing and this archiving is successful actually.
Also it's possible that the celery queue had too many tasks to process and when the main thread finishes archiving, user sees his registration and when celery processes archive_success tasks that fails, user receives this email.

Changes

Added a one-time retry.

Ticket

https://openscience.atlassian.net/browse/ENG-8175?atlOrigin=eyJpIjoiMjg4MWM1YWI1ZTE3NDMyZmEyODk2Y2QxZjlhNjFlOGQiLCJwIjoiaiJ9

@ihorsokhanexoft ihorsokhanexoft changed the base branch from develop to feature/pbs-25-10 June 11, 2025 10:40
Copy link
Collaborator

@brianjgeiger brianjgeiger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question.

'missing_files': err.missing_files,
},
)
self.retry(exc=err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of the retry? Are you expecting a situation where the files will be found on the retry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it looks like this celery task is run faster than attaching files to a node, this is why they cannot be found. Why I think so because Mark mentioned that despite a user got email with "some files were altered" message, these files were actually included in the node registration, so maybe there was some retry and this celery task was re-run or Matt ran it manually. So if we don't find some files, it may be related to race condition and we make one more attempt in 5 minutes. If it doesn't work -> raise error
In order to avoid this race condition I think we can make this task synchronous, but in this case user'll need to wait a bit more time on registration page to not abort a request

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I described this issue to Matt and it's a known bug:

Issue appears to be a known bug wherein those file_urls don't get updated after registration is created, leaving those links pointing to the same file on the node. Unsure about the exact cause, but usually just calling migrate_file_metadata works. Would you like me to run that against it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And migrate_file_metadata is a problem place that checks if files are found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants