Skip to content

Minimise db changes when crawling sources #293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 18, 2025

Conversation

ndepaola
Copy link
Collaborator

Description

  • This PR is a precursor to development of a user-controlled tagging system
  • Previously, the google drive crawler task recorded its changes to the database by dropping the contents of the Card table and repopulating it, then dropping the entire elasticsearch index and rebuilding it
    • This is very slow and causes the ES index to be in a partial state while it's rebuilding, so usability of the app is affected
    • It also means we can't foreign key to the Card table, or persist any data in the Card table which isn't sourced from google drive, because the table is constantly dropped and repopulated
  • In the past we've tackled this by using the django-bulk-sync package to manage database changes, but this became prohibitively slow when we switched from SQLite to postgres
  • I've reworked this code to only apply the minimal set of changes to the database and elasticsearch:
    • We are now reading dateModified from the google drive API and using this to determine when updates to existing cards must be recorded
  • Some rough performance measurements:
    • Crawling + applying db changes + elasticsearch changes with a handful of meaty google drives (55,900 images in total)
    • Before this PR: 5 minutes and 25 seconds
    • After this PR:
      • First run (from empty database): 3 minutes and 28 seconds
      • Second run (no changes to apply to database or elasticsearch): 2 minutes and 33 seconds. Most of this time is reading from the google drive API

Checklist

  • I have installed pre-commit and installed the hooks with pre-commit install before creating any commits.
  • I have updated any related tests for code I modified or added new tests where appropriate.
  • I have manually tested my changes as follows:
    • Exercised database updater in dev environment, validated results in-app
  • I have updated any relevant documentation or created new documentation where appropriate.
    • None required

@ndepaola ndepaola moved this to In Progress in MPC Autofill Backend May 18, 2025
@ndepaola ndepaola merged commit 7ca6321 into master May 18, 2025
3 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in MPC Autofill Backend May 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

1 participant