Skip to content

Slow dataset queries #1665

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
forgetso opened this issue Feb 1, 2025 · 2 comments · May be fixed by #1705
Open

Slow dataset queries #1665

forgetso opened this issue Feb 1, 2025 · 2 comments · May be fixed by #1705
Labels
bug Something isn't working dev Product development size-s

Comments

@forgetso
Copy link
Member

forgetso commented Feb 1, 2025

We are using $sample 2 when getting captchas. This is causing slow queries on the nodes. We need to change this approach as follows:

  1. Create an index on { datasetId: 1, solved: 1 }

  2. Instead of $sample, use a random selection method to improve performance. For example:

  • Add a random field to each document at insertion time.
  • Index this field.
  • Query using $gte or $lte to efficiently retrieve random documents.
  1. Use $limit Before $sample

Instead of sampling from the entire dataset, limit the query first:

db.captchas.aggregate([
  { $match: { datasetId: "0xe666b35451f302b9fccfbe783b1de9a6a4420b840abed071931d68a9ccc1c21d", solved: true } },
  { $limit: 1000 },  // Get a subset first
  { $sample: { size: 2 } },  // Then sample from that subset
  { $project: { datasetId: 1, datasetContentId: 1, captchaId: 1, captchaContentId: 1, items: 1, target: 1 } }
]);

This reduces the number of documents MongoDB has to scan.

@forgetso forgetso added bug Something isn't working dev Product development labels Feb 1, 2025
@goastler
Copy link
Member

goastler commented Feb 3, 2025

aggregate has no ordering so you don't need the random field

@forgetso forgetso added the size-s label Feb 4, 2025
@forgetso
Copy link
Member Author

forgetso commented Mar 5, 2025

#1705

@forgetso forgetso linked a pull request Mar 5, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dev Product development size-s
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants