Skip to content

Slow dataset queries #1665

Open
@forgetso

Description

@forgetso

We are using $sample 2 when getting captchas. This is causing slow queries on the nodes. We need to change this approach as follows:

  1. Create an index on { datasetId: 1, solved: 1 }

  2. Instead of $sample, use a random selection method to improve performance. For example:

  • Add a random field to each document at insertion time.
  • Index this field.
  • Query using $gte or $lte to efficiently retrieve random documents.
  1. Use $limit Before $sample

Instead of sampling from the entire dataset, limit the query first:

db.captchas.aggregate([
  { $match: { datasetId: "0xe666b35451f302b9fccfbe783b1de9a6a4420b840abed071931d68a9ccc1c21d", solved: true } },
  { $limit: 1000 },  // Get a subset first
  { $sample: { size: 2 } },  // Then sample from that subset
  { $project: { datasetId: 1, datasetContentId: 1, captchaId: 1, captchaContentId: 1, items: 1, target: 1 } }
]);

This reduces the number of documents MongoDB has to scan.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdevProduct developmentsize-s

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions