Skip to content

[POC] Ticket Scraping Storage

Jack Li edited this page Oct 25, 2022 · 10 revisions

Assumption

We assume the following to be true in this POC:

  • Ticket scraping requests are handled individually, that is, each ticket scraping request keeps its own data to monitor the price;
  • Same configuration of the ticket scraping requests should be grouped and handled as a single request to save computing.

Problem Analysis

  • Due to high volume of data accumulated in the ticket scraping process, the database is often overwhelmed with non-essential data and needs to be cleaned and backed up in traditional file storage such as AWS S3.
  • The database should only store data that is essential to fulfill the purpose of finding the best seats under some constraints.

Design Involving S3 Bucket

  • Periodically, the system should store the scraping results in comma separated file and save in AWS S3. Meanwhile, the dataset file should link to a ticket scraping configuration.
  • The datasets stored in the AWS S3 should be able to retrieve by some identifier stored in the database.

Design Involving NoSQL Database

events (scraping requests) collection

| _id | client_name | client_emails | name | target_price | tolerance | ticket_num | tm_event_id | last_modified | markPaused ...

  • price range = [target_price - tolerance, target_price + tolerance].
  • markPause (true) indicates that client has requested to pause ticket scraping. Otherwise, (empty or false) indicates ticket scraping in progress.

best_available_seats collection

| _id | scraping_id | quality (rank) | price | section | row | area | seat_columns | offer | ...

db.getCollection('best-available-seats').createIndex({"scraping_id": 1})
rank of each seat is a float in (0, 1) representing the viewing experience (worst, best).

All seats referenced by the same scraping_id obtained from the query

  • should not exceed maximum fetch size (i.e. 100);
  • should always reflect the latest scraping result.

best_history_seats collection

The collection contains seats that were available in the past. There is no guarantee that the seats are available in the present.

| _id | scraping_id | quality (rank) | price | section | row | area | seat_columns | offer | ...

db.getCollection('best-history-seats').createIndex({"scraping_id": 1, "price": 1, "quality": -1 })
db.getCollection('best-history-seats').createIndex({"scraping_id": 1, "section": 1, "row": 1, "seat_columns": 1, "last_modified": 1})

The collection

  • is an indexed table based on (scraping_id) and (rank, sec, row?, seat?) and stores the picks;
  • should store the tickets that are overwritten when new scraping result is added to best_available_seats collection;
  • stores history best seats that can be better/worse than the seats in best_available_seats collection;
  • should store better seats in the event of increasing ticket price;
  • should store worse seats in the event of decreasing ticket price;
  • stores history best seats whose availability is uncertain and we should verify the availability before informing the client to make a purchase.
Clone this wiki locally