-
Notifications
You must be signed in to change notification settings - Fork 9
[POC] Ticket Scraping Storage
We assume the following to be true in this POC:
- Ticket scraping requests are handled individually, that is, each ticket scraping request keeps its own data to monitor the price;
- Same configuration of the ticket scraping requests should be grouped and handled as a single request to save computing.
- Due to high volume of data accumulated in the ticket scraping process, the database is often overwhelmed with non-essential data and needs to be cleaned and backed up in traditional file storage such as AWS S3.
- The database should only store data that is essential to fulfill the purpose of finding the best seats under some constraints.
- Periodically, the system should store the scraping results in comma separated file and save in AWS S3. Meanwhile, the dataset file should link to a ticket scraping configuration.
- The datasets stored in the AWS S3 should be able to retrieve by some identifier stored in the database.
| _id | client_name | client_emails | name | target_price | tolerance | ticket_num | tm_event_id | last_modified | markPaused ...
- price range = [
target_price
-tolerance
,target_price
+tolerance
]. -
markPause
(true) indicates that client has requested to pause ticket scraping. Otherwise, (empty or false) indicates ticket scraping in progress.
| _id | scraping_id | quality (rank) | price | section | row | area | seat_columns | offer | ...
db.getCollection('best-available-seats').createIndex({"scraping_id": 1})
rank
of each seat is a float in (0, 1) representing the viewing experience (worst, best).
All seats referenced by the same scraping_id
obtained from the query
- should not exceed maximum fetch size (i.e. 100);
- should always reflect the latest scraping result.
The collection contains seats that were available in the past. There is no guarantee that the seats are available in the present.
| _id | scraping_id | quality (rank) | price | section | row | area | seat_columns | offer | ...
db.getCollection('best-history-seats').createIndex({"scraping_id": 1, "price": 1, "quality": -1 })
db.getCollection('best-history-seats').createIndex({"scraping_id": 1, "section": 1, "row": 1, "seat_columns": 1, "last_modified": 1})
The collection
- is an indexed table based on (scraping_id) and (rank, sec, row?, seat?) and stores the picks;
- should store the tickets that are overwritten when new scraping result is added to
best_available_seats
collection; - stores history best seats that can be better/worse than the seats in
best_available_seats
collection; - should store better seats in the event of increasing ticket price;
- should store worse seats in the event of decreasing ticket price;
- stores history best seats whose availability is uncertain and we should verify the availability before informing the client to make a purchase.