[POC] Ticket Scraping Storage

Assumption

We assume the following to be true in this POC:

Ticket scraping requests are handled individually, that is, each ticket scraping request keeps its own data to monitor the price;
Same configuration of the ticket scraping requests should be grouped and handled as a single request to save computing.

Problem Analysis

Due to high volume of data accumulated in the ticket scraping process, the database is often overwhelmed with non-essential data and needs to be cleaned and backed up in traditional file storage such as AWS S3.
The database should only store data that is essential to fulfill the purpose of finding the best seats under some constraints.

Design Involving S3 Bucket

Periodically, the system should store the scraping results in comma separated file and save in AWS S3. Meanwhile, the dataset file should link to a ticket scraping configuration.
The datasets stored in the AWS S3 should be able to retrieve by some identifier stored in the database.

Design Involving NoSQL Database

`events` (scraping requests) collection

| _id | client_name | client_emails | name | target_price | tolerance | ticket_num | tm_event_id | last_modified | markPaused ...

price range = [target_price - tolerance, target_price + tolerance].
markPause (true) indicates that client has requested to pause ticket scraping. Otherwise, (empty or false) indicates ticket scraping in progress.

`best_available_seats` collection

| _id | scraping_id | quality (rank) | price | section | row | area | seat_columns | offer | ...

db.getCollection('best-available-seats').createIndex({"scraping_id": 1})
rank of each seat is a float in (0, 1) representing the viewing experience (worst, best).

All seats referenced by the same scraping_id obtained from the query

should not exceed maximum fetch size (i.e. 100);
should always reflect the latest scraping result.

`best_history_seats` collection

The collection contains seats that were available in the past. There is no guarantee that the seats are available in the present.

| _id | scraping_id | quality (rank) | price | section | row | area | seat_columns | offer | ...

db.getCollection('best-history-seats').createIndex({"scraping_id": 1, "price": 1, "quality": -1 })
db.getCollection('best-history-seats').createIndex({"scraping_id": 1, "section": 1, "row": 1, "seat_columns": 1, "last_modified": 1})

The collection

is an indexed table based on (scraping_id) and (rank, sec, row?, seat?) and stores the picks;
should store the tickets that are overwritten when new scraping result is added to best_available_seats collection;
stores history best seats that can be better/worse than the seats in best_available_seats collection;
should store better seats in the event of increasing ticket price;
should store worse seats in the event of decreasing ticket price;
stores history best seats whose availability is uncertain and we should verify the availability before informing the client to make a purchase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[POC] Ticket Scraping Storage

Assumption

Problem Analysis

Design Involving S3 Bucket

Design Involving NoSQL Database

`events` (scraping requests) collection

`best_available_seats` collection

`best_history_seats` collection

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

[POC] Ticket Scraping Storage

Assumption

Problem Analysis

Design Involving S3 Bucket

Design Involving NoSQL Database

events (scraping requests) collection

best_available_seats collection

best_history_seats collection

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`events` (scraping requests) collection

`best_available_seats` collection

`best_history_seats` collection