-
Notifications
You must be signed in to change notification settings - Fork 7
[CV2-6075] documentation for Check-Alegre-Presto media indexing and similarity workflows #489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
# How Check Processes Items For Similarity | ||
|
||
## Images | ||
|
||
 | ||
[Edit Link](https://docs.google.com/drawings/d/1jXgbM_06rlpPeip1vxUKpiRYyhumrkFlr-2EC3qBxHg/edit) | ||
|
||
At a high level, Check-API receives new `ProjectMedia` items and, as they are created, we perform the following procedures: | ||
|
||
1. Store the `ProjectMedia`, | ||
2. Send the `ProjectMedia` through `Bot::Alegre.run`, | ||
3. `Bot::Alegre.run` searches for items via image hashing, | ||
4. Matches are simultaneously checked asynchronously for suggested and confirmed items, | ||
5. Once both queries are completed, we process the item in a callback, and store results in `Bot::Alegre.relate_project_media_callback`, | ||
6. We also store OCR text annotations in the text index for the item, which will match subsequent items (but does *not* match against existing items for the OCR attempt), | ||
7. OCR data is exhausted into OpenSearch for text lookups, and Image Hashes are exhausted into Postgres for image lookups on Alegre, | ||
8. Relationships at the Check API level are persisted after (5). | ||
|
||
When Searching images, the following events occur: | ||
|
||
1. Eventually, during the chain of a sync or async lookup on an existing item, we hit `app.main.lib.image_similarity.search_image`, | ||
2. We pass into `search_image` the hash value of the current item (either as yielded from existing DB record or as received from presto), | ||
3. We then set the value on the image object identified by URL or Doc ID, | ||
4. We then search via PDQ or PHASH, whichever is set, | ||
5. When searching for PDQ we use a custom `bit_count_pdq` function for similarity score in postgres queries, | ||
6. When searching for PHASH we use a custom `bit_count_image` function for similarity score in postgres queries, | ||
7. A set of image records are returned - we then render them in the response to Check API or other requestor. | ||
|
||
|
||
## Video | ||
|
||
 | ||
[Edit Link](https://docs.google.com/drawings/d/1HQTwHmkhzp-J742-QAowfYMNaYoALYTPwOTA-PASHnk/edit) | ||
|
||
For video, Check-API receives new `ProjectMedia` items and, as they are created, performs the following procedures: | ||
|
||
1. Store the `ProjectMedia`, | ||
2. Send the `ProjectMedia` through `Bot::Alegre.run`, | ||
3. `Bot::Alegre.run` searches for items via video fingerprinting, | ||
4. Matches are simultaneously checked asynchronously for suggested and confirmed items, | ||
DGaffney marked this conversation as resolved.
Show resolved
Hide resolved
|
||
5. Once both queries are completed, we process the item in a callback, and store results in `Bot::Alegre.relate_project_media_callback`, | ||
6. We also store transcription text annotations in the text index for the item, which will match subsequent items (but does *not* match against existing items for the transcription attempt), | ||
7. Transcription data is exhausted into OpenSearch for text lookups, and Video Hashes are exhausted into Postgres for video lookups on Alegre, as well as on a disk lookup for .tmk file lookups, | ||
8. Relationships at the Check API level are persisted after (5). | ||
|
||
When Searching videos, the following events occur: | ||
|
||
1. Eventually, during the chain of a sync or async lookup on an existing item, we hit `app.main.lib.shared_models.video_model.VideoModel.search`, | ||
2. We pass into `search` either the references sufficient to find an existing hash from an existing video, or the data yielded from Presto in order to set that hash value / tmk filepath value, | ||
3. We then identify *all videos that have similar context* and pull up that full list, | ||
4. We then calculate `l1` scores based off the simplistic hash stored on the objects to determine candidates for deeper analysis, | ||
5. We then conduct a more thorough TMK-based analysis for videos passing the candidate test, | ||
6. We return the list of matching TMK-based results. | ||
|
||
## Text | ||
|
||
 | ||
[Edit Link](https://docs.google.com/drawings/d/12WljT8-qsUi8xG584clD_eV1ABOcB6CqkMX0eAxSPrE/edit) | ||
|
||
For text, Check-API receives new `ProjectMedia` items and, as they are created, performs the following procedures: | ||
|
||
1. Store the `ProjectMedia`, | ||
2. Send the `ProjectMedia` through `Bot::Alegre.run`, | ||
3. `Bot::Alegre.run` searches for items via video fingerprinting, | ||
DGaffney marked this conversation as resolved.
Show resolved
Hide resolved
|
||
4. Matches are simultaneously checked asynchronously for suggested and confirmed items *for `original_title` and `original_description`*, *for all vector models applied*, | ||
5. Once both queries are completed *for both fields*, *for all vector models applied*, we process the item in a callback, and store results in `Bot::Alegre.relate_project_media_callback` - that method tracks remaining messages and only processes a match when all messages are no longer in flight, | ||
6. Relationships at the Check API level are persisted after (5). | ||
|
||
When Searching text, the following events occur: | ||
|
||
1. Eventually, during the chain of a sync or async lookup on an existing item, we hit `app.main.lib.text_similarity.search_text` (after waiting to pass on to this step until all vectors are completed, subject to `elastic_crud.requires_encoding`), | ||
2. We pass into `search_text` the OpenSearch document which contains all completed relevant vectors along with the list of models from which we will process search results, | ||
3. For each model, including elasticsearch, we search for results from OpenSearch - using language analyzers where applicable with opensearch, else, cosine similarity searches with vectors. | ||
4. We append results into a large list of results, each item of which contains sufficient data to indicate the model that yielded the result. | ||
|
||
## Audio | ||
|
||
 | ||
[Edit Link](https://docs.google.com/drawings/d/1YwWJMgPxAlonCdq4M5RWaSOzSSucwHkg7EWTggWOhw8/edit) | ||
|
||
For audio, Check-API receives new `ProjectMedia` items and, as they are created, performs the following procedures: | ||
|
||
1. Store the `ProjectMedia`, | ||
2. Send the `ProjectMedia` through `Bot::Alegre.run`, | ||
3. `Bot::Alegre.run` searches for items via video fingerprinting, | ||
4. Matches are simultaneously checked asynchronously for suggested and confirmed items, | ||
DGaffney marked this conversation as resolved.
Show resolved
Hide resolved
|
||
5. Once both queries are completed, we process the item in a callback, and store results in `Bot::Alegre.relate_project_media_callback`, | ||
6. We also store transcription text annotations in the text index for the item, which will match subsequent items (but does *not* match against existing items for the transcription attempt), | ||
7. Transcription data is exhausted into OpenSearch for text lookups, and Audio Hashes are exhausted into Postgres for audio lookups on Alegre | ||
8. Relationships at the Check API level are persisted after (5). | ||
|
||
When Searching text, the following events occur: | ||
|
||
1. Eventually, during the chain of a sync or async lookup on an existing item, we hit `app.main.lib.shared_models.audio_model.AudioModel.search`, | ||
2. We pass into `search` either the references sufficient to find an existing hash from an existing audio, or the data yielded from Presto in order to set that chromaprint hash value, | ||
3. We then run that hash through the custom postgres/perl script `get_audio_chromaprint_score` in order to calculate similarity, | ||
4. We then return all matches yielded. | ||
|
Binary file not shown.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.