diff --git a/.github/workflows/gitlab-mirror.yml b/.github/workflows/gitlab-mirror.yml new file mode 100644 index 00000000..07a15946 --- /dev/null +++ b/.github/workflows/gitlab-mirror.yml @@ -0,0 +1,19 @@ +name: GitLab Mirror + +on: + - push + - delete + +jobs: + sync: + runs-on: ubuntu-latest + name: Git Repo Sync + steps: + - uses: actions/checkout@v2 + with: + fetch-depth: 0 + - uses: wangchucheng/git-repo-sync@v0.1.0 + with: + target-url: https://gitlab.com/meedan/alegre.git/ + target-username: sonoransun + target-token: ${{ secrets.GITLAB_ACCESS_TOKEN }} diff --git a/doc/img/alegre-audio-flow.png b/doc/img/alegre-audio-flow.png new file mode 100644 index 00000000..ed114d51 Binary files /dev/null and b/doc/img/alegre-audio-flow.png differ diff --git a/doc/img/alegre-image-flow.png b/doc/img/alegre-image-flow.png new file mode 100644 index 00000000..7867e329 Binary files /dev/null and b/doc/img/alegre-image-flow.png differ diff --git a/doc/img/alegre-text-flow.png b/doc/img/alegre-text-flow.png new file mode 100644 index 00000000..108a3032 Binary files /dev/null and b/doc/img/alegre-text-flow.png differ diff --git a/doc/img/alegre-video-flow.png b/doc/img/alegre-video-flow.png new file mode 100644 index 00000000..94b09195 Binary files /dev/null and b/doc/img/alegre-video-flow.png differ diff --git a/doc/similarity-high-level.md b/doc/similarity-high-level.md new file mode 100644 index 00000000..51d21ed6 --- /dev/null +++ b/doc/similarity-high-level.md @@ -0,0 +1,59 @@ +# How Check Processes Items For Similarity + +## Images + +![Typical Flow, Check Image Matching](doc/img/alegre-image-flow.png?raw=true "Typical Flow, Check Image Matching") +[Edit Link](https://docs.google.com/drawings/d/1jXgbM_06rlpPeip1vxUKpiRYyhumrkFlr-2EC3qBxHg/edit) + +At a high level, Check-API receives new `ProjectMedia` items and, as they are created, we perform the following procedures: + +1. Store the `ProjectMedia`, +2. Send the `ProjectMedia` through `Bot::Alegre.run`, +3. `Bot::Alegre.run` searches for items via image hashing, +4. Matches are simultaneously checked asynchronously for suggested and confirmed items, +5. Once both queries are completed, we process the item in a callback, and store results in `Bot::Alegre.relate_project_media_callback`, +6. We also store OCR text annotations in the text index for the item, which will match subsequent items (but does *not* match against existing items for the OCR attempt), +7. OCR data is exhausted into OpenSearch for text lookups, and Image Hashes are exhausted into Postgres for image lookups on Alegre, +8. Relationships at the Check API level are persisted after (5). + +## Video + +![Typical Flow, Check Video Matching](doc/img/alegre-video-flow.png?raw=true "Typical Flow, Check Video Matching") +[Edit Link](https://docs.google.com/drawings/d/1HQTwHmkhzp-J742-QAowfYMNaYoALYTPwOTA-PASHnk/edit) + +For video, Check-API receives new `ProjectMedia` items and, as they are created, performs the following procedures: + +1. Store the `ProjectMedia`, +2. Send the `ProjectMedia` through `Bot::Alegre.run`, +3. `Bot::Alegre.run` searches for items via video fingerprinting, +4. Matches are simultaneously checked asynchronously for suggested and confirmed items, +5. Once both queries are completed, we process the item in a callback, and store results in `Bot::Alegre.relate_project_media_callback`, +6. We also store transcription text annotations in the text index for the item, which will match subsequent items (but does *not* match against existing items for the transcription attempt), +7. Transcription data is exhausted into OpenSearch for text lookups, and Video Hashes are exhausted into Postgres for video lookups on Alegre, as well as on a disk lookup for .tmk file lookups, +8. Relationships at the Check API level are persisted after (5). + +## Text + +![Typical Flow, Check Text Matching](doc/img/alegre-text-flow.png?raw=true "Typical Flow, Check Text Matching") +[Edit Link](https://docs.google.com/drawings/d/12WljT8-qsUi8xG584clD_eV1ABOcB6CqkMX0eAxSPrE/edit) + +1. Store the `ProjectMedia`, +2. Send the `ProjectMedia` through `Bot::Alegre.run`, +3. `Bot::Alegre.run` searches for items via video fingerprinting, +4. Matches are simultaneously checked asynchronously for suggested and confirmed items *for `original_title` and `original_description`*, *for all vector models applied*, +5. Once both queries are completed *for both fields*, *for all vector models applied*, we process the item in a callback, and store results in `Bot::Alegre.relate_project_media_callback` - that method tracks remaining messages and only processes a match when all messages are no longer in flight, +6. Relationships at the Check API level are persisted after (5). + +## Audio + +![Typical Flow, Check Audio Matching](doc/img/alegre-audio-flow.png?raw=true "Typical Flow, Check Audio Matching") +[Edit Link](https://docs.google.com/drawings/d/1YwWJMgPxAlonCdq4M5RWaSOzSSucwHkg7EWTggWOhw8/edit) + +1. Store the `ProjectMedia`, +2. Send the `ProjectMedia` through `Bot::Alegre.run`, +3. `Bot::Alegre.run` searches for items via video fingerprinting, +4. Matches are simultaneously checked asynchronously for suggested and confirmed items, +5. Once both queries are completed, we process the item in a callback, and store results in `Bot::Alegre.relate_project_media_callback`, +6. We also store transcription text annotations in the text index for the item, which will match subsequent items (but does *not* match against existing items for the transcription attempt), +7. Transcription data is exhausted into OpenSearch for text lookups, and Audio Hashes are exhausted into Postgres for audio lookups on Alegre +8. Relationships at the Check API level are persisted after (5). diff --git a/doc/similarity.png b/doc/similarity.png deleted file mode 100644 index b0de5777..00000000 Binary files a/doc/similarity.png and /dev/null differ diff --git a/doc/similarity.svg b/doc/similarity.svg deleted file mode 100644 index a048134a..00000000 --- a/doc/similarity.svg +++ /dev/null @@ -1,533 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - image/svg+xml - - - - - - - - - - - Check - - Check Bot(JS - 1. A claim is created on Check - 2. Claim is sent to the Similarity Bot, which is a bot from the Bot Garden like any other 3.1. The bot calls Alegre's /similarity endpoint to save this claim to Alegre's index... for now, the type can be wordvec or es (default)... this diagram uses the wordvec case as example3.2. The bot also calls Alegre's /similarity/query endpoint in order to get similar claims... today this can be done based on pure ElasticSearch match or Word2Vec similarity -let's use the latter as example - - Alegre 4.1. Claim is saved in ElasticSearch index, along with any other context information that was provided (on Check case, this means project_media_id and project_id)4.2. Query is sent to ElasticSearch in order to get the results that have a minimum threshold and that arefrom the same context (on Check case, same project) - 5.1. The Meedan ElasticSearch plugin calls Alegre's /wordvec/vector endpoint which returns a word vector for that claim, as a n-dimensional array of floats, where n is the number of dimensions in the corpus... that vector is saved in the vector field in the ElasticSearch index 5.2. The Meedan ElasticSearch plugin calls Alegre's /wordvec/similarity endpoint for the input claim vector and for each vector stored in the index... the similarity (between 0 and 1) returned by Alegre is used as the ElasticSearch _score for that result - - - Elastic Search Plugin - - 6.1. Alegre returns whether the claim was saved in the index6.2. Alegre returns the list of similar claims, along with the context information - 7.1. The bot creates a comment saying if the claim was saved in the similarity index or not7.2. The bot creates relationships between the created claim and any other similar claim returned by Alegre Claim Similarity Bot -