Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
alegre
------

A media analysis service. Part of the [Check platform](https://meedan.com/check). Refer to the [main repository](https://github.com/meedan/check) for quick start instructions.
A media similarity analysis service. Part of the [Check platform](https://meedan.com/check). Refer to the [main repository](https://github.com/meedan/check) for quick start instructions.

There is also an [overview of the similairty infrastructure](doc/meedan_similarity_infra_overview.md) and more [detailed explanation of the process for each media type](doc/similarity-media-type-detail.md).

## Development

Expand Down Expand Up @@ -36,13 +38,15 @@ To test individual modules:

## Diagrams

NOTE: these diagrams need to be updated with the new endpoints from Presto migration

### Similarity-Related HTTP requests Alegre receives from Check API

![Similarity-Related HTTP requests Alegre receives from Check API](elasticsearch_detail.png?raw=true "Similarity-Related HTTP requests Alegre receives from Check API")
![Similarity-Related HTTP requests Alegre receives from Check API](doc/elasticsearch_detail.png?raw=true "Similarity-Related HTTP requests Alegre receives from Check API")

(Source: https://docs.google.com/drawings/d/1-teqtZJfU4MSDUGVwWL9F4cXDKDnVObDYg3a9jJOP1Y/edit)
### Text Queries generated by Similarity Requests from Check API within Alegre

![Text Queries generated by Similarity Requests from Check API within Alegre](alegre_parameter_breakdown.png?raw=true "Text Queries generated by Similarity Requests from Check API within Alegre")
![Text Queries generated by Similarity Requests from Check API within Alegre](doc/alegre_parameter_breakdown.png?raw=true "Text Queries generated by Similarity Requests from Check API within Alegre")

(Source: https://docs.google.com/drawings/d/1jvwn5wM6T2jlnaS_fS7_u6sH02HVHi6L8Q9H_vD4SuY/edit)
File renamed without changes
File renamed without changes
77 changes: 77 additions & 0 deletions doc/meedan_similarity_infra_overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Meedan Similarity Infrastructure Overview

This document illustrates the relationships between the parts of Meedan similarity services (Alegre) and the other Meedan systems it supports and depends on. Elements in this diagram correspond to observable pieces of infrastructure. i.e. something we can go look at log files for when tracing through.

The most common usecase is for one of the Meedan services (CheckWeb, Timpani, etc) to call the Alegre API to request storing a fingerprint for a media item and returning a list of previously stored media items with similar fingerprints. The mechanishms for fingerprint vary considerably by media type ([see similarity-media-type-detail](similarity-media-type-detail.md) ) but in general the media is passed via SQS queues to an appropriate "Presto" model to compute a fingerprint (e.x. text embedding vector), which is stored in OpenSearch (vectors), AWS S3 filestore (video fingerprints) or a postgres database (audio fingerprints, image fingerprints)

The media items returned by a search request are used to show search results in the check web UI, establish stored relationships between content in a workspace (Check API) or to look up corresponding fact-checks to return to users (tipline queries)


Alegre relies on a number of 3rd party APIs to perform some intermediate classification (identifying language of text, etc) or extraction of text (Optical Charater Recognition, transcripts) from other media types.

Redis is used to manage state of blocking calls while wating for work to complete, and also to cache fingerprint results for quickly repeated queries.

Deprecated: Most of the text models exist both as part of Alegre and Presto, because a refactor is in process to move them out of Alegre.


```mermaid
graph LR
Tipline_Queries --> Check_API
Check_Web --> Check_API --> Alegre_API_ECS_Service
Rake_Indexing_Tasks --> Check_API
Check_API --> Check_Relationship_Store_Postgres_DB
Check_API --> Alegre_API_ECS_Service
Timpani --> Alegre_API_ECS_Service
Alegre_API_ECS_Service --> Alegre_RDS_Postgres_DB
Alegre_API_ECS_Service --> OpenSearch_Vector_Index
Alegre_API_ECS_Service --> Redis_State_Cache

Alegre_API_ECS_Service --> TMK_Video_Fingerprint_Files_S3
subgraph Deprecated_Alegre_Text_Models_ECS
Multilingual_Text
MeanTokens_Text
IndianSbert_Text
FPTG_Text
Worker
end
Alegre_API_ECS_Service --> Deprecated_Alegre_Text_Models_ECS

subgraph External_APIs
Google_Lang_ID
Google_Image_OCR
Google_Image_Classification
OpenAI_Text_Embeddings
AWS_Audio_Transcriptions
end
Alegre_API_ECS_Service --> External_APIs
Alegre_API_ECS_Service --> Presto_API_Service
Presto_API_Service --> Alegre_API_ECS_Service

subgraph AWS_SQS_Queues
Input_Queues
Output_Queues
DeadLetter_Queues
end

Presto_API_Service --> Input_Queues --> Presto_Models_ECS --> Output_Queues --> Presto_API_Service

subgraph Presto_Models_ECS
YAKE_Keywords_Model
Multilingual_Text_Model
IndianSbert_Text_Model
MeanTokens_Text_Model
FPTG_Text_Model
Video_Model
Audio_Model
Image_Model
ClassyCat_Model
end


Video_Model --> TMK_Video_Fingerprint_Files_S3
Audio_Model --> Alegre_RDS_Postgres_DB
Image_Model --> Alegre_RDS_Postgres_DB

```


Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# How Check Processes Items For Similarity
# Similarity Processing by Media Type

This document provides a general outline the steps and calls made by each of the general media types that Meedan's "Alegre" similairty knows how to compute similarity for: Images, Video, Text and Audio.

## Images

![Typical Flow, Check Image Matching](doc/img/alegre-image-flow.png?raw=true "Typical Flow, Check Image Matching")
![Typical Flow, Check Image Matching](img/alegre-image-flow.png?raw=true "Typical Flow, Check Image Matching")
[Edit Link](https://docs.google.com/drawings/d/1jXgbM_06rlpPeip1vxUKpiRYyhumrkFlr-2EC3qBxHg/edit)

At a high level, Check-API receives new `ProjectMedia` items and, as they are created, we perform the following procedures:
Expand All @@ -29,7 +31,7 @@ When Searching images, the following events occur:

## Video

![Typical Flow, Check Video Matching](doc/img/alegre-video-flow.png?raw=true "Typical Flow, Check Video Matching")
![Typical Flow, Check Video Matching](img/alegre-video-flow.png?raw=true "Typical Flow, Check Video Matching")
[Edit Link](https://docs.google.com/drawings/d/1HQTwHmkhzp-J742-QAowfYMNaYoALYTPwOTA-PASHnk/edit)

For video, Check-API receives new `ProjectMedia` items and, as they are created, performs the following procedures:
Expand All @@ -54,7 +56,7 @@ When Searching videos, the following events occur:

## Text

![Typical Flow, Check Text Matching](doc/img/alegre-text-flow.png?raw=true "Typical Flow, Check Text Matching")
![Typical Flow, Check Text Matching](img/alegre-text-flow.png?raw=true "Typical Flow, Check Text Matching")
[Edit Link](https://docs.google.com/drawings/d/12WljT8-qsUi8xG584clD_eV1ABOcB6CqkMX0eAxSPrE/edit)

For text, Check-API receives new `ProjectMedia` items and, as they are created, performs the following procedures:
Expand All @@ -75,7 +77,7 @@ When Searching text, the following events occur:

## Audio

![Typical Flow, Check Audio Matching](doc/img/alegre-audio-flow.png?raw=true "Typical Flow, Check Audio Matching")
![Typical Flow, Check Audio Matching](img/alegre-audio-flow.png?raw=true "Typical Flow, Check Audio Matching")
[Edit Link](https://docs.google.com/drawings/d/1YwWJMgPxAlonCdq4M5RWaSOzSSucwHkg7EWTggWOhw8/edit)

For audio, Check-API receives new `ProjectMedia` items and, as they are created, performs the following procedures:
Expand Down