Object identification and question answering. Main goal of the project was to make decisions about images based on the provided questions. The question MUST contain the possible answers like:
'Is this apple red or blue?'.
We extract the keywords with the Hungarian Spacy, so you can just change that part to adapt the model to your language. Keyword extraction part relies on the CCONJ part-of-speech tag so you also need to include a word that fulfils that role (like or). Model then returns an answer based on which keyword represents the image.
- Python = 3.9.*
pip install git+https://github.com/ficstamas/huclip-the-text.gitfrom huclip_the_text.model.clip import KeywordCLIP
from PIL import Image
model = KeywordCLIP(model_name='M-BERT-Base-ViT-B')
img = Image.open('bananas.jpg')
out = model.evaluate(img, 'Sárga, kerek vagy lila banánt látsz?')
# Output:
# Probability of the answer 'Sárga banán' is 0.601322591304779
# Probability of the answer 'kerek banán' is 0.20016320049762726
# Probability of the answer 'lila banán' is 0.19851425290107727Pre-trained models and projection weights are from MultilingualCLIP
| Name | Language Model | Model Base | Vision Model | Pre-trained Languages | Target Languages | #Parameters |
|---|---|---|---|---|---|---|
| M-BERT-Distil-40 | M-BERT Distil 40 | M-BERT Distil | RN50x4 | 101 Languages | 40 Languages | 66 M |
| M-BERT-Base-69 | M-BERT Base 69 | M-BERT Base | RN50x4 | 101 Languages | 68 Languages | 110 M |
| M-BERT-Base-ViT-B | M-BERT Base ViT-B | M-BERT Base | ViT-B/32 | 101 Languages | 68 Languages | 110 M |
- MultilingualCLIP model: https://github.com/FreddeFrallan/Multilingual-CLIP/
- HuSpaCy: https://github.com/huspacy/huspacy/
- Multi-Rake: https://github.com/vgrabovets/multi_rake/