SONAR

We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. It substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks.

Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. We also provide a single text decoder, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations.

SONAR stands for Sentence-level multimOdal and laNguage-Agnostic Representations

The full list of supported languages (along with download links) can be found here below.

SONAR Architecture:

Text results

Speech results

Installing

You can install SONAR with pip install sonar-space. Note that there is another sonar package on pip that IS NOT this project, make sure to use sonar-space in your dependencies.

Note that SONAR depends on Fairseq2, which should precisely match the versions of pytorch and CUDA (here are the possible variants). You can check with pip show torch which version of pytorch you gave. For example, if it equals 2.6.0+cu124, you should install fairseq2 with from the following source:

pip install fairseq2 --extra-index-url https://fair.pkg.atmeta.com/fairseq2/whl/pt2.6.0/cu124

If fairseq2 does not provide a build for your machine, check the readme of that project to build it locally.

We recommend installing SONAR only after you have a correct version of fairseq2 installed. Note that SONAR currently relies on the stable version of fairseq2 0.4.5 (with minor variations possible).

If you want to install SONAR manually, you can install it localy:

pip install --upgrade pip
pip install -e .

Usage

fairseq2 will automatically download models into your $TORCH_HOME/hub directory upon using the commands below.

Compute text sentence embeddings with SONAR:

from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
t2vec_model = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder",
                                           tokenizer="text_sonar_basic_encoder")
sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
embeddings = t2vec_model.predict(sentences, source_lang="eng_Latn")
print(embeddings.shape)
# torch.Size([2, 1024])

Note that by default, all SONAR models are loaded to a CPU device, which is relatively slow. If you want to use a GPU instead, you should provide the device argument when initializing the model (this applies to every model). Similarly, you can pass a dtype argument. For example:

import torch
from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline

embedder = TextToEmbeddingModelPipeline(
  encoder="text_sonar_basic_encoder", 
  tokenizer="text_sonar_basic_encoder", 
  device=torch.device("cuda"),
  dtype=torch.float16,
)

Reconstruct text from SONAR embeddings

from sonar.inference_pipelines.text import EmbeddingToTextModelPipeline
vec2text_model = EmbeddingToTextModelPipeline(decoder="text_sonar_basic_decoder",
                                              tokenizer="text_sonar_basic_encoder")
reconstructed = vec2text_model.predict(embeddings, target_lang="eng_Latn", max_seq_len=512)
# max_seq_len is a keyword argument passed to the fairseq2 BeamSearchSeq2SeqGenerator.
print(reconstructed)
# ['My name is SONAR.', 'I can embed the sentences into vector space.']

By default, text generation in SONAR is based on beam search (BeamSearchSeq2SeqGenerator from fairseq2) with the default setting of beam_size=5. If one passes a sampler argument, we will use a SamplingSeq2SeqGenerator instead. All additional arguments are passed to the generator constructor. For example:

from fairseq2.generation import TopPSampler, TopKSampler
embeddings = t2vec_model.predict(["Bonjour le monde!"] * 10, source_lang="fra_Latn")
vec2text_model.predict(embeddings, target_lang="eng_Latn", sampler=TopPSampler(0.99), max_seq_len=128)
# ['Hello, the world!',
#  'Hey, everybody!',
#  'Good day to you, world!',
#  'Hello, the world!',
#  'Hello, people.',
#  'Hello, everybody, around the world.',
#  'Hello, world. How are you?',
#  "Hey, what's up?",
#  'Good afternoon, everyone.',
#  'Hello to the world!']
# the outputs are now random, so they will be different every time

Note that the sampler argument was a singal to use a SamplingSeq2SeqGenerator instead of a BeamSearchSeq2SeqGenerator, and the max_seq_len argument was passed to the SamplingSeq2SeqGenerator constructor.

Translate text with SONAR

from sonar.inference_pipelines.text import TextToTextModelPipeline
t2t_model = TextToTextModelPipeline(encoder="text_sonar_basic_encoder",
                                    decoder="text_sonar_basic_decoder",
                                    tokenizer="text_sonar_basic_encoder")  # tokenizer is attached to both encoder and decoder cards

sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
t2t_model.predict(sentences, source_lang="eng_Latn", target_lang="fra_Latn")
# ['Mon nom est SONAR.', "Je peux intégrer les phrases dans l'espace vectoriel."]

Compute speech sentence embeddings with SONAR

from sonar.inference_pipelines.speech import SpeechToEmbeddingModelPipeline
s2vec_model = SpeechToEmbeddingModelPipeline(encoder="sonar_speech_encoder_eng")

s2vec_model.predict(["./tests/integration_tests/data/audio_files/audio_1.wav",
                     "./tests/integration_tests/data/audio_files/audio_2.wav"]).shape
# torch.Size([2, 1024])
import torchaudio
inp, sr = torchaudio.load("./tests/integration_tests/data/audio_files/audio_1.wav")
assert sr == 16000, "Sample rate should be 16kHz"

s2vec_model.predict([inp]).shape
# torch.Size([1, 1024])

Speech-to-text translation with SONAR

from sonar.inference_pipelines.speech import SpeechToTextModelPipeline

s2t_model = SpeechToTextModelPipeline(encoder="sonar_speech_encoder_eng",
                                      decoder="text_sonar_basic_decoder",
                                      tokenizer="text_sonar_basic_decoder")

import torchaudio
inp, sr = torchaudio.load("./tests/integration_tests/data/audio_files/audio_1.wav")
assert sr == 16000, "Sample rate should be 16kHz"

# passing loaded audio files
s2t_model.predict([inp], target_lang="eng_Latn")
# ['Television reports show white smoke coming from the plant.']

# passing multiple wav files 
s2t_model.predict(["./tests/integration_tests/data/audio_files/audio_1.wav",
                   "./tests/integration_tests/data/audio_files/audio_2.wav"], target_lang="eng_Latn")
# ['Television reports show white smoke coming from the plant.',
# 'These couples may choose to make an adoption plan for their baby.']

Predicting sentence similarity with BLASER 2.0 models

BLASER 2.0 is a family of models for automatic evaluation of machine translation quality based on SONAR embeddings. They predict cross-lingual semantic similarity between the translation and the source (optionally, also using a reference translation).

from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
from sonar.models.blaser.loader import load_blaser_model

blaser_ref = load_blaser_model("blaser_2_0_ref").eval()
blaser_qe = load_blaser_model("blaser_2_0_qe").eval()
text_embedder = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder", tokenizer="text_sonar_basic_encoder")

src_embs = text_embedder.predict(["Le chat s'assit sur le tapis."], source_lang="fra_Latn")
ref_embs = text_embedder.predict(["The cat sat on the mat."], source_lang="eng_Latn")
mt_embs = text_embedder.predict(["The cat sat down on the carpet."], source_lang="eng_Latn")

with torch.inference_mode():
    print(blaser_ref(src=src_embs, ref=ref_embs, mt=mt_embs).item())  # 4.688
    print(blaser_qe(src=src_embs, mt=mt_embs).item())  # 4.708

Detailed model cards with more examples: facebook/blaser-2.0-ref, facebook/blaser-2.0-qe.

Classifying the toxicity of sentences with MuTox

MuTox, the first highly multilingual audio-based classifier (binary) and dataset with toxicity labels. The dataset consists of 20k audio utterances for English and Spanish, and 4k for the other 19 languages, and uses the multi-model and multilingual encoders from SONAR. The output of the MuTox classifier is a logit of the evaluated being "toxic", according to the definition adopted in the corresponding dataset.

from sonar.models.mutox.loader import load_mutox_model
from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
import torch

if torch.cuda.is_available():
    device = torch.device("cuda:0")
    dtype = torch.float16
else:
    device = torch.device("cpu")
    dtype = torch.float32

t2vec_model = TextToEmbeddingModelPipeline(
    encoder="text_sonar_basic_encoder",
    tokenizer="text_sonar_basic_encoder",
    device=device,
)
text_column='lang_txt'
classifier = load_mutox_model(
    "sonar_mutox",
    device=device,
    dtype=dtype,
).eval()

with torch.inference_mode():
    emb = t2vec_model.predict(["De peur que le pays ne se prostitue et ne se remplisse de crimes."], source_lang='fra_Latn')
    x = classifier(emb.to(device).to(dtype)) 
    print(x) # tensor([[-19.7812]], device='cuda:0', dtype=torch.float16)

with torch.inference_mode():
    emb = t2vec_model.predict(["She worked hard and made a significant contribution to the team."], source_lang='eng_Latn')
    x = classifier(emb.to(device).to(dtype))
    print(x) # tensor([[-53.5938]], device='cuda:0', dtype=torch.float16)

with torch.inference_mode():
    emb = t2vec_model.predict(["El no tiene ni el más mínimo talento, todo lo que ha logrado ha sido gracias a sobornos y manipulaciones."], source_lang='spa_Latn')
    x = classifier(emb.to(device).to(dtype))
    print(x) # tensor([[-21.4062]], device='cuda:0', dtype=torch.float16)

For a CLI way of running the MuTox pipeline, go to Seamless Communication/.../MuTox.

Demo notebooks

See more complete demo notebooks :

Supported languages and download links

The SONAR text encoder & decoder supports 200 languages. SONAR speech encoders support 37 languages.

Available text encoders/decoders

model	link
encoder	download
decoder	download
finetuned decoder	download
tokenizer	download

The languages supported by SONAR text encoders/decoders are all the 202 languages from the NLLB-200 models. They comprise all 204 FLORES-200 languages, except arb_Latn and min_Arab (note that sat_Olck is supported under the name sat_Beng, alghough Olck is the right scripts).

See more details on the languages list in the No Language Left Behind paper (the table below is based on Table 1 in this paper):

flores_lang_code	sonar_lang_code	lang_name	script	family	subgrouping	resource_level	variety
ace_Arab	ace_Arab	Acehnese	Arabic	Austronesian	Malayo-Polynesian	Low	North Acehnese
ace_Latn	ace_Latn	Acehnese	Latin	Austronesian	Malayo-Polynesian	Low	North Acehnese
acm_Arab	acm_Arab	Mesopotamian Arabic	Arabic	Afro-Asiatic	Semitic	Low	Baghdadi
acq_Arab	acq_Arab	Taʽizzi-Adeni Arabic	Arabic	Afro-Asiatic	Semitic	Low
aeb_Arab	aeb_Arab	Tunisian Arabic	Arabic	Afro-Asiatic	Semitic	Low	Derja
afr_Latn	afr_Latn	Afrikaans	Latin	Indo-European	Germanic	High
ajp_Arab	ajp_Arab	South Levantine Arabic	Arabic	Afro-Asiatic	Semitic	Low	Ammani
aka_Latn	aka_Latn	Akan	Latin	Atlantic-Congo	Kwa Volta-Congo	Low	Asante
amh_Ethi	amh_Ethi	Amharic	Geʽez	Afro-Asiatic	Semitic	Low	Addis Ababa
apc_Arab	apc_Arab	North Levantine Arabic	Arabic	Afro-Asiatic	Semitic	Low
arb_Arab	arb_Arab	Modern Standard Arabic	Arabic	Afro-Asiatic	Semitic	High
arb_Latn	-	Modern Standard Arabic	Latin	Afro-Asiatic	Semitic	Low
ars_Arab	ars_Arab	Najdi Arabic	Arabic	Afro-Asiatic	Semitic	Low
ary_Arab	ary_Arab	Moroccan Arabic	Arabic	Afro-Asiatic	Semitic	Low
arz_Arab	arz_Arab	Egyptian Arabic	Arabic	Afro-Asiatic	Semitic	Low
asm_Beng	asm_Beng	Assamese	Bengali	Indo-European	Indo-Aryan	Low	Eastern
ast_Latn	ast_Latn	Asturian	Latin	Indo-European	Italic	Low	Central
awa_Deva	awa_Deva	Awadhi	Devanagari	Indo-European	Indo-Aryan	Low	Ayodhya
ayr_Latn	ayr_Latn	Central Aymara	Latin	Aymaran	Central Southern Aymara	Low	Aymara La Paz jilata
azb_Arab	azb_Arab	South Azerbaijani	Arabic	Turkic	Common Turkic	Low	Tabrizi
azj_Latn	azj_Latn	North Azerbaijani	Latin	Turkic	Common Turkic	Low	Shirvan
bak_Cyrl	bak_Cyrl	Bashkir	Cyrillic	Turkic	Common Turkic	Low	Literary
bam_Latn	bam_Latn	Bambara	Latin	Mande	Western Mande	Low
ban_Latn	ban_Latn	Balinese	Latin	Austronesian	Malayo-Polynesian	Low
bel_Cyrl	bel_Cyrl	Belarusian	Cyrillic	Indo-European	Balto-Slavic	Low	Central
bem_Latn	bem_Latn	Bemba	Latin	Atlantic-Congo	Benue-Congo	Low	Central
ben_Beng	ben_Beng	Bengali	Bengali	Indo-European	Indo-Aryan	High	Rarhi
bho_Deva	bho_Deva	Bhojpuri	Devanagari	Indo-European	Indo-Aryan	Low
bjn_Arab	bjn_Arab	Banjar	Arabic	Austronesian	Malayo-Polynesian	Low	Banjar Kuala
bjn_Latn	bjn_Latn	Banjar	Latin	Austronesian	Malayo-Polynesian	Low	Banjar Kuala
bod_Tibt	bod_Tibt	Standard Tibetan	Tibetan	Sino-Tibetan	Bodic	Low	Lhasa
bos_Latn	bos_Latn	Bosnian	Latin	Indo-European	Balto-Slavic	High
bug_Latn	bug_Latn	Buginese	Latin	Austronesian	Malayo-Polynesian	Low	Bone
bul_Cyrl	bul_Cyrl	Bulgarian	Cyrillic	Indo-European	Balto-Slavic	High
cat_Latn	cat_Latn	Catalan	Latin	Indo-European	Italic	High
ceb_Latn	ceb_Latn	Cebuano	Latin	Austronesian	Malayo-Polynesian	Low
ces_Latn	ces_Latn	Czech	Latin	Indo-European	Balto-Slavic	High
cjk_Latn	cjk_Latn	Chokwe	Latin	Atlantic-Congo	Benue-Congo	Low
ckb_Arab	ckb_Arab	Central Kurdish	Arabic	Indo-European	Iranian	Low
crh_Latn	crh_Latn	Crimean Tatar	Latin	Turkic	Common Turkic	Low
cym_Latn	cym_Latn	Welsh	Latin	Indo-European	Celtic	Low	Y Wyndodeg
dan_Latn	dan_Latn	Danish	Latin	Indo-European	Germanic	High
deu_Latn	deu_Latn	German	Latin	Indo-European	Germanic	High
dik_Latn	dik_Latn	Southwestern Dinka	Latin	Nilotic	Western Nilotic	Low	Rek
dyu_Latn	dyu_Latn	Dyula	Latin	Mande	Western Mande	Low
dzo_Tibt	dzo_Tibt	Dzongkha	Tibetan	Sino-Tibetan	Bodic	Low
ell_Grek	ell_Grek	Greek	Greek	Indo-European	Graeco-Phrygian	High
eng_Latn	eng_Latn	English	Latin	Indo-European	Germanic	High
epo_Latn	epo_Latn	Esperanto	Latin	Constructed	Esperantic	Low
est_Latn	est_Latn	Estonian	Latin	Uralic	Finnic	High
eus_Latn	eus_Latn	Basque	Latin	Basque	–	High
ewe_Latn	ewe_Latn	Ewe	Latin	Atlantic-Congo	Kwa Volta-Congo	Low	Aŋlo
fao_Latn	fao_Latn	Faroese	Latin	Indo-European	Germanic	Low
fij_Latn	fij_Latn	Fijian	Latin	Austronesian	Malayo-Polynesian	Low	Bau
fin_Latn	fin_Latn	Finnish	Latin	Uralic	Finnic	High
fon_Latn	fon_Latn	Fon	Latin	Atlantic-Congo	Kwa Volta-Congo	Low
fra_Latn	fra_Latn	French	Latin	Indo-European	Italic	High
fur_Latn	fur_Latn	Friulian	Latin	Indo-European	Italic	Low	Central
fuv_Latn	fuv_Latn	Nigerian Fulfulde	Latin	Atlantic-Congo	North-Central Atlantic	Low	Sokoto
gla_Latn	gla_Latn	Scottish Gaelic	Latin	Indo-European	Celtic	Low	Northern Hebrides
gle_Latn	gle_Latn	Irish	Latin	Indo-European	Celtic	Low
glg_Latn	glg_Latn	Galician	Latin	Indo-European	Italic	Low
grn_Latn	grn_Latn	Guarani	Latin	Tupian	Maweti-Guarani	Low
guj_Gujr	guj_Gujr	Gujarati	Gujarati	Indo-European	Indo-Aryan	Low	Amdavadi/Surti
hat_Latn	hat_Latn	Haitian Creole	Latin	Indo-European	Italic	Low
hau_Latn	hau_Latn	Hausa	Latin	Afro-Asiatic	Chadic	Low
heb_Hebr	heb_Hebr	Hebrew	Hebrew	Afro-Asiatic	Semitic	High
hin_Deva	hin_Deva	Hindi	Devanagari	Indo-European	Indo-Aryan	High
hne_Deva	hne_Deva	Chhattisgarhi	Devanagari	Indo-European	Indo-Aryan	Low
hrv_Latn	hrv_Latn	Croatian	Latin	Indo-European	Balto-Slavic	High
hun_Latn	hun_Latn	Hungarian	Latin	Uralic	–	High
hye_Armn	hye_Armn	Armenian	Armenian	Indo-European	Armenic	Low	Yerevan
ibo_Latn	ibo_Latn	Igbo	Latin	Atlantic-Congo	Benue-Congo	Low	Central
ilo_Latn	ilo_Latn	Ilocano	Latin	Austronesian	Malayo-Polynesian	Low
ind_Latn	ind_Latn	Indonesian	Latin	Austronesian	Malayo-Polynesian	High
isl_Latn	isl_Latn	Icelandic	Latin	Indo-European	Germanic	High
ita_Latn	ita_Latn	Italian	Latin	Indo-European	Italic	High
jav_Latn	jav_Latn	Javanese	Latin	Austronesian	Malayo-Polynesian	Low
jpn_Jpan	jpn_Jpan	Japanese	Japanese	Japonic	Japanesic	High
kab_Latn	kab_Latn	Kabyle	Latin	Afro-Asiatic	Berber	Low	North Eastern
kac_Latn	kac_Latn	Jingpho	Latin	Sino-Tibetan	Brahmaputran	Low
kam_Latn	kam_Latn	Kamba	Latin	Atlantic-Congo	Benue-Congo	Low	Machakos
kan_Knda	kan_Knda	Kannada	Kannada	Dravidian	South Dravidian	Low	Central
kas_Arab	kas_Arab	Kashmiri	Arabic	Indo-European	Indo-Aryan	Low	Kishtwari
kas_Deva	kas_Deva	Kashmiri	Devanagari	Indo-European	Indo-Aryan	Low	Kishtwari
kat_Geor	kat_Geor	Georgian	Georgian	Kartvelian	Georgian-Zan	Low	Kartlian
knc_Arab	knc_Arab	Central Kanuri	Arabic	Saharan	Western Saharan	Low	Yerwa
knc_Latn	knc_Latn	Central Kanuri	Latin	Saharan	Western Saharan	Low	Yerwa
kaz_Cyrl	kaz_Cyrl	Kazakh	Cyrillic	Turkic	Common Turkic	High
kbp_Latn	kbp_Latn	Kabiyè	Latin	Atlantic-Congo	North Volta-Congo	Low	Kɛ̀̀wɛ
kea_Latn	kea_Latn	Kabuverdianu	Latin	Indo-European	Italic	Low	Sotavento
khm_Khmr	khm_Khmr	Khmer	Khmer	Austroasiatic	Khmeric	Low	Central
kik_Latn	kik_Latn	Kikuyu	Latin	Atlantic-Congo	Benue-Congo	Low	Southern
kin_Latn	kin_Latn	Kinyarwanda	Latin	Atlantic-Congo	Benue-Congo	Low
kir_Cyrl	kir_Cyrl	Kyrgyz	Cyrillic	Turkic	Common Turkic	Low	Northern
kmb_Latn	kmb_Latn	Kimbundu	Latin	Atlantic-Congo	Benue-Congo	Low
kmr_Latn	kmr_Latn	Northern Kurdish	Latin	Indo-European	Iranian	Low
kon_Latn	kon_Latn	Kikongo	Latin	Atlantic-Congo	Benue-Congo	Low
kor_Hang	kor_Hang	Korean	Hangul	Koreanic	Korean	High
lao_Laoo	lao_Laoo	Lao	Lao	Tai-Kadai	Kam-Tai	Low	Vientiane
lij_Latn	lij_Latn	Ligurian	Latin	Indo-European	Italic	Low	Zeneise
lim_Latn	lim_Latn	Limburgish	Latin	Indo-European	Germanic	Low	Maastrichtian
lin_Latn	lin_Latn	Lingala	Latin	Atlantic-Congo	Benue-Congo	Low
lit_Latn	lit_Latn	Lithuanian	Latin	Indo-European	Balto-Slavic	High
lmo_Latn	lmo_Latn	Lombard	Latin	Indo-European	Italic	Low	Western
ltg_Latn	ltg_Latn	Latgalian	Latin	Indo-European	Balto-Slavic	Low	Central
ltz_Latn	ltz_Latn	Luxembourgish	Latin	Indo-European	Germanic	Low
lua_Latn	lua_Latn	Luba-Kasai	Latin	Atlantic-Congo	Benue-Congo	Low
lug_Latn	lug_Latn	Ganda	Latin	Atlantic-Congo	Benue-Congo	Low
luo_Latn	luo_Latn	Luo	Latin	Nilotic	Western Nilotic	Low
lus_Latn	lus_Latn	Mizo	Latin	Sino-Tibetan	Kuki-Chin-Naga	Low	Aizawl
lvs_Latn	lvs_Latn	Standard Latvian	Latin	Indo-European	Balto-Slavic	High
mag_Deva	mag_Deva	Magahi	Devanagari	Indo-European	Indo-Aryan	Low	Gaya
mai_Deva	mai_Deva	Maithili	Devanagari	Indo-European	Indo-Aryan	Low
mal_Mlym	mal_Mlym	Malayalam	Malayalam	Dravidian	South Dravidian	Low
mar_Deva	mar_Deva	Marathi	Devanagari	Indo-European	Indo-Aryan	Low	Varhadi
min_Arab	-	Minangkabau	Arabic	Austronesian	Malayo-Polynesian	Low	Agam-Tanah Datar
min_Latn	min_Latn	Minangkabau	Latin	Austronesian	Malayo-Polynesian	Low	Agam-Tanah Datar
mkd_Cyrl	mkd_Cyrl	Macedonian	Cyrillic	Indo-European	Balto-Slavic	High
plt_Latn	plt_Latn	Plateau Malagasy	Latin	Austronesian	Malayo-Polynesian	Low	Merina
mlt_Latn	mlt_Latn	Maltese	Latin	Afro-Asiatic	Semitic	High
mni_Beng	mni_Beng	Meitei	Bengali	Sino-Tibetan	Kuki-Chin-Naga	Low
khk_Cyrl	khk_Cyrl	Halh Mongolian	Cyrillic	Mongolic-Khitan	Mongolic	Low
mos_Latn	mos_Latn	Mossi	Latin	Atlantic-Congo	North Volta-Congo	Low	Ouagadougou
mri_Latn	mri_Latn	Maori	Latin	Austronesian	Malayo-Polynesian	Low	Waikato-Ngapuhi
mya_Mymr	mya_Mymr	Burmese	Myanmar	Sino-Tibetan	Burmo-Qiangic	Low	Mandalay-Yangon
nld_Latn	nld_Latn	Dutch	Latin	Indo-European	Germanic	High
nno_Latn	nno_Latn	Norwegian Nynorsk	Latin	Indo-European	Germanic	Low
nob_Latn	nob_Latn	Norwegian Bokmål	Latin	Indo-European	Germanic	Low
npi_Deva	npi_Deva	Nepali	Devanagari	Indo-European	Indo-Aryan	Low	Eastern
nso_Latn	nso_Latn	Northern Sotho	Latin	Atlantic-Congo	Benue-Congo	Low
nus_Latn	nus_Latn	Nuer	Latin	Nilotic	Western Nilotic	Low
nya_Latn	nya_Latn	Nyanja	Latin	Atlantic-Congo	Benue-Congo	Low
oci_Latn	oci_Latn	Occitan	Latin	Indo-European	Italic	Low
gaz_Latn	gaz_Latn	West Central Oromo	Latin	Afro-Asiatic	Cushitic	Low
ory_Orya	ory_Orya	Odia	Oriya	Indo-European	Indo-Aryan	Low	Baleswari (Northern)
pag_Latn	pag_Latn	Pangasinan	Latin	Austronesian	Malayo-Polynesian	Low
pan_Guru	pan_Guru	Eastern Panjabi	Gurmukhi	Indo-European	Indo-Aryan	Low	Majhi
pap_Latn	pap_Latn	Papiamento	Latin	Indo-European	Italic	Low	Römer-Maduro-Jonis
pes_Arab	pes_Arab	Western Persian	Arabic	Indo-European	Iranian	High
pol_Latn	pol_Latn	Polish	Latin	Indo-European	Balto-Slavic	High
por_Latn	por_Latn	Portuguese	Latin	Indo-European	Italic	High	Brazil
prs_Arab	prs_Arab	Dari	Arabic	Indo-European	Iranian	Low	Kabuli
pbt_Arab	pbt_Arab	Southern Pashto	Arabic	Indo-European	Iranian	Low	Literary
quy_Latn	quy_Latn	Ayacucho Quechua	Latin	Quechuan	Chinchay	Low	Southern Quechua
ron_Latn	ron_Latn	Romanian	Latin	Indo-European	Italic	High
run_Latn	run_Latn	Rundi	Latin	Atlantic-Congo	Benue-Congo	Low
rus_Cyrl	rus_Cyrl	Russian	Cyrillic	Indo-European	Balto-Slavic	High
sag_Latn	sag_Latn	Sango	Latin	Atlantic-Congo	North Volta-Congo	Low
san_Deva	san_Deva	Sanskrit	Devanagari	Indo-European	Indo-Aryan	Low
sat_Olck	sat_Beng	Santali	Ol Chiki	Austroasiatic	Mundaic	Low
scn_Latn	scn_Latn	Sicilian	Latin	Indo-European	Italic	Low	Literary Sicilian
shn_Mymr	shn_Mymr	Shan	Myanmar	Tai-Kadai	Kam-Tai	Low
sin_Sinh	sin_Sinh	Sinhala	Sinhala	Indo-European	Indo-Aryan	Low
slk_Latn	slk_Latn	Slovak	Latin	Indo-European	Balto-Slavic	High
slv_Latn	slv_Latn	Slovenian	Latin	Indo-European	Balto-Slavic	High
smo_Latn	smo_Latn	Samoan	Latin	Austronesian	Malayo-Polynesian	Low
sna_Latn	sna_Latn	Shona	Latin	Atlantic-Congo	Benue-Congo	Low
snd_Arab	snd_Arab	Sindhi	Arabic	Indo-European	Indo-Aryan	Low	Vicholi
som_Latn	som_Latn	Somali	Latin	Afro-Asiatic	Cushitic	Low	Nsom
sot_Latn	sot_Latn	Southern Sotho	Latin	Atlantic-Congo	Benue-Congo	High
spa_Latn	spa_Latn	Spanish	Latin	Indo-European	Italic	High	Latin American
als_Latn	als_Latn	Tosk Albanian	Latin	Indo-European	Albanian	High
srd_Latn	srd_Latn	Sardinian	Latin	Indo-European	Italic	Low	Logudorese and Campidanese
srp_Cyrl	srp_Cyrl	Serbian	Cyrillic	Indo-European	Balto-Slavic	Low
ssw_Latn	ssw_Latn	Swati	Latin	Atlantic-Congo	Benue-Congo	Low
sun_Latn	sun_Latn	Sundanese	Latin	Austronesian	Malayo-Polynesian	Low
swe_Latn	swe_Latn	Swedish	Latin	Indo-European	Germanic	High
swh_Latn	swh_Latn	Swahili	Latin	Atlantic-Congo	Benue-Congo	High	Kiunguja
szl_Latn	szl_Latn	Silesian	Latin	Indo-European	Balto-Slavic	Low
tam_Taml	tam_Taml	Tamil	Tamil	Dravidian	South Dravidian	Low	Chennai
tat_Cyrl	tat_Cyrl	Tatar	Cyrillic	Turkic	Common Turkic	Low	Central and Middle
tel_Telu	tel_Telu	Telugu	Telugu	Dravidian	South Dravidian	Low	Coastal
tgk_Cyrl	tgk_Cyrl	Tajik	Cyrillic	Indo-European	Iranian	Low
tgl_Latn	tgl_Latn	Tagalog	Latin	Austronesian	Malayo-Polynesian	High
tha_Thai	tha_Thai	Thai	Thai	Tai-Kadai	Kam-Tai	High
tir_Ethi	tir_Ethi	Tigrinya	Geʽez	Afro-Asiatic	Semitic	Low
taq_Latn	taq_Latn	Tamasheq	Latin	Afro-Asiatic	Berber	Low	Kal Ansar
taq_Tfng	taq_Tfng	Tamasheq	Tifinagh	Afro-Asiatic	Berber	Low	Kal Ansar
tpi_Latn	tpi_Latn	Tok Pisin	Latin	Indo-European	Germanic	Low
tsn_Latn	tsn_Latn	Tswana	Latin	Atlantic-Congo	Benue-Congo	High	Sehurutshe
tso_Latn	tso_Latn	Tsonga	Latin	Atlantic-Congo	Benue-Congo	Low
tuk_Latn	tuk_Latn	Turkmen	Latin	Turkic	Common Turkic	Low	Teke
tum_Latn	tum_Latn	Tumbuka	Latin	Atlantic-Congo	Benue-Congo	Low	Rumphi
tur_Latn	tur_Latn	Turkish	Latin	Turkic	Common Turkic	High
twi_Latn	twi_Latn	Twi	Latin	Atlantic-Congo	Kwa Volta-Congo	Low	Akuapem
tzm_Tfng	tzm_Tfng	Central Atlas Tamazight	Tifinagh	Afro-Asiatic	Berber	Low
uig_Arab	uig_Arab	Uyghur	Arabic	Turkic	Common Turkic	Low
ukr_Cyrl	ukr_Cyrl	Ukrainian	Cyrillic	Indo-European	Balto-Slavic	High
umb_Latn	umb_Latn	Umbundu	Latin	Atlantic-Congo	Benue-Congo	Low
urd_Arab	urd_Arab	Urdu	Arabic	Indo-European	Indo-Aryan	Low	Lashkari
uzn_Latn	uzn_Latn	Northern Uzbek	Latin	Turkic	Common Turkic	High
vec_Latn	vec_Latn	Venetian	Latin	Indo-European	Italic	Low	Venice
vie_Latn	vie_Latn	Vietnamese	Latin	Austroasiatic	Vietic	High
war_Latn	war_Latn	Waray	Latin	Austronesian	Malayo-Polynesian	Low	Tacloban
wol_Latn	wol_Latn	Wolof	Latin	Atlantic-Congo	North-Central Atlantic	Low	Dakkar
xho_Latn	xho_Latn	Xhosa	Latin	Atlantic-Congo	Benue-Congo	High	Ngqika
ydd_Hebr	ydd_Hebr	Eastern Yiddish	Hebrew	Indo-European	Germanic	Low	Hasidic
yor_Latn	yor_Latn	Yoruba	Latin	Atlantic-Congo	Benue-Congo	Low	Ọyọ and Ibadan
yue_Hant	yue_Hant	Yue Chinese	Han (Traditional)	Sino-Tibetan	Sinitic	Low
zho_Hans	zho_Hans	Chinese	Han (Simplified)	Sino-Tibetan	Sinitic	High
zho_Hant	zho_Hant	Chinese	Han (Traditional)	Sino-Tibetan	Sinitic	High
zsm_Latn	zsm_Latn	Standard Malay	Latin	Austronesian	Malayo-Polynesian	High
zul_Latn	zul_Latn	Zulu	Latin	Atlantic-Congo	Benue-Congo	High

Available speech encoders

lang_code	language	link
arb	ms arabic	download
asm	assamese	download
bel	belarussian	download
ben	bengali	download
bos	bosnian	download
bul	bulgarian	download
cat	catalan	download
ces	czech	download
cmn	mandarin chinese	download
cym	welsh	download
dan	danish	download
deu	german	download
est	estonian	download
fin	finnish	download
fra	french	download
guj	gujurati	download
heb	hebrew	download
hin	hindi	download
hrv	croatian	download
ind	indonesian	download
ita	italian	download
jpn	japanse	download
kan	kannada	download
kor	korean	download
lao	lao	download
lit	lithaian	download
lvs	standard latvian	download
mal	malayalam	download
mar	marathi	download
mkd	macedonian	download
mlt	maltese	download
npi	nepali	download
nld	dutch	download
ory	odia	download
pan	punjabi	download
pes	western persian	download
pol	polish	download
por	portuguese	download
ron	romanian	download
rus	russian	download
slk	slovak	download
slv	slovenian	download
snd	sindhi	download
srp	serbian	download
spa	spanish	download
swe	swedish	download
swh	swahili	download
tam	tamil	download
tel	telugu	download
tgl	tagalog	download
tha	thai	download
tur	turkish	download
ukr	ukrainian	download
urd	urdu	download
uzn	northern uzbek	download
vie	vietnamese	download
yue	yue	download

Citation Information

Please cite the paper when referencing the SONAR embedding space, encoders and decoders as:

@misc{Duquenne:2023:sonar_arxiv,
  author = {Paul-Ambroise Duquenne and Holger Schwenk and Benoit Sagot},
  title = {{SONAR:} Sentence-Level Multimodal and Language-Agnostic Representations},
  publisher = {arXiv},
  year = {2023},
  url = {https://arxiv.org/abs/2308.11466},
}

Contributing

See the CONTRIBUTING file for how to help out.

License

SONAR code is released under the MIT license (see CODE_LICENSE).

Some of SONAR models are released with the same MIT license, BUT BEWARE, some of them are released under a non commercial license (see NC_MODEL_LICENSE). Please refer to LICENSE for the details.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github		.github
examples		examples
huggingface_pipelines		huggingface_pipelines
materials		materials
sonar		sonar
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_LICENSE.md		CODE_LICENSE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
NC_MODEL_LICENSE.md		NC_MODEL_LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

SONAR

SONAR Architecture:

Text results

Speech results

Installing

Usage

Compute text sentence embeddings with SONAR:

Reconstruct text from SONAR embeddings

Translate text with SONAR

Compute speech sentence embeddings with SONAR

Speech-to-text translation with SONAR

Predicting sentence similarity with BLASER 2.0 models

Classifying the toxicity of sentences with MuTox

Demo notebooks

Supported languages and download links

Citation Information

Contributing

License

About

Licenses found

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 16

Languages

License

Licenses found

facebookresearch/SONAR

Folders and files

Latest commit

History

Repository files navigation

SONAR

SONAR Architecture:

Text results

Speech results

Installing

Usage

Compute text sentence embeddings with SONAR:

Reconstruct text from SONAR embeddings

Translate text with SONAR

Compute speech sentence embeddings with SONAR

Speech-to-text translation with SONAR

Predicting sentence similarity with BLASER 2.0 models

Classifying the toxicity of sentences with MuTox

Demo notebooks

Supported languages and download links

Citation Information

Contributing

License

About

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 16

Languages

Packages