A multilingual compilation of open-source textual corpora across major & minor world languages - curated for accessibility and linguistic research. Includes links and metadata for publicly available, CC-licensed, and machine-readable datasets.
We identify well‑curated public‑domain text collections that already include rich metadata or linguistic annotation.
For example:
- The Perseus Digital Library provides large Greek (32M words) and Latin (16M words) corpora with morphological parsing and dictionary links (https://methods.clsinfra.io; https://wiki.digitalclassicist.org)
- The Open Greek and Latin project exposes the same Perseus texts in TEI XML with lemmas and POS tags.
- Similarly, Project Gutenberg offers a massive PD literary corpus (primarily English) (https://methods.clsinfra.io)
- Language‑specific projects like the Deutsches Textarchiv provide TEI‑encoded historical German texts with lemmatization (https://tei-c.org)
The table below summarizes key sources (all public domain) with their languages, genres, annotation, formats, licenses, and suitability for lexical research.
Source (Name & URL) | Languages | Types of works | Annotation Types | Format/Access | License/Reuse | Suitability |
---|---|---|---|---|---|---|
Open Greek & Latin (Perseus) – Greek corpus | Ancient Greek | Classical Greek literature (Homer, tragedians) | Lemmas, POS tags, morphological features [*][*] | TEI XML (Scaife, GitHub) | CC-BY-SA 4.0 (PD) [*] | Richly annotated syntax & lexicon |
Open Greek & Latin (Perseus) – Latin corpus | Latin | Latin classical texts (Caesar, Cicero, Vergil) | Lemmas, POS tags, morphological features [*][*] | TEI XML (Scaife, GitHub) | CC-BY-SA 4.0 (PD) [*] | Fully parsed Latin texts |
Ancient Greek Dependency Treebank | Ancient Greek | Homer, Hesiod, Aeschylus, etc. | Lemmas, POS tags, morphological codes; syntactic dependencies [*] | CoNLL-U / XML (download) | CC (open) [*] | Expert-annotated treebank |
Latin Dependency Treebank | Latin | Caesar, Cicero, Ovid, Vergil, etc. | Lemmas, POS tags, morphological codes; syntactic dependencies [*] | CoNLL-U / XML (download) | CC (open) [*] | Expert-annotated treebank |
Universal Dependencies (UD) | 150+ langs (incl. English, Latin, Greek, French, German) | Mixed corpora (news, Wikipedia, web) | Universal POS tags, morphological features, dependency parses, lemmas [*] | CoNLL-U (plaintext) | CC-BY (treebanks) | Broad multilingual corpora with consistent annotation |
Project Gutenberg | Primarily English (also FR, DE, etc.) | Literature, philosophy, classics | None (raw text); header metadata (title, author, date) | Plain text, HTML, EPUB | Public Domain [*] | Massive PD corpus for English and other literatures |
Wikisource | Multilingual (EN, FR, DE, etc.) | Literature, historical and religious texts | Wiki markup; page metadata (titles, categories, authors) | Wiki XML dumps / HTML | CC-BY-SA | Crowdsourced collections of PD texts |
Oxford Text Archive (OTA) | Various (English, Latin, etc.) | Scholarly corpora (poetry, drama, prose) | TEI-encoded texts (structural+metadata tags: author, date, genre) | TEI XML, plain text | Varies by collection [*] | Curated humanities texts (check individual licenses) |
IntraText Digital Library | Latin (ancient to modern) | Published Latin works (literature, theology, etc.) | Full text with search/concordances; edition metadata [*] | HTML (search interface) | Free access (copyright respected) | Scholarly Latin corpus with lexicon links |
The Latin Library | Latin | Classical and medieval Latin literature | None (raw text) | HTML | Public domain (collected PD) [*] | Broad Latin text collection (no annotation) |
Bibliotheca Augustana | Greek, Latin (incl. medieval/Neo-Latin) | Selected classical/medieval texts | None (raw text) | HTML | PD academic collection [*] | Curated texts with simple interface (no annotation) |
Deutsches Textarchiv (DTA) | German (1600–1900) | Historical German literature (prose, poetry) | Tokenized, lemmatized; normalized orthography [*] | TEI-XML, HTML, TCF, plain text [*] | CC-BY (open) | Annotated diachronic German corpus [*] |
Each source in Table 1 is public-domain (or CC‑licensed) and minimally processed. Most entries include lemma and POS annotations or rich TEI markup. This means they can be readily searched for word contexts and used to build chronological lexica: for example, the Perseus corpora and dependency treebanks provide lemmatized, dated examples for Greek and Latin, while Gutenberg and Wikisource supply large quantities of dated literary usage (with author/date metadata). The combination of these resources covers a wide temporal span.
Sources: We prioritize academically curated collections [*]. Citations indicate where full details of each resource’s content and licensing are documented.
English: The Open American National Corpus (OANC) (https://anc.org/oanc) provides ~15 M words of modern American English (1990s–present) across diverse written (news, fiction, academic, web) and spoken genres (https://anc.org/). It is richly annotated with POS tags and lemmas (via ANC’s annotation pipeline) and can be delivered in XML/CoNLL-U formats using the ANC2Go tool (https://en.wikipedia.org/, https://anc.org/). The OANC is fully open/unrestricted (effectively CC0; https://anc.org/). Its Manually Annotated Sub-Corpus (MASC) (500K words) is balanced over 18+ genres and adds gold-standard annotations (tokenization, POS, lemma, syntax, named entities, coreference, discourse, etc.; https://anc.org/). MASC is licensed CC-BY 3.0 (US; https://anc.org/). These corpora are excellent for syntactic and semantic research. Other key English resources include Project Gutenberg (PD literature in many genres; plain text; PD/no restriction; https://gutenberg.org/policy/license.html) and English Wikipedia (encyclopedic articles; XML dumps; CC-BY-SA 4.0; https://en.wikipedia.org/wiki/Wikipedia:Database_download) for general lexical usage. The Leipzig Corpora Collection offers up to 1 M sentences of English (news and web text, sentence-aligned; plain text) with word co-occurrence stats (https://wortschatz.uni-leipzig.de/en/download) – data freely downloadable under CC-BY (https://wortschatz.uni-leipzig.de/en/usage).
Corpus / Resource | Language(s) | Genres / Text Types | Annotation / Metadata | Format / Access | License | Notes (suitability) |
---|---|---|---|---|---|---|
OANC (Open American National Corpus) |
American English | 15M words; balanced modern genres (news, fiction, blogs, transcripts) | POS tags, lemmas, etc. (automatic) [*] | XML/CoNLL-U via ANC2Go [*] | CC0 (unrestricted) [*] | Good for lexical frequencies, collocations, grammar (chronological). |
MASC (Manually Annotated Sub-Corpus) | American English | 500K words; balanced 18+ genres (including speech, blogs, news) | Manual annotation: sentence, token, lemma, POS, NP/VP chunks, NER, Penn Treebank syntax, coreference, etc. [*] | XML/GrAF (coming); samples downloadable | CC-BY 3.0 US [*] | Gold-standard grammar and semantics – ideal for NLP, lexicography. |
Project Gutenberg | Many (English-centric) | Public-domain literature & nonfiction (classics, poetry, etc.) | None (plain text) | Plain text, ePub, HTML | Public domain [*] | Rich historical texts; useful for rare/archaic usage and large-scale corpora. |
Wikipedia (English) | English | Encyclopedia articles (all domains) | None (wiki markup; can extract text) | XML dumps | CC-BY-SA 4.0 [*] | Huge up-to-date lexicon; varied style and topics (user-generated). |
Leipzig Corpora (English) [*] | English | News and random web sentences (10K–1M) | None (pre-tokenized sentences; includes co-occurrence stats) | Plain text (sentences) | CC-BY 4.0 [*] | Balanced word-freq and collocation data (quick lookup tables). |
Spanish: Open Spanish corpora include Leipzig News Crawls, Wikipedia (es), and public-domain literature. The Leipzig Corpora (Spanish) (100K–1M sentences) cover newspaper and web text; plain sentences with frequency info (CC-BY; https://wortschatz.uni-leipzig.de/en/download; https://wortschatz.uni-leipzig.de/en/usage). Spanish Wikipedia provides up-to-date encyclopedic text (CC-BY-SA; https://en.wikipedia.org/wiki/Wikipedia:Database_download). Project Gutenberg (Spanish) offers PD literary classics. Together these form robust corpora for Spanish lexicography.
Corpus / Resource | Language(s) | Genres / Text Types | Annotation / Metadata | Format / Access | License | Notes |
---|---|---|---|---|---|---|
Leipzig Corpora (Spanish) [*] | Spanish | News articles and web sentences (10K–1M) | None | Plain text (sentences) | CC-BY 4.0 [*] | Good for frequency and collocation stats. |
Wikipedia (Spanish) | Spanish | Encyclopedic articles | None (wiki markup) | XML dumps | CC-BY-SA 4.0 [*] | Large modern lexicon across topics. |
Project Gutenberg (Spanish) | Spanish | PD literature and poetry | None | Plain text, ePub | Public domain [*] | Classic works (Cervantes, etc.) for literary language. |
French: Similarly, Leipzig French corpora (news/web; CC-BY) and French Wikipedia (CC-BY-SA) provide broad coverage. Project Gutenberg (French) and France’s Gallica offer PD classics and historical texts.
Corpus / Resource | Language(s) | Genres / Text Types | Annotation / Metadata | Format | License | Notes |
---|---|---|---|---|---|---|
Leipzig Corpora (French) [*] | French | News articles, web (10K–1M) | None | Plain text | CC-BY 4.0 [*] | Balanced modern text (journals, blogs). |
Wikipedia (French) | French | Encyclopedia entries | None | XML dump | CC-BY-SA 4.0 [*] | General-purpose text, up to date. |
Project Gutenberg (French) | French | PD literature (Voltaire, Hugo) | None | Plain text | Public domain [*] | Classic literature for stylistic study. |
German: Open German corpora include Leipzig German (newspaper/web; CC-BY; https://wortschatz.uni-leipzig.de/en/download; https://wortschatz.uni-leipzig.de/en/usage), German Wikipedia (CC-BY-SA; https://en.wikipedia.org/wiki/Wikipedia:Database_download), and German Project Gutenberg. The Deutsches Textarchiv (DTA) is a scholarly collection of 19th-century texts (Open Access), and parliamentary corpora (e.g. Hansard Germany) may be used.
Corpus / Resource | Language(s) | Genres / Text Types | Annotation / Metadata | Format | License | Notes |
---|---|---|---|---|---|---|
Leipzig Corpora (German) [*] | German | News, web sentences (10K–1M) | None | Plain text | CC-BY 4.0 [*] | Modern usage frequency data. |
Wikipedia (German) | German | Encyclopedia articles | None | XML dump | CC-BY-SA 4.0 [*] | Large multi-domain text. |
Project Gutenberg (German) | German | PD classics (Goethe, Grimm) | None | Plain text | Public domain [*] | Standard literary language. |
Russian: The Leipzig Russian corpus (news/web; CC-BY) and Russian Wikipedia (CC-BY-SA) are open sources. The Open Russian National Corpus (ORNC) is in development, but the public Russian Wikipedia, PD literary texts, and Leipzig provide ample raw text. (Full annotated RNC exists but is not open download.)
Corpus / Resource | Language(s) | Genres / Text Types | Annotation / Metadata | Format | License | Notes |
---|---|---|---|---|---|---|
Leipzig Corpora (Russian) [*] | Russian | News and web (10K–1M) | None | Plain text | CC-BY 4.0 [*] | Modern usage; large size. |
Wikipedia (Russian) | Russian | Encyclopedia articles | None | XML dump | CC-BY-SA 4.0 [*] | Broad vocabulary. |
Project Gutenberg (Russian) | Russian | PD literature (Tolstoy, Dost.) | None | Plain text | Public domain [*] | 19th-century classic lit. |
Other Indo-European Languages: For classical and less-common IE languages, open corpora exist: e.g. the Latin Library (latinlibrary.com) and Perseus Project provide extensive Latin/Greek texts (mostly PD). The OpenITI corpus offers open-access Persian and Arabic texts (incl. many medieval works; https://openiti.org/projects/OpenITI%20Corpus.html). UD treebanks cover dozens of languages (e.g. Old Norse, Welsh, Haitian, etc.) with POS/morphology annotations (CoNLL-U format; varying CC-BY/CC-BY-SA). These can seed lexicographic data where available.
Arabic: The Leipzig Arabic corpus (news/web; CC-BY) and Arabic Wikipedia (CC-BY-SA) are major open sources. In addition, OpenITI provides thousands of classical Arabic and Persian texts (liturgical and scholarly works; https://openiti.org/projects/OpenITI%20Corpus.html) – open-access and increasingly cleaned. The Quranic Arabic Corpus (grammar/lemmas) is annotated but carries a CC-BY-NC-SA license, so use primarily for reference (not included here). These corpora (especially wiki and news) cover modern MSA; classical texts enrich dictionary entries.
Corpus / Resource | Language(s) | Genres / Text Types | Annotation / Metadata | Format | License | Notes |
---|---|---|---|---|---|---|
Leipzig Corpora (Arabic) [*] | Arabic | News and web (10K–1M) | None | Plain text | CC-BY 4.0 [*] | Modern MSA usage, word co-occurrences. |
Wikipedia (Arabic) | Arabic | Encyclopedic articles | None | XML dump | CC-BY-SA 4.0 [*] | General vocabulary, cross-domain. |
OpenITI (Arabic/Persian) | Arabic; Persian | Classical religious/philosophical works | Metadata (author, date) | TEI/XML | Open access (CC-BY-like) | Rich historical texts; some OCR errors. |
Hebrew: Open resources include Hebrew Bible (Tanakh) texts (PD) and the Hebrew Wikipedia (CC-BY-SA). The Bar-Ilan Responsa Project is scholarly but not open. UD Hebrew corpora (modern liturgical and wiki) exist with annotations (UD format, CC-BY). These can support lexicographic work in modern and Biblical Hebrew.
Other Afro-Asiatic Languages: For Amharic/Ge‘ez, no large open corpora are easily available. Somali Wikipedia and small Leipzig corpora may be used. In general, many Ethiopian languages rely on religious texts (often PD) and small parallel data.
Mandarin Chinese: Open corpora include Leipzig Chinese (CC-BY) and Chinese Wikipedia (CC-BY-SA) for modern Chinese. Classical Chinese is available via the Chinese Text Project (ancient texts; open-access but CC-BY-NC-SA – not strictly CC-BY). Modern annotated treebanks exist (UD Chinese, POS-tagged), and the CC-CEDICT dictionary is open (for reference). These corpora serve both modern usage and classical study.
Corpus / Resource | Language(s) | Genres / Text Types | Annotation / Metadata | Format | License | Notes |
---|---|---|---|---|---|---|
Leipzig Corpora (Chinese) [*] | Chinese (Mandarin) | News and web (10K–1M) | None | Plain text | CC-BY 4.0 [*] | High-frequency modern Chinese. |
Wikipedia (Chinese) | Chinese | Encyclopedia articles | None | XML dump | CC-BY-SA 4.0 [*] | Wide topical coverage. |
Other Sino-Tibetan Languages: For Burmese, Tibetan, etc., open data is sparse. Wikipedia exists for Burmese (CC-BY-SA) and smaller Wikipedias for Tibetan, plus some scholarly parallel texts (e.g. Buddhist Tripiṭaka translations). UD has limited treebanks for Tibetan. For major Chinese languages/dialects (e.g. Cantonese), Leipzig Corpora include some web data. These should be listed under a broader “multilingual” note if used.
Tamil: The Tamil Virtual Academy and Project Madurai (https://projectmadurai.org) provide PD Tamil literary works (epics, poems) in UTF-8 text. Tamil Wikipedia (CC-BY-SA) and Leipzig Tamil corpora (from news/web) also exist. UD Tamil is available (CC-BY-SA). These textual resources cover both classical and contemporary Tamil usage.
Other Dravidian Languages: Malayalam, Telugu, Kannada have small Wikipedias and some UD treebanks. The Indian Language Corpora Initiative (ILCI) produced parallel corpora with open licensing (CC-BY) for Indian languages, useful for lexical entries.
Vietnamese: Leipzig Vietnamese (news/web; CC-BY) and Vietnamese Wikipedia (CC-BY-SA) are available. Vietnamese Parallel Corpora from OPUS (e.g. TED talks, Tatoeba) exist under CC licenses. These support modern language usage.
Thai: Thai Wikipedia (CC-BY-SA) and National Electronics and Computer Technology Center (NECTEC) corpus (not fully open). Leipzig includes small Thai corpora. An OpenParallel corpus (e.g. Tatoeba sentences, CC-BY 4.0) can supplement.
Austronesian (Malay/Indonesian): Open corpora include Leipzig Indonesian/Malay and Wikipedia (MS, ID) under CC-BY-SA. The OSCAR web corpus and UD (e.g. Indonesian-GSD) are CC-BY-SA. These cover modern usage in Malay and Indonesian.
Finnish/Hungarian: The Leipzig Corpora include Finnish and Hungarian (CC-BY). Wikipedia (fi, hu) provide broad text. UD treebanks exist. These resources support standard lexical work in these languages.
Turkish: Leipzig Turkish and Wikipedia Turkish (CC-BY-SA) are open. The Turkish National Corpus is not open, but news archives (e.g. open news articles) and historical texts (forged citations) may be used. The UD Turkish treebank is CC-BY.
Other Altaic/Caucasian Languages: Open corpora for Uzbek, Azerbaijani (Leipzig, Wikipedia) and Caucasian languages (e.g. Georgian Wikipedia) are limited but exist. Notably, open glossed corpora have been published for indigenous Siberian languages (Selkup, Nganasan, Kamas, Dolgan) with linguistic annotation (https://copius.univie.ac.at/). These specialized corpora (accessible via inel.corpora.uni-hamburg.de) are rare examples of endangered-language resources.
Swahili: Swahili Wikipedia (CC-BY-SA) and Leipzig Swahili corpus (news; CC-BY) are key. The UD Swahili treebank (CC-BY) also exists. These support lexical analysis of modern Swahili.
Yoruba, Zulu, etc.: Wikipedias exist (Yoruba, Zulu, etc., CC-BY-SA). Leipzig has some corpora (e.g. Yoruba). PanAfrican Parallel Corpora (Paracrawl) include news and the UDHR. These can provide base texts, though limited.
Other African Languages: Many African languages lack large corpora. Parallel religious texts (Bible, UDHR) in local languages can offer seed data. Efforts like Masakhane (MT dataset) are promising but often custom-licensing (not CC-BY).
Global: Wikipedia (Spanish, Portuguese, and English) cover major Colonial languages. Leipzig Corpora include Spanish, Portuguese (Brazilian) etc. Latin American historical texts (PD Spanish/Portuguese lit) and US/Canadian legal texts (PD) are available but not unified corpora.
Indigenous Languages: UD provides treebanks for some (Quechua, Guarani, Nahuatl, Navajo, etc.) with CC-BY-SA. Small collections exist (e.g. collections of folk tales, bilingual corpora). The Universal Declaration of Human Rights in 500+ languages (UN translations) is PD and a handy parallel snippet for rare languages. Tools like Universal Dependencies (https://universaldependencies.org/) and Leipzig (https://wortschatz.uni-leipzig.de/en/download; https://wortschatz.uni-leipzig.de/en/usage) can be leveraged for others.
Several resources span many languages:
- Leipzig Corpora Collection: Provides monolingual corpora (10K–1M sentences) for 100+ languages (including major and many minority languages; https://wortschatz.uni-leipzig.de/en/download). Texts are drawn from news and the web; formats are plain-text sentences. All downloads are CC-BY 4.0 (https://wortschatz.uni-leipzig.de/en/usage).
- Wikipedia (dumps): >300 languages; encyclopedic articles (all licensed CC-BY-SA; https://en.wikipedia.org/wiki/Wikipedia:Database_download). It is a huge multilingual corpus (though quality varies).
- Universal Dependencies (UD): Over 200 treebanks in 150+ languages (Indo-European, Uralic, Altaic, Sino-Tibetan, etc.; https://universaldependencies.org/), annotated for POS, morphology, and dependencies in CoNLL-U format. Licenses vary by treebank (mostly CC-BY or CC-BY-SA). UD provides structured, annotated data across families.
- PubMed Central Open Access Subset (PMC OA): Millions of English biomedical articles (full text XML) under liberal (often CC-BY) licenses (https://pmc.ncbi.nlm.nih.gov/tools/openftlist/). This is an authoritative corpus of scientific prose, suitable for technical vocabulary.
- Project Gutenberg & Wikisource (multilingual): Multi-language texts (many European classics in the public domain), usually plain text or ePub, virtually no restrictions (https://gutenberg.org/policy/license.html).
- Parallel Bible Corpora: While not strictly CC (e.g. JW300), Bible translations in hundreds of languages are broadly PD (e.g. King James, Septuagint) and can be used for basic lexicography in underserved languages.
Most corpora are available in plain-text or XML formats and include at least basic metadata (author/date) when provided. They are suitable for building lexicons (word lists, frequency), etymological studies (via cross-language parallels), and semantic fields (through usage contexts). Priority should be given to sources cited above with open licenses (PD or CC-BY/CC-BY-SA) and minimal cleaning needs, ensuring authoritative coverage (classic literature, major newspapers, official documents) across linguistic families.
Sources: Authoritative corpora and documentation for each language (as cited above) were used to compile this list (https://anc.org/, https://gutenberg.org/policy/license.html, https://en.wikipedia.org/wiki/Wikipedia:Database_download, https://wortschatz.uni-leipzig.de/en/download. https://wortschatz.uni-leipzig.de/en/usage, https://copius.univie.ac.at/, https://pmc.ncbi.nlm.nih.gov/tools/openftlist/), prioritizing open licenses and finished texts.
Dictionary / Project | Languages | Type | Format / Access | License | Notes (coverage, quality) |
---|---|---|---|---|---|
Wiktionary | 4,400+ [*] (all major languages) | Collaborative multilingual (definitions, translations) | Web (MediaWiki); data dumps, API | CC BY-SA 4.0 [*] | around 8.4M entries [*], includes etymologies, pronunciation, translations. |
GCIDE (GNU Collab. Int. Dict. of English) | English | Historical monolingual (Webster 1913 + WordNet) | Download (tar.gz text) | GPLv3+ [*] | Derived from Webster’s 1913 [*]; includes supplemental WordNet definitions. |
Open English WordNet | English | Semantic lexicon (synset database) | XML/RDF download | CC BY 4.0 [*] | WordNet-style network (synonyms, hypernyms) derived from Princeton WordNet. |
Bueno Spanish–English Dictionary (XML) | Spanish ↔ English | Bilingual (modern) | Download (TEI-XML) | Apache 2.0 [*] | ~58K entries; high-quality, manually curated dataset [*]. |
FreeDict | around 45 languages (e.g. Afrikaans, Arabic, Breton) [*] | Bilingual (various pairs) | Download (TEI XML; StarDict, etc.) | GPL [*] | 140+ bilingual dictionaries (≈45 languages) [*]. Offline lookup, corpus-compatible format. |
CC-CEDICT | Chinese (Simplified, Traditional) ↔ English | Bilingual (learning dictionary) | Download (UTF-8 text) | CC BY-SA 4.0 [*] | ~123K entries (2025) [*]; includes pinyin. Widely used in apps and studies. |
Lane’s Arabic–English Lexicon (1863–93) | Classical Arabic ↔ English | Historical lexicon | Text scans (DjVu/TXT OCR) on Internet Archive | Public Domain [*] | 8 volumes; exhaustive classical Arabic dictionary (derived from earlier Kāmūs). |
Latin WordNet | Latin | Semantic lexicon (WordNet) | JSON/API (REST) | CC BY-SA 4.0 [*] | ~70,000 words (archaic to medieval) [*]; modern online WordNet for Latin. |
OpenWordnet-PT | Portuguese | Semantic lexicon (WordNet) | RDF/JSON download | CC BY 4.0 [*] | Open Portuguese WordNet; linked concepts and glosses in Portuguese. |
JMdict (Japanese-Multilingual) | Japanese ↔ English, plus French/Ger/Rus… | Multilingual (bilingual entries) | Download (XML) | CC BY-SA 4.0 [*] | Expanded EDICT; ~200K entries with translations into multiple languages. |
PanLex | 5,000+ languages (broad “global lexicon”) | Multilingual lexicon | Download (CSV/JSON snapshots) | CC0 1.0 Universal [*] | Panlingual database of translations; includes many minority and endangered languages. |
Apertium Dictionaries | 20+ language pairs (e.g. Catalan‑Sp., Asturian‑Sp., etc.) [*] | Bilingual (MT lexicons) | Download (XML on GitHub) | GPL (majority) / CC BY-SA [*] | Rule-based MT dictionaries; open-source lexical data for many languages. |
Sources: Authoritative project pages and documentation for each resource (https://en.wiktionary.org/wiki/Wiktionary:Main_Page, https://gcide.gnu.org.ua/license, https://gcide.gnu.org.ua/, https://en-word.net/, https://github.com/mananoreboton/en-es-en-Dic, https://freedict.org/, https://tei-c.org/activities/projects/freedict/, https://www.mdbg.net/chinese/dictionary?page=cedict, https://archive.org/details/ArabicEnglishLexicon.CopiousEasternSources.EnlargedSuppl.Kamoos.Lane.Poole.1863, https://latinwordnet.exeter.ac.uk/, https://github.com/own-pt/openWordnet-PT, https://blog.okfn.org/2009/07/21/open-dictionary-databases-an-overview/, https://panlex.org/license/, https://blog.okfn.org/2009/07/21/open-dictionary-databases-an-overview/), which detail formats, coverage, and licensing.