OpenCorpus

A multilingual compilation of open-source textual corpora across major & minor world languages - curated for accessibility and linguistic research. Includes links and metadata for publicly available, CC-licensed, and machine-readable datasets.

We identify well‑curated public‑domain text collections that already include rich metadata or linguistic annotation.

For example:

The Perseus Digital Library provides large Greek (32M words) and Latin (16M words) corpora with morphological parsing and dictionary links (https://methods.clsinfra.io; https://wiki.digitalclassicist.org)
The Open Greek and Latin project exposes the same Perseus texts in TEI XML with lemmas and POS tags.
Similarly, Project Gutenberg offers a massive PD literary corpus (primarily English) (https://methods.clsinfra.io)
Language‑specific projects like the Deutsches Textarchiv provide TEI‑encoded historical German texts with lemmatization (https://tei-c.org)

The table below summarizes key sources (all public domain) with their languages, genres, annotation, formats, licenses, and suitability for lexical research.

Source (Name & URL)	Languages	Types of works	Annotation Types	Format/Access	License/Reuse	Suitability
Open Greek & Latin (Perseus) – Greek corpus	Ancient Greek	Classical Greek literature (Homer, tragedians)	Lemmas, POS tags, morphological features [][]	TEI XML (Scaife, GitHub)	CC-BY-SA 4.0 (PD) [*]	Richly annotated syntax & lexicon
Open Greek & Latin (Perseus) – Latin corpus	Latin	Latin classical texts (Caesar, Cicero, Vergil)	Lemmas, POS tags, morphological features [][]	TEI XML (Scaife, GitHub)	CC-BY-SA 4.0 (PD) [*]	Fully parsed Latin texts
Ancient Greek Dependency Treebank	Ancient Greek	Homer, Hesiod, Aeschylus, etc.	Lemmas, POS tags, morphological codes; syntactic dependencies [*]	CoNLL-U / XML (download)	CC (open) [*]	Expert-annotated treebank
Latin Dependency Treebank	Latin	Caesar, Cicero, Ovid, Vergil, etc.	Lemmas, POS tags, morphological codes; syntactic dependencies [*]	CoNLL-U / XML (download)	CC (open) [*]	Expert-annotated treebank
Universal Dependencies (UD)	150+ langs (incl. English, Latin, Greek, French, German)	Mixed corpora (news, Wikipedia, web)	Universal POS tags, morphological features, dependency parses, lemmas [*]	CoNLL-U (plaintext)	CC-BY (treebanks)	Broad multilingual corpora with consistent annotation
Project Gutenberg	Primarily English (also FR, DE, etc.)	Literature, philosophy, classics	None (raw text); header metadata (title, author, date)	Plain text, HTML, EPUB	Public Domain [*]	Massive PD corpus for English and other literatures
Wikisource	Multilingual (EN, FR, DE, etc.)	Literature, historical and religious texts	Wiki markup; page metadata (titles, categories, authors)	Wiki XML dumps / HTML	CC-BY-SA	Crowdsourced collections of PD texts
Oxford Text Archive (OTA)	Various (English, Latin, etc.)	Scholarly corpora (poetry, drama, prose)	TEI-encoded texts (structural+metadata tags: author, date, genre)	TEI XML, plain text	Varies by collection [*]	Curated humanities texts (check individual licenses)
IntraText Digital Library	Latin (ancient to modern)	Published Latin works (literature, theology, etc.)	Full text with search/concordances; edition metadata [*]	HTML (search interface)	Free access (copyright respected)	Scholarly Latin corpus with lexicon links
The Latin Library	Latin	Classical and medieval Latin literature	None (raw text)	HTML	Public domain (collected PD) [*]	Broad Latin text collection (no annotation)
Bibliotheca Augustana	Greek, Latin (incl. medieval/Neo-Latin)	Selected classical/medieval texts	None (raw text)	HTML	PD academic collection [*]	Curated texts with simple interface (no annotation)
Deutsches Textarchiv (DTA)	German (1600–1900)	Historical German literature (prose, poetry)	Tokenized, lemmatized; normalized orthography [*]	TEI-XML, HTML, TCF, plain text [*]	CC-BY (open)	Annotated diachronic German corpus [*]

Each source in Table 1 is public-domain (or CC‑licensed) and minimally processed. Most entries include lemma and POS annotations or rich TEI markup. This means they can be readily searched for word contexts and used to build chronological lexica: for example, the Perseus corpora and dependency treebanks provide lemmatized, dated examples for Greek and Latin, while Gutenberg and Wikisource supply large quantities of dated literary usage (with author/date metadata). The combination of these resources covers a wide temporal span.

Sources: We prioritize academically curated collections [*]. Citations indicate where full details of each resource’s content and licensing are documented.

Indo-European Languages

English: The Open American National Corpus (OANC) (https://anc.org/oanc) provides ~15 M words of modern American English (1990s–present) across diverse written (news, fiction, academic, web) and spoken genres (https://anc.org/). It is richly annotated with POS tags and lemmas (via ANC’s annotation pipeline) and can be delivered in XML/CoNLL-U formats using the ANC2Go tool (https://en.wikipedia.org/, https://anc.org/). The OANC is fully open/unrestricted (effectively CC0; https://anc.org/). Its Manually Annotated Sub-Corpus (MASC) (500K words) is balanced over 18+ genres and adds gold-standard annotations (tokenization, POS, lemma, syntax, named entities, coreference, discourse, etc.; https://anc.org/). MASC is licensed CC-BY 3.0 (US; https://anc.org/). These corpora are excellent for syntactic and semantic research. Other key English resources include Project Gutenberg (PD literature in many genres; plain text; PD/no restriction; https://gutenberg.org/policy/license.html) and English Wikipedia (encyclopedic articles; XML dumps; CC-BY-SA 4.0; https://en.wikipedia.org/wiki/Wikipedia:Database_download) for general lexical usage. The Leipzig Corpora Collection offers up to 1 M sentences of English (news and web text, sentence-aligned; plain text) with word co-occurrence stats (https://wortschatz.uni-leipzig.de/en/download) – data freely downloadable under CC-BY (https://wortschatz.uni-leipzig.de/en/usage).

Corpus / Resource	Language(s)	Genres / Text Types	Annotation / Metadata	Format / Access	License	Notes (suitability)
OANC (Open American National Corpus)	American English	15M words; balanced modern genres (news, fiction, blogs, transcripts)	POS tags, lemmas, etc. (automatic) [*]	XML/CoNLL-U via ANC2Go [*]	CC0 (unrestricted) [*]	Good for lexical frequencies, collocations, grammar (chronological).
MASC (Manually Annotated Sub-Corpus)	American English	500K words; balanced 18+ genres (including speech, blogs, news)	Manual annotation: sentence, token, lemma, POS, NP/VP chunks, NER, Penn Treebank syntax, coreference, etc. [*]	XML/GrAF (coming); samples downloadable	CC-BY 3.0 US [*]	Gold-standard grammar and semantics – ideal for NLP, lexicography.
Project Gutenberg	Many (English-centric)	Public-domain literature & nonfiction (classics, poetry, etc.)	None (plain text)	Plain text, ePub, HTML	Public domain [*]	Rich historical texts; useful for rare/archaic usage and large-scale corpora.
Wikipedia (English)	English	Encyclopedia articles (all domains)	None (wiki markup; can extract text)	XML dumps	CC-BY-SA 4.0 [*]	Huge up-to-date lexicon; varied style and topics (user-generated).
Leipzig Corpora (English) [*]	English	News and random web sentences (10K–1M)	None (pre-tokenized sentences; includes co-occurrence stats)	Plain text (sentences)	CC-BY 4.0 [*]	Balanced word-freq and collocation data (quick lookup tables).

Spanish: Open Spanish corpora include Leipzig News Crawls, Wikipedia (es), and public-domain literature. The Leipzig Corpora (Spanish) (100K–1M sentences) cover newspaper and web text; plain sentences with frequency info (CC-BY; https://wortschatz.uni-leipzig.de/en/download; https://wortschatz.uni-leipzig.de/en/usage). Spanish Wikipedia provides up-to-date encyclopedic text (CC-BY-SA; https://en.wikipedia.org/wiki/Wikipedia:Database_download). Project Gutenberg (Spanish) offers PD literary classics. Together these form robust corpora for Spanish lexicography.

Corpus / Resource	Language(s)	Genres / Text Types	Annotation / Metadata	Format / Access	License	Notes
Leipzig Corpora (Spanish) [*]	Spanish	News articles and web sentences (10K–1M)	None	Plain text (sentences)	CC-BY 4.0 [*]	Good for frequency and collocation stats.
Wikipedia (Spanish)	Spanish	Encyclopedic articles	None (wiki markup)	XML dumps	CC-BY-SA 4.0 [*]	Large modern lexicon across topics.
Project Gutenberg (Spanish)	Spanish	PD literature and poetry	None	Plain text, ePub	Public domain [*]	Classic works (Cervantes, etc.) for literary language.

French: Similarly, Leipzig French corpora (news/web; CC-BY) and French Wikipedia (CC-BY-SA) provide broad coverage. Project Gutenberg (French) and France’s Gallica offer PD classics and historical texts.

Corpus / Resource	Language(s)	Genres / Text Types	Annotation / Metadata	Format	License	Notes
Leipzig Corpora (French) [*]	French	News articles, web (10K–1M)	None	Plain text	CC-BY 4.0 [*]	Balanced modern text (journals, blogs).
Wikipedia (French)	French	Encyclopedia entries	None	XML dump	CC-BY-SA 4.0 [*]	General-purpose text, up to date.
Project Gutenberg (French)	French	PD literature (Voltaire, Hugo)	None	Plain text	Public domain [*]	Classic literature for stylistic study.

German: Open German corpora include Leipzig German (newspaper/web; CC-BY; https://wortschatz.uni-leipzig.de/en/download; https://wortschatz.uni-leipzig.de/en/usage), German Wikipedia (CC-BY-SA; https://en.wikipedia.org/wiki/Wikipedia:Database_download), and German Project Gutenberg. The Deutsches Textarchiv (DTA) is a scholarly collection of 19th-century texts (Open Access), and parliamentary corpora (e.g. Hansard Germany) may be used.

Corpus / Resource	Language(s)	Genres / Text Types	Annotation / Metadata	Format	License	Notes
Leipzig Corpora (German) [*]	German	News, web sentences (10K–1M)	None	Plain text	CC-BY 4.0 [*]	Modern usage frequency data.
Wikipedia (German)	German	Encyclopedia articles	None	XML dump	CC-BY-SA 4.0 [*]	Large multi-domain text.
Project Gutenberg (German)	German	PD classics (Goethe, Grimm)	None	Plain text	Public domain [*]	Standard literary language.

Russian: The Leipzig Russian corpus (news/web; CC-BY) and Russian Wikipedia (CC-BY-SA) are open sources. The Open Russian National Corpus (ORNC) is in development, but the public Russian Wikipedia, PD literary texts, and Leipzig provide ample raw text. (Full annotated RNC exists but is not open download.)

Corpus / Resource	Language(s)	Genres / Text Types	Annotation / Metadata	Format	License	Notes
Leipzig Corpora (Russian) [*]	Russian	News and web (10K–1M)	None	Plain text	CC-BY 4.0 [*]	Modern usage; large size.
Wikipedia (Russian)	Russian	Encyclopedia articles	None	XML dump	CC-BY-SA 4.0 [*]	Broad vocabulary.
Project Gutenberg (Russian)	Russian	PD literature (Tolstoy, Dost.)	None	Plain text	Public domain [*]	19th-century classic lit.

Other Indo-European Languages: For classical and less-common IE languages, open corpora exist: e.g. the Latin Library (latinlibrary.com) and Perseus Project provide extensive Latin/Greek texts (mostly PD). The OpenITI corpus offers open-access Persian and Arabic texts (incl. many medieval works; https://openiti.org/projects/OpenITI%20Corpus.html). UD treebanks cover dozens of languages (e.g. Old Norse, Welsh, Haitian, etc.) with POS/morphology annotations (CoNLL-U format; varying CC-BY/CC-BY-SA). These can seed lexicographic data where available.

Afro-Asiatic Languages

Arabic: The Leipzig Arabic corpus (news/web; CC-BY) and Arabic Wikipedia (CC-BY-SA) are major open sources. In addition, OpenITI provides thousands of classical Arabic and Persian texts (liturgical and scholarly works; https://openiti.org/projects/OpenITI%20Corpus.html) – open-access and increasingly cleaned. The Quranic Arabic Corpus (grammar/lemmas) is annotated but carries a CC-BY-NC-SA license, so use primarily for reference (not included here). These corpora (especially wiki and news) cover modern MSA; classical texts enrich dictionary entries.

Corpus / Resource	Language(s)	Genres / Text Types	Annotation / Metadata	Format	License	Notes
Leipzig Corpora (Arabic) [*]	Arabic	News and web (10K–1M)	None	Plain text	CC-BY 4.0 [*]	Modern MSA usage, word co-occurrences.
Wikipedia (Arabic)	Arabic	Encyclopedic articles	None	XML dump	CC-BY-SA 4.0 [*]	General vocabulary, cross-domain.
OpenITI (Arabic/Persian)	Arabic; Persian	Classical religious/philosophical works	Metadata (author, date)	TEI/XML	Open access (CC-BY-like)	Rich historical texts; some OCR errors.

Hebrew: Open resources include Hebrew Bible (Tanakh) texts (PD) and the Hebrew Wikipedia (CC-BY-SA). The Bar-Ilan Responsa Project is scholarly but not open. UD Hebrew corpora (modern liturgical and wiki) exist with annotations (UD format, CC-BY). These can support lexicographic work in modern and Biblical Hebrew.

Other Afro-Asiatic Languages: For Amharic/Ge‘ez, no large open corpora are easily available. Somali Wikipedia and small Leipzig corpora may be used. In general, many Ethiopian languages rely on religious texts (often PD) and small parallel data.

Sino-Tibetan and East Asian Languages

Mandarin Chinese: Open corpora include Leipzig Chinese (CC-BY) and Chinese Wikipedia (CC-BY-SA) for modern Chinese. Classical Chinese is available via the Chinese Text Project (ancient texts; open-access but CC-BY-NC-SA – not strictly CC-BY). Modern annotated treebanks exist (UD Chinese, POS-tagged), and the CC-CEDICT dictionary is open (for reference). These corpora serve both modern usage and classical study.

Corpus / Resource	Language(s)	Genres / Text Types	Annotation / Metadata	Format	License	Notes
Leipzig Corpora (Chinese) [*]	Chinese (Mandarin)	News and web (10K–1M)	None	Plain text	CC-BY 4.0 [*]	High-frequency modern Chinese.
Wikipedia (Chinese)	Chinese	Encyclopedia articles	None	XML dump	CC-BY-SA 4.0 [*]	Wide topical coverage.

Other Sino-Tibetan Languages: For Burmese, Tibetan, etc., open data is sparse. Wikipedia exists for Burmese (CC-BY-SA) and smaller Wikipedias for Tibetan, plus some scholarly parallel texts (e.g. Buddhist Tripiṭaka translations). UD has limited treebanks for Tibetan. For major Chinese languages/dialects (e.g. Cantonese), Leipzig Corpora include some web data. These should be listed under a broader “multilingual” note if used.

Dravidian Languages

Tamil: The Tamil Virtual Academy and Project Madurai (https://projectmadurai.org) provide PD Tamil literary works (epics, poems) in UTF-8 text. Tamil Wikipedia (CC-BY-SA) and Leipzig Tamil corpora (from news/web) also exist. UD Tamil is available (CC-BY-SA). These textual resources cover both classical and contemporary Tamil usage.

Other Dravidian Languages: Malayalam, Telugu, Kannada have small Wikipedias and some UD treebanks. The Indian Language Corpora Initiative (ILCI) produced parallel corpora with open licensing (CC-BY) for Indian languages, useful for lexical entries.

Austro-Asiatic and Austronesian Languages

Vietnamese: Leipzig Vietnamese (news/web; CC-BY) and Vietnamese Wikipedia (CC-BY-SA) are available. Vietnamese Parallel Corpora from OPUS (e.g. TED talks, Tatoeba) exist under CC licenses. These support modern language usage.

Thai: Thai Wikipedia (CC-BY-SA) and National Electronics and Computer Technology Center (NECTEC) corpus (not fully open). Leipzig includes small Thai corpora. An OpenParallel corpus (e.g. Tatoeba sentences, CC-BY 4.0) can supplement.

Austronesian (Malay/Indonesian): Open corpora include Leipzig Indonesian/Malay and Wikipedia (MS, ID) under CC-BY-SA. The OSCAR web corpus and UD (e.g. Indonesian-GSD) are CC-BY-SA. These cover modern usage in Malay and Indonesian.

Uralic and Altaic Languages

Finnish/Hungarian: The Leipzig Corpora include Finnish and Hungarian (CC-BY). Wikipedia (fi, hu) provide broad text. UD treebanks exist. These resources support standard lexical work in these languages.

Turkish: Leipzig Turkish and Wikipedia Turkish (CC-BY-SA) are open. The Turkish National Corpus is not open, but news archives (e.g. open news articles) and historical texts (forged citations) may be used. The UD Turkish treebank is CC-BY.

Other Altaic/Caucasian Languages: Open corpora for Uzbek, Azerbaijani (Leipzig, Wikipedia) and Caucasian languages (e.g. Georgian Wikipedia) are limited but exist. Notably, open glossed corpora have been published for indigenous Siberian languages (Selkup, Nganasan, Kamas, Dolgan) with linguistic annotation (https://copius.univie.ac.at/). These specialized corpora (accessible via inel.corpora.uni-hamburg.de) are rare examples of endangered-language resources.

African Languages Languages

Swahili: Swahili Wikipedia (CC-BY-SA) and Leipzig Swahili corpus (news; CC-BY) are key. The UD Swahili treebank (CC-BY) also exists. These support lexical analysis of modern Swahili.

Yoruba, Zulu, etc.: Wikipedias exist (Yoruba, Zulu, etc., CC-BY-SA). Leipzig has some corpora (e.g. Yoruba). PanAfrican Parallel Corpora (Paracrawl) include news and the UDHR. These can provide base texts, though limited.

Other African Languages: Many African languages lack large corpora. Parallel religious texts (Bible, UDHR) in local languages can offer seed data. Efforts like Masakhane (MT dataset) are promising but often custom-licensing (not CC-BY).

Americas

Global: Wikipedia (Spanish, Portuguese, and English) cover major Colonial languages. Leipzig Corpora include Spanish, Portuguese (Brazilian) etc. Latin American historical texts (PD Spanish/Portuguese lit) and US/Canadian legal texts (PD) are available but not unified corpora.

Indigenous Languages: UD provides treebanks for some (Quechua, Guarani, Nahuatl, Navajo, etc.) with CC-BY-SA. Small collections exist (e.g. collections of folk tales, bilingual corpora). The Universal Declaration of Human Rights in 500+ languages (UN translations) is PD and a handy parallel snippet for rare languages. Tools like Universal Dependencies (https://universaldependencies.org/) and Leipzig (https://wortschatz.uni-leipzig.de/en/download; https://wortschatz.uni-leipzig.de/en/usage) can be leveraged for others.

Multilingual Corpora

Several resources span many languages:

Leipzig Corpora Collection: Provides monolingual corpora (10K–1M sentences) for 100+ languages (including major and many minority languages; https://wortschatz.uni-leipzig.de/en/download). Texts are drawn from news and the web; formats are plain-text sentences. All downloads are CC-BY 4.0 (https://wortschatz.uni-leipzig.de/en/usage).
Wikipedia (dumps): >300 languages; encyclopedic articles (all licensed CC-BY-SA; https://en.wikipedia.org/wiki/Wikipedia:Database_download). It is a huge multilingual corpus (though quality varies).
Universal Dependencies (UD): Over 200 treebanks in 150+ languages (Indo-European, Uralic, Altaic, Sino-Tibetan, etc.; https://universaldependencies.org/), annotated for POS, morphology, and dependencies in CoNLL-U format. Licenses vary by treebank (mostly CC-BY or CC-BY-SA). UD provides structured, annotated data across families.
PubMed Central Open Access Subset (PMC OA): Millions of English biomedical articles (full text XML) under liberal (often CC-BY) licenses (https://pmc.ncbi.nlm.nih.gov/tools/openftlist/). This is an authoritative corpus of scientific prose, suitable for technical vocabulary.
Project Gutenberg & Wikisource (multilingual): Multi-language texts (many European classics in the public domain), usually plain text or ePub, virtually no restrictions (https://gutenberg.org/policy/license.html).
Parallel Bible Corpora: While not strictly CC (e.g. JW300), Bible translations in hundreds of languages are broadly PD (e.g. King James, Septuagint) and can be used for basic lexicography in underserved languages.

Most corpora are available in plain-text or XML formats and include at least basic metadata (author/date) when provided. They are suitable for building lexicons (word lists, frequency), etymological studies (via cross-language parallels), and semantic fields (through usage contexts). Priority should be given to sources cited above with open licenses (PD or CC-BY/CC-BY-SA) and minimal cleaning needs, ensuring authoritative coverage (classic literature, major newspapers, official documents) across linguistic families.

Sources: Authoritative corpora and documentation for each language (as cited above) were used to compile this list (https://anc.org/, https://gutenberg.org/policy/license.html, https://en.wikipedia.org/wiki/Wikipedia:Database_download, https://wortschatz.uni-leipzig.de/en/download. https://wortschatz.uni-leipzig.de/en/usage, https://copius.univie.ac.at/, https://pmc.ncbi.nlm.nih.gov/tools/openftlist/), prioritizing open licenses and finished texts.

Open-Access Dictionaries (by Language and Type)

Dictionary / Project	Languages	Type	Format / Access	License	Notes (coverage, quality)
Wiktionary	4,400+ [*] (all major languages)	Collaborative multilingual (definitions, translations)	Web (MediaWiki); data dumps, API	CC BY-SA 4.0 [*]	around 8.4M entries [*], includes etymologies, pronunciation, translations.
GCIDE (GNU Collab. Int. Dict. of English)	English	Historical monolingual (Webster 1913 + WordNet)	Download (tar.gz text)	GPLv3+ [*]	Derived from Webster’s 1913 [*]; includes supplemental WordNet definitions.
Open English WordNet	English	Semantic lexicon (synset database)	XML/RDF download	CC BY 4.0 [*]	WordNet-style network (synonyms, hypernyms) derived from Princeton WordNet.
Bueno Spanish–English Dictionary (XML)	Spanish ↔ English	Bilingual (modern)	Download (TEI-XML)	Apache 2.0 [*]	~58K entries; high-quality, manually curated dataset [*].
FreeDict	around 45 languages (e.g. Afrikaans, Arabic, Breton) [*]	Bilingual (various pairs)	Download (TEI XML; StarDict, etc.)	GPL [*]	140+ bilingual dictionaries (≈45 languages) [*]. Offline lookup, corpus-compatible format.
CC-CEDICT	Chinese (Simplified, Traditional) ↔ English	Bilingual (learning dictionary)	Download (UTF-8 text)	CC BY-SA 4.0 [*]	~123K entries (2025) [*]; includes pinyin. Widely used in apps and studies.
Lane’s Arabic–English Lexicon (1863–93)	Classical Arabic ↔ English	Historical lexicon	Text scans (DjVu/TXT OCR) on Internet Archive	Public Domain [*]	8 volumes; exhaustive classical Arabic dictionary (derived from earlier Kāmūs).
Latin WordNet	Latin	Semantic lexicon (WordNet)	JSON/API (REST)	CC BY-SA 4.0 [*]	~70,000 words (archaic to medieval) [*]; modern online WordNet for Latin.
OpenWordnet-PT	Portuguese	Semantic lexicon (WordNet)	RDF/JSON download	CC BY 4.0 [*]	Open Portuguese WordNet; linked concepts and glosses in Portuguese.
JMdict (Japanese-Multilingual)	Japanese ↔ English, plus French/Ger/Rus…	Multilingual (bilingual entries)	Download (XML)	CC BY-SA 4.0 [*]	Expanded EDICT; ~200K entries with translations into multiple languages.
PanLex	5,000+ languages (broad “global lexicon”)	Multilingual lexicon	Download (CSV/JSON snapshots)	CC0 1.0 Universal [*]	Panlingual database of translations; includes many minority and endangered languages.
Apertium Dictionaries	20+ language pairs (e.g. Catalan‑Sp., Asturian‑Sp., etc.) [*]	Bilingual (MT lexicons)	Download (XML on GitHub)	GPL (majority) / CC BY-SA [*]	Rule-based MT dictionaries; open-source lexical data for many languages.

Sources: Authoritative project pages and documentation for each resource (https://en.wiktionary.org/wiki/Wiktionary:Main_Page, https://gcide.gnu.org.ua/license, https://gcide.gnu.org.ua/, https://en-word.net/, https://github.com/mananoreboton/en-es-en-Dic, https://freedict.org/, https://tei-c.org/activities/projects/freedict/, https://www.mdbg.net/chinese/dictionary?page=cedict, https://archive.org/details/ArabicEnglishLexicon.CopiousEasternSources.EnlargedSuppl.Kamoos.Lane.Poole.1863, https://latinwordnet.exeter.ac.uk/, https://github.com/own-pt/openWordnet-PT, https://blog.okfn.org/2009/07/21/open-dictionary-databases-an-overview/, https://panlex.org/license/, https://blog.okfn.org/2009/07/21/open-dictionary-databases-an-overview/), which detail formats, coverage, and licensing.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenCorpus

Indo-European Languages

Afro-Asiatic Languages

Sino-Tibetan and East Asian Languages

Dravidian Languages

Austro-Asiatic and Austronesian Languages

Uralic and Altaic Languages

African Languages Languages

Americas

Multilingual Corpora

Open-Access Dictionaries (by Language and Type)

About

Uh oh!

Releases

Packages

License

madhav1k/OpenCorpus

Folders and files

Latest commit

History

Repository files navigation

OpenCorpus

Indo-European Languages

Afro-Asiatic Languages

Sino-Tibetan and East Asian Languages

Dravidian Languages

Austro-Asiatic and Austronesian Languages

Uralic and Altaic Languages

African Languages Languages

Americas

Multilingual Corpora

Open-Access Dictionaries (by Language and Type)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages