Skip to content

A multilingual compilation of open-source textual corpora across major & minor world languages - curated for accessibility and linguistic research. Includes links and metadata for publicly available, CC-licensed, and machine-readable datasets.

License

Notifications You must be signed in to change notification settings

madhav1k/OpenCorpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 

Repository files navigation

OpenCorpus

A multilingual compilation of open-source textual corpora across major & minor world languages - curated for accessibility and linguistic research. Includes links and metadata for publicly available, CC-licensed, and machine-readable datasets.

We identify well‑curated public‑domain text collections that already include rich metadata or linguistic annotation.

For example:

  1. The Perseus Digital Library provides large Greek (32M words) and Latin (16M words) corpora with morphological parsing and dictionary links (https://methods.clsinfra.io; https://wiki.digitalclassicist.org)
  2. The Open Greek and Latin project exposes the same Perseus texts in TEI XML with lemmas and POS tags.
  3. Similarly, Project Gutenberg offers a massive PD literary corpus (primarily English) (https://methods.clsinfra.io)
  4. Language‑specific projects like the Deutsches Textarchiv provide TEI‑encoded historical German texts with lemmatization (https://tei-c.org)

The table below summarizes key sources (all public domain) with their languages, genres, annotation, formats, licenses, and suitability for lexical research.

Source (Name & URL) Languages Types of works Annotation Types Format/Access License/Reuse Suitability
Open Greek & Latin (Perseus) – Greek corpus Ancient Greek Classical Greek literature (Homer, tragedians) Lemmas, POS tags, morphological features [*][*] TEI XML (Scaife, GitHub) CC-BY-SA 4.0 (PD) [*] Richly annotated syntax & lexicon
Open Greek & Latin (Perseus) – Latin corpus Latin Latin classical texts (Caesar, Cicero, Vergil) Lemmas, POS tags, morphological features [*][*] TEI XML (Scaife, GitHub) CC-BY-SA 4.0 (PD) [*] Fully parsed Latin texts
Ancient Greek Dependency Treebank Ancient Greek Homer, Hesiod, Aeschylus, etc. Lemmas, POS tags, morphological codes; syntactic dependencies [*] CoNLL-U / XML (download) CC (open) [*] Expert-annotated treebank
Latin Dependency Treebank Latin Caesar, Cicero, Ovid, Vergil, etc. Lemmas, POS tags, morphological codes; syntactic dependencies [*] CoNLL-U / XML (download) CC (open) [*] Expert-annotated treebank
Universal Dependencies (UD) 150+ langs (incl. English, Latin, Greek, French, German) Mixed corpora (news, Wikipedia, web) Universal POS tags, morphological features, dependency parses, lemmas [*] CoNLL-U (plaintext) CC-BY (treebanks) Broad multilingual corpora with consistent annotation
Project Gutenberg Primarily English (also FR, DE, etc.) Literature, philosophy, classics None (raw text); header metadata (title, author, date) Plain text, HTML, EPUB Public Domain [*] Massive PD corpus for English and other literatures
Wikisource Multilingual (EN, FR, DE, etc.) Literature, historical and religious texts Wiki markup; page metadata (titles, categories, authors) Wiki XML dumps / HTML CC-BY-SA Crowdsourced collections of PD texts
Oxford Text Archive (OTA) Various (English, Latin, etc.) Scholarly corpora (poetry, drama, prose) TEI-encoded texts (structural+metadata tags: author, date, genre) TEI XML, plain text Varies by collection [*] Curated humanities texts (check individual licenses)
IntraText Digital Library Latin (ancient to modern) Published Latin works (literature, theology, etc.) Full text with search/concordances; edition metadata [*] HTML (search interface) Free access (copyright respected) Scholarly Latin corpus with lexicon links
The Latin Library Latin Classical and medieval Latin literature None (raw text) HTML Public domain (collected PD) [*] Broad Latin text collection (no annotation)
Bibliotheca Augustana Greek, Latin (incl. medieval/Neo-Latin) Selected classical/medieval texts None (raw text) HTML PD academic collection [*] Curated texts with simple interface (no annotation)
Deutsches Textarchiv (DTA) German (1600–1900) Historical German literature (prose, poetry) Tokenized, lemmatized; normalized orthography [*] TEI-XML, HTML, TCF, plain text [*] CC-BY (open) Annotated diachronic German corpus [*]

Each source in Table 1 is public-domain (or CC‑licensed) and minimally processed. Most entries include lemma and POS annotations or rich TEI markup. This means they can be readily searched for word contexts and used to build chronological lexica: for example, the Perseus corpora and dependency treebanks provide lemmatized, dated examples for Greek and Latin, while Gutenberg and Wikisource supply large quantities of dated literary usage (with author/date metadata). The combination of these resources covers a wide temporal span.

Sources: We prioritize academically curated collections [*]. Citations indicate where full details of each resource’s content and licensing are documented.

Indo-European Languages

English: The Open American National Corpus (OANC) (https://anc.org/oanc) provides ~15 M words of modern American English (1990s–present) across diverse written (news, fiction, academic, web) and spoken genres (https://anc.org/). It is richly annotated with POS tags and lemmas (via ANC’s annotation pipeline) and can be delivered in XML/CoNLL-U formats using the ANC2Go tool (https://en.wikipedia.org/, https://anc.org/). The OANC is fully open/unrestricted (effectively CC0; https://anc.org/). Its Manually Annotated Sub-Corpus (MASC) (500K words) is balanced over 18+ genres and adds gold-standard annotations (tokenization, POS, lemma, syntax, named entities, coreference, discourse, etc.; https://anc.org/). MASC is licensed CC-BY 3.0 (US; https://anc.org/). These corpora are excellent for syntactic and semantic research. Other key English resources include Project Gutenberg (PD literature in many genres; plain text; PD/no restriction; https://gutenberg.org/policy/license.html) and English Wikipedia (encyclopedic articles; XML dumps; CC-BY-SA 4.0; https://en.wikipedia.org/wiki/Wikipedia:Database_download) for general lexical usage. The Leipzig Corpora Collection offers up to 1 M sentences of English (news and web text, sentence-aligned; plain text) with word co-occurrence stats (https://wortschatz.uni-leipzig.de/en/download) – data freely downloadable under CC-BY (https://wortschatz.uni-leipzig.de/en/usage).

Corpus / Resource Language(s) Genres / Text Types Annotation / Metadata Format / Access License Notes (suitability)
OANC
(Open American National Corpus)
American English 15M words; balanced modern genres (news, fiction, blogs, transcripts) POS tags, lemmas, etc. (automatic) [*] XML/CoNLL-U via ANC2Go [*] CC0 (unrestricted) [*] Good for lexical frequencies, collocations, grammar (chronological).
MASC (Manually Annotated Sub-Corpus) American English 500K words; balanced 18+ genres (including speech, blogs, news) Manual annotation: sentence, token, lemma, POS, NP/VP chunks, NER, Penn Treebank syntax, coreference, etc. [*] XML/GrAF (coming); samples downloadable CC-BY 3.0 US [*] Gold-standard grammar and semantics – ideal for NLP, lexicography.
Project Gutenberg Many (English-centric) Public-domain literature & nonfiction (classics, poetry, etc.) None (plain text) Plain text, ePub, HTML Public domain [*] Rich historical texts; useful for rare/archaic usage and large-scale corpora.
Wikipedia (English) English Encyclopedia articles (all domains) None (wiki markup; can extract text) XML dumps CC-BY-SA 4.0 [*] Huge up-to-date lexicon; varied style and topics (user-generated).
Leipzig Corpora (English) [*] English News and random web sentences (10K–1M) None (pre-tokenized sentences; includes co-occurrence stats) Plain text (sentences) CC-BY 4.0 [*] Balanced word-freq and collocation data (quick lookup tables).

Spanish: Open Spanish corpora include Leipzig News Crawls, Wikipedia (es), and public-domain literature. The Leipzig Corpora (Spanish) (100K–1M sentences) cover newspaper and web text; plain sentences with frequency info (CC-BY; https://wortschatz.uni-leipzig.de/en/download; https://wortschatz.uni-leipzig.de/en/usage). Spanish Wikipedia provides up-to-date encyclopedic text (CC-BY-SA; https://en.wikipedia.org/wiki/Wikipedia:Database_download). Project Gutenberg (Spanish) offers PD literary classics. Together these form robust corpora for Spanish lexicography.

Corpus / Resource Language(s) Genres / Text Types Annotation / Metadata Format / Access License Notes
Leipzig Corpora (Spanish) [*] Spanish News articles and web sentences (10K–1M) None Plain text (sentences) CC-BY 4.0 [*] Good for frequency and collocation stats.
Wikipedia (Spanish) Spanish Encyclopedic articles None (wiki markup) XML dumps CC-BY-SA 4.0 [*] Large modern lexicon across topics.
Project Gutenberg (Spanish) Spanish PD literature and poetry None Plain text, ePub Public domain [*] Classic works (Cervantes, etc.) for literary language.

French: Similarly, Leipzig French corpora (news/web; CC-BY) and French Wikipedia (CC-BY-SA) provide broad coverage. Project Gutenberg (French) and France’s Gallica offer PD classics and historical texts.

Corpus / Resource Language(s) Genres / Text Types Annotation / Metadata Format License Notes
Leipzig Corpora (French) [*] French News articles, web (10K–1M) None Plain text CC-BY 4.0 [*] Balanced modern text (journals, blogs).
Wikipedia (French) French Encyclopedia entries None XML dump CC-BY-SA 4.0 [*] General-purpose text, up to date.
Project Gutenberg (French) French PD literature (Voltaire, Hugo) None Plain text Public domain [*] Classic literature for stylistic study.

German: Open German corpora include Leipzig German (newspaper/web; CC-BY; https://wortschatz.uni-leipzig.de/en/download; https://wortschatz.uni-leipzig.de/en/usage), German Wikipedia (CC-BY-SA; https://en.wikipedia.org/wiki/Wikipedia:Database_download), and German Project Gutenberg. The Deutsches Textarchiv (DTA) is a scholarly collection of 19th-century texts (Open Access), and parliamentary corpora (e.g. Hansard Germany) may be used.

Corpus / Resource Language(s) Genres / Text Types Annotation / Metadata Format License Notes
Leipzig Corpora (German) [*] German News, web sentences (10K–1M) None Plain text CC-BY 4.0 [*] Modern usage frequency data.
Wikipedia (German) German Encyclopedia articles None XML dump CC-BY-SA 4.0 [*] Large multi-domain text.
Project Gutenberg (German) German PD classics (Goethe, Grimm) None Plain text Public domain [*] Standard literary language.

Russian: The Leipzig Russian corpus (news/web; CC-BY) and Russian Wikipedia (CC-BY-SA) are open sources. The Open Russian National Corpus (ORNC) is in development, but the public Russian Wikipedia, PD literary texts, and Leipzig provide ample raw text. (Full annotated RNC exists but is not open download.)

Corpus / Resource Language(s) Genres / Text Types Annotation / Metadata Format License Notes
Leipzig Corpora (Russian) [*] Russian News and web (10K–1M) None Plain text CC-BY 4.0 [*] Modern usage; large size.
Wikipedia (Russian) Russian Encyclopedia articles None XML dump CC-BY-SA 4.0 [*] Broad vocabulary.
Project Gutenberg (Russian) Russian PD literature (Tolstoy, Dost.) None Plain text Public domain [*] 19th-century classic lit.

Other Indo-European Languages: For classical and less-common IE languages, open corpora exist: e.g. the Latin Library (latinlibrary.com) and Perseus Project provide extensive Latin/Greek texts (mostly PD). The OpenITI corpus offers open-access Persian and Arabic texts (incl. many medieval works; https://openiti.org/projects/OpenITI%20Corpus.html). UD treebanks cover dozens of languages (e.g. Old Norse, Welsh, Haitian, etc.) with POS/morphology annotations (CoNLL-U format; varying CC-BY/CC-BY-SA). These can seed lexicographic data where available.

Afro-Asiatic Languages

Arabic: The Leipzig Arabic corpus (news/web; CC-BY) and Arabic Wikipedia (CC-BY-SA) are major open sources. In addition, OpenITI provides thousands of classical Arabic and Persian texts (liturgical and scholarly works; https://openiti.org/projects/OpenITI%20Corpus.html) – open-access and increasingly cleaned. The Quranic Arabic Corpus (grammar/lemmas) is annotated but carries a CC-BY-NC-SA license, so use primarily for reference (not included here). These corpora (especially wiki and news) cover modern MSA; classical texts enrich dictionary entries.

Corpus / Resource Language(s) Genres / Text Types Annotation / Metadata Format License Notes
Leipzig Corpora (Arabic) [*] Arabic News and web (10K–1M) None Plain text CC-BY 4.0 [*] Modern MSA usage, word co-occurrences.
Wikipedia (Arabic) Arabic Encyclopedic articles None XML dump CC-BY-SA 4.0 [*] General vocabulary, cross-domain.
OpenITI (Arabic/Persian) Arabic; Persian Classical religious/philosophical works Metadata (author, date) TEI/XML Open access (CC-BY-like) Rich historical texts; some OCR errors.

Hebrew: Open resources include Hebrew Bible (Tanakh) texts (PD) and the Hebrew Wikipedia (CC-BY-SA). The Bar-Ilan Responsa Project is scholarly but not open. UD Hebrew corpora (modern liturgical and wiki) exist with annotations (UD format, CC-BY). These can support lexicographic work in modern and Biblical Hebrew.

Other Afro-Asiatic Languages: For Amharic/Ge‘ez, no large open corpora are easily available. Somali Wikipedia and small Leipzig corpora may be used. In general, many Ethiopian languages rely on religious texts (often PD) and small parallel data.

Sino-Tibetan and East Asian Languages

Mandarin Chinese: Open corpora include Leipzig Chinese (CC-BY) and Chinese Wikipedia (CC-BY-SA) for modern Chinese. Classical Chinese is available via the Chinese Text Project (ancient texts; open-access but CC-BY-NC-SA – not strictly CC-BY). Modern annotated treebanks exist (UD Chinese, POS-tagged), and the CC-CEDICT dictionary is open (for reference). These corpora serve both modern usage and classical study.

Corpus / Resource Language(s) Genres / Text Types Annotation / Metadata Format License Notes
Leipzig Corpora (Chinese) [*] Chinese (Mandarin) News and web (10K–1M) None Plain text CC-BY 4.0 [*] High-frequency modern Chinese.
Wikipedia (Chinese) Chinese Encyclopedia articles None XML dump CC-BY-SA 4.0 [*] Wide topical coverage.

Other Sino-Tibetan Languages: For Burmese, Tibetan, etc., open data is sparse. Wikipedia exists for Burmese (CC-BY-SA) and smaller Wikipedias for Tibetan, plus some scholarly parallel texts (e.g. Buddhist Tripiṭaka translations). UD has limited treebanks for Tibetan. For major Chinese languages/dialects (e.g. Cantonese), Leipzig Corpora include some web data. These should be listed under a broader “multilingual” note if used.

Dravidian Languages

Tamil: The Tamil Virtual Academy and Project Madurai (https://projectmadurai.org) provide PD Tamil literary works (epics, poems) in UTF-8 text. Tamil Wikipedia (CC-BY-SA) and Leipzig Tamil corpora (from news/web) also exist. UD Tamil is available (CC-BY-SA). These textual resources cover both classical and contemporary Tamil usage.

Other Dravidian Languages: Malayalam, Telugu, Kannada have small Wikipedias and some UD treebanks. The Indian Language Corpora Initiative (ILCI) produced parallel corpora with open licensing (CC-BY) for Indian languages, useful for lexical entries.

Austro-Asiatic and Austronesian Languages

Vietnamese: Leipzig Vietnamese (news/web; CC-BY) and Vietnamese Wikipedia (CC-BY-SA) are available. Vietnamese Parallel Corpora from OPUS (e.g. TED talks, Tatoeba) exist under CC licenses. These support modern language usage.

Thai: Thai Wikipedia (CC-BY-SA) and National Electronics and Computer Technology Center (NECTEC) corpus (not fully open). Leipzig includes small Thai corpora. An OpenParallel corpus (e.g. Tatoeba sentences, CC-BY 4.0) can supplement.

Austronesian (Malay/Indonesian): Open corpora include Leipzig Indonesian/Malay and Wikipedia (MS, ID) under CC-BY-SA. The OSCAR web corpus and UD (e.g. Indonesian-GSD) are CC-BY-SA. These cover modern usage in Malay and Indonesian.

Uralic and Altaic Languages

Finnish/Hungarian: The Leipzig Corpora include Finnish and Hungarian (CC-BY). Wikipedia (fi, hu) provide broad text. UD treebanks exist. These resources support standard lexical work in these languages.

Turkish: Leipzig Turkish and Wikipedia Turkish (CC-BY-SA) are open. The Turkish National Corpus is not open, but news archives (e.g. open news articles) and historical texts (forged citations) may be used. The UD Turkish treebank is CC-BY.

Other Altaic/Caucasian Languages: Open corpora for Uzbek, Azerbaijani (Leipzig, Wikipedia) and Caucasian languages (e.g. Georgian Wikipedia) are limited but exist. Notably, open glossed corpora have been published for indigenous Siberian languages (Selkup, Nganasan, Kamas, Dolgan) with linguistic annotation (https://copius.univie.ac.at/). These specialized corpora (accessible via inel.corpora.uni-hamburg.de) are rare examples of endangered-language resources.

African Languages Languages

Swahili: Swahili Wikipedia (CC-BY-SA) and Leipzig Swahili corpus (news; CC-BY) are key. The UD Swahili treebank (CC-BY) also exists. These support lexical analysis of modern Swahili.

Yoruba, Zulu, etc.: Wikipedias exist (Yoruba, Zulu, etc., CC-BY-SA). Leipzig has some corpora (e.g. Yoruba). PanAfrican Parallel Corpora (Paracrawl) include news and the UDHR. These can provide base texts, though limited.

Other African Languages: Many African languages lack large corpora. Parallel religious texts (Bible, UDHR) in local languages can offer seed data. Efforts like Masakhane (MT dataset) are promising but often custom-licensing (not CC-BY).

Americas

Global: Wikipedia (Spanish, Portuguese, and English) cover major Colonial languages. Leipzig Corpora include Spanish, Portuguese (Brazilian) etc. Latin American historical texts (PD Spanish/Portuguese lit) and US/Canadian legal texts (PD) are available but not unified corpora.

Indigenous Languages: UD provides treebanks for some (Quechua, Guarani, Nahuatl, Navajo, etc.) with CC-BY-SA. Small collections exist (e.g. collections of folk tales, bilingual corpora). The Universal Declaration of Human Rights in 500+ languages (UN translations) is PD and a handy parallel snippet for rare languages. Tools like Universal Dependencies (https://universaldependencies.org/) and Leipzig (https://wortschatz.uni-leipzig.de/en/download; https://wortschatz.uni-leipzig.de/en/usage) can be leveraged for others.

Multilingual Corpora

Several resources span many languages:

  1. Leipzig Corpora Collection: Provides monolingual corpora (10K–1M sentences) for 100+ languages (including major and many minority languages; https://wortschatz.uni-leipzig.de/en/download). Texts are drawn from news and the web; formats are plain-text sentences. All downloads are CC-BY 4.0 (https://wortschatz.uni-leipzig.de/en/usage).
  2. Wikipedia (dumps): >300 languages; encyclopedic articles (all licensed CC-BY-SA; https://en.wikipedia.org/wiki/Wikipedia:Database_download). It is a huge multilingual corpus (though quality varies).
  3. Universal Dependencies (UD): Over 200 treebanks in 150+ languages (Indo-European, Uralic, Altaic, Sino-Tibetan, etc.; https://universaldependencies.org/), annotated for POS, morphology, and dependencies in CoNLL-U format. Licenses vary by treebank (mostly CC-BY or CC-BY-SA). UD provides structured, annotated data across families.
  4. PubMed Central Open Access Subset (PMC OA): Millions of English biomedical articles (full text XML) under liberal (often CC-BY) licenses (https://pmc.ncbi.nlm.nih.gov/tools/openftlist/). This is an authoritative corpus of scientific prose, suitable for technical vocabulary.
  5. Project Gutenberg & Wikisource (multilingual): Multi-language texts (many European classics in the public domain), usually plain text or ePub, virtually no restrictions (https://gutenberg.org/policy/license.html).
  6. Parallel Bible Corpora: While not strictly CC (e.g. JW300), Bible translations in hundreds of languages are broadly PD (e.g. King James, Septuagint) and can be used for basic lexicography in underserved languages.

Most corpora are available in plain-text or XML formats and include at least basic metadata (author/date) when provided. They are suitable for building lexicons (word lists, frequency), etymological studies (via cross-language parallels), and semantic fields (through usage contexts). Priority should be given to sources cited above with open licenses (PD or CC-BY/CC-BY-SA) and minimal cleaning needs, ensuring authoritative coverage (classic literature, major newspapers, official documents) across linguistic families.

Sources: Authoritative corpora and documentation for each language (as cited above) were used to compile this list (https://anc.org/, https://gutenberg.org/policy/license.html, https://en.wikipedia.org/wiki/Wikipedia:Database_download, https://wortschatz.uni-leipzig.de/en/download. https://wortschatz.uni-leipzig.de/en/usage, https://copius.univie.ac.at/, https://pmc.ncbi.nlm.nih.gov/tools/openftlist/), prioritizing open licenses and finished texts.

Open-Access Dictionaries (by Language and Type)

Dictionary / Project Languages Type Format / Access License Notes (coverage, quality)
Wiktionary 4,400+ [*] (all major languages) Collaborative multilingual (definitions, translations) Web (MediaWiki); data dumps, API CC BY-SA 4.0 [*] around 8.4M entries [*], includes etymologies, pronunciation, translations.
GCIDE (GNU Collab. Int. Dict. of English) English Historical monolingual (Webster 1913 + WordNet) Download (tar.gz text) GPLv3+ [*] Derived from Webster’s 1913 [*]; includes supplemental WordNet definitions.
Open English WordNet English Semantic lexicon (synset database) XML/RDF download CC BY 4.0 [*] WordNet-style network (synonyms, hypernyms) derived from Princeton WordNet.
Bueno Spanish–English Dictionary (XML) Spanish ↔ English Bilingual (modern) Download (TEI-XML) Apache 2.0 [*] ~58K entries; high-quality, manually curated dataset [*].
FreeDict around 45 languages (e.g. Afrikaans, Arabic, Breton) [*] Bilingual (various pairs) Download (TEI XML; StarDict, etc.) GPL [*] 140+ bilingual dictionaries (≈45 languages) [*]. Offline lookup, corpus-compatible format.
CC-CEDICT Chinese (Simplified, Traditional) ↔ English Bilingual (learning dictionary) Download (UTF-8 text) CC BY-SA 4.0 [*] ~123K entries (2025) [*]; includes pinyin. Widely used in apps and studies.
Lane’s Arabic–English Lexicon (1863–93) Classical Arabic ↔ English Historical lexicon Text scans (DjVu/TXT OCR) on Internet Archive Public Domain [*] 8 volumes; exhaustive classical Arabic dictionary (derived from earlier Kāmūs).
Latin WordNet Latin Semantic lexicon (WordNet) JSON/API (REST) CC BY-SA 4.0 [*] ~70,000 words (archaic to medieval) [*]; modern online WordNet for Latin.
OpenWordnet-PT Portuguese Semantic lexicon (WordNet) RDF/JSON download CC BY 4.0 [*] Open Portuguese WordNet; linked concepts and glosses in Portuguese.
JMdict (Japanese-Multilingual) Japanese ↔ English, plus French/Ger/Rus… Multilingual (bilingual entries) Download (XML) CC BY-SA 4.0 [*] Expanded EDICT; ~200K entries with translations into multiple languages.
PanLex 5,000+ languages (broad “global lexicon”) Multilingual lexicon Download (CSV/JSON snapshots) CC0 1.0 Universal [*] Panlingual database of translations; includes many minority and endangered languages.
Apertium Dictionaries 20+ language pairs (e.g. Catalan‑Sp., Asturian‑Sp., etc.) [*] Bilingual (MT lexicons) Download (XML on GitHub) GPL (majority) / CC BY-SA [*] Rule-based MT dictionaries; open-source lexical data for many languages.

Sources: Authoritative project pages and documentation for each resource (https://en.wiktionary.org/wiki/Wiktionary:Main_Page, https://gcide.gnu.org.ua/license, https://gcide.gnu.org.ua/, https://en-word.net/, https://github.com/mananoreboton/en-es-en-Dic, https://freedict.org/, https://tei-c.org/activities/projects/freedict/, https://www.mdbg.net/chinese/dictionary?page=cedict, https://archive.org/details/ArabicEnglishLexicon.CopiousEasternSources.EnlargedSuppl.Kamoos.Lane.Poole.1863, https://latinwordnet.exeter.ac.uk/, https://github.com/own-pt/openWordnet-PT, https://blog.okfn.org/2009/07/21/open-dictionary-databases-an-overview/, https://panlex.org/license/, https://blog.okfn.org/2009/07/21/open-dictionary-databases-an-overview/), which detail formats, coverage, and licensing.

About

A multilingual compilation of open-source textual corpora across major & minor world languages - curated for accessibility and linguistic research. Includes links and metadata for publicly available, CC-licensed, and machine-readable datasets.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published