OPUS - a collection of parallel corpora and tools

Structure of the repository

corpus: corpus data corpus build scripts/makefiles
doc: (rudimentary) documentation
eflomal: recipes for creating eflomal priors
incoming: notes about incoming data sets
templates: template recipes for importing additional data sets
tools: some additional scripts and tools (mostly obsolete)

Submodules and generated files

releases: released data files (submodule OPUS)
public_html: websites and data sample files (submodule OPUS-website)
admin: administration stuff (non-public git repository OPUS-admin)
cwb: Corpus Workbench index files and registers (generated)

Pre-requisites

python packages: opustools, polyglot, fast-mosestokenizer
Perl modules: OpusTools, Uplug and dependencies
subalign (for subtitle conversion and alignment)
pdftotext, recode, tidy, pigz, GNU parallel and other common GNU/Unix tools
Moses and eflomal (optional for word alignment and phrase table extraction)
the corpus work bench (CWB) and cwb Perl modules (optional for cwb index generation)
optional: yasa (our fork from https://github.com/Helsinki-NLP/yasa)

Installation and setup

git clone git@github.com:Helsinki-NLP/OPUS-ingest.git
cd OPUS-ingest
git submodule update --init --recursive --remote
make install

The last step will most likely fail. Check error messages and the Makefile for details.

Documentation

NOTE: The documentation belowe requires serious updates!

TODO

make build scripts more readable
consistent language codes
get rid of hard-coded paths to tools and make the repo more general and less depending on specific environments (like the one on puhti/CSC)
better documentation (as always)
more efficient pre-processing
consistent pre-processing (UD-based?)
more frequent corpus updates (Tatoeba, wikimedia and other frequently changing corpora)
streamline corpus creation, processing and maintenance procedures
improve integration/updates of OPUS-API and website updates
…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OPUS - a collection of parallel corpora and tools

Structure of the repository

Submodules and generated files

Pre-requisites

Installation and setup

Documentation

TODO

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 386 Commits
OPUSbench @ 9102676		OPUSbench @ 9102676
OPUSbench-website @ 00480d0		OPUSbench-website @ 00480d0
corpus		corpus
doc		doc
eflomal		eflomal
incoming		incoming
index		index
overlaps		overlaps
templates		templates
tools		tools
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitmodules		.gitmodules
Makefile		Makefile
Makefile.def		Makefile.def
Makefile.submit		Makefile.submit
OPUS		OPUS
OPUS-website		OPUS-website
README.md		README.md
admin		admin
public_html		public_html
releases		releases
requirements.txt		requirements.txt

Helsinki-NLP/OPUSbench-ingest

Folders and files

Latest commit

History

Repository files navigation

OPUS - a collection of parallel corpora and tools

Structure of the repository

Submodules and generated files

Pre-requisites

Installation and setup

Documentation

TODO

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages