corpus: corpus data corpus build scripts/makefilesdoc: (rudimentary) documentationeflomal: recipes for creating eflomal priorsincoming: notes about incoming data setstemplates: template recipes for importing additional data setstools: some additional scripts and tools (mostly obsolete)
releases: released data files (submodule OPUS)public_html: websites and data sample files (submodule OPUS-website)admin: administration stuff (non-public git repository OPUS-admin)cwb: Corpus Workbench index files and registers (generated)
- python packages: opustools, polyglot, fast-mosestokenizer
 - Perl modules: OpusTools, Uplug and dependencies
 - subalign (for subtitle conversion and alignment)
 - pdftotext, recode, tidy, pigz, GNU parallel and other common GNU/Unix tools
 - Moses and eflomal (optional for word alignment and phrase table extraction)
 - the corpus work bench (CWB) and cwb Perl modules (optional for cwb index generation)
 - optional: yasa (our fork from https://github.com/Helsinki-NLP/yasa)
 
git clone git@github.com:Helsinki-NLP/OPUS-ingest.git
cd OPUS-ingest
git submodule update --init --recursive --remote
make install
The last step will most likely fail. Check error messages and the Makefile for details.
NOTE: The documentation belowe requires serious updates!
- make build scripts more readable
 - consistent language codes
 - get rid of hard-coded paths to tools and make the repo more general and less depending on specific environments (like the one on puhti/CSC)
 - better documentation (as always)
 - more efficient pre-processing
 - consistent pre-processing (UD-based?)
 - more frequent corpus updates (Tatoeba, wikimedia and other frequently changing corpora)
 - streamline corpus creation, processing and maintenance procedures
 - improve integration/updates of OPUS-API and website updates
 - …