Le Boucher d'Amsterdam

Boudams, or "Le boucher d'Amsterdam", is a deep-learning tool built for tokenizing Latin or Medieval French languages.

How to cite

An article has been published about this work : https://hal.archives-ouvertes.fr/hal-02154122v1

@unpublished{clerice:hal-02154122,
  TITLE = {{Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin}},
  AUTHOR = {Cl{\'e}rice, Thibault},
  URL = {https://hal.archives-ouvertes.fr/hal-02154122},
  NOTE = {working paper or preprint},
  YEAR = {2019},
  MONTH = Jun,
  KEYWORDS = {convolutional network ; scripta continua ; tokenization ; Old French ; word segmentation},
  PDF = {https://hal.archives-ouvertes.fr/hal-02154122/file/Evaluating_Deep_Learning_Methods_for_Tokenization_of_Scripta_Continua_in_Old_French_and_Latin%284%29.pdf},
  HAL_ID = {hal-02154122},
  HAL_VERSION = {v1},
}

How to

Install the usual way you install python stuff: python setup.py install (Python >= 3.6)).

The config file can be kickstarted using boudams template config.json, we recommend using the following settings :

linear-conv-no-pos for the model, as it is not limited by the input size;
normalize and lower to True depending on your dataset size.

The initial dataset is pretty small but if you want to build with your own, it's fairly simple : you need data in the following shape : "samesentence<TAB>same sentence" where the first element is the same than the second but with no space and they are separated by tabs (\t, marked here as <TAB>).

{
    "name": "model",
    "max_sentence_size": 150,
    "network": {
        "emb_enc_dim": 256,
        "enc_n_layers": 10,
        "enc_kernel_size": 3,
        "enc_dropout": 0.25
    },
    "model": "linear-conv-no-pos",
    "learner": {
        "lr_grace_periode": 2,
        "lr_patience": 2,
        "lr": 0.0001
    },
    "label_encoder": {
        "normalize": true,
        "lower": true
    },
    "datasets": {
        "test": "./test.tsv",
        "train": "./train.tsv",
        "dev": "./dev.tsv",
        "random": true
    }
}

The best architecture I find for medieval French was Conv to Linear without POS using the following setup:

{
    "network": {
        "emb_enc_dim": 256,
        "enc_n_layers": 10,
        "enc_kernel_size": 5,
        "enc_dropout": 0.25
    },
    "model": "linear-conv-no-pos",
    "batch_size": 64,
    "learner": {
        "lr_grace_periode": 2,
        "lr_patience": 2,
        "lr": 0.00005,
        "lr_factor": 0.5
    }
}

Credits

Inspirations, bits of code and source for being able to understand how Seq2Seq words or write my own Torch module come both from Ben Trevett and Enrique Manjavacas.

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
article		article
baselines		baselines
boudams		boudams
configs		configs
datasets		datasets
test_data		test_data
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dev-requirements.txt		dev-requirements.txt
generate.sh		generate.sh
generate_latin.sh		generate_latin.sh
generate_latin_epigraphy.sh		generate_latin_epigraphy.sh
generate_latin_epigraphy_unknown.sh		generate_latin_epigraphy_unknown.sh
generate_medieval_latin.sh		generate_medieval_latin.sh
generate_unknown.sh		generate_unknown.sh
requirements.txt		requirements.txt
setup.py		setup.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Le Boucher d'Amsterdam

How to cite

How to

Credits

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

PonteIneptique/boudams

Folders and files

Latest commit

History

Repository files navigation

Le Boucher d'Amsterdam

How to cite

How to

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages