CharCut

Character-based MT evaluation and difference highlighting

CharCut compares outputs of MT systems with reference translations. It can compare multiple file pairs simultaneously and produce HTML outputs showing character-based differences along with scores that are directly inferred from the lengths of those differences, thus making the link between evaluation and visualisation straightforward.

The matching algorithm is based on an iterative search for longest common substrings, combined with a length-based threshold that limits short and noisy character matches. As a similarity metric this is not new, but to the best of our knowledge it was never applied to highlighting and scoring of MT outputs. It has the neat effect of keeping character-based differences readable by humans.

Accidentally, the scores inferred from those differences correlate very well with human judgments, similarly to other great character-based metrics like chrF(++) or CharacTER. It was evaluated here:

Adrien Lardilleux and Yves Lepage: "CharCut: Human-Targeted Character-Based MT Evaluation with Loose Differences". In Proceedings of IWSLT 2017.

It is intended to be lightweight and easy to use, so the HTML outputs are, and will be kept, slick on purpose.

Usage

CharCut is written in Python 3. It only relies on the standard library.

Basic usage:

python3 charcut.py cand.txt,ref.txt

where cand.txt and ref.txt contain corresponding candidate (MT) and reference (human) segments, 1 per line. Multiple file pairs can be specified on the command line: candidates with references, candidates with other candidates, etc. By default, only document-level scores are displayed on standard output. To produce a HTML output file, use the -o option:

python3 charcut.py cand.txt,ref.txt -o mydiff.html

A few more options are available; call

python3 charcut.py -h

to list them.

Consider lowering the -m option value (minimum match size) for non-alphabetical writing systems such as Chinese or Japanese. The default value (3 characters) should be acceptable for most European languages, but depending on the language and data, larger values might produce better looking results.

Changes

11/02/2022

added a Flask app to use charcut in the browser - includes modifications to charcut.py to be able to output strings instead of only dumping to files
deployed to Vercel at charcut.vercel.app

27/07/2022

forked by Luis Kolb to include a directory crawler+formatting+execution script (run.py)
execute the script like python run.py sample-data/ format

09/07/2019

ported code to Python3
added support for comparing multiple file pairs simultaneously
removed "-c" and "-r" command line arguments, replaced with a space-separated list of (comma-separated) file pairs

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.vscode		.vscode
sample-data		sample-data
static		static
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
charcut.py		charcut.py
index.py		index.py
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
run.py		run.py
tailwind.config.js		tailwind.config.js
test.py		test.py
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CharCut

Usage

Changes

About

Uh oh!

Releases

Packages

Languages

License

LuisKolb/charcut

Folders and files

Latest commit

History

Repository files navigation

CharCut

Usage

Changes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages