Skip to content

Commit 30e02f9

Browse files
authored
Merge pull request #36 from CopticScriptorium/dev
V4.0.0
2 parents c190dd5 + 22ffd03 commit 30e02f9

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

90 files changed

+97321
-148099
lines changed

README.md

Lines changed: 22 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,11 @@
33
An end-to-end NLP pipeline for Coptic text in UTF-8 encoding.
44

55
Online production version available as a web interface at:
6-
https://corpling.uis.georgetown.edu/coptic-nlp/
6+
https://tools.copticscriptorium.org/coptic-nlp/
7+
8+
The pipeline supports normalization, segmentation (at the word and subword levels), part of speech tagging, lemmatization, language of origin detection, sentence splitting, syntactic dependency parsing, multiword expression recognition, entity recognition, Wikification, and more.
9+
10+
🔥**New**🔥: The coptic-nlp now supports both Sahidic and Bohairic Coptic dialect varieties!
711

812
## Installation
913

@@ -17,24 +21,13 @@ The NLP pipeline can run as a script or as part of the included web interface vi
1721

1822
### Python libraries
1923

20-
The NLP pipeline will run on Python 2.7+ or Python 3.5+ (2.6 and lower are not supported). Required libraries:
21-
22-
* requests
23-
* numpy
24-
* pandas
25-
* scikit-learn==0.19.0
26-
27-
You should be able to install these manually via pip if necessary (i.e. `pip install scikit-learn==0.19.0`).
28-
29-
Note that some versions of Python + Windows do not install numpy correctly from pip, in which case you can download compiled binaries for your version of Python + Windows here: https://www.lfd.uci.edu/~gohlke/pythonlibs/, then run for example:
30-
31-
`pip install c:\some_directory\numpy‑1.15.0+mkl‑cp27‑cp27m‑win_amd64.whl`
24+
The NLP pipeline will run on Python 2.7+ or Python 3.5+ (2.6 and lower are not supported). See requirements.txt for required libraries.
3225

3326
### External dependencies
3427

35-
The pipeline also requires **perl** and **java** to be available (the latter only for parsing). Note you will also need binaries of TreeTagger and MaltParser 1.8 if you want to use POS tagging and parsing. These are not included in the distribution but the script will offer to attempt to download them if they are missing.
28+
The pipeline also requires **perl** for segmentation. If you want to use the old marmot tagger instead of flair (not recommended), or the old Malt parser model or Marmot tagging model (also not recommended) instead of the Python models, **java** needs to be available. Additionally if you want to use the old TreeTagger for POS tagging and lemmatization, TreeTagger must be installed. These are not included in the distribution but the script will offer to attempt to download them if they are missing.
3629

37-
**Note on older Linux distributions**: the latest TreeTagger binaries do not run on some older Linux distributions. When automatically downloading TreeTagger, the script will attempt to notice this. If you receive the error `FATAL: kernel too old`, please contact @amir-zeldes or open an issue describing your Linux version so it can be added to the script handler. The compatible older version of TreeTagger can be downloaded manually from http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger-linux-3.2-old5.tar.gz
30+
**Note on using TreeTagger with older Linux distributions**: the latest TreeTagger binaries do not run on some older Linux distributions. When automatically downloading TreeTagger, the script will attempt to notice this. If you receive the error `FATAL: kernel too old`, please contact @amir-zeldes or open an issue describing your Linux version so it can be added to the script handler. The compatible older version of TreeTagger can be downloaded manually from http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger-linux-3.2-old5.tar.gz. This should not be necessary is you are using our recommended Python tagger built using flair.
3831

3932
## Command line usage
4033

@@ -57,9 +50,12 @@ standard module options:
5750
-p, --parse Parse with dependency parser
5851
-e, --etym Add etymolgical language of origin for loan words
5952
-s SENT, --sent SENT XML tag to split sentences, e.g. verse for <verse ..>
60-
(otherwise PUNCT tag is used to split sentences)
53+
(otherwise PUNCT tag is used to split sentences);
54+
use -s=predict to use neural segmenter instead
6155
-o {pipes,sgml,conllu}, --outmode {pipes,sgml,conllu}
6256
Output SGML, conllu or tokenize with pipes
57+
--dialect {auto,sahidic,bohairic}
58+
Coptic dialect of input data (default: auto-detect)
6359
6460
less common options:
6561
-f, --finitestate Use old finite-state tokenizer (less accurate)
@@ -82,6 +78,12 @@ less common options:
8278
--pos_spans Harvest POS tags and lemmas from SGML spans
8379
--merge_parse Merge/add a parse into a ready SGML file
8480
--version Print version number and quit
81+
--treetagger Tag using TreeTagger instead of flair
82+
--marmot Tag using Marmot instead of flair
83+
--malt Parse using MaltParser instead of Diaparser (requires Java)
84+
--no_gold_parse Do not use UD_Coptic cache for gold parses
85+
--processing_meta Add segmentation/tagging/parsing/entities="auto"
86+
--old_testament Use Old Testament identities (Jesus means Jesus son of Naue i.e. Joshua, etc.)
8587
```
8688

8789
### Example usage
@@ -202,3 +204,7 @@ If all requirements are installed correctly, you can verify that modules are wor
202204
```
203205
python run_tests.py
204206
```
207+
208+
## Acknowledgments
209+
210+
We would like to thank Tito Orlandi (CMCL) for contributing Sahidic lexicon data to the project, Hany Takla (Saint Shenouda the Archimandrite Society) for much of the Bohairic data that the tools are based on and Pishoy Georgios for contributing lexical data for Bohairic. A full list of [contributors and collaborators](https://copticscriptorium.org/about) in other projects can be found on the [Coptic Scriptorium](https://copticscriptorium.org/) website.

_version.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,13 @@
11
#!/usr/bin/python
22
# -*- coding: utf-8 -*-
33

4-
__version__ = "3.0.0"
4+
__version__ = "4.0.0"
55
__author__ = "Amir Zeldes"
66
__copyright__ = "Copyright 2015-2019, Amir Zeldes"
77
__license__ = "Apache 2.0 License"
8+
9+
tool_versions = {
10+
"tokenizer_version": "stk-6.0.0",
11+
"tagger_version": "flairmbert-6.0.0",
12+
"parser_version": "diambert-UD2.15"
13+
}

api.py

Lines changed: 22 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -1,86 +1,43 @@
1-
#!/usr/bin/python
1+
#!/usr/bin/python3.6
22
# -*- coding: utf-8 -*-
33

4-
#Example call on localhost:
5-
#http://localhost/coptic-nlp/api.py?data=%E2%B2%81%CF%A5%E2%B2%A5%E2%B2%B1%E2%B2%A7%E2%B2%99%20%E2%B2%9B%CF%AD%E2%B2%93%E2%B2%A1%E2%B2%A3%E2%B2%B1%E2%B2%99%E2%B2%89&lb=line
6-
7-
from nlp_form import nlp_coptic
8-
import cgi, sys, re
4+
import requests
5+
import cgi, sys
96

107
PY3 = sys.version_info[0] == 3
118
if PY3:
12-
sys.stdout = open(sys.stdout.fileno(), mode='w', encoding='utf8', buffering=1)
9+
sys.stdout = open(sys.stdout.fileno(), mode='w', encoding='utf8', buffering=1)
1310

1411
storage = cgi.FieldStorage()
15-
#storage = {"data":"ⲁϥⲥⲱⲧⲙ ⲛϭⲓⲡⲣⲱⲙⲉ"}
12+
#storage = {"data": "ⲁϥⲥⲱⲧⲙ ⲛϭⲓⲡⲣⲱⲙⲉ"}
1613
if "data" in storage:
17-
#data = storage["data"]
18-
data = storage.getvalue("data")
14+
#data = storage["data"]
15+
data = storage.getvalue("data")
1916
else:
20-
data = ""
17+
data = ""
2118

2219
# Diagnose detokenization needs
23-
detok = 0
24-
segment_merged = False
25-
orig_chars = ["̈", "", "̄", "̀", "̣", "`", "̅", "̈", "̂", "︤", "︥", "︦", "⳿", "~", "\n", "[", "]", "̇", "᷍", "⸍", "›", "‹"]
26-
clean = "".join([c for c in data if c not in orig_chars])
27-
clean = re.sub(r'<[^<>]+>','',clean).replace(" ","_").replace("\n","").lower()
28-
preps = clean.count("_ϩⲛ_") + clean.count("_ⲙⲛ_")
29-
if preps > 4:
30-
detok = 1
31-
segment_merged = True
32-
3320
if "lb" in storage:
34-
line = storage.getvalue("lb")
21+
line = storage.getvalue("lb")
3522
else:
36-
if "<lb" in data:
37-
line = "noline"
38-
else:
39-
line = "line"
23+
if "<lb" in data:
24+
line = "noline"
25+
else:
26+
line = "line"
4027

4128
if "format" in storage:
42-
format = storage.getvalue("format")
29+
format = storage.getvalue("format")
4330
else:
44-
format = "sgml"
31+
format = "sgml_no_parse"
4532

4633
if format not in ["conll", "pipes", "sgml_no_parse", "sgml_entities"]:
47-
format = "sgml"
48-
49-
if format == "pipes":
50-
print("Content-Type: text/plain; charset=UTF-8\n")
51-
processed = nlp_coptic(data,lb=line=="line",sgml_mode="pipes",do_tok=True, detokenize=detok, segment_merged=segment_merged)
52-
if "</lb>" in processed:
53-
processed = processed.replace("</lb>","</lb>\n")
54-
print(processed.strip())
55-
elif format == "sgml_no_parse":
56-
print("Content-Type: text/sgml; charset=UTF-8\n")
57-
# secure call, note that htaccess prevents this running without authentication
58-
if "|" in data:
59-
processed = nlp_coptic(data, lb=line=="line", parse_only=False, do_tok=True, do_mwe=False,
60-
do_norm=True, do_tag=True, do_lemma=True, do_lang=True,
61-
do_milestone=True, do_parse=("no_parse" not in format), sgml_mode="sgml",
62-
tok_mode="from_pipes", old_tokenizer=False)
63-
else:
64-
processed = nlp_coptic(data, lb=line=="line", parse_only=False, do_tok=True, do_mwe=False,
65-
do_norm=True, do_tag=True, do_lemma=True, do_lang=True,
66-
do_milestone=True, do_parse=("no_parse" not in format), sgml_mode="sgml",
67-
tok_mode="auto", old_tokenizer=False)
68-
print(processed.strip() + "\n")
69-
elif format == "sgml_entities":
70-
print("Content-Type: text/sgml; charset=UTF-8\n\n")
71-
# secure call, note that htaccess prevents this running without authentication
72-
processed = nlp_coptic(data, lb=line == "line", parse_only=False, do_tok=False, do_mwe=False,
73-
do_norm=False, do_tag=False, do_lemma=False, do_lang=False, sent_tag="translation",
74-
do_milestone=True, do_parse=False, sgml_mode="sgml", merge_parse=True,
75-
tok_mode="auto", old_tokenizer=False, do_entities=True, pos_spans=True, preloaded={"stk":"","xrenner":None})
76-
print(processed.strip() + "\n")
77-
elif format != "conll":
78-
print("Content-Type: text/"+format+"; charset=UTF-8\n")
79-
processed = nlp_coptic(data,lb=line=="line")
80-
print("<doc>\n"+processed.strip()+"\n</doc>\n")
34+
format = "sgml_no_parse"
8135

36+
if "sgml" in format:
37+
print("Content-Type: text/sgml; charset=UTF-8\n")
8238
else:
83-
print("Content-Type: text/plain; charset=UTF-8\n")
84-
processed = nlp_coptic(data,lb=False,parse_only=True,do_tok=True,do_tag=True)
85-
print(processed.strip())
39+
print("Content-Type: text/plain; charset=UTF-8\n")
8640

41+
params = {"data":data,"lb":line,"format":format}
42+
result = requests.post("http://localhost:5555/",params=params)
43+
print(result.content.decode("utf8"))

bin/coptic.mco

157 KB
Binary file not shown.

bin/coptic_foma.bin

5.02 KB
Binary file not shown.

0 commit comments

Comments
 (0)