Skip to content

Commit c8931ba

Browse files
authored
Merge pull request #27 from CopticScriptorium/dev
V3.0.0
2 parents a35476b + 195d713 commit c8931ba

File tree

295 files changed

+1955418
-21822
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

295 files changed

+1955418
-21822
lines changed

.gitignore

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
*.pyc
2+
__pycache__/
3+
*.swp
4+
.*.tmp
5+
.idea
6+
/_scrap/
7+
/.idea/
8+
_tmp*.tab
9+
errors/

.idea/dictionaries/luke.xml

Lines changed: 3 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,9 +63,9 @@ standard module options:
6363
6464
less common options:
6565
-f, --finitestate Use old finite-state tokenizer (less accurate)
66-
-d {0,1,2}, --detokenize {0,1,2}
66+
-d {0,1,2,3}, --detokenize {0,1,2,3}
6767
Re-group non-standard bound groups (a.k.a.
68-
'laytonize') - 1=normal 2=aggressive
68+
'laytonize') - 1=normal 2=aggressive 3=smart
6969
--segment_merged When re-grouping bound groups, assume merged groups
7070
have segmentation boundary between them
7171
-q, --quiet Suppress verbose messages
@@ -116,7 +116,7 @@ The pipeline accepts the following kinds of input:
116116
* Plain text, with bound groups separated by underscores or spaces.
117117
* Note that if punctuation has not been separated from bound groups, you can use the `--space` option to attempt to automatically separate punctuation
118118
* If your Coptic text represents line breaks as new line characters, you can automatically add line break tags using `-b` / `--breaklines`
119-
* Gold tokenization information may be present in the input at pipes between part-of-speech bearing units and hyphens between morphemes
119+
* Gold tokenization information may be present in the input as pipes between part-of-speech bearing units and hyphens between morphemes
120120
* XML/SGML input, with bound groups separated by underscores or spaces. The script will retain XML tags as-is around Coptic text.
121121
* Coptic Scriptorium style TreeTagger SGML, with normalized units in tags such as <norm norm="...">.
122122
* This input format is used when adding a parse to an existing .tt file using the `--merge_parse` option
@@ -193,4 +193,12 @@ The pipeline accepts the following kinds of input:
193193
</norm>
194194
</norm_group>
195195
</lb>
196-
```
196+
```
197+
198+
## Testing installation
199+
200+
If all requirements are installed correctly, you can verify that modules are working correctly by running the built-in unit tests:
201+
202+
```
203+
python run_tests.py
204+
```

_version.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
#!/usr/bin/python
22
# -*- coding: utf-8 -*-
33

4-
__version__ = "1.1.0"
4+
__version__ = "3.0.0"
55
__author__ = "Amir Zeldes"
6-
__copyright__ = "Copyright 2015-2016, Amir Zeldes"
6+
__copyright__ = "Copyright 2015-2019, Amir Zeldes"
77
__license__ = "Apache 2.0 License"

api.py

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,20 @@
1-
#!/usr/local/bin/python2.7
1+
#!/usr/bin/python3.5
22
# -*- coding: utf-8 -*-
33

4-
#from lib.tokenize_rf import MultiColumnLabelEncoder, DataFrameSelector, lambda_underscore
5-
64
#Example call on localhost:
75
#http://localhost/coptic-nlp/api.py?data=%E2%B2%81%CF%A5%E2%B2%A5%E2%B2%B1%E2%B2%A7%E2%B2%99%20%E2%B2%9B%CF%AD%E2%B2%93%E2%B2%A1%E2%B2%A3%E2%B2%B1%E2%B2%99%E2%B2%89&lb=line
86

97
from nlp_form import nlp_coptic
10-
import cgi
8+
import cgi, sys
9+
10+
PY3 = sys.version_info[0] == 3
11+
if PY3:
12+
sys.stdout = open(sys.stdout.fileno(), mode='w', encoding='utf8', buffering=1)
13+
1114
storage = cgi.FieldStorage()
15+
#storage = {"data":"ⲁϥⲥⲱⲧⲙ ⲛϭⲓⲡⲣⲱⲙⲉ"}
1216
if "data" in storage:
17+
#data = storage["data"]
1318
data = storage.getvalue("data")
1419
else:
1520
data = ""
@@ -40,12 +45,12 @@
4045
print("Content-Type: text/sgml; charset=UTF-8\n")
4146
# secure call, note that htaccess prevents this running without authentication
4247
if "|" in data:
43-
processed = nlp_coptic(data, lb=line=="line", parse_only=False, do_tok=True,
48+
processed = nlp_coptic(data, lb=line=="line", parse_only=False, do_tok=True, do_mwe=False,
4449
do_norm=True, do_tag=True, do_lemma=True, do_lang=True,
4550
do_milestone=True, do_parse=("no_parse" not in format), sgml_mode="sgml",
4651
tok_mode="from_pipes", old_tokenizer=False)
4752
else:
48-
processed = nlp_coptic(data, lb=line=="line", parse_only=False, do_tok=True,
53+
processed = nlp_coptic(data, lb=line=="line", parse_only=False, do_tok=True, do_mwe=False,
4954
do_norm=True, do_tag=True, do_lemma=True, do_lang=True,
5055
do_milestone=True, do_parse=("no_parse" not in format), sgml_mode="sgml",
5156
tok_mode="auto", old_tokenizer=False)

bin/coptic.mco

441 KB
Binary file not shown.

bin/coptic_foma.bin

893 KB
Binary file not shown.

bin/foma/README.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
This directory contains the files needed to compile a fresh transducer binary for the Foma based normalization module.
2+
3+
To compile a new transducer based on the latest normalization info, make sure that foma binaries for your system are in bin/foma/ and that data/norm_table.tab contains the latest normalization data. Foma binaries are available for **Windows** and **Mac OSX** and will be automatically unzipped when you run coptic_nlp.py.
4+
5+
If you are running coptic_nlp.py on **Linux**, you will need to compile Foma (which should work):
6+
7+
```
8+
wget https://bitbucket.org/mhulden/foma/downloads/foma-0.9.18.tar.gz
9+
tar -xvzf foma-0.9.18.tar.gz
10+
cd foma-0.9.18/
11+
make
12+
sudo make install
13+
```
14+
15+
When you are ready, run:
16+
17+
```
18+
> python compile_grammar.py
19+
```
20+
21+
A new coptic_foma.bin will be generated in this folder, which should replace the existing coptic_foma.bin in bin/

0 commit comments

Comments
 (0)