CopticScriptorium
diff --git a/‎README.md
Lines changed: 22 additions & 16 deletions b/‎README.md
Lines changed: 22 additions & 16 deletions
diff --git a/‎_version.py
Lines changed: 7 additions & 1 deletion b/‎_version.py
Lines changed: 7 additions & 1 deletion
diff --git a/‎api.py
Lines changed: 22 additions & 65 deletions b/‎api.py
Lines changed: 22 additions & 65 deletions
diff --git a/‎bin/coptic.mco
157 KB b/‎bin/coptic.mco
157 KB
diff --git a/‎bin/coptic_foma.bin
5.02 KB b/‎bin/coptic_foma.bin
5.02 KB
@@ -3,7 +3,11 @@
 An end-to-end NLP pipeline for Coptic text in UTF-8 encoding. 
 
 Online production version available as a web interface at:
-https://corpling.uis.georgetown.edu/coptic-nlp/
+https://tools.copticscriptorium.org/coptic-nlp/
+
+The pipeline supports normalization, segmentation (at the word and subword levels), part of speech tagging, lemmatization, language of origin detection, sentence splitting, syntactic dependency parsing, multiword expression recognition, entity recognition, Wikification, and more.
+
+🔥**New**🔥: The coptic-nlp now supports both Sahidic and Bohairic Coptic dialect varieties!
 
 ## Installation
 
@@ -17,24 +21,13 @@ The NLP pipeline can run as a script or as part of the included web interface vi
 
 ### Python libraries
 
-The NLP pipeline will run on Python 2.7+ or Python 3.5+ (2.6 and lower are not supported). Required libraries:
-
-  * requests
-  * numpy
-  * pandas
-  * scikit-learn==0.19.0
-
-You should be able to install these manually via pip if necessary (i.e. `pip install scikit-learn==0.19.0`).
-
-Note that some versions of Python + Windows do not install numpy correctly from pip, in which case you can download compiled binaries for your version of Python + Windows here: https://www.lfd.uci.edu/~gohlke/pythonlibs/, then run for example:
-
-`pip install c:\some_directory\numpy‑1.15.0+mkl‑cp27‑cp27m‑win_amd64.whl`
+The NLP pipeline will run on Python 2.7+ or Python 3.5+ (2.6 and lower are not supported). See requirements.txt for required libraries.
 
 ### External dependencies
 
-The pipeline also requires **perl** and **java** to be available (the latter only for parsing). Note you will also need binaries of TreeTagger and MaltParser 1.8 if you want to use POS tagging and parsing. These are not included in the distribution but the script will offer to attempt to download them if they are missing.
+The pipeline also requires **perl** for segmentation. If you want to use the old marmot tagger instead of flair (not recommended), or the old Malt parser model or Marmot tagging model (also not recommended) instead of the Python models, **java** needs to be available. Additionally if you want to use the old TreeTagger for POS tagging and lemmatization, TreeTagger must be installed. These are not included in the distribution but the script will offer to attempt to download them if they are missing.
 
-**Note on older Linux distributions**: the latest TreeTagger binaries do not run on some older Linux distributions. When automatically downloading TreeTagger, the script will attempt to notice this. If you receive the error `FATAL: kernel too old`, please contact @amir-zeldes or open an issue describing your Linux version so it can be added to the script handler. The compatible older version of TreeTagger can be downloaded manually from http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger-linux-3.2-old5.tar.gz
+**Note on using TreeTagger with older Linux distributions**: the latest TreeTagger binaries do not run on some older Linux distributions. When automatically downloading TreeTagger, the script will attempt to notice this. If you receive the error `FATAL: kernel too old`, please contact @amir-zeldes or open an issue describing your Linux version so it can be added to the script handler. The compatible older version of TreeTagger can be downloaded manually from http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger-linux-3.2-old5.tar.gz. This should not be necessary is you are using our recommended Python tagger built using flair.
 
 ## Command line usage
 
@@ -57,9 +50,12 @@ standard module options:
   -p, --parse           Parse with dependency parser
   -e, --etym            Add etymolgical language of origin for loan words
   -s SENT, --sent SENT  XML tag to split sentences, e.g. verse for <verse ..>
-                        (otherwise PUNCT tag is used to split sentences)
+                        (otherwise PUNCT tag is used to split sentences); 
+                        use -s=predict to use neural segmenter instead
   -o {pipes,sgml,conllu}, --outmode {pipes,sgml,conllu}
                         Output SGML, conllu or tokenize with pipes
+  --dialect {auto,sahidic,bohairic}
+                        Coptic dialect of input data (default: auto-detect)
 
 less common options:
   -f, --finitestate     Use old finite-state tokenizer (less accurate)
@@ -82,6 +78,12 @@ less common options:
   --pos_spans           Harvest POS tags and lemmas from SGML spans
   --merge_parse         Merge/add a parse into a ready SGML file
   --version             Print version number and quit
+  --treetagger          Tag using TreeTagger instead of flair
+  --marmot              Tag using Marmot instead of flair
+  --malt                Parse using MaltParser instead of Diaparser (requires Java)
+  --no_gold_parse       Do not use UD_Coptic cache for gold parses
+  --processing_meta     Add segmentation/tagging/parsing/entities="auto"
+  --old_testament       Use Old Testament identities (Jesus means Jesus son of Naue i.e. Joshua, etc.)
 ```
 
 ### Example usage
@@ -202,3 +204,7 @@ If all requirements are installed correctly, you can verify that modules are wor
 ```
 python run_tests.py
 ```
+
+## Acknowledgments
+
+We would like to thank Tito Orlandi (CMCL) for contributing Sahidic lexicon data to the project, Hany Takla (Saint Shenouda the Archimandrite Society) for much of the Bohairic data that the tools are based on and Pishoy Georgios for contributing lexical data for Bohairic. A full list of [contributors and collaborators](https://copticscriptorium.org/about) in other projects can be found on the [Coptic Scriptorium](https://copticscriptorium.org/) website.
@@ -1,7 +1,13 @@
 #!/usr/bin/python
 # -*- coding: utf-8 -*-
 
-__version__ = "3.0.0"
+__version__ = "4.0.0"
 __author__ = "Amir Zeldes"
 __copyright__ = "Copyright 2015-2019, Amir Zeldes"
 __license__ = "Apache 2.0 License"
+
+tool_versions = {
+	"tokenizer_version": "stk-6.0.0",
+  	"tagger_version": "flairmbert-6.0.0",
+	"parser_version": "diambert-UD2.15"
+}
@@ -1,86 +1,43 @@
-#!/usr/bin/python
+#!/usr/bin/python3.6
 # -*- coding: utf-8 -*-
 
-#Example call on localhost:
-#http://localhost/coptic-nlp/api.py?data=%E2%B2%81%CF%A5%E2%B2%A5%E2%B2%B1%E2%B2%A7%E2%B2%99%20%E2%B2%9B%CF%AD%E2%B2%93%E2%B2%A1%E2%B2%A3%E2%B2%B1%E2%B2%99%E2%B2%89&lb=line
-
-from nlp_form import nlp_coptic
-import cgi, sys, re
+import requests
+import cgi, sys
 
 PY3 = sys.version_info[0] == 3
 if PY3:
-	sys.stdout = open(sys.stdout.fileno(), mode='w', encoding='utf8', buffering=1)
+    sys.stdout = open(sys.stdout.fileno(), mode='w', encoding='utf8', buffering=1)
 
 storage = cgi.FieldStorage()
-#storage = {"data":"ⲁϥⲥⲱⲧⲙ ⲛϭⲓⲡⲣⲱⲙⲉ"}
+#storage = {"data": "ⲁϥⲥⲱⲧⲙ ⲛϭⲓⲡⲣⲱⲙⲉ"}
 if "data" in storage:
-	#data = storage["data"]
-	data = storage.getvalue("data")
+    #data = storage["data"]
+    data = storage.getvalue("data")
 else:
-	data = ""
+    data = ""
 
 # Diagnose detokenization needs
-detok = 0
-segment_merged = False
-orig_chars = ["̈", "", "̄", "̀", "̣", "`", "̅", "̈", "̂", "︤", "︥", "︦", "⳿", "~", "\n", "[", "]", "̇", "᷍", "⸍", "›", "‹"]
-clean = "".join([c for c in data if c not in orig_chars])
-clean = re.sub(r'<[^<>]+>','',clean).replace(" ","_").replace("\n","").lower()
-preps = clean.count("_ϩⲛ_") + clean.count("_ⲙⲛ_")
-if preps > 4:
-	detok = 1
-	segment_merged = True
-
 if "lb" in storage:
-	line = storage.getvalue("lb")
+    line = storage.getvalue("lb")
 else:
-	if "<lb" in data:
-		line = "noline"
-	else:
-		line = "line"
+    if "<lb" in data:
+        line = "noline"
+    else:
+        line = "line"
 
 if "format" in storage:
-	format = storage.getvalue("format")
+    format = storage.getvalue("format")
 else:
-	format = "sgml"
+    format = "sgml_no_parse"
 
 if format not in ["conll", "pipes", "sgml_no_parse", "sgml_entities"]:
-	format = "sgml"
-
-if format == "pipes":
-	print("Content-Type: text/plain; charset=UTF-8\n")
-	processed = nlp_coptic(data,lb=line=="line",sgml_mode="pipes",do_tok=True, detokenize=detok, segment_merged=segment_merged)
-	if "</lb>" in processed:
-		processed = processed.replace("</lb>","</lb>\n")
-	print(processed.strip())
-elif format == "sgml_no_parse":
-	print("Content-Type: text/sgml; charset=UTF-8\n")
-	# secure call, note that htaccess prevents this running without authentication
-	if "|" in data:
-		processed = nlp_coptic(data, lb=line=="line", parse_only=False, do_tok=True, do_mwe=False,
-							   do_norm=True, do_tag=True, do_lemma=True, do_lang=True,
-							   do_milestone=True, do_parse=("no_parse" not in format), sgml_mode="sgml",
-							   tok_mode="from_pipes", old_tokenizer=False)
-	else:
-		processed = nlp_coptic(data, lb=line=="line", parse_only=False, do_tok=True, do_mwe=False,
-							   do_norm=True, do_tag=True, do_lemma=True, do_lang=True,
-							   do_milestone=True, do_parse=("no_parse" not in format), sgml_mode="sgml",
-							   tok_mode="auto", old_tokenizer=False)
-	print(processed.strip() + "\n")
-elif format == "sgml_entities":
-	print("Content-Type: text/sgml; charset=UTF-8\n\n")
-	# secure call, note that htaccess prevents this running without authentication
-	processed = nlp_coptic(data, lb=line == "line", parse_only=False, do_tok=False, do_mwe=False,
-						   do_norm=False, do_tag=False, do_lemma=False, do_lang=False, sent_tag="translation",
-						   do_milestone=True, do_parse=False, sgml_mode="sgml", merge_parse=True,
-						   tok_mode="auto", old_tokenizer=False, do_entities=True, pos_spans=True, preloaded={"stk":"","xrenner":None})
-	print(processed.strip() + "\n")
-elif format != "conll":
-	print("Content-Type: text/"+format+"; charset=UTF-8\n")
-	processed = nlp_coptic(data,lb=line=="line")
-	print("<doc>\n"+processed.strip()+"\n</doc>\n")
+    format = "sgml_no_parse"
 
+if "sgml" in format:
+    print("Content-Type: text/sgml; charset=UTF-8\n")
 else:
-	print("Content-Type: text/plain; charset=UTF-8\n")
-	processed = nlp_coptic(data,lb=False,parse_only=True,do_tok=True,do_tag=True)
-	print(processed.strip())
+    print("Content-Type: text/plain; charset=UTF-8\n")
 
+params = {"data":data,"lb":line,"format":format}
+result = requests.post("http://localhost:5555/",params=params)
+print(result.content.decode("utf8"))