-
Notifications
You must be signed in to change notification settings - Fork 4
LapDevelopment_Giellatekno
In wrapping the UiT Giellatekno (GT) pipeline for Sami segmentation, tokenization, morphological analysis, and dependency parsing, we need to (a) decide how many individual steps of processing to distinguish (from the LAP point of view) and (b) how to map the input and output data from the GT pipeline to LAF-compatible annotations.
Following is a sample invocation (against the LAP Tree), which was originally suggested by Trond Trosterud of UiT:
cat $LAPTREE/gt/etc/test.txt \
| $LAPTREE/perl/lap/perl $LAPTREE/gt/script/preprocess --abbr=$LAPTREE/gt/sme/abbr.txt \
| $LAPTREE/hfst/lap/hfst-lookup $LAPTREE/gt/sme/analyser-gt-norm.hfstol \
| $LAPTREE/perl/lap/perl $LAPTREE/gt/script/lookup2cg | sed 's/0.000000//g' \
| $LAPTREE/vislcg3/lap/vislcg3 -g $LAPTREE/gt/sme/disambiguation.cg3 \
| $LAPTREE/vislcg3/lap/vislcg3 -g $LAPTREE/gt/sme/functions.cg3 \
| $LAPTREE/vislcg3/lap/vislcg3 -g $LAPTREE/gt/sme/dependency.cg3
It probably makese sense to conceptually distinguish the following layers:
-
(a) (maybe normalization and) tokenization (preprocess)
(b) morphological analysis (hfst-lookup)
(c) segmentation and morpho-syntactic tagging (disambiguation.cg3)
(e) grammatical function analysis (functions.cg3)
(f) dependency parsing (dependency.cg3)
"<Dán>"
"dát" Pron Dem Sg Acc @OBJ> #1->3
"<skábma>"
"skábma" N Sg Nom @SUBJ> #2->3
"<čuojaha>"
"čuojahit" V TV Ind Prs Sg3 @FMV #3->0
"<Mari>"
"Mari" N Prop Sem/Fem Attr @>N #4->5
"<Boine>"
"Boine" N Prop Sem/Sur Sg Nom @<SUBJ #5->3
"<ovttas>"
"ovttas" Adv @<ADVL #6->3
"<Kai>"
"Kai" N Prop Sem/Mal Attr @>N #7->8
"<Sombyn>"
"Somby" N Prop Sem/Sur Ess @<OPRED #8->1
"<ja>"
"ja" CC @CNP #9->8
"<Ája>"
"Ája" N Prop Sem/Org Sg Gen @>N #10->11
"<joavkkuin>"
"joavku" N Sg Com @<ADVL #11->3
"<.>"
"." CLB #12->3
"<Go>"
"go" CS @CVP #1->5
"<beakkán>"
"beakkán" A Attr @>N #2->3
"<artistta>"
"artista" N Sg Acc @OBJ> #3->5
"<guovttos>"
"guovttos" N Coll Sg Nom @SUBJ> #4->5
"<čuojaheaba>"
"čuojahit" V TV Ind Prs Du3 @FS-ADVL> #5->13
"<oktasaš>"
"oktasaš" A Attr @>N #6->7
"<konsearttaid>"
"konsearta" N Pl Gen @>N #7->9
"<vuosttaš>"
"vuosttaš" A Ord Attr @>N #8->9
"<gearddi>"
"geardi" N Sg Gen @<ADVL #9->5
"gearddi" Adv @<ADVL #9->5
"<,>"
"," CLB #10->1
"<de>"
"de" Adv @ADVL> #11->13
"<šaddaba>"
"šaddat" V IV Ind Prs Du3 @FAUX #12->0
"<garvit>"
"garvit" V TV Inf @IMV #13->12
"<sámi>"
"sápmi" N Sg Gen @>N #14->15
"<báikkiid>"
"báiki" N Pl Acc @<OBJ #15->13
"<gos>"
"gos" Adv @ADVL> #16->18
"<girkuin>"
"girku" N Pl Loc @ADVL> #17->18
"<lea>"
"leat" V IV Ind Prs Sg3 @FS-<ADVL #18->13
"<juoigangielddus>"
"juoigan#gielddus" N Sg Nom <ext> @<SUBJ #19->18
"<.>"
"." CLB #20->12
"<Dattetge>"
"dattetge" Adv @ADVL> #1->2
"<illudeaba>"
"illudit" V IV Ind Prs Du3 @FMV #2->0
"<sakka>"
"sakka" Adv @>N #3->4
"<ovttasbargui>"
"ovttasbargu" N Sg Ill @<ADVL #4->2
"<.>"
Home | Forum | Discussions | Events