Skip to content

LapDevelopment_Giellatekno

StephanOepen edited this page Jan 27, 2015 · 15 revisions

Background

In wrapping the UiT Giellatekno (GT) pipeline for Sami segmentation, tokenization, morphological analysis, and dependency parsing, we need to (a) decide how many individual steps of processing to distinguish (from the LAP point of view) and (b) how to map the input and output data from the GT pipeline to LAF-compatible annotations.

Following is a sample invocation (against the LAP Tree), which was originally suggested by Trond Trosterud of UiT:

  cat $LAPTREE/gt/etc/test.txt \
  | $LAPTREE/perl/lap/perl $LAPTREE/gt/script/preprocess --abbr=$LAPTREE/gt/sme/abbr.txt \
  | $LAPTREE/hfst/lap/hfst-lookup $LAPTREE/gt/sme/analyser-gt-norm.hfstol \
  | $LAPTREE/perl/lap/perl $LAPTREE/gt/script/lookup2cg | sed 's/0.000000//g' \
  | $LAPTREE/vislcg3/lap/vislcg3 -g $LAPTREE/gt/sme/disambiguation.cg3 \
  | $LAPTREE/vislcg3/lap/vislcg3 -g $LAPTREE/gt/sme/functions.cg3 \
  | $LAPTREE/vislcg3/lap/vislcg3 -g $LAPTREE/gt/sme/dependency.cg3

It probably makese sense to conceptually distinguish the following layers:

  • (a) (maybe normalization and) tokenization (preprocess)

    (b) morphological analysis (hfst-lookup)

    (c) segmentation and morpho-syntactic tagging (disambiguation.cg3)

    (d) grammatical function analysis (functions.cg3)

    (e) dependency parsing (dependency.cg3)

Sample Input

Dán skábma čuojaha Mari Boine ovttas Kai Sombyn ja Ája joavkkuin. Go beakkán artistta guovttos čuojaheaba oktasaš konsearttaid vuosttaš gearddi, de šaddaba garvit sámi báikkiid gos girkuin lea juoigangielddus. Dattetge illudeaba sakka ovttasbargui.

Sample Output

"<Dán>"
         "dát" Pron Dem Sg Loc Attr
         "dát" Pron Dem Sg Ill Attr
         "dát" Pron Dem Sg Acc
         "dát" Pron Dem Sg Gen
"<skábma>"
         "skábma" N Sg Nom
"<čuojaha>"
         "čuodjat" V* IV* Der/h V TV Ind Prs Sg3
         "čuojahit" V TV Ind Prs Sg3
"<Mari>"
         "Mari" N Prop Sem/Fem Sg Acc
         "Mari" N Prop Sem/Fem Sg Nom
         "Mari" N Prop Sem/Fem Attr
         "Mari" N Prop Sem/Fem Sg Gen
"<Boine>"
         "Boine" N Prop Sem/Sur Sg Gen
         "Boine" N Prop Sem/Sur Sg Nom
         "Boine" N Prop Sem/Sur Sg Acc
"<ovttas>"
         "ovttas" N Coll Sg Nom
         "ovttastit" V TV Imprt ConNeg
         "ovttastit" V TV Ind Prs ConNeg
         "ovttastit" V TV Imprt Sg2
         "ovttas" Adv
         "okta" Num Sg Loc
         "ovttastit" V TV VGen
"<Kai>"
         "Kai" N Prop Sem/Mal Sg Nom
         "Kai" N Prop Sem/Mal Sg Gen
         "Kai" N Prop Sem/Mal Attr
         "Kai" N Prop Sem/Mal Sg Acc
"<Sombyn>"
         "Somby" N Prop Sem/Sur Ess
"<ja>"
         "ja" CC
"<Ája>"
         "Ája" N Prop Sem/Org Sg Gen
         "ája" N Sg Nom
         "Ája" N Prop Sem/Org Sg Nom
         "Ája" N Prop Sem/Org Sg Acc
"<joavkkuin>"
         "joavku" N Pl Loc
         "joavku" N Sg Com
"<.>"
         "." CLB
"<Go>"
         "go" Pcle Qst
         "go" CS
"<beakkán>"
         "beakkán" A Sg Gen
         "beakkán" A Sg Nom
         "beakkán" A Sg Acc
         "beaggit" V IV Ind Prs Sg1
         "beakkán" A Attr
"<artistta>"
         "artista" N Sg Gen
         "artista" N Sg Acc
"<guovttos>"
         "guovttos" N Coll Sg Loc
         "guovttos" N Coll Sg Gen PxSg3
         "guovttos" N Coll Sg Nom
         "guovttos" N Coll Sg Acc PxSg3
"<čuojaheaba>"
         "čuojahit" V TV Ind Prs Du3
         "čuodjat" V* IV* Der/h V TV Ind Prs Du3
"<oktasaš>"
         "oktasaš" A Attr
         "oktasaš" A Sg Nom
"<konsearttaid>"
         "konsearta" N Pl Acc
         "konsearta" N Pl Gen
"<vuosttaš>"
         "vuosttaš" A Ord Sg Nom
         "vuosttaš" A Ord Attr
"<gearddi>"
         "gearddi" Adv
         "geardi" N Sg Gen
         "geardi" N Sg Acc
"<,>"
         "," CLB
"<de>"
         "de" Adv
"<šaddaba>"
         "šaddat" V IV Ind Prs Du3
"<garvit>"
         "garvit" V TV Inf
         "garvit" V TV Ind Prs Pl1
         "garvit" V* TV* Der/NomAg N Pl Nom
         "garvit" V TV Imprt Pl2
"<sámi>"
         "sápmi" N Sg Acc
         "sápmi" N Sg Gen
"<báikkiid>"
         "báiki" N Pl Gen
         "báiki" N Pl Acc
"<gos>"
         "gos" Adv
"<girkuin>"
         "girku" N Sg Com
         "girku" N Pl Loc
"<lea>"
         "leat" V IV Ind Prs Sg3
"<juoigangielddus>"
         "juoigan#gieldu" N Sg Acc PxSg3
         "juoigan#gieldu" N Sg Loc
         "juoigan#gieldu" N Sg Gen PxSg3
         "juoigan#gielddus" N Sg Nom
"<.>"
         "." CLB
"<Dattetge>"
         "dattetge" Adv
"<illudeaba>"
         "illudit" V IV Ind Prs Du3
"<sakka>"
         "sakka" Adv
"<ovttasbargui>"
         "ovttasbargu" N Sg Ill
         "ovttasbargat" V* IV* Der/PassS V IV Ind Prt Sg3
"<.>"
         "." CLB

Sample Output

"<Dán>"
        "dát" Pron Dem Sg Acc @OBJ> #1->3
"<skábma>"
        "skábma" N Sg Nom @SUBJ> #2->3
"<čuojaha>"
        "čuojahit" V TV Ind Prs Sg3 @FMV #3->0
"<Mari>"
        "Mari" N Prop Sem/Fem Attr @>N #4->5
"<Boine>"
        "Boine" N Prop Sem/Sur Sg Nom @<SUBJ #5->3
"<ovttas>"
        "ovttas" Adv @<ADVL #6->3
"<Kai>"
        "Kai" N Prop Sem/Mal Attr @>N #7->8
"<Sombyn>"
        "Somby" N Prop Sem/Sur Ess @<OPRED #8->1
"<ja>"
        "ja" CC @CNP #9->8
"<Ája>"
        "Ája" N Prop Sem/Org Sg Gen @>N #10->11
"<joavkkuin>"
        "joavku" N Sg Com @<ADVL #11->3
"<.>"
        "." CLB #12->3

"<Go>"
        "go" CS @CVP #1->5
"<beakkán>"
        "beakkán" A Attr @>N #2->3
"<artistta>"
        "artista" N Sg Acc @OBJ> #3->5
"<guovttos>"
        "guovttos" N Coll Sg Nom @SUBJ> #4->5
"<čuojaheaba>"
        "čuojahit" V TV Ind Prs Du3 @FS-ADVL> #5->13
"<oktasaš>"
        "oktasaš" A Attr @>N #6->7
"<konsearttaid>"
        "konsearta" N Pl Gen @>N #7->9
"<vuosttaš>"
        "vuosttaš" A Ord Attr @>N #8->9
"<gearddi>"
        "geardi" N Sg Gen @<ADVL #9->5
        "gearddi" Adv @<ADVL #9->5
"<,>"
        "," CLB #10->1
"<de>"
        "de" Adv @ADVL> #11->13
"<šaddaba>"
        "šaddat" V IV Ind Prs Du3 @FAUX #12->0
"<garvit>"
        "garvit" V TV Inf @IMV #13->12
"<sámi>"
        "sápmi" N Sg Gen @>N #14->15
"<báikkiid>"
        "báiki" N Pl Acc @<OBJ #15->13
"<gos>"
        "gos" Adv @ADVL> #16->18
"<girkuin>"
        "girku" N Pl Loc @ADVL> #17->18
"<lea>"
        "leat" V IV Ind Prs Sg3 @FS-<ADVL #18->13
"<juoigangielddus>"
        "juoigan#gielddus" N Sg Nom <ext> @<SUBJ #19->18
"<.>"
        "." CLB #20->12

"<Dattetge>"
        "dattetge" Adv @ADVL> #1->2
"<illudeaba>"
        "illudit" V IV Ind Prs Du3 @FMV #2->0
"<sakka>"
        "sakka" Adv @>N #3->4
"<ovttasbargui>"
        "ovttasbargu" N Sg Ill @<ADVL #4->2
"<.>"
Clone this wiki locally