-
Notifications
You must be signed in to change notification settings - Fork 4
UtTop
Ubertagging is what we have called the process of supertagging over ambiguous tokenisation. This process filters the lexical lattice prior to full parsing according to a statistical model (a trigram semi-HMM, see Dridan, 2013 for details). As of the 1214 version of the ERG, mechanisams are in place to use ubertagging functionality when parsing with PET and the ERG.
PET will look for ubertagging specific files in an ut/ subdirectory of the grammar. There are two types of files you will find here:
- model files: these come in pairs (transition and emission), and you select which set to use by using the basename in the settings file, described in Runtime configuration below.
- configuration files: pending a more transparent way to extract this
information from the grammar directly, three configuration files
specify required information in an easily accessible fashion. These
files need to be kept up to date with respect to the grammar. They
are:
- generics.cfg - a mapping from generic lexical type to the appropriate native lexical type
- prefix.cfg - possible surface strings for each prefix inflection rule
- suffix.cfg - possible surface strings for each suffix inflection rule
In addition, various options need to be set. These are handled using the standard PET settings mechanism, with .set files under the pet/ subdir. See below for the actual settings.
To use ubertagging with PET, give the -ut[=file] option to the parser. The file should be a settings file in the pet/ subdir of the grammar. The required options are:
- ut-model - the basename of the model files
- generics_map - the name of the generics mapping file in the ut/ subdir
- prefixes - the name of the prefix file in the ut/ subdir
- suffixes - the name of the suffix file in the ut/ subdir
Other possible options:
-
ut-threshold - this is the tag probability under which associated lexical items are filtered. It can be set in the configuration file, or else on the cheap commandline as -lpthreshold=n (0 < n < 1). The commandline option will override the config file option. If no threshold is given, probabilities are calculated, but no filtering is done. (This can be useful for debugging, with the right output setup.)
-
ut-viterbi - if set to true will filter all lexical items not associated with the single best path through the lexical lattice, as calculated by Viterbi. The threshold is ignored in this case.
The options regarding tag type, caseclass separator and whether or not to map generics are all set at model training time, and as such are selected by selecting the right model.
An example file, ut.set is shown below. This would be invoked by giving cheap the option -ut=ut
ut-model := tri-nanc-wsjr5-noaffix.
ut-threshold := "0.01".
;; uncomment to turn on full Viterbi filtering
;;ut-viterbi := true.
generics_map := "generics.cfg".
prefixes := "prefix.cfg".
suffixes := "suffix.cfg".
;;for model creation, set from model when tagging
ut-caseclass_separator := ▲.
ut-tagtype := NOAFFIX.
ut-mapgen := true.
Home | Forum | Discussions | Events