-
Notifications
You must be signed in to change notification settings - Fork 4
LogonProcessing_BatchParsing
Batch parsing for LOGON is the process of sending a collection of inputs (often termed test items) through the analysis component, collecting a range of metrics of grammar and system behavior in the [incr tsdb()] database.
In the standard LOGON set-up, the itsdb cpu definition that instantiates the Norwegian parsing client (using NorGram) is termed :norgram. Thus, the Lisp command
(tsdb:tsdb :cpu :norgram :task :parse :file t)
will create a new process that runs the analysis grammar (i.e. the XLE running inside the LOGON Common-Lisp wrapper and spawning a tokenizer and morphological analysis server autonomously).
Once loaded, the client will register itself with the itsdb server, e.g.
wait-for-clients(): `ld.uio.no' registered as tid <40044> [1:40].
In order to interactively run batch parsing from the itsdb podium, first create a target profile, i.e. a new itsdb database to record the results of batch processing. Typically, this can be achieved by means of the File | Create menu, which provides a choice of pre-defined data sets (termed test suite skeletons, as they provide the test material but no processing results yet). See below for extra functionality that allows batch processing from a plain textual input file. Executing the File | Create menu command will prompt for a new database name (suggesting a name based on the skeleton, processing engine and grammar and current date) and then add a new profile entry to the list of profiles showing in the itsdb podium.
Assuming the target profile is selected in the podium window (i.e. highlighted), make sure that Process | Switches is set to Parsing and then execute one of the processing commands, e.g. Process | All Items.
A streamlined way of running itsdb batch parsing is by means of the LOGON parse script. The script resides in the top-level LOGON directory $LOGONROOT and is invoked from a command shell, e.g.
$LOGONROOT/parse --norgram mrs
The fan-out parse script requires a functional LOGON installation (please see the LogonInstallation page, for background information) and will first load up the itsdb environment and then configure one or more parsing clients. As a result of running the parse script, a new itsdb profile will be stored in the itsdb profile repository, and a log file will be generated in the user home directory. For the example command above, the profile will be called norgram/mrs/05-11-16/xle (assuming the current date was November 16, 2005), with its corresponding log file mrs.parse.05-11-16.log.
Note that parsing Norwegian (using the NorGram LFG grammar) is only possible with proprietary add-ons installed (see the LogonExtras page for details). Using the core LOGON tree (without the XLE and a full Allegro Common Lisp), it is necessary to request the LOGON system to use its pre-compiled run-time binaries (by virtue of the --binary command line option), and to request another analysis client, for example --erg, --gg, --jacy, or --srg (or variants of these, see below). By default, the LOGON tree sets the itsdb skeleton directory to English (as of sometime in 2010; earlier, the default used to be Norwegian). The Options | Skeleton Root menu command in the itsdb podium can be used to change the set of available skeletons interactively. The batch parse script, on the other hand, will select the appropriate language based on the choice of analysis grammar used.
To parse a file in textual input format using the German Grammar, for example, the following command could be used:
cd $LOGONROOT
./parse --binary --gg --text ./dfki/gg/data/mrs.deen.txt
The LOGON parse script has a number of command line options that facilitate a limited amount of customization. Note that the script is not very robust in its option parsing, i.e. it is vital to spell everything exactly right (the script may just hang, when giving incorrect option names).
-
--binary (which requires no argument) builds on the precompiled LOGON run-time binaries, instead of loading itsdb from source (the default);
-
--reset (no argument) performs a complete restart of the PVM daemon (on the local host) use this option with care, as it will terminate other itsdb jobs on the same host;
-
--count n parallelizes processing and start-up n full instantiations of the parser client;
-
--suffix string appends string to the name for the newly created profile, e.g. when more than one run per day needs to be recorded;
-
--target string sets the name of the newly created profile to string, rather than letting itsdb assign a name based on the specific configuration used and current date;
-
--best n enables n-best parsing (e.g. selective unpacking, in clients that support it), with a maximum of n results;
-
--gold string sets the name of the 'gold' profile (relevant to post-parsing operations like --update or --compare, see below for details) to string;
-
--update (no argument) invokes an automated Redwoods treebank update after parsing, using the 'gold' profile as its source;
-
--thin (no argument) applies a Redwoods thinning step (after the automated treebank update), writing into a secondary profile with the suffix .1 appended to the 'target' profile name;
-
--compare string performs an in-depth comparison (equivalent to the interactive itsdb command 'Compare | Detail') against the 'gold' profile (see below);
-
--compress (no argument) compresses the profile (all non-empty database files) after parsing (and treebank updating, where applicable);
-
--text (no argument) toggles the input source to a plain-text test item file (see below), rather than a pre-existing itsdb skeleton.
Thus, a command like:
./parse --norgram --text --suffix ".42" /tmp/avis.txt
will import test items from the file avis.txt in the /tmp/ directory into a new [incr tsdb()] profile (called norgram/avis/05-11-16/xle.42) prior to batch processing. While batch parsing from textual input files adds flexibility, it is often desirable to freeze frequently used data sets as itsdb skeletons, so as to make sure that a stable version of a data set is readily available from the File | Create menu.
-
--norgram the default client, using NorGram, not supported in --binary mode; this client definition depends on availability of the XLE system and Allegro Common Lisp.
-
--erg or --erg+tnt the English Resource Grammar, optionally using TnT for input pre-processing and unknown word handling; this client uses PET for parsing.
-
--gg the German Grammar; this client (for the time being) uses the LKB parser.
-
--jacy or --jacy+chasen the Japanese Grammar, optionally using ChaSen for input segmentation, morphological analysis, and PoS tagging; this client uses PET.
-
--srg the Spanish Resource Grammar, always using FreeLing for input pre-processing and morphological analysis; this client uses PET.
When using the --text option to the LOGON parse script or the [incr tsdb()] File | Import | Test Items command, processing will first construct the target profile from an ASCII input file, essentially a newline-separated list of test items (always using, in LOGON at least, UTF-8 encoding).
Following is an example textual input file comprising three test items:
Vi skal møte Ask på mandag.
Ta båt til Ortnevik.
Tar du båten til Ortnevik, kan du gå stien samme dagen.
Since we are running this on Unix, it is important to produce Unix-style linebreaks, i.e. either create the file in a Unix environment itself or make sure the linebreaks are ^L (linefeed) and not ^M (carriage return).
During batch parsing, a condensed summary of processing results for each input item is written to the standard output (and also to the log file, which is named after the specific configuration used and current date. The itsdb [http://www.delph-in.net/itsdb/publications/index.html#manual Reference Manual] provides a discussion of the syntax (although some additional fields may have been added since the late 1990s).
The LOGON parse script can be used to partially automate regression testing, for example when making changes to a processing client like PET. Assuming a functional (and up-to-date, as of at least August 2011) LOGON installation, a command like the following can be used to establish a point of comparison (adjust the --count value to the number of cpus you have available)
$LOGONROOT/parse --binary --erg --count 4 mrs
In general, the next step would be to invoke a different configuration on the same data, and compare the results in depth. For use with PET, the LOGON environment includes precompiled binaries (which are used in the predefined cpus), and the above command will by default use the current stable binary. For comparison to a binary external to the LOGON tree (e.g. the result of locally compiling a modified PET source tree), the parse script (or, strictly speaking, the LOGON wrapper for PET: $LOGONROOT/bin/cheap) can be made to use a different binary. This is accomplished by setting the environment variable $LOGONCHEAP to a suitable value, for example
LOGONCHEAP=~/src/pet/repp/debug/cheap/cheap \
$LOGONROOT/parse --binary --reset --erg --count 4 \
--suffix ".n" --gold magic --compare pedges,readings \
mrs
In the above command, the reserved value magic to the --gold option will determine the value of the 'gold' profile dynamically, viz. as the name of the (new) 'target' profile, stripped of the --suffix value. Alternatively, one could provide the full name of an existing 'gold' profile, of course, for example gold/erg/mrs.
The in-depth comparison of parsing results (using the pedges and readings fields, in the above example) will print out one line per item, where either of the fields show different values across the two profiles.
Home | Forum | Discussions | Events