-
Notifications
You must be signed in to change notification settings - Fork 4
LogonProcessing_BatchParsing
Batch parsing for LOGON is the process of sending a collection of inputs (often termed test items) through the analysis component, collecting a range of metrics of grammar and system behavior in the [incr tsdb()] database.
In the standard LOGON set-up, the itsdb cpu definition that instantiates the Norwegian parsing client (using NorGram) is termed :norgram. Thus, the Lisp command
(tsdb:tsdb :cpu :norgram :task :parse :file t)
will create a new process that runs the analysis grammar (i.e. the XLE running inside the LOGON Common-Lisp wrapper and spawning a tokenizer and morphological analysis server autonomously).
Once loaded, the client will register itself with the itsdb server, e.g.
wait-for-clients(): `ld.uio.no' registered as tid <40044> [1:40].
In order to interactively run batch parsing from the itsdb podium, first create a target profile, i.e. a new itsdb database to record the results of batch processing. Typically, this can be achieved by means of the File | Create menu, which provides a choice of pre-defined data sets (termed test suite skeletons, as they provide the test material but no processing results yet). See below for extra functionality that allows batch processing from a plain textual input file. Executing the File | Create menu command will prompt for a new database name (suggesting a name based on the skeleton, processing engine and grammar and current date) and then add a new profile entry to the list of profiles showing in the itsdb podium.
Assuming the target profile is selected in the podium window (i.e. highlighted), make sure that Process | Switches is set to Parsing and then execute one of the processing commands, e.g. Process | All Items.
A streamlined way of running itsdb batch parsing is by means of the LOGON parse script. The script resides in the top-level LOGON directory $LOGONROOT and is invoked from a command shell, e.g.
$LOGONROOT/parse mrs
The fan-out parse script requires a functional LOGON installation and will first load up the itsdb environment and then configure one or more parsing clients. As a result of running the parse script, a new itsdb profile will be stored in the itsdb profile repository, and a log file will be generated in the user home directory. For the example command above, the profile will be called norgram/mrs/05-11-16/xle (assuming the current date was November 16, 2005) and the NorGram parsing client was active), with its corresponding log file mrs.2005-11-16.log.
Note that parsing Norwegian (using the NorGram LFG grammar) is only possible with proprietary add-ons installed (see the LogonExtras page for details). Using the core LOGON tree (without the XLE and a full Allegro Common Lisp), it is necessary to request the LOGON system to use its pre-compiled run-time binaries (by virtue of the --binary command line option), and to request another analysis client, for example --erg, --gg, --jacy, or --srg (or variants of these, see below). By default, the LOGON tree sets the itsdb skeleton directory to Norwegian. The Options | Skeleton Root menu command in the itsdb podium can be used to change the set of available skeletons interactively. The batch parse script, on the other hand, will select the appropriate language based on the choice of analysis grammar used. To parse a file in textual input format using the German Grammar, for example, the following command could be used:
$LOGONROOT/parse --binary --gg --text $LOGONROOT/dfki/gg/data/mrs.deen.txt
The LOGON parse script has a few command line options that facilitate a limited amount of customization: --count n will parallelize processing and start-up n full instantiations of the parser client; --suffix string will append string to the name for the newly created profile, e.g. when more than one run per day needs to be recorded. Finally, the --text option (which requires no argument) toggles the input source to a plain-text test item file (see below), rather than a pre-existing itsdb skeleton. Thus, a command like:
$LOGONROOT/parse --text --suffix ".42" /tmp/avis.txt
will import test items from the file avis.txt in the /tmp/ directory into a new [incr tsdb()] profile (called norgram/avis/05-11-16/xle.42) prior to batch processing. While batch parsing from textual input files adds flexibility, it is often desirable to freeze frequently used data sets as itsdb skeletons, so as to make sure that a stable version of a data set is readily available from the File | Create menu.
-
--norgram the default client, using NorGram, not supported in --binary mode; this client definition depends on availability of the XLE system and Allegro Common Lisp.
-
--erg or --erg+tnt the English Resource Grammar, optionally using TnT for input pre-processing and unknown word handling; this client uses PET for parsing.
-
--gg the German Grammar; this client (for the time being) uses the LKB parser.
-
--jacy or --jacy+chasen the Japanese Grammar, optionally using ChaSen for input segmentation, morphological analysis, and PoS tagging; this client uses PET.
-
--srg the Spanish Resource Grammar, always using FreeLing for input pre-processing and morphological analysis; this client uses PET.
When using the --text option to the LOGON parse script or the [incr tsdb()] File | Import | Test Items command, processing will first construct the target profile from an ASCII input file, essentially a newline-separated list of test items (always using, in LOGON at least, UTF-8 encoding).
Following is an example textual input file comprising three test items:
Vi skal møte Ask på mandag.
Ta båt til Ortnevik.
Tar du båten til Ortnevik, kan du gå stien samme dagen.
Since we are running this on Unix, it is important to produce Unix-style linebreaks, i.e. either create the file in a Unix environment itself or make sure the linebreaks are ^L (linefeed) and not ^M (carriage return).
Home | Forum | Discussions | Events