-
Notifications
You must be signed in to change notification settings - Fork 4
LogonProcessing_BatchParsing
Batch parsing for LOGON is the process of sending a collection of inputs (often termed test items) through the analysis component, collecting a range of metrics of grammar and system behavior in the [incr tsdb()] database.
In the standard LOGON set-up, the [incr tsdb()] cpu definition that instantiates the parsing client (using NorGram) is termed :norgram. Thus, the command
(tsdb :cpu :norgram)
will create a new process that runs the analysis grammar (i.e. the XLE running inside the LOGON Common-Lisp wrapper and spawning a tokenizer and morphological analysis server autonomously).
Once loaded, the client will register itself; e.g.
wait-for-clients(): `ld.uio.no' registered as tid <40044> [1:40].
In order to run batch parsing from the [incr tsdb()] podium, first create a target profile, i.e. a new [incr tsdb()] database to record the results of batch processing. Typically, this can be achieved by means of the File | Create menu, which provides a choice of pre-defined data sets (termed test suite skeletons, as they provide the test material but no processing results yet). See below for extra functionality that allows batch processing from an ASCII input file. Executing the File | Create menu command will prompt for a new database name (suggesting a name based on the skeleton, processing engine and grammar and current date) and then add a new profile entry to the list of profiles showing in the [incr tsdb()] podium.
Assuming the target profile is selected in the podium window (i.e. highlighted), make sure that Process | Switches is set to Parsing and then execute one of the processing commands, e.g. Process | All Items.
A streamlined way of running [incr tsdb()] batch parsing is by means of the LOGON parse script. The script resides in the top-level LOGON directory $LOGONROOT and is invoked from a command shell, e.g.
$LOGONROOT/parse mrs
The fan-out parse script requires a functional LOGON installation (see separate instructions; currently on the LOGON workspace, September 2005) and will first load up the [incr tsdb()] environment and then configure one or more parsing clients. As a result of running the parse script, a new [incr tsdb()] profile will be stored in the [incr tsdb()] profile repository, and a log file will be generated in the user home directory. For the example command above, the profile will be called norgram/mrs/05-11-16/xle (assuming the current date was 16-nov-05 and the NorGram parsing client was active), with its corresponding log file mrs.2005-11-16.log.
The LOGON parse script has a few command line options that facilitate a limited amount of customization: --count n will parallelize processing and start-up n full instantiations of the parser client; --suffix string will append string to the name for the newly created profile, e.g. when more than one run per day needs to be recorded. Finally, the --ascii option (which requires no argument) toggles the input source to an ASCII test item file (see below), rather than a pre-existing [incr tsdb()] skeleton. Thus, a command like:
$LOGONROOT/parse --ascii --suffix ".42" /tmp/avis.txt
will import test items from the file avis.txt in the /tmp/ directory into a new [incr tsdb()] profile (called norgram/avis/05-11-16/xle.42) prior to batch processing. While batch parsing from ASCII input files adds flexibility, it is often desirable to freeze frequently used data sets as [incr tsdb()] skeletons, so as to make sure that a stable version of a data set is readily available from the File | Create menu.
When using the --ascii option to the LOGON parse script or the [incr tsdb()] File | Import | Test Items command, processing will first construct the target profile from an ASCII input file, essentially a newline-separated list of test items (always using, in LOGON at least, UTF-8 encoding).
Following is an example ASCII input file comprising three test items:
Vi skal møte Ask på mandag.
Ta båt til Ortnevik.
Tar du båten til Ortnevik, kan du gå stien samme dagen.
Since we are running this on Unix, it is important to produce Unix-style linebreaks, i.e. either create the file in a Unix environment itself or make sure the linebreaks are ^L (linefeed) and not ^M (carriage return).
Home | Forum | Discussions | Events