-
Notifications
You must be signed in to change notification settings - Fork 4
MatrixTDBProcedures
MatrixTDB is the regression test facility for the Grammar Matrix and the Matrix customization system. It allows us to create gold standard tsdb++ profiles on demand for language types defined in choices files.
There are three main things you might want to do with MatrixTDB: put data in or get data out. Add new strings, add a language type, extract a profile for a language type. These three high-level tasks break down into smaller sub-tasks. The breakdown into sub-tasks is displayed here, while the Detailed Processes section of this page breaks each of those down into smaller tasks.
- Create a source profile
- Import the source profile
- Add permutes
- Run specific filters
- (Actually this just breaks down to one sub-task: adding a language type
- Import the language type
- Generate a profile
- Evaluate the profile using [incr_tsdb()]
This section describes step-by-step instructions on how to perform various tasks and sub-tasks with MatrixTDB.
If you're not sure of the effect of what you are about to do, you may want to make a dump of the database so that the data can be quickly restored if what you do doesn't go the way you want. To do this:
-
Run the following command:
- $ mysqldump -h capuchin.ling.washington.u -u username -p --result-file=resultfile dbname
- The arguments are as folows:
- username - your username for the database
- resultfile - the name of the file you want to dump the backup to.
- dbname - the name of the database on capuchin to backup. Currently MatrixTDB2 is the database we are using.
Note: This won't actually back up the data per se, but it will create a (very large) file full of SQL statements that can be used to restore the database to its state at the time of the dump.
If you want to revert the database to a previous point:
-
Log in to the database
-
Issue the following command:
- mysql> source filename
- where filename is the name of the file with the dump you want to revert to.
Source profiles (sometimes also called 'original source profiles') are what are used to bring the big, hairy mrs semantics into the database. To create one:
- Create a flat file with one harvester string per line
- Start LKB and load the grammar you want to use to create the mrs semantics to import
- Start [incr_tsdb()] and process all the items in that file
Source profiles (sometimes also called 'original source profiles') are what are used to bring the big, hairy mrs semantics into the database. When you have a [incr_tsdb()] profile that was created by processing a flat file of items you can use that to import a source profile into the database. To do so:
-
Create a file that has each harvester string in your profile on a line preceded by its mrs tag and a '@' E.g., wo1@n1 iv
-
Run the following command:
- $ python import_from_itsdb.py itsdbDir harvMrsFilename choicesFilename
- The arguments are as follows:
- itsdbDir - the absolute directory of your [incr_tsdb()] directory. Be sure to end it in a '/'
- harvMrsFilename - the name of the file you created above with mrs tags and harvester strings
- choicesFilename - the choices file of the grammar you used to create the profile
-
The system will prompt for a username and password to the database
-
The system may ask you if the tags you're adding really are new or if you want to replace the existing tags with the new semantics you are importing. Answer appropriately. If the system indicates you are changing some semantics, make sure that is what you want to do.
-
The system will also ask you for a description of the source profile. It can be up to 1000 characters.
-
The system will import the profile and, if the choices file you used represents a language type not already in the database, will create a language type for that, too. It will return the osp_id which you will need to add permutes and run specific filters and the language type, which you will need to add permutes run specific filters
At this point you will have imported a profile with its harvester strings. But a harvester string just yields to potentially millions of other possible strings with the same semantics. Specifically, each harvester string gives rise to seed strings which are then permuted and added to the database as string to be run through specific filters. (Earlier versions of MatrixTDB added all permutations which were run through universal filters and then specific filters, but more recently only those string/semantic tag combos that pass all universal filters are being added to the database.) Seed strings are stored in a canonical form: words in alphabetic order followed by prefixes in alphabetic order followed by suffixes in alphabetic order. Permutations are then every possible permutation of the words with every possible permutation of prefixes and suffixes on every word in every one of those permutations. Seed strings are generated from harvester strings by the stringmods in stringmod.py. Here is how to generate all the permutations for an imported original source profile:
- Make sure stringmods is updated to meet your needs (optional)
- Create a condor file. A template is in the repository named
addPerms.cmd
- change ospID to be the ID of the source profile you want to create permutes for
- change username to be your username to the MatrixTDB database
- change password to be your password to the MatrixTDB database
- Submit your command file to condor with the following line:
- $ condor_submit addPerms.cmd
- The process may take many hours depending on how many strings you have and how long they are. Two ways to monitor the progress are to check the count of records in the result table or to check the .warning file from time to time.
Matrix developers should update MatrixTDB as follows:
-
Determine the harvester strings you'll need to illustrate your library.
-
Determine the semantically neutral variants your library allows for each harvester string, at the level of bags of words. For example, the basic lexical library allows for case-marking adpositions. So p-nom can be added to any string with an overt subject to get a semantically equivalent string, provided p-nom is in the right place.
-
update customize/sql_profiles/stringmods.py to reflect the modifications.
-
Create a harvester grammar to process you harvester strings with. Save the choices file from that grammar.
-
Create a file listing the harvester strings and their mrs_tags (see customize/sql_profiles/harv_str/harv_mrs_1 for an example).
-
Create a file with just the harvester strings.
-
Start the LKB and tsdb++
-
Load the harvester grammar in to the LKB.
-
In tsdb++ to File > Import > Test items to import the harvester strings.
-
Make sure tsdb++ is set to write the mrs field.
-
Process the items you imported with your grammar.
-
The resulting profile will be your source_profile.
-
Next, run customize/sql_profiles/import_from_itsdb.py with your source_profile, choices file and harv_mrs file as arguments.
-
Get the resulting osp_id
-
Then run customize/sql_profiles/add_permutes.py and give it the osp_id
-
Update the universal and specific filters in u_filters.py and s_filters.py
-
Run run_u_filters.py
-
Run the SQL query that separates the universally ungrammatical from universally grammatical results.
-
Run run_specific_filters.py
At this point, MatrixTDB is up to date. We can also use import_from_tsdb.py to update the mrss we want to have corresponding to particular mrs_tags.
To export a profile corresopnding to a given choices file:
- _FIX_ME_ instructions here.
TODO:
- Work out how to run filters recursively for coordination et al
- Update filters for coordination
- Update filters for inflection version of question particles
Home | Forum | Discussions | Events