This repository was archived by the owner on Feb 1, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Handling Dependencies Automatically
s3cur3 edited this page Aug 5, 2012
·
2 revisions
Suppose a user requests SRL output for 10,000 documents. Consulting the Dependency Tree for Annotation Tools, we see that SRL requires the following tools, in order:
- Tokenizer
- POS
- Chunker
- Charniak parser
Our tool needs to do the following:
- Figure out which of the above list to run, and in which order (trivial)
- Ensure that we do not copy the tool results out of Hadoop until after SRL runs
- Ideally, we indicate to the controller (script) outside the cluster that the run through the first tool is finished, so we are ready for the next job and the next tool.
- Between runs, we shut down the running tools, but we do not delete the directories.
- Ideally, we ensure that all dependencies for a given document stay in the same HDFS block (available for easy access later)
This is now performed by the JobHandler. It takes a directory as input from a user, along with the desired annotation, then:
- Samples the input directory to decide whether it contains new, unannotated, raw text or the serialized records output from a previous MapReduce job.
- If it contains serialized records, it takes a random sample of the input and decides what annotations the Records have in common.
- It gets the list of annotations that the user's requested annotation depends on.
- If the input was serialized records, it removes from that list whatever annotations the Records had in common.
- It starts a Hadoop job for each annotation remaining in the list, in order.