Skip to content
This repository was archived by the owner on Feb 1, 2022. It is now read-only.

Handling Dependencies Automatically

s3cur3 edited this page Aug 5, 2012 · 2 revisions

Suppose a user requests SRL output for 10,000 documents. Consulting the Dependency Tree for Annotation Tools, we see that SRL requires the following tools, in order:

  1. Tokenizer
  2. POS
  3. Chunker
  4. Charniak parser

Our tool needs to do the following:

  • Figure out which of the above list to run, and in which order (trivial)
  • Ensure that we do not copy the tool results out of Hadoop until after SRL runs
    • Ideally, we indicate to the controller (script) outside the cluster that the run through the first tool is finished, so we are ready for the next job and the next tool.
    • Between runs, we shut down the running tools, but we do not delete the directories.
    • Ideally, we ensure that all dependencies for a given document stay in the same HDFS block (available for easy access later)

Conclusion

This is now performed by the JobHandler. It takes a directory as input from a user, along with the desired annotation, then:

  1. Samples the input directory to decide whether it contains new, unannotated, raw text or the serialized records output from a previous MapReduce job.
    • If it contains serialized records, it takes a random sample of the input and decides what annotations the Records have in common.
  2. It gets the list of annotations that the user's requested annotation depends on.
    • If the input was serialized records, it removes from that list whatever annotations the Records had in common.
  3. It starts a Hadoop job for each annotation remaining in the list, in order.
Clone this wiki locally