Skip to content
This repository was archived by the owner on Feb 1, 2022. It is now read-only.

Handling Dependencies Automatically

s3cur3 edited this page Jul 3, 2012 · 2 revisions

Suppose a user requests SRL output for 10,000 documents. Consulting the Dependency Tree for Annotation Tools, we see that SRL requires the following tools, in order:

  1. Tokenizer
  2. POS
  3. Chunker
  4. Charniak parser

Our tool needs to do the following:

  • Figure out which of the above list to run, and in which order (trivial)
  • Ensure that we do not copy the tool results out of Hadoop until after SRL runs
    • Ideally, we indicate to the controller (script) outside the cluster that the run through the first tool is finished, so we are ready for the next job and the next tool.
    • Between runs, we shut down the running tools, but we do not delete the directories.
    • Ideally, we ensure that all dependencies for a given document stay in the same HDFS block (available for easy access later)
Clone this wiki locally