Handling Dependencies Automatically

Suppose a user requests SRL output for 10,000 documents. Consulting the Dependency Tree for Annotation Tools, we see that SRL requires the following tools, in order:

Tokenizer
POS
Chunker
Charniak parser

Our tool needs to do the following:

Figure out which of the above list to run, and in which order (trivial)
Ensure that we do not copy the tool results out of Hadoop until after SRL runs
- Ideally, we indicate to the controller (script) outside the cluster that the run through the first tool is finished, so we are ready for the next job and the next tool.
- Between runs, we shut down the running tools, but we do not delete the directories.
- Ideally, we ensure that all dependencies for a given document stay in the same HDFS block (available for easy access later)

Conclusion

This is now performed by the JobHandler. It takes a directory as input from a user, along with the desired annotation, then:

Samples the input directory to decide whether it contains new, unannotated, raw text or the serialized records output from a previous MapReduce job.
- If it contains serialized records, it takes a random sample of the input and decides what annotations the Records have in common.
It gets the list of annotations that the user's requested annotation depends on.
- If the input was serialized records, it removes from that list whatever annotations the Records had in common.
It starts a Hadoop job for each annotation remaining in the list, in order.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling Dependencies Automatically

Conclusion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally