Interface Design

To send a batch job (i.e., a large set of documents to be each be run through a single tool) to the Hadoop cluster, the default, as-is Curator (the "Master Curator") will make a command-line call of the following form: ./hadoop jar CuratorHadoopInterface.jar <location_of_documents_in_hdfs>.

In order for this to work, a few things need to be in place first:

The document collection and corresponding annotations must have been transferred to the Hadoop Distributed File System (HDFS), probably through a command-line call like this: ./hadoop dfs -copyFromLocal <location_of_docs_on_local_machine> <destination_in_hdfs>.
- The directory that is being copied in must be of the following structure:
  - <Top-Level Directory>/
    - <Document Hash/ID>/
      - <annotation type>.txt
      - . . .
      - <annotation type>.txt
- For instance:
  - job_1/
    - 0956d2fbd5d5c29844a4d21ed2f76e0c/
      - srl.txt
      - chunking.txt
      - ner.txt
The document collection copied into HDFS must include all required annotations for the tool you want to use---Hadoop will not calculate dependencies and get them for you automatically. Instead, it will simply pass over any document for which the required annotations have not been provided.
Each Hadoop node must have a special Curator instance, which relies only on a local instance of the annotation tool being used. For instance, if you want NER performed on the document collection, each Hadoop node would have one copy of Curator and one copy of the NER tool running.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Interface Design

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally