Skip to content
This repository was archived by the owner on Feb 1, 2022. It is now read-only.

Interface Design

s3cur3 edited this page Jun 18, 2012 · 10 revisions

To send a batch job (i.e., a large set of documents to be each be run through a single tool) to the Hadoop cluster, the default, as-is Curator (the "Master Curator") will make a command-line call of the following form: ./hadoop jar CuratorHadoopInterface.jar <location_of_documents_in_hdfs>.

In order for this to work, a few things need to be in place first:

  • The document collection and corresponding annotations must have been transferred to the Hadoop Distributed File System (HDFS), probably through a command-line call like this: ./hadoop dfs -copyFromLocal <location_of_docs_on_local_machine> <destination_in_hdfs>.
    • The directory that is being copied in must be of the following structure:
      • <Top-Level Directory>/
        • <Document Hash/ID>/
          • <annotation type>.txt
          • . . .
          • <annotation type>.txt
    • For instance:
      • job_1/
        • 0956d2fbd5d5c29844a4d21ed2f76e0c/
          • srl.txt
          • chunking.txt
          • ner.txt
  • The document collection copied into HDFS must include all required annotations for the tool you want to use---Hadoop will not calculate dependencies and get them for you automatically. Instead, it will simply pass over any document for which the required annotations have not been provided.
  • Each Hadoop node must have a special Curator instance, which relies only on a local instance of the annotation tool being used. For instance, if you want NER performed on the document collection, each Hadoop node would have one copy of Curator and one copy of the NER tool running.
Clone this wiki locally