Skip to content
This repository was archived by the owner on Feb 1, 2022. It is now read-only.

Interface Design

s3cur3 edited this page Jun 18, 2012 · 10 revisions

To send a batch job (i.e., a large set of documents to be each be run through a single tool) to the Hadoop cluster, the default, as-is Curator (the "Master Curator") will make a command-line call of the following form: ./hadoop jar CuratorHadoopInterface.jar <*location_of_documents_in_hdfs*>.

In order for this to work, a few things need to be in place first:

  • The document collection must have been transferred to the Hadoop Distributed File System (HDFS), probably through a command-line call like this: ./hadoop dfs -copyFromLocal <*location_of_docs_on_local_machine*> <*destination_in_hdfs*>.
    • The directory that is being copied in must be of the following structure:
      • <Top-Level_Directory>/
        • <Document Hash/ID>/
          • <*annotation type>.txt
          • . . .
          • <*annotation type>.txt
    • For instance:
      • job_1/
        • 0956d2fbd5d5c29844a4d21ed2f76e0c/
          • srl.txt
          • chunking.txt
          • ner.txt
  • Each Hadoop node must have a special Curator instance, which relies only on a local instance of the annotation tool.
Clone this wiki locally