This repository was archived by the owner on Feb 1, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Interface Design
s3cur3 edited this page Aug 5, 2012
·
10 revisions
Note that it is woefully out of date. For one thing, we now work with truly serialized records, so there is only ever a single text file per document ([document hash].txt). For information on how to use the program, see either Building and Running or Project Overview.
To send a batch job (i.e., a large set of documents to be each be run through a single tool) to the Hadoop cluster, the master job script will make a call similar to this: ./hadoop jar CuratorHadoopInterface.jar <location_of_documents_in_hdfs> <mode>
, where mode is a tool type: NER, SRL, chunking, etc.
In order for this to work, a few things need to be in place first:
- The document collection and corresponding annotations must have been transferred to the Hadoop Distributed File System (HDFS), probably through a command-line call like this:
./hadoop dfs -copyFromLocal <location_of_docs_on_local_machine> <destination_in_hdfs>
. ** - The document collection copied into HDFS must include all required annotations for the tool you want to use---Hadoop will not calculate dependencies and get them for you automatically. Instead, it will simply pass over any document for which the required annotations have not been provided. Note that, at some point, the master script will handle linking batch annotation jobs if you have many documents which lack the dependencies for the tool you requested. For details on this, see the page on Handling Dependencies Automatically.
- Each Hadoop node must have a special Curator instance, which relies only on a local instance of the annotation tool being used. For instance, if you want NER performed on the document collection, each Hadoop node would have one copy of Curator and one copy of the NER tool running.
**Note that the directory that is being copied in must be of the following structure:
- <Top-Level Directory (Job name)>/
- <Document Hash/ID>/
- original.txt
- <annotation type in same form as AnnotationMode>.txt
- . . .
- <annotation type in same form as AnnotationMode>.txt
- <Document Hash/ID>/
For instance:
- job_1/
- 0956d2fbd5d5c29844a4d21ed2f76e0c/
- original.txt
- SRL.txt
- CHUNK.txt
- NER.txt
- 0956d2fbd5d5c29844a4d21ed2f76e0c/