This repository was archived by the owner on Feb 1, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Interface Design
s3cur3 edited this page Jun 18, 2012
·
10 revisions
To send a batch job (i.e., a large set of documents to be each be run through a single tool) to the Hadoop cluster, the default, as-is Curator (the "Master Curator") will make a command-line call of the following form: ./hadoop jar CuratorHadoopInterface.jar <*location_of_documents_in_hdfs*>
.
In order for this to work, a few things need to be in place first:
- The document collection must have been transferred to the Hadoop Distributed File System (HDFS), probably through a command-line call like this:
./hadoop dfs -copyFromLocal <*location_of_docs_on_local_machine*> <*destination_in_hdfs*>
.- The directory that is being copied in must be of the following structure:
- <Top-Level_Directory>/
- <Document Hash/ID>/
- <*annotation type>.txt
- . . .
- <*annotation type>.txt
- <Document Hash/ID>/
- <Top-Level_Directory>/
- For instance:
- job_1/
- 0956d2fbd5d5c29844a4d21ed2f76e0c/
- srl.txt
- chunking.txt
- ner.txt
- 0956d2fbd5d5c29844a4d21ed2f76e0c/
- job_1/
- The directory that is being copied in must be of the following structure:
- Each Hadoop node must have a special Curator instance, which relies only on a local instance of the annotation tool.