Curator Modifications

Bird's-Eye view of these modifications

Create Master mode with the following responsibilities:
- Sets up document collection in Hadoop Distributed File System (HDFS).
- Launches local-mode Curators and associated annotation tools on all Hadoop nodes.
- Sends batch job to Hadoop cluster (i.e., starts HadoopInterface.java with the proper parameters).
- Waits for error messages from the annotation tools, and logs them in a user-actionable way.
Create local mode with the following responsibilities:
- Interfaces with exactly one annotation tool, as specified by the Master Curator.
- Assumes all dependencies for all documents are present in HDFS, and skips those documents which do not meet the requirements.
- Logs errors from the annotation tools in a user-actionable way.

Master Curator Mode for Hadoop

Here's what the Master Curator needs to do, along with thoughts on how to do it.

First, a shell script does the following:

Launch
- Specify that configuration comes from curator.hadoop.master.properties (for example)
Decide what tool will be run on all documents
- Where is this specified?
Figure out (parse?) what documents and annotations will be sent to Hadoop
- Where does this input come from?
Write a serialized form of all those records to the disk.

At this point, a shell script initiates an scp to transfer all those serialized records to the Hadoop cluster. A script launches the Hadoop job on that document collection. Later, after the Hadoop job has finished, a script transfers all those serialized records back out from Hadoop to the local machine. It then launches Curator again and does the following:

Re-construct Record objects from the ones on the disk.
Write them to the database, given them to the user, etc.

Somewhere in there is handling of error logs available to the user.

Local/Slave Curator Mode for Hadoop

At this point, it appears the local Curator will be completely unchanged. Woohoo!

Here's what will happen, from its point of view:

(A MapReduce job sees that the local Curator is running on "this" node; if not, it will launch said local Curator.)
The local Curator's config file will tell it that there is only 1 annotation tool running.
At some point, a Curator Client calls the local Curator's performAnnotation() (found in CuratorHandler; this method will trust that the Record is providing the right dependencies). The local Curator will respond by connecting to the annotation tool and running it on the Record.
Steps 1--3 will be repeated until the user is done running jobs on the Hadoop interface.
Finally, the local Curator is shut down via an external shell script.

Interactions between MapReduce, the Curator Client, Curator, and Annotators

It's hard to think about these apart from the MapReduce job, so here's what a job will look like as a whole (bold denotes things the Overarching shell script does):

Overarching shell script launched. This will collect a list of the documents to be annotated.
Ask the Master Curator to serialize records for that big list of documents and place them in some directory.
Copy that serialized-record document directory over to HDFS. (HDFS get serialized forms of all the Records to be annotated.)
Launch a MapReduce job on the Hadoop cluster.
reduce() job checks annotation tool, Curator, and client to make sure they're running; if not, it launches them.
- Probably by running a shell script. Do this in Java like this: Runtime.getRuntime().exec(myShellScript); (Source).
- Curator must be launched in "no cache" mode (i.e., force it to trust the annotation prerequisites provided in the Record and thus not make a database call).
  - This appears to already be done. Just call CuratorHandler's performAnnotation() method with an already-filled Record.
reduce() constructs (Curator-friendly) Records to pass to the locally-running Curator Client.
reduce() calls client.provide() to get the requested annotation for our Record. Pass it the input text and the Record.
MapReduce writes the (serialized Record) output to a place that's easy to access from the outside (hopefully just HDFS).
Copy the data back from HDFS
Call a new (Master) Curator to read the local, serialized-Records (complete with the new annotations) back into the database.
Shut down all Curators, Curator Clients, and annotators on the Hadoop cluster.

An actually useful to-do list for making this happen

Figure out how to launch a locally-running Curator with a single annotation tool (probably from the command line)
Figure out how to send a job (programmatically) to the annotation tool
Figure out where to modify the local-mode Curator code to check for the input directory (described in Step 4 above)
Figure out how output is returned (programmatically) from annotation tools

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Curator Modifications

Bird's-Eye view of these modifications

Master Curator Mode for Hadoop

Local/Slave Curator Mode for Hadoop

Interactions between MapReduce, the Curator Client, Curator, and Annotators

An actually useful to-do list for making this happen

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally