-
Notifications
You must be signed in to change notification settings - Fork 0
Curator Modifications
- Create Master mode with the following responsibilities:
- Sets up document collection in Hadoop Distributed File System (HDFS).
- Launches local-mode Curators and associated annotation tools on all Hadoop nodes.
- Sends batch job to Hadoop cluster (i.e., starts HadoopInterface.java with the proper parameters).
- Waits for error messages from the annotation tools, and logs them in a user-actionable way.
- Create local mode with the following responsibilities:
- Interfaces with exactly one annotation tool, as specified by the Master Curator.
- Assumes all dependencies for all documents are present in HDFS, and skips those documents which do not meet the requirements.
- Logs errors from the annotation tools in a user-actionable way.
Here's what the Master Curator needs to do, along with thoughts on how to do it.
First, a shell script does the following:
- Launch
- Specify that configuration comes from
curator.hadoop.master.properties
(for example)
- Specify that configuration comes from
- Decide what tool will be run on all documents
- Where is this specified?
- Figure out (parse?) what documents and annotations will be sent to Hadoop
- Where does this input come from?
- Write a serialized form of all those records to the disk.
At this point, a shell script initiates an scp
to transfer all those serialized records to the Hadoop cluster. A script launches the Hadoop job on that document collection. Later, after the Hadoop job has finished, a script transfers all those serialized records back out from Hadoop to the local machine. It then launches Curator again and does the following:
- Re-construct Record objects from the ones on the disk.
- Write them to the database, given them to the user, etc.
Somewhere in there is handling of error logs available to the user.
At this point, it appears the local Curator will be completely unchanged. Woohoo!
Here's what will happen, from its point of view:
- (A MapReduce job sees that the local Curator is running on "this" node; if not, it will launch said local Curator.)
- The local Curator's config file will tell it that there is only 1 annotation tool running.
- At some point, a Curator Client calls the local Curator's
performAnnotation()
(found in CuratorHandler; this method will trust that the Record is providing the right dependencies). The local Curator will respond by connecting to the annotation tool and running it on the Record. - Steps 1--3 will be repeated until the user is done running jobs on the Hadoop interface.
- Finally, the local Curator is shut down via an external shell script.
It's hard to think about these apart from the MapReduce job, so here's what a job will look like as a whole (bold denotes things the Overarching shell script does):
- Overarching shell script launched. This will collect a list of the documents to be annotated.
- Ask the Master Curator to serialize records for that big list of documents and place them in some directory.
- Copy that serialized-record document directory over to HDFS. (HDFS get serialized forms of all the Records to be annotated.)
- Launch a MapReduce job on the Hadoop cluster.
-
reduce()
job checks annotation tool, Curator, and client to make sure they're running; if not, it launches them.- Probably by running a shell script. Do this in Java like this:
Runtime.getRuntime().exec(myShellScript);
(Source). - Curator must be launched in "no cache" mode (i.e., force it to trust the annotation prerequisites provided in the Record and thus not make a database call).
- This appears to already be done. Just call CuratorHandler's
performAnnotation()
method with an already-filled Record.
- This appears to already be done. Just call CuratorHandler's
- Probably by running a shell script. Do this in Java like this:
-
reduce()
constructs (Curator-friendly) Records to pass to the locally-running Curator Client. -
reduce()
callsclient.provide()
to get the requested annotation for our Record. Pass it the input text and the Record. - MapReduce writes the (serialized Record) output to a place that's easy to access from the outside (hopefully just HDFS).
- Copy the data back from HDFS
- Call a new (Master) Curator to read the local, serialized-Records (complete with the new annotations) back into the database.
- Shut down all Curators, Curator Clients, and annotators on the Hadoop cluster.
- Figure out how to launch a locally-running Curator with a single annotation tool (probably from the command line)
- Figure out how to send a job (programmatically) to the annotation tool
- Figure out where to modify the local-mode Curator code to check for the input directory (described in Step 4 above)
- Figure out how output is returned (programmatically) from annotation tools