-
Notifications
You must be signed in to change notification settings - Fork 0
Curator Modifications
- Create Master mode with the following responsibilities:
- Sets up document collection in Hadoop Distributed File System (HDFS).
- Launches local-mode Curators and associated annotation tools on all Hadoop nodes.
- Sends batch job to Hadoop cluster (i.e., starts HadoopInterface.java with the proper parameters).
- Waits for error messages from the annotation tools, and logs them in a user-actionable way.
- Create local mode with the following responsibilities:
- Interfaces with exactly one annotation tool, as specified by the Master Curator.
- Assumes all dependencies for all documents are present in HDFS, and skips those documents which do not meet the requirements.
- Logs errors from the annotation tools in a user-actionable way.
Here's what the Master Curator needs to do, along with thoughts on how to do it.
First, a shell script does the following:
- Launch
- Specify that configuration comes from
curator.hadoop.master.properties
(for example)
- Specify that configuration comes from
- Decide what tool will be run on all documents
- Where is this specified?
- Figure out (parse?) what documents and annotations will be sent to Hadoop
- Where does this input come from?
- Write a serialized form of all those records to the disk.
At this point, a shell script initiates an scp
to transfer all those serialized records to the Hadoop cluster. A script launches the Hadoop job on that document collection. Later, after the Hadoop job has finished, a script transfers all those serialized records back out from Hadoop to the local machine. It then launches Curator again and does the following:
- Re-construct Record objects from the ones on the disk.
- Write them to the database, given them to the user, etc.
Somewhere in there, there should be handling of error logs, and making those available to the user.
At this point, it appears the local Curator will be completely unchanged. Woohoo!
Here's what will happen, from it's point of view:
- (A MapReduce job sees that the local Curator is running on "this" node; if it's not, it will launch said local Curator.)
- The local Curator's config file will tell it there is only 1 annotation tool running.
- At some point, a Curator Client call the local Curator's
performAnnotation()
(found in CuratorHandler; this method will "trust," so to speak, that the Record is providing the right dependencies), so this local Curator will respond by connecting to the annotation tool and having it run on the Record. - Steps 1--3 will be repeated a bunch of times, until finally, the local Curator is shut down.
It's hard to think about these apart from the MapReduce job, so here's what a job will look like as a whole (bold denotes things the Overarching shell script does):
- Overarching shell script launched. This will collect a list of the documents to be annotated, then. . .
- Ask the Master Curator to serialize records for that big list of documents and place them in some directory.
- Copy that serialized-record document directory over to HDFS. (HDFS get serialized forms of all the Records to be annotated.)
- Launch a MapReduce job on the Hadoop cluster.
-
reduce()
job checks annotation tool, Curator, and client to make sure they're running; if not, it launches them.- Probably by running a shell script. Do this in Java like this:
Runtime.getRuntime().exec(myShellScript);
(Source). - Curator must be launched in "no cache" mode (i.e., force it to trust the annotation prerequisites provided in the Record and thus not make a database call).
- This appears to already be done. Just call CuratorHandler's
performAnnotation()
method with an already-filled Record.
- This appears to already be done. Just call CuratorHandler's
- Probably by running a shell script. Do this in Java like this:
-
reduce()
constructs (Curator-friendly) Records to pass to the locally-running Curator Client. -
reduce()
callsclient.provide()
to get the requested annotation for our Record. Pass it the input text and the Record. - MapReduce writes the (serialized Record) output to a place that's easy to access from the outside (hopefully just HDFS).
- Copy the data back from HDFS
- Call a new (Master) Curator to read the local, serialized-Records (complete with the new annotations) back into the database.
- Shut down all Curators, Curator Clients, and annotators on the Hadoop cluster.
- Figure out how to launch a locally-running Curator with a single annotation tool (probably from the command line)
- Figure out how to send a job (programmatically) to the annotation tool
- Figure out where to modify the local-mode Curator code to check for the input directory (described in Step 4 above)
- Figure out how output is returned (programmatically) from annotation tools