Skip to content
This repository was archived by the owner on Feb 1, 2022. It is now read-only.

Curator Modifications

s3cur3 edited this page Jun 27, 2012 · 20 revisions

Bird's-Eye view of these modifications

  • Create Master mode with the following responsibilities:
    • Sets up document collection in Hadoop Distributed File System (HDFS).
    • Launches local-mode Curators and associated annotation tools on all Hadoop nodes.
    • Sends batch job to Hadoop cluster (i.e., starts HadoopInterface.java with the proper parameters).
    • Waits for error messages from the annotation tools, and logs them in a user-actionable way.
  • Create local mode with the following responsibilities:
    • Interfaces with exactly one annotation tool, as specified by the Master Curator.
    • Assumes all dependencies for all documents are present in HDFS, and skips those documents which do not meet the requirements.
    • Logs errors from the annotation tools in a user-actionable way.

Master Curator Mode for Hadoop

Here's what the Master Curator needs to do, along with thoughts on how to do it.

First, a shell script does the following:

  1. Launch
    • Specify that configuration comes from curator.hadoop.master.properties (for example)
  2. Decide what tool will be run on all documents
    • Where is this specified?
  3. Figure out (parse?) what documents and annotations will be sent to Hadoop
    • Where does this input come from?
  4. Write a serialized form of all those records to the disk.

At this point, a shell script initiates an scp to transfer all those serialized records to the Hadoop cluster. A script launches the Hadoop job on that document collection. Later, after the Hadoop job has finished, a script transfers all those serialized records back out from Hadoop to the local machine. It then launches Curator again and does the following:

  1. Re-construct Record objects from the ones on the disk.
  2. Write them to the database, given them to the user, etc.

Somewhere in there, there should be handling of error logs, and making those available to the user.

Local/Slave Curator Mode for Hadoop

At this point, it appears the local Curator will be completely unchanged. Woohoo!

Here's what will happen, from it's point of view:

  1. (A MapReduce job sees that the local Curator is running on "this" node; if it's not, it will launch said local Curator.)
  2. The local Curator's config file will tell it there is only 1 annotation tool running.
  3. At some point, a Curator Client call the local Curator's performAnnotation() (found in CuratorHandler; this method will "trust," so to speak, that the Record is providing the right dependencies), so this local Curator will respond by connecting to the annotation tool and having it run on the Record.
  4. Steps 1--3 will be repeated a bunch of times, until finally, the local Curator is shut down.

Interactions between MapReduce, the Curator Client, Curator, and Annotators

It's hard to think about these apart from the MapReduce job, so here's what a job will look like as a whole (bold denotes things the Overarching shell script does):

  1. Overarching shell script launched. This will collect a list of the documents to be annotated, then. . .
  2. Ask the Master Curator to serialize records for that big list of documents and place them in some directory.
  3. Copy that serialized-record document directory over to HDFS. (HDFS get serialized forms of all the Records to be annotated.)
  4. Launch a MapReduce job on the Hadoop cluster.
  5. reduce() job checks annotation tool, Curator, and client to make sure they're running; if not, it launches them.
    • Probably by running a shell script. Do this in Java like this: Runtime.getRuntime().exec(myShellScript); (Source).
    • Curator must be launched in "no cache" mode (i.e., force it to trust the annotation prerequisites provided in the Record and thus not make a database call).
      • This appears to already be done. Just call CuratorHandler's performAnnotation() method with an already-filled Record.
  6. reduce() constructs (Curator-friendly) Records to pass to the locally-running Curator Client.
  7. reduce() calls client.provide() to get the requested annotation for our Record. Pass it the input text and the Record.
  8. MapReduce writes the (serialized Record) output to a place that's easy to access from the outside (hopefully just HDFS).
  9. Copy the data back from HDFS
  10. Call a new (Master) Curator to read the local, serialized-Records (complete with the new annotations) back into the database.
  11. Shut down all Curators, Curator Clients, and annotators on the Hadoop cluster.

An actually useful to-do list for making this happen

  • Figure out how to launch a locally-running Curator with a single annotation tool (probably from the command line)
  • Figure out how to send a job (programmatically) to the annotation tool
  • Figure out where to modify the local-mode Curator code to check for the input directory (described in Step 4 above)
  • Figure out how output is returned (programmatically) from annotation tools
Clone this wiki locally