Skip to content
This repository was archived by the owner on Feb 1, 2022. It is now read-only.

Curator Modifications

s3cur3 edited this page Aug 5, 2012 · 20 revisions

Before reading this page

Note that this page is not the most easy to follow. You're probably better off reading the Project Overview or Project Roadmap for an overview of how we use the Curator.

Bird's-Eye view of these modifications

  • Create Master mode with the following responsibilities:
    • Serializes and de-serializes Record objects, one for each input file
  • Create local mode with the following responsibilities:
    • Interfaces with exactly one annotation tool, as specified by the Hadoop job.
    • Assumes all dependencies for all documents are present in HDFS, and skips those documents which do not meet the requirements.
    • Logs errors from the annotation tools in a user-actionable way.

Master Curator Mode for Hadoop

Here's what the Master Curator needs to do, along with thoughts on how to do it.

First, a shell script does the following:

  1. Launch
    • Specify that configuration comes from curator.hadoop.master.properties (for example)
  2. Figure out what documents and annotations will be sent to Hadoop and asks the Master Curator for Record objects for these documents
  3. Write a serialized form of all those records to the disk.

At this point, a shell script initiates an scp to transfer all those serialized records to the Hadoop cluster. A script launches the Hadoop job on that document collection. Later, after the Hadoop job has finished, a script transfers all those serialized records back out from Hadoop to the local machine. It then launches Curator again and does the following:

  1. Re-construct Record objects from the ones on the disk.
  2. Write them to the database, give them to the user, etc.

Somewhere in there is handling of error logs available to the user.

Local/Slave Curator Mode for Hadoop

At this point, it appears the local Curator will be completely unchanged. Woohoo!

Here's what will happen, from its point of view:

  1. (A MapReduce job sees that the local Curator is running on "this" node; if not, it will launch said local Curator.)
  2. The local Curator's config file will tell it that there is only 1 annotation tool running.
  3. At some point, a Curator Client calls the local Curator's performAnnotation() (found in CuratorHandler; this method will trust that the Record is providing the right dependencies). The local Curator will respond by connecting to the annotation tool and running it on the Record.
  4. Steps 1--3 will be repeated until the user is done running jobs on the Hadoop interface.
  5. Finally, the local Curator is shut down via an external shell script.

Interactions between MapReduce, the Curator Client, Curator, and Annotators

It's hard to think about these apart from the MapReduce job, so here's what a job will look like as a whole (bold denotes things the Overarching shell script does):

  1. Overarching shell script launched. This will collect a list of the documents to be annotated.
  2. Ask the Master Curator to serialize records for that big list of documents and place them in some directory.
  3. Copy that serialized-record document directory over to HDFS. (HDFS get serialized forms of all the Records to be annotated.)
  4. Launch a MapReduce job on the Hadoop cluster.
  5. reduce() job checks annotation tool, Curator, and client to make sure they're running; if not, it launches them.
    • Probably by running a shell script. Do this in Java like this: Runtime.getRuntime().exec(myShellScript); (Source).
    • Curator must be launched in "no cache" mode (i.e., force it to trust the annotation prerequisites provided in the Record and thus not make a database call).
      • This appears to already be done. Just call CuratorHandler's performAnnotation() method with an already-filled Record.
  6. reduce() constructs (Curator-friendly) Records to pass to the locally-running Curator Client.
  7. reduce() calls client.provide() to get the requested annotation for our Record. Pass it the input text and the Record.
  8. MapReduce writes the (serialized Record) output to a place that's easy to access from the outside (hopefully just HDFS).
  9. Copy the data back from HDFS
  10. Call a new (Master) Curator to read the local, serialized-Records (complete with the new annotations) back into the database.
  11. Shut down all Curators, Curator Clients, and annotators on the Hadoop cluster.

An actually useful to-do list for making this happen

  • Figure out how to send a job (programmatically) to the annotation tool
    • Use a CuratorClient (details still unknown)
      • Expose the CuratorHandler's performAnnotation() method through the Thrift Curator service (documentation)
      • Looks like this will just require adding a method to curator-client-0.6.jar!/edu/illinois/cs/cogcomp/thrift/curator/Curator.java -- ask Mark how to modify this?

Done

  • Figure out how to launch a locally-running Curator with a single annotation tool (probably from the command line)
    • Run the launch_annotator_on_this_node.sh, then launch_curator_on_this_node.sh shell scripts
  • Figure out how output is returned (programmatically) from annotation tools
    • CuratorClient's client.provide() returns a Record, which will contain the newly added annotation
  • Figure out how to serialize and de-serialize a Record from within the Curator; create methods in Curator to do this en masse
    • cf. Record's toString() method
    • Basically, get each view (returns a Map<String, SomeViewType>), then iterate through the view's keys to figure out what type of annotation this is
    • The Record uses Strings to identify the different types of annotations. More info on the Curator Annotation Identifiers page.
    • New Curator methods serializeRecord() and deserializeRecord() should be filesystem-independent, returning a Map of annotation keys and their text string values.
Clone this wiki locally