Skip to content
This repository was archived by the owner on Feb 1, 2022. It is now read-only.

Curator Modifications

s3cur3 edited this page Aug 5, 2012 · 20 revisions

Before reading this page

For a higher-level overview of these modifications, you're probably better off reading the Project Overview or Project Roadmap.

Bird's-Eye view of these modifications

  • Create "master" CuratorClient with the following responsibilities:
    • Serializes and de-serializes Record objects, one for each input file
  • Create Hadoop-side Curator (local to individual MapReduce nodes) with the following responsibilities:
    • Interfaces with exactly one annotation tool, as specified by the Hadoop job.
    • Assumes all dependencies for all documents are present in HDFS, and skips those documents which do not meet the requirements.
    • Shuts down both the Curator and the annotator if either is inactive for too long. (This prevents orphaned processes running indefinitely on Hadoop nodes.)
    • Logs errors from the annotation tools in a user-actionable way.

Master (User-Side) CuratorClient

Here's what the Master Curator does.

First, a shell script does the following:

  1. Launch
  2. Figure out what documents and annotations will be sent to Hadoop and asks the Master Curator for Record objects for these documents
  3. Write a serialized form of all those records to the disk.

At this point, a shell script initiates an scp to transfer all those serialized records to the Hadoop cluster. A script launches the Hadoop job on that document collection. Later, after the Hadoop job has finished, a script transfers all those serialized records back out from Hadoop to the local machine. It then launches Curator again and does the following:

  1. Re-constructs Record objects from the serialized ones on the disk.
  2. Writes them to the database, gives them to the user, etc.

Somewhere in there is handling of error logs available to the user.

Hadoop-Side (Node-local) Curator and Annotators

The Curator and annotators running on the Hadoop cluster (and, of course, the Thrift definitions that these inherit from) are all modified in one way: they keep track of when they last performed an annotation and shut themselves down if they wait more than about 5 minutes between annotations (as seen in the code for the CuratorServer, IllinoisSRLServer, and other "Server" classes). This prevents orphaned processes from running indefinitely on the cluster (eventually forcing system administrators to reboot the cluster!) in the event that we don't cleanly shut down.

Note: the changes in the "Server" classes of each annotator are all pretty straightforward---they simply create a thread for an InactiveCuratorKiller or InactiveAnnotatorKiller which queries the Handler about its time of last annotation and shuts it down if it takes too long. When you build the Curator (using ant dist as usual) with the updated .java and .thrift files found in our modified_files_in_curator directory, they will get the required changes. There is one exception, though: the two classes used for SRL (the nominal and verb SRL handlers) rely on a single underlying JAR, illinoisSRL-3.0.3.jar. Because the changes here require rebuilding via Maven (after adding our rebuilt curator-interfaces.jar to the dependencies and swapping in our IllinoisSRLServer.java and IllinoisSRLHandler.java), we have pre-built this as illinoisSRL-3.0.3-1.jar and included it in Git. To use this SRL Handler/Server combination instead of the default, be sure to modify the classpaths in the Verb and Nom SRL startup scripts.

Here's what will happen, from the local Curator's point of view:

  1. (A MapReduce job sees that the local Curator is running on "this" node; if not, it will launch said local Curator.)
  2. The local Curator's config file will tell it that there is only 1 annotation tool running.
  3. At some point, a Curator Client calls the local Curator's performAnnotation() (found in CuratorHandler; this method will trust that the Record is providing the right dependencies). The local Curator will respond by connecting to the annotation tool and running it on the Record.
  4. Steps 1--3 will be repeated until the user is done running jobs on the Hadoop interface.
  5. Finally, the local Curator is shut down via an external shell script.

Interactions between MapReduce, the Curator Client, Curator, and Annotators

It's hard to think about these apart from the MapReduce job, so here's what a job will look like as a whole (bold denotes things the Overarching shell script does):

  1. Overarching shell script launched by the JobHandler. This will collect a list of the documents to be annotated.
  2. Ask the master Curator to serialize records for that big list of documents and place them in some local directory.
  3. Copy that serialized-record document directory over to HDFS. (HDFS get serialized forms of all the Records to be annotated.)
  4. Launch a MapReduce job on the Hadoop cluster.
  5. reduce() job checks annotation tool, Curator, and client to make sure they're running; if not, it launches them.
    • Uses Runtime.exec() to launch similar to a command line. Do this in Java like this: Runtime.getRuntime().exec("myshellcommand"); (Source).
  6. reduce() constructs (Curator-friendly) Records to pass to the locally-running Curator Client.
  7. reduce() calls client.provide() to get the requested annotation for our Record. Pass it the input text and the Record.
  8. MapReduce writes the (serialized Record) output to a place that's easy to access from the outside (hopefully just HDFS).
  9. The JobHandler copies the data back from HDFS to the local machine.
  10. The JobHandler calls the master Curator to read the local, serialized-Records (complete with the new annotations) back into the database.
  11. Shut down all Curators, Curator Clients, and annotators on the Hadoop cluster.
Clone this wiki locally