Skip to content
This repository was archived by the owner on Feb 1, 2022. It is now read-only.

Curator Modifications

s3cur3 edited this page Jun 27, 2012 · 20 revisions

Bird's-Eye view of these modifications

  • Create Master mode with the following responsibilities:
    • Sets up document collection in Hadoop Distributed File System (HDFS).
    • Launches local-mode Curators and associated annotation tools on all Hadoop nodes.
    • Sends batch job to Hadoop cluster (i.e., starts HadoopInterface.java with the proper parameters).
    • Waits for error messages from the annotation tools, and logs them in a user-actionable way.
  • Create local mode with the following responsibilities:
    • Interfaces with exactly one annotation tool, as specified by the Master Curator.
    • Assumes all dependencies for all documents are present in HDFS, and skips those documents which do not meet the requirements.
    • Logs errors from the annotation tools in a user-actionable way.

Master Curator Mode for Hadoop

Here's what the Master Curator needs to do, along with thoughts on how to do it:

  1. Launch
    • Specify that configuration comes from curator.hadoop.master.properties (for example)
  2. Decide what tool will be run on all documents
    • Where is this specified?
  3. Launch the local Curator with that annotation tool on all Hadoop nodes
    • Run shell script that "knows" the location of all Hadoop nodes?
  4. Wait for confirmation from those nodes that their tools are up and running
    • Pass message over network?
  5. Figure out (parse?) what documents and annotations will be sent to Hadoop
    • Where does this input come from?
  6. Transfer those documents, with their prerequisite annotations
    • Initiate scp (or equivalent) transfer to Hadoop master (namenode?)
  7. Send job to Hadoop
    • Pass message over network to job tracker?
  8. Wait for the job to finish
    • How do we know when it finishes?
  9. Copy data out from HDFS
    • Initiate scp or equivalent transfer from Hadoop master

Local/Slave Curator Mode for Hadoop

Here's what each local Curator (running on each Hadoop node) needs to do, along with thoughts on how to do it:

  1. Launch
    • Gets launched by the Master Curator
    • Bundled with the launch command is a note about which annotation tool we should launch (?)
    • Launch that tool in whatever the standard way to do so locally is
  2. Wait for required tool to finish launching, then give the OK to Master Curator
    • How do we know it's ready?
    • How do we pass a message back to the MC?
  3. (MapReduce job launches outside of local Curator)
    • Was launched by the jobtracker after getting a message from the MC
  4. Wait for input from a local map() operation
    • map() will copy the data to be processed to the user directory (~/document_hash_here/)
    • map() will add a .lock file to that directory while it is still writing to it
  5. When input is received and there is no lock on the input directory:
    1. Lock the input directory (i.e., create a .lock file)
    2. Prepare the job to send to the tool
      • Build a edu.illinois.cs.cogcomp.thrift.curator.Record structure?
    3. Send the job to the annotation tool
    4. Write the output to the local disk
      • Once we know how to send a job to the tool, this should be easy.
    5. Unlock (i.e., delete the .lock file)
  6. (MapReduce job will handle transfer of the output back to the Master Curator once the .lock file is gone)
Clone this wiki locally