This repository was archived by the owner on Feb 1, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Curator Modifications
s3cur3 edited this page Jun 27, 2012
·
20 revisions
- Create Master mode with the following responsibilities:
- Sets up document collection in Hadoop Distributed File System (HDFS).
- Launches local-mode Curators and associated annotation tools on all Hadoop nodes.
- Sends batch job to Hadoop cluster (i.e., starts HadoopInterface.java with the proper parameters).
- Waits for error messages from the annotation tools, and logs them in a user-actionable way.
- Create local mode with the following responsibilities:
- Interfaces with exactly one annotation tool, as specified by the Master Curator.
- Assumes all dependencies for all documents are present in HDFS, and skips those documents which do not meet the requirements.
- Logs errors from the annotation tools in a user-actionable way.
Here's what the Master Curator needs to do, along with thoughts on how to do it:
- Launch
- Specify that configuration comes from
curator.hadoop.master.properties
(for example)
- Specify that configuration comes from
- Decide what tool will be run on all documents
- Where is this specified?
- Launch the local Curator with that annotation tool on all Hadoop nodes
- Run shell script that "knows" the location of all Hadoop nodes?
- Defer work on this (probably) until we actually have access to a Hadoop cluster
- Wait for confirmation from those nodes that their tools are up and running
- Pass message over network?
- Figure out (parse?) what documents and annotations will be sent to Hadoop
- Where does this input come from?
- Transfer those documents, with their prerequisite annotations
- Initiate
scp
(or equivalent) transfer to Hadoop master (namenode?)
- Initiate
- Send job to Hadoop
- Pass message over network to job tracker?
- Wait for the job to finish
- How do we know when it finishes?
- Copy data out from HDFS
- Initiate
scp
or equivalent transfer from Hadoop master
- Initiate
Here's what each local Curator (running on each Hadoop node) needs to do, along with thoughts on how to do it:
- Launch
- Gets launched by the Master Curator
- Bundled with the launch command is a note about which annotation tool we should launch (?)
- Launch that tool in whatever the standard way to do so locally is
- Scaffolding (during design): we can launch this by hand if we want to work on this code before moving on to the MC
- Wait for required tool to finish launching, then give the OK to Master Curator
- How do we know it's ready?
- How do we pass a message back to the MC?
- Scaffolding (during design): we can skip giving the OK until we're ready to code the MC
- (MapReduce job launches outside of local Curator)
- Was launched by the jobtracker after getting a message from the MC
- Scaffolding (during design): Simulate job submission to a locally-running version of Hadoop
- Wait for input from a local map() operation
- map() will copy the data to be processed to the user directory (
~/document_hash_here/
) - map() will add a
.lock
file to that directory while it is still writing to it
- map() will copy the data to be processed to the user directory (
- When input is received and there is no lock on the input directory:
- Lock the input directory (i.e., create a
.lock
file) - Prepare the job to send to the tool
- Build a
edu.illinois.cs.cogcomp.thrift.curator.Record
structure?
- Build a
- Send the job to the annotation tool
- How does this work? Is it similar to
client.provide()
in CuratorDemo.java?
- How does this work? Is it similar to
- Write the output to the local disk
- Once we know how to send a job to the tool, this should be easy.
- Unlock (i.e., delete the
.lock
file)
- Lock the input directory (i.e., create a
- (MapReduce job will handle transfer of the output back to the Master Curator once the
.lock
file is gone)
- Figure out how to launch a locally-running Curator with a single annotation tool (probably from the command line)
- Figure out how to send a job (programmatically) to the annotation tool
- Figure out where to modify the local-mode Curator code to check for the input directory (described in Step 4 above)
- Figure out how output is returned (programmatically) from annotation tools