This repository was archived by the owner on Feb 1, 2022. It is now read-only.

Project Roadmap

Jump to bottom

s3cur3 edited this page Jun 21, 2012 · 14 revisions

In Progress: Hadoop Parallelization Interface
- Sets up the annotation jobs in HadoopInterface.java, using the modified CuratorMapper.java and CuratorReducer.java.
- Accepts command-line arguments in a flexible way, requiring only the specification of the location of the document collection in distributed storage and the annotation mode to be used, but allowing for much finer control as desired.
- Distributes inputs across the cluster at the file level.
- Uses the locally-running Curator on each node to interface with the annotation tool.
- Requires a new class called Record, which will hold each document and any of its annotations. In order to use this as input to a MapReduce job, we need a number of infrastructure classes, including the DirectoryInputFormat, DirectorySplit, and CuratorRecordReader (all described in the Infrastructure UML Diagram).
TODO: Changes to Curator
- Create Master mode with the following responsibilities:
  - Sets up document collection in Hadoop Distributed File System (HDFS).
  - Launches local-mode Curators and associated annotation tools on all Hadoop nodes.
  - Sends batch job to Hadoop cluster (i.e., starts HadoopInterface.java with the proper parameters).
  - Waits for error messages from the annotation tools, and logs them in a user-actionable way.
- Create local mode with the following responsibilities:
  - Interfaces with exactly one annotation tool, as specified by the Master Curator.
  - Assumes all dependencies for all documents are present in HDFS, and skips those documents which do not meet the requirements.
  - Logs errors from the annotation tools in a user-actionable way.
TODO: Scripts and Miscellaneous
- Script to launch local-mode Curators and associated annotation tools on all Hadoop nodes. (?)