This repository was archived by the owner on Feb 1, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Project Roadmap
s3cur3 edited this page Aug 3, 2012
·
14 revisions
-
Ready for testing in Hadoop: Hadoop Parallelization Interface
- TODO: Fix issues running on the Cognitive Computation Group's cluster (the Altocumulus Cloud at the Illinois Cloud Testbed), as described on the page Issues Running on the Altocumulus Cloud's Hadoop Cluster
- Sets up the annotation jobs in HadoopInterface.java, using the modified CuratorMapper and CuratorReducer.
- Accepts command-line arguments in a flexible way, requiring only the specification of the location of the document collection in distributed storage and the annotation mode to be used, but allowing for much finer control as desired.
- Distributes inputs across the cluster at the file level.
- Uses the locally-running Curator on each MapReduce node to interface with the annotation tool.
- Stores data as a HadoopRecord (a child class of the standard Curator Record), which will hold each document and any of its annotations. In order to use this as input to a MapReduce job, we need a number of infrastructure classes, including the DirectoryInputFormat, DirectorySplit, and CuratorRecordReader (all described in the Infrastructure UML Diagram).
-
Finished: Job Handler
- The front-end with which the user interacts.
- Takes either raw text documents or serialized records (i.e., the output of a previous Hadoop job) as input and pre-processes them for the Hadoop cluster.
- Copies input to Hadoop and launches MapReduce jobs using user-modifiable shell scripts.
- Copies output from the Hadoop cluster after all jobs finish, then updates the client-side Curator's database with those newly annotated records.
-
Finished: Curator Modifications
- Create "Master" (i.e., client-side, outside of Hadoop) Curator Client with the following responsibilities:
- Serializes Record objects (turns user's raw text into serialized Record objects usable on the Hadoop cluster)
- Deserializes Record objects (reads in serialized Record objects that are output by Hadoop, storing those in the locally-running Curator's database)
- Create Hadoop Curator Client with the following responsibilities:
- Interfaces with a Curator running a single annotation tool, as requested by the Job Handler.
- Assumes all dependencies for all documents are present in HDFS, and skips those documents which do not meet the requirements.
- Logs errors from the annotation tools in a user-actionable way.
- Create "Master" (i.e., client-side, outside of Hadoop) Curator Client with the following responsibilities:
-
Finished: Scripts and Miscellaneous:
- Script to set up document collection in Hadoop Distributed File System (HDFS).
- Script to launch Hadoop MapReduce jobs
- Script to copy output of a MapReduce job back to the user's machine
-
TODO: Comprehensive Package Tests
- Preferably an automated tool to test all components of the software.
- Primary test: annotate a document in the standard, non-Hadoop Curator. Send the same document for annotation in Hadoop, then compare the resultant records from the two.