Skip to content
This repository was archived by the owner on Feb 1, 2022. It is now read-only.

Final Project Report

s3cur3 edited this page Aug 8, 2012 · 10 revisions

At the beginning of the summer, our first project decision was to choose a parallelization framework: Apache Hadoop or Amazon Web Services EC2. We chose Hadoop for its automatic load balancing capabilities and existing software framework.

Having decided on a Hadoop implementation of the Curator parallelization interface, we began writing the CuratorMapper and CuratorReducer classes. The map() method quickly became trivial since we simply passed a single document to the Hadoop interface and annotated it in the reduce() portion of each job. The results of the Reducer were then written to disk in the Hadoop distributed filesystem (HDFS).

In order to pass in all of the necessary information about the document to be annotated, we created a HadoopRecord class inheriting from the Curator's Record class. As required by Hadoop, we also created several infrastructure classes. Moreover, we finalized the architecture of a Master Curator on the user's side of the Hadoop interface and individual node-local Curators on the other side of the interface.

At this point, we discovered that Record serialization had to be handled through Apache Thrift (rather than our own serialize() and deserialize() methods) in order to properly interact with the Curator. So we overhauled the Hadoop interface to use the Thrift API. We also made some necessary modifications to the Curator itself with new child classes of the CuratorClient.

Under this new architecture, the Curator's automatic dependency handling no longer applied to our Hadoop jobs. We replicated the important functionality using a JobHandler class as a wrapper for the Hadoop interface and shell scripts that chained multiple MapReduce jobs together (one job for each annotation tool). The job handler also took care of I/O operations between the local filesystem and the HDFS.

After extensive testing in a pseudo-distributed setup, we attempted to test our interface with a truly distributed Hadoop installation on the Altocumulus cloud cluster. However, unforeseen usage restrictions forced us to reconfigure our planned architecture, since we were unable to install local Curators on each node of the cluster. This issue remains largely unresolved (though we can successfully run a few annotators on a limited number of Hadoop nodes) but has been fully documented. Despite the issues we encountered, we were able to annotate about 1,000 documents using the tokenizer, part of speech tagger, and chunker using up to about 25 Hadoop nodes. The performance measurements we took in doing so are documented as a spreadsheet.

Finally, we completed a testing suite and a full documentation of the project on this GitHub wiki. The Readme lays out specifics for installation and replication of our setup, and summary reports are available in text (see pages under the "Documentation" heading) or poster format.

In future work, we hope to resolve the remaining issues with the Altocumulus cloud cluster, and then to implement a web annotation service for the Curator as previously proposed.

Clone this wiki locally