Skip to content
This repository was archived by the owner on Feb 1, 2022. It is now read-only.

Infrastructure UML Diagram

s3cur3 edited this page Aug 5, 2012 · 2 revisions

Infrastructure UML Diagram

Discussion of the Classes

The following classes are involved in setting up a MapReduce job in our system.

The HadoopInterface is the main class in our compiled JAR. This is what is launched when you send the job to Hadoop, via a command like this:

 bin/hadoop jar CuratorHadoopInterface.jar -i input_dir -m TOKEN -out output_dir -reduces 10 -curator /path/to/curator-0.6.9 -shared

The CuratorJob is an extension of the Hadoop Job class, which is used to specify things like the input and directories for a MapReduce job, the Mapper and Reducer classes to be used, and so on. However, unlike a typical Job, the CuratorJob "knows" how it needs to be set up (it handles its own configuration). It creates an ArgumentParser to deal with the command line arguments---those arguments contain all the information required for the CuratorJob to set itself up completely.

The DirectoryInputFormat, which extends the generic InputFormat class, is used by the Hadoop back-end when setting up the inputs. It looks at the input directory and creates a DirectorySplit for each documen (a serialized Record) that it finds there.

The CuratorRecordReader, like the DirectoryInputFormat, is used by the Hadoop back-end when setting up the inputs. It gives the Hadoop back-end the key-value pairs that will be passed to the map() and reduce() phases.

The HadoopRecord, which extends the Cognitive Computation Group's Thrift-generated Record class, represents a document and all the annotations we have for it. Thus, it contains the "raw text" of the document (the plain text that the user started with), as well as a number of "views" (like the parsing, part of speech, or named entity recognition annotations).

Clone this wiki locally