Infrastructure UML Diagram

Discussion of the Classes

The following classes are involved in setting up a MapReduce job in our system.

The HadoopInterface is the main class in our compiled JAR. This is what is launched when you send the job to Hadoop, via a command like this:

 bin/hadoop jar CuratorHadoopInterface.jar -i input_dir -m TOKEN -out output_dir -reduces 10 -curator /path/to/curator-0.6.9 -shared

The CuratorJob is an extension of the Hadoop Job class, which is used to specify things like the input and directories for a MapReduce job, the Mapper and Reducer classes to be used, and so on. However, unlike a typical Job, the CuratorJob "knows" how it needs to be set up (it handles its own configuration). It creates an ArgumentParser to deal with the command line arguments---those arguments contain all the information required for the CuratorJob to set itself up completely.

The DirectoryInputFormat, which extends the generic InputFormat class, is used by the Hadoop back-end when setting up the inputs. It looks at the input directory and creates a DirectorySplit for each documen (a serialized Record) that it finds there.

The CuratorRecordReader, like the DirectoryInputFormat, is used by the Hadoop back-end when setting up the inputs. It gives the Hadoop back-end the key-value pairs that will be passed to the map() and reduce() phases.

The HadoopRecord, which extends the Cognitive Computation Group's Thrift-generated Record class, represents a document and all the annotations we have for it. Thus, it contains the "raw text" of the document (the plain text that the user started with), as well as a number of "views" (like the parsing, part of speech, or named entity recognition annotations).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Infrastructure UML Diagram

Discussion of the Classes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally