This repository was archived by the owner on Feb 1, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Project Overview
s3cur3 edited this page Aug 5, 2012
·
2 revisions
Here's how the project looks from 30,000 feet:
The JobHandler (which handles all the shell scripts in the front end) analyzes the text files that the user passed in as input and determines the dependencies that need to be satisfied in order to get the annotation requested by the user. Then, for each annotation that is required:
- The JobHandler calls the shell script to copy input to Hadoop. This results in . . .
- Having the "master" (i.e., locally running) CuratorClient create serialized Records from the user's input. These are the input files after preliminary processing.
- The same shell script (
copy_input_to_hadoop.sh
) sends those serialized records to Hadoop. The JobHandler then callslaunch_hadoop_job.sh
and has the Hadoop Job Handler start running our HadoopInterface program on each of the nodes in the cluster. - After a bit of Hadoop back-end wizardry, each node in the cluster reaches the Reduce phase. There, it launches a Curator and the required annotator on that node, and launches the HadoopCuratorClient to interface with them.
- The input Records are annotated using the required annotator.
- The newly annotated Records are stored in the Hadoop Distributed File System (HDFS) as serialized records.
- After all Reduce phases finish, the JobHandler runs the
copy_output_from_hadoop.sh
script and copies the output back to the local disk. Once those serialized records finish copying, it will de-serialize them and have the local ("master") Curator store the updates in its database cache.