-
Notifications
You must be signed in to change notification settings - Fork 0
Issues Running on the Altocumulus Cloud's Hadoop Cluster
This page documents the issues we encountered in attempting to run the Hadoop interface on the the Altocumulus Cloud, sponsored by the Illinois Cloud Testbed.
We designed the system with the assumption that we would be able to install the (pre-compiled) Curator along with the 11 annotation tools (totaling between 15 and 20 GB) locally on each Hadoop node. Then, when we ran a job, we would simply launch the Curator from, e.g., ~/curator/dist/
as necessary during a Reduce phase.
In contrast, the Altocumulus Cloud does not allow us to load data directly on individual nodes. Instead, we are forced to have every Hadoop node access a single, shared network disk. Each Hadoop machine can access our project directory at /project/cogcomp/
.
To compensate for this drastic change, we attempted the following two solutions:
-
Have all nodes access a single network installation of Curator (located at, e.g.,
/project/cogcomp/curator-0.6.9/
)-
This has the obvious disadvantage of significant network traffic each time we launch the Curator.
-
For an unknown reason, this resulted in strange behavior when attempting to launch the Curator on many nodes. Most machines, when running the launch command, simply failed to start---we waited and waited for the Curator to start, with no apparent output, but it never did start.
- We did not experience the same lack of output when running a single shared "Hello World" program (which printed "Hello World" once per second) from each node. Thus, it appears the issue is not caused by file locking issues on the server. (Perhaps it is due to the Curator locking some components of itself?)
-
-
Create many copies of Curator (located at, e.g.,
/project/cogcomp/curator-0.6.9_0/
,/project/cogcomp/curator-0.6.9_1/
, . . .,/project/cogcomp/curator-0.6.9_31/
), then launch Curator with the "-shared" flag-
Our attempts at "locking" an installation of Curator once it has been launched on a node have been unsuccessful
- Java warns that using the File class for file locking is unreliable. However, because we need instances of Curator to stay running on a node between individual document annotations (i.e., between
reduce()
calls), and each node needs to be able to figure out which Curator directory, if any, it has "claimed" in previousreduce()
calls, we require more information to be encoded than a typical Java lock stores. In particular, we need the lock to be associated with a machine (we presently use the machine's MAC address as an identifier).
- Java warns that using the File class for file locking is unreliable. However, because we need instances of Curator to stay running on a node between individual document annotations (i.e., between
-
- The most straightforward solution is to convince the operators of the Altocumulus Cloud to implement a method of mirroring a directory (in this case, a single Curator) to each node in the cluster. This does not seem to be excessively complex.
- Advantage: we could use our original, known-reliable design and simply launch Curator locally on each node.
- Alternatively, more research into why the single, network-shared installation of the Curator appeared to not launch successfully may allow us to work within the existing constraints.
- Advantage: requires no work on the part of the cluster's operators.
Despite the issues in sharing a single directory between many machines, we were able to run a large document collection through a small number of tools. To do so, we had to disable speculative execution (allowing us to specify exactly how many nodes will attempt to launch a Curator) and set a relatively low number of Reduce operations (a maximum of about 25).
Using this setup, we ran the tokenizer, part of speech tagger, and chunker over about 1,000 documents (each on the order of 1KB, although rate of speedup should scale linearly to larger document sizes). Because each of these tools have very small memory requirements, we ran these three as a single job, thus minimizing the overhead associated with using Hadoop.
The performance observed in this setup is documented in a spreadsheet.