Skip to content
This repository was archived by the owner on Feb 1, 2022. It is now read-only.

Issues Running on the Altocumulus Cloud's Hadoop Cluster

s3cur3 edited this page Aug 3, 2012 · 11 revisions

This page documents the issues we encountered in attempting to run the Hadoop interface on the the Altocumulus Cloud, sponsored by the Illinois Cloud Testbed.

Intended architecture and modifications forced by the Altocumulus Cloud's policies

We designed the system with the assumption that we would be able to install the (pre-compiled) Curator along with the 11 annotation tools (totaling between 15 and 20 GB) locally on each Hadoop node. Then, when we ran a job, we would simply launch the Curator from, e.g., ~/curator/dist/ as necessary during a Reduce phase.

In contrast, the Altocumulus Cloud does not allow us to load data directly on individual nodes. Instead, we are forced to have every Hadoop node access a single, shared network disk. Each Hadoop machine can access our project directory at /project/cogcomp/.

Failed attempts to address the modifications

To compensate for this (rather drastic) change, we attempted the following two solutions:

  • Have all nodes access a single network installation of Curator (located at, e.g., /project/cogcomp/curator-0.6.9/)

    • This has the obvious disadvantage of significant network traffic each time we launch the Curator.

    • For an unknown reason, this resulted in strange behavior when attempting to launch the Curator on many nodes. Most machines, when running the launch command, simply failed to start---we waited and waited for the Curator to start, with no (apparent) output, but it never did start.

      • We did not experience the same lack of output when running a single shared "Hello World" program (which printed "Hello World" once per second) from each node. Thus, it appears the issue is not caused by file locking issues on the server. (Perhaps it is due to the Curator locking some components of itself?)
  • Create many copies of Curator (located at, e.g., /project/cogcomp/curator-0.6.9_0/, /project/cogcomp/curator-0.6.9_1/, . . ., /project/cogcomp/curator-0.6.9_31/), then launch Curator with the "-shared" flag

    • Our attempts at "locking" an installation of Curator once it has been launched on a node have been unsuccessful

      • Java warns that using the File class for file locking is unreliable. However, because we need instances of Curator to stay running on a node between individual document annotations (i.e., between reduce() calls), and each node needs to be able to figure out which Curator directory, if any, it has "claimed" in previous reduce() calls, we require more information to be encoded than a typical Java lock stores. In particular, we need the lock to be associated with a machine (we presently use the machine's MAC address as an identifier).

Potential true solutions

  • The most straightforward solution is to convince the operators of the Altocumulus Cloud to implement a method of mirroring a directory (in this case, a single Curator) to each node in the cluster. This does not seem to be excessively complex.
    • Advantage: we could use our original, known-reliable design and simply launch Curator locally on each node.
  • Alternatively, more research into why the single, network-shared installation of the Curator appeared to not launch successfully may allow us to work within the existing constraints.
    • Advantage: requires no work on the part of the cluster's operators.
Clone this wiki locally