This repository was archived by the owner on Feb 1, 2022. It is now read-only.

Open Design Questions

Jump to bottom

s3cur3 edited this page Jun 18, 2012 · 5 revisions

The following questions make reference to the "pseudo-sequence diagram."

In `run()`, what does it mean to set up a document collection on the local storage (i.e., bring into HDFS)?

Might be best to have the master curator do this. Since it will have access to all the records, it can create a directory in HDFS for each job. If there are a number of annotations for a given document, there can be sub-directories at the document level, like this:
- hdfs:/user/home/
  - job123/
    - <Unique ID/hash here>/
      - NER.txt
      - SRL.txt
      - chunking.txt
      - . . .

In `run()`, what should clean-up be like?

Simply write the output annotation (e.g., NER.txt) to HDFS just like the input was written
Have the Master Curator copy it out of HDFS, put it in the database, etc.

In `map()`, how do we optimize things so that we're bringing the computation to the data?

Low priority. MapReduce may take care of this for us, but in any case, we should wait to see if there will significant benefit from optimizing here.

Do we actually need to be running the Curator on the nodes of the Hadoop cluster?

If yes . . .

Pre-launch, how does the Master Curator ensure that the Hadoop nodes are running Curator?
How does the CuratorReducer connect to the local Curator client?