This repository was archived by the owner on Feb 1, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Open Design Questions
s3cur3 edited this page Jun 27, 2012
·
5 revisions
See basically every item on Curator modifications.
The following questions make reference to the "pseudo-sequence diagram."
In run()
, what does it mean to set up a document collection on the local storage (i.e., bring into HDFS)?
- Might be best to have the master curator do this. Since it will have access to all the records, it can create a directory in HDFS for each job. If there are a number of annotations for a given document, there can be sub-directories at the document level, like this:
-
hdfs:/user/home/
<retrieve with Configuration.get("inputDirectory")>-
job123/
-
<Unique hash ID here>/
original.txt
CHUNK.txt
COREF.txt
NOM_SRL.txt
POS.txt
TOKEN.txt
VERB_SRL.txt
WIKI.txt
PARSE.txt
-
-
-
- Simply write the output annotation (e.g., NER.txt) to HDFS just like the input was written
- Have the Master Curator copy it out of HDFS, put it in the database, etc.
- Low priority. MapReduce may take care of this for us, but in any case, we should wait to see if there will significant benefit from optimizing here.
If yes . . .
- Pre-launch, how does the Master Curator ensure that the Hadoop nodes are running Curator?
- How does the CuratorReducer connect to the local Curator client?