This repository was archived by the owner on Feb 1, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Open Design Questions
s3cur3 edited this page Jun 18, 2012
·
5 revisions
The following questions make reference to the "pseudo-sequence diagram."
In run()
, what does it mean to set up a document collection on the local storage (i.e., bring into HDFS)?
- Might be best to have the master curator do this. Since it will have access to all the records, it can create a directory in HDFS for each job. If there are a number of annotations for a given document, there can be sub-directories at the document level, like this:
-
hdfs:/user/home/
-
job123/
-
<Unique ID/hash here>/
NER.txt
SRL.txt
chunking.txt
- . . .
-
-
-
- Simply write the output annotation (e.g., NER.txt) to HDFS just like the input was written
- Have the Master Curator copy it out of HDFS, put it in the database, etc.
- Low priority. MapReduce may take care of this for us, but in any case, we should wait to see if there will significant benefit from optimizing here.
If yes . . .
- Pre-launch, how does the Master Curator ensure that the Hadoop nodes are running Curator?
- How does the CuratorReducer connect to the local Curator client?