-
Notifications
You must be signed in to change notification settings - Fork 95
Open
Labels
Milestone
Description
If both data sets are stored sorted on the join key, then its possible to perform the join on the map side. The general idea is to:
- Build up an index of keys to file location/offset of one of the data sets.
- Use the other data set as normal input to a map job.
- For each key, look up the the corresponding file/offset from the index.
- Directly read the file, seeking to the offset.
There are already implementations in both pig and hive, and would be a nice addition to scoobi.
Pigs implementation - http://wiki.apache.org/pig/PigMergeJoin
Hives implementation - https://issues.apache.org/jira/browse/HIVE-1194