Skip to content

Implement sorted merge join #197

@raronson

Description

@raronson

If both data sets are stored sorted on the join key, then its possible to perform the join on the map side. The general idea is to:

  • Build up an index of keys to file location/offset of one of the data sets.
  • Use the other data set as normal input to a map job.
  • For each key, look up the the corresponding file/offset from the index.
  • Directly read the file, seeking to the offset.

There are already implementations in both pig and hive, and would be a nice addition to scoobi.

Pigs implementation - http://wiki.apache.org/pig/PigMergeJoin
Hives implementation - https://issues.apache.org/jira/browse/HIVE-1194

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions