Skip to content

timyiu478/concurrency-mapreduce

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Concurrency Mapreduce

Map Reduce Background

In 2004, engineers at Google introduced a new paradigm for large-scale parallel data processing known as MapReduce (see the original paper here, and make sure to look in the citations at the end). One key aspect of MapReduce is that it makes programming such tasks on large-scale clusters easy for developers; instead of worrying about how to manage parallelism, handle machine crashes, and many other complexities common within clusters of machines, the developer can instead just focus on writing little bits of code (described below) and the infrastructure handles the rest.

Design Overview

Example Usage

To run the MapReduce example, you can use the following command:

./mapreduce test_files/3/in/*
jumps 10
 10
dog 10
brown 10
quick 10
the 20
fox 10
lazy 10
over 10

MapReduce Infrastructure setting: 10 mappers, 10 reducers, and default hash partitioning.

int main(int argc, char *argv[]) {
    MR_Run(argc, argv, Map, 10, Reduce, 10, MR_DefaultHashPartition);
}

Implementation Limitations/Assumptions

  • does not support running multiple MapReduce jobs in parallel.
  • does not support running workers on multiple machines.
  • does not support key lookup in O(1) time.
  • key-value pairs are stored in memory, so the total size of the data is limited by the memory of the machine.
  • does not support task scheduling policy.
  • number of partitions is equal to the number of reducers.

About

A MapReduce Infrastructure runs on one machine

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published