Skip to content

kristifidani/map_reduce_data_processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MapReduce Solutions for Diverse Data Tasks

Project Overview

In this project, I've harnessed the power of MapReduce to tackle a variety of tasks:

  1. Encompass counting
  2. Reverse web-graph link analysis
  3. K-Nearest Neighbors (KNN) algorithm
  4. Matrix multiplication.

Each task is associated with its unique dataset and specific requirements, all meticulously detailed in the project description.

Utilizing the MapReduce programming paradigm, I've efficiently processed substantial volumes of data by breaking them into manageable chunks and orchestrating parallel operations on a cluster, exemplified by technologies such as Hadoop. My implementation leverages the MRJob framework, a powerful tool that empowers seamless programming with the MapReduce model in Python.

Key Aspects:

  • Mapper: Transforms input data into logical key-value pairs, generating a collection of intermediate key-values.
  • Combiner: Acts as a localized reducer for each map operation, optimizing intermediate data processing.
  • Reducer: Aggregates intermediate values linked with the same key, culminating in the desired outcome.

This project integrates mapper and reducer functions within a unified class, offering a streamlined and coherent approach. The flexibility to define multiple steps, each composed of at least one function, facilitates a comprehensive data processing pipeline that can be tailored to the intricacies of the tasks at hand.

Conclusion

Utilizing the MapReduce paradigm, this project showcases the power of parallel processing for solving a range of tasks, from keyword extraction to graph analysis and classification. The provided solutions leverage the MRJob framework and Python libraries to efficiently process large datasets, demonstrating the capabilities of MapReduce in a diverse set of applications.

Feel free to explore the code and adapt the provided solutions to your specific needs.

About

Project related to my Masters program for the Distributed Computing and Storage Architectures course.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages