Skip to content

mithulcb/PageRank

Repository files navigation

PageRank

Implementation of pagerank algorithm using python in hadoop.
mapper.py and reducer.py are used to compute the initial page rank and adjacency list from an input data consisting of current page and page it is pointing to separated by a tab space. The path to file w is provided as a command line argument.
The initial page rank is stored locally a "w" and adjacency list is stored in HDFS.
sample_input.txt is given as
1 3
2 1
2 4
4 5
4 3
4 1
5 3

w file will be
1,1
3,1
4,1
5,1

The computed sample_adjacency_list will be
1 [3]
2 [4, 1]
4 [5, 1, 3]
5 [3]

mapper2.py and reducer2.py will compute the new page ranks according to page embeddongs provided and w file containing the ranks.
mapper2.py contains functions for computing contribution and similarity wrt current nodes and incoming nodes.
These vaues are then sent to reducer2.py as a key-value pair of current node and contribution from where new rank is calculated.
Inputs to mapper2.py file will be sample_page_embeddings.json and w files as command line arguments. The adjacency list will be read directly from HDFS.
This can be manually be repeated till page ranks of previous iteration is same as current iteration.

About

Implementation of pagerank algorithm using python in hadoop.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages