Skip to content

rreemmii-dev/Wikipedia-Database-Parser-Analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wikipedia-Database-Parser-Analyzer

A parser and analyzer for Wikipedia database dumps, written in Rust.

Generate the graph linking each article to its sources, keep the most important vertexes using algorithms such as the PageRank algorithm, and render it using Gephi!

You can also find the shortest path between 2 Wikipedia articles using the BFS algorithm.

Cluster dealing with UK politics in a generated graph:
Graph of UK Politics

Table of content

Getting started

Installation

# Clone the repository:
git clone git@github.com:rreemmii-dev/Wikipedia-Database-Parser-Analyzer.git

cd Wikipedia-Database-Parser-Analyzer

Ensure you have cargo installed.

Wikipedia dump download

  1. Download a Wikipedia dump (a list can be found here). The dump I used is English Wikipedia, 2025-05-01 (uncompressed size: 101 GB, contains about 7 million parsed articles).
  2. Extract the dump.
  3. Set the WIKI_PATH constant in both src/main.rs and src/simple_main.rs to the dump file path, relative to Cargo.toml.

Run

Run the example file:

src/simple_main.rs does the following:

  1. Generates required databases (using generate_databases, see #Provided tools/Generate and load link databases for more information).
  2. Loads databases.
  3. Executes a BFS to find the shortest path from the Wikipedia article to the Rust_(programming_language) article.
  4. Filters the graph to keep only vertexes having more than 1000 children or parents.
  5. Exports the filtered graph as a CSV file.

It can be run using:

cargo run --release --bin Example

Run the src/main.rs file

cargo run --release

Provided tools

Generate and load link databases

generate_databases in src/database/generate_database.rs generates the following databases:

  • graph: The graph with each article pointing towards its sources, stored as an adjacency list. Each article is represented by an ID.
  • name_from_id: The article name corresponding to the article ID.

load_name_id_databases and load_graph_database in src/database/load_database.rs load the two previous databases, plus the id_from_name database.

Different graph types with built-in functions

There are 2 graph types:

Render graphs using Gephi

To open the graph in Gephi, it has to be exported as a CSV file using export_as_csv in src/hashmap_graph_utils.rs.

It is then stored as an adjacency list using # as the CSV delimiter (Wikipedia article names cannot contain #).

It is not recommended to export graph having more than 10,000 vertexes. See https://gephi.org/users/requirements/.

Wikipedia documentation

Here is some documentation about Wikipedia databases:

License

Distributed under the MIT License. See LICENSE.md

Releases

No releases published

Packages

No packages published

Languages