Revolutionizing website internal linking by leveraging cutting-edge data processing techniques, vector embeddings, and graph-based link prediction algorithms. By combining these advanced technologies and methodologies, the project aims to create an intelligent solution that optimizes internal link structures, enhancing both SEO performance and user navigation.
We're enabling the first publicly available and transparent research for academic and industry purposes in the field of end-to-end SEO and technical marketing on a global level. This initiative opens the door to innovation and collaboration, setting a new standard for how large-scale websites can manage and improve their internal linking strategies using AI-powered, reproducible methods. A scientific paper is in progress and will follow.
Note: Weβve implemented clearer separation between frontend, backend, testing, and data logic, and are now conducting rigorous stress tests with the SEO community.
- Manual testing confirms module stability
- Initial test cases are provided
- Implement deep test automation
Target Audience β’
Sponsors β’
Getting Started β’
App UIs β’
Product Roadmap β’
License β’
About the Creator
The project is organized into a modular structure to promote maintainability, reusability, and clear separation of concerns. This is the current folder layout but can change over time:
WebKnoGraph/ (Your project root)
βββ assets/ # Project assets (images, etc.)
β βββ 01_crawler.png
β βββ 02_embeddings.png
β βββ 03_link_graph.png
β βββ 04_graphsage_01.png
β βββ 04_graphsage_02.png
β βββ 06_HITS_PageRank_Sorted_URLs.png
β βββ WL_logo.png
β βββ fcse_logo.png
β βββ kalicube.com.png
βββ data/ # (This directory should typically be empty in the repo, used for runtime output)
β βββ link_graph_edges.csv # Example of existing data files
β βββ prediction_model/
β β βββ model_metadata.json # Example of existing data files
β βββ url_analysis_results.csv # Example of existing data files
βββ notebooks/ # Jupyter notebooks, each acting as a UI entry point
β βββ crawler_ui.ipynb # UI for Content Crawler
β βββ embeddings_ui.ipynb # UI for Embeddings Pipeline
β βββ link_crawler_ui.ipynb # UI for Link Graph Extractor
β βββ link_prediction_ui.ipynb # UI for GNN Link Prediction & Recommendation
β βββ pagerank_ui.ipynb # UI for PageRank & HITS Analysis (Newly added)
βββ src/ # Core source code for the application
β βββ backend/ # Backend logic for various functionalities
β β βββ __init__.py
β β βββ config/ # Configuration settings for each module
β β β βββ __init__.py
β β β βββ crawler_config.py
β β β βββ embeddings_config.py
β β β βββ link_crawler_config.py
β β β βββ link_prediction_config.py
β β β βββ pagerank_config.py
β β βββ data/ # Data loading, saving, and state management components
β β β βββ __init__.py
β β β βββ repositories.py # For Content Crawler state (SQLite)
β β β βββ embeddings_loader.py
β β β βββ embeddings_saver.py
β β β βββ embedding_state_manager.py
β β β βββ graph_dataloader.py # For Link Prediction data loading
β β β βββ graph_processor.py # For Link Prediction data processing
β β β βββ link_graph_repository.py # For Link Graph Extractor state (SQLite) & CSV saving
β β βββ graph/ # Graph-specific algorithms and analysis
β β β βββ __init__.py
β β β βββ analyzer.py
β β βββ models/ # Machine learning model definitions
β β β βββ __init__.py
β β β βββ graph_models.py # For GNN Link Prediction (GraphSAGE)
β β βββ services/ # Orchestrators and core business logic for each module
β β β βββ __init__.py
β β β βββ crawler_service.py
β β β βββ embeddings_service.py
β β β βββ graph_training_service.py
β β β βββ link_crawler_service.py
β β β βββ pagerank_service.py
β β β βββ recommendation_engine.py
β β βββ utils/ # General utility functions
β β βββ __init__.py
β β βββ http.py # HTTP client utilities (reusable)
β β βββ url.py # URL filtering/extraction for Content Crawler
β β βββ link_url.py # URL filtering/extraction for Link Graph Extractor
β β βββ strategies.py # Crawling strategies (BFS/DFS), generalized for both crawlers
β β βββ text_processing.py # Text extraction from HTML
β β βββ embedding_generation.py # Embedding model loading & generation
β β βββ url_processing.py # URL path processing (e.g., folder depth)
β βββ shared/ # Components shared across frontend and backend
β βββ __init__.py
β βββ interfaces.py # Abstract interfaces (e.g., ILogger)
β βββ logging_config.py # Standardized logging setup
βββ tests/ # Top-level directory for all unit tests
β βββ backend/ # Mirrors src/backend
β β βββ services/ # Mirrors src/backend/services
β β β βββ test_crawler_service.py # Unit tests for crawler_service
β β β βββ test_embeddings_service.py # Unit tests for embeddings_service
β β β βββ test_link_crawler_service.py # Unit tests for link_crawler_service
β β β βββ test_graph_training_service.py # Unit tests for graph_training_service
β β β βββ test_pagerank_service.py # Unit tests for pagerank_service (Newly added)
β β βββ __init__.py # Makes 'services' a Python package
β βββ __init__.py # Makes 'backend' a Python package
βββ .github/
β βββ workflows/
β βββ python_tests.yaml # GitHub Actions workflow for automated tests
βββ LICENSE
βββ README.md
βββ requirements.txt
βββ technical_report/ # Placeholder for documentation
βββ WebKnoGraph_Technical_Report.pdf
To begin a new crawl for a different website, delete the entire data/
folder. This directory stores all intermediate and final outputs from the previous crawl session. Removing it ensures a clean start without residual data interfering.
Path | Description |
---|---|
data/ |
Root folder for all crawl-related data and model artifacts. |
data/link_graph_edges.csv |
Stores inter-page hyperlinks, forming the basis of the internal link graph. |
data/url_analysis_results.csv |
Contains extracted structural features such as PageRank and folder depth per URL. |
data/crawled_data_parquet/ |
Directory for the raw HTML content captured by the crawler in Parquet format. |
data/crawler_state.db |
SQLite database that maintains the crawl state to support resume capability. |
data/url_embeddings/ |
Holds vector embeddings representing the semantic content of each URL. |
data/prediction_model/ |
Includes the trained GraphSAGE model and metadata for link prediction. |
For additional details about how this fits into the full project workflow, refer to the Project Structure section of the README.
We are incredibly grateful to our sponsors for their continued support in making this project possible. Their contributions have been vital in pushing the boundaries of what can be achieved through data-driven internal linking solutions.
- WordLift.io: We extend our deepest gratitude to WordLift.io for their generous sponsorship and for sharing insights and data that were essential for this project's success.
- Kalicube.com: Special thanks to Kalicube.com for providing invaluable data, resources, and continuous encouragement. Your support has greatly enhanced the scope and impact of WebKnoGraph.
- Faculty of Computer Science and Engineering (FCSE) Skopje: A heartfelt thanks to FCSE Skopje professors, PhD Georgina Mircheva and PhD Miroslav Mirchev for their innovative ideas and technical suggestions. Their expertise and advisory during this were a key component in shaping the direction of WebKnoGraph.
Without the contributions from these amazing sponsors, WebKnoGraph would not have been possible. Thank you for believing in the vision and supporting the evolution of this groundbreaking project.
We welcome more sponsors and partners who are passionate about driving innovation in SEO and website optimization. If you're interested in collaborating or sponsoring, feel free to reach out!
WebKnoGraph is created for companies where content plays a central role in business growth. It is suited for mid to large-sized organizations that manage high volumes of content, often exceeding 1,000 unique pages within each structured folder, such as a blog, help center, or product documentation section.
These organizations publish regularly, with dedicated editorial workflows that add new content across folders, subdomains, or language versions. Internal linking is a key part of their SEO and content strategies. However, maintaining these links manually becomes increasingly difficult as the content volume grows.
WebKnoGraph addresses this challenge by offering AI-driven link prediction workflows. It supports teams that already work with technical SEO, semantic search, or structured content planning. It fits well into environments where companies prefer to maintain direct control over their data, models, and optimization logic rather than relying on opaque external services.
The tool is especially relevant for the following types of companies:
-
Media and Publishing Groups: Teams operating large-scale news websites, online magazines, or niche vertical content hubs.
-
B2B SaaS Providers: Companies managing growing knowledge bases, release notes, changelogs, and resource libraries.
-
Ecommerce Brands and Marketplaces: Organizations that handle thousands of product pages, category overviews, and search-optimized content.
-
Enterprise Knowledge Platforms: Firms supporting complex internal documentation across departments or in multiple languages.
WebKnoGraph empowers these organizations to scale internal linking with precision, consistency, and clarity, while keeping full control over their infrastructure.
WebKnoGraph is designed for tech-savvy marketers and marketing engineers with a strong understanding of advanced data analytics and data-driven marketing strategies. Our ideal users are professionals who have experience with Python or have access to development support within their teams.
These individuals are skilled in interpreting and utilizing data, as well as working with technical tools to optimize internal linking structures, improve SEO performance, and enhance overall website navigation. Whether directly coding or collaborating with developers, they are adept at leveraging data and technology to streamline marketing operations, increase customer engagement, and drive business growth.
If you're a data-driven marketer comfortable with using cutting-edge tools to push the boundaries of SEO, WebKnoGraph is built for you.
To explore and utilize WebKnoGraph, follow the instructions below to get started with the code, data, and documentation provided in the repository:
- Code: The core code for this project is located in the
src
folder, organized intobackend
andshared
modules. Thenotebooks
folder contains the Jupyter notebooks that serve as interactive Gradio UIs for each application. - Data: The data used for analysis and testing, as well as generated outputs (like crawled content, embeddings, and link graphs), are stored within the
data
folder (though this folder is typically empty in the repository and populated at runtime). - Technical Report: For a comprehensive understanding of the project, including the methodology, algorithms, and results, refer to the detailed technical report provided in the
technical_report/WebKnoGraph_Technical_Report.pdf
file. This document gives an in-depth coverage of the concepts and the execution of the solution.
By following these resources, you will gain full access to the materials and insights needed to experiment with and extend WebKnoGraph.
This project is designed to be easily runnable in a Google Colab environment, leveraging Google Drive for persistent data storage.
- Google Account: Required for Google Colab and Google Drive.
- Python 3.8+
-
Clone (if using Git locally):
git clone https://github.com/martech-engineer/WebKnoGraph.git cd WebKnoGraph
Then, upload this
WebKnoGraph
folder to your Google Drive. -
Upload (if directly from Colab):
- Download the entire
WebKnoGraph
folder as a ZIP from the repository. - Unzip it.
- Upload the
WebKnoGraph
folder directly to your Google Drive (e.g., intoMy Drive/
). Ensure the internal folder structure is preserved exactly as shown in the Project Structure section.
- Download the entire
All notebooks assume your WebKnoGraph
project is located at /content/drive/My Drive/WebKnoGraph/
after Google Drive is mounted. This path is explicitly set in each notebook.
Each notebook's first cell contains the necessary Python code to mount your Google Drive. You will be prompted to authenticate.
# Part of the first cell in each notebook
from google.colab import drive
drive.mount("/content/drive")
Each notebook's first cell also contains commented-out !pip install
commands. It's recommended to:
-
Open any of the notebooks (e.g.,
notebooks/crawler_ui.ipynb
). -
Uncomment the
!pip install ...
lines in the first cell. -
Run that first cell. This will install all necessary libraries into your Colab environment for the current session. Alternatively, you can manually run
!pip install -r requirements.txt
in a Colab cell, ensuring your requirements.txt is up to date. -
Running the Applications (Gradio UIs)
Each module has its own dedicated Gradio UI notebook. It's recommended to run them in the following order as outputs from one serve as inputs for the next. General Steps for Each Notebook:
- Open the desired
*.ipynb
file in Google Colab. - Go to
Runtime
->Disconnect and delete runtime
(This is CRUCIAL for a clean start and to pick up any code changes). - Go to
Runtime
->Run all cells
. - After the cells finish executing, a Gradio UI link (local and/or public
ngrok.io
link) will appear in the output of the last cell. Click this link to interact with the application.
5.1. Content Crawler
- Notebook:
notebooks/crawler_ui.ipynb
- Purpose: Crawl a website and save content as Parquet files.
- Default Output:
/content/drive/My Drive/WebKnoGraph/data/crawled_data_parquet/
5.2. Embeddings Pipeline
- Notebook:
notebooks/embeddings_ui.ipynb
- Purpose: Generate embeddings for crawled URLs.
- Requires: Output from the Content Crawler (
crawled_data_parquet/
). - Default Output:
/content/drive/My Drive/WebKnoGraph/data/url_embeddings/
5.3. Link Graph Extractor
- Notebook:
notebooks/link_crawler_ui.ipynb
- Purpose: Extract internal FROM, TO links and save as a CSV edge list.
- Default Output:
/content/drive/My Drive/WebKnoGraph/data/link_graph_edges.csv
5.4. GNN Link Prediction & Recommendation Engine
- Notebook:
notebooks/link_prediction_ui.ipynb
- Purpose: Train a GNN model on the link graph and embeddings, then get link recommendations.
- Requires:
- Output from Link Graph Extractor (
link_graph_edges.csv
). - Output from Embeddings Pipeline (
url_embeddings/
).
- Output from Link Graph Extractor (
- Default Output:
/content/drive/My Drive/WebKnoGraph/data/prediction_model/
- Important Note: After training, you must select a specific URL from the dropdown in the "Get Link Recommendations" tab for recommendations to be generated. Do not use the placeholder message.
5.5. PageRank & HITS Analysis
- Notebook:
notebooks/pagerank_ui.ipynb
- Purpose: Calculate PageRank and HITS scores for URLs based on the link graph, and analyze folder depths.
- Requires: Output from the Link Graph Extractor (
link_graph_edges.csv
). (It also generatesurl_analysis_results.csv
which is then used internally for HITS analysis). - Default Output:
/content/drive/My Drive/WebKnoGraph/data/url_analysis_results.csv
Important Note: After training, you must select a specific URL from the dropdown in the "Get Link Recommendations" tab for recommendations to be generated. Do not use the placeholder message.
To execute all unit tests located within the tests/backend/services/ directory and its subdirectories, navigate to the root of your WebKnoGraph project in your terminal. Once there, you can use Python's built-in unittest module with its discover command:
python -m unittest discover tests/backend/services/
-
python -m unittest: This part invokes the unittest module as a script.
-
discover: This command tells unittest to search for and load all test cases.
-
tests/backend/services/: This specifies the starting directory for the test discovery process. unittest will look for any file whose name begins with test (e.g., test_crawler_service.py, test_pagerank_service.py) within this directory and any subdirectories, and then run all test methods found within the unittest.TestCase classes in those files.
A successful test run will typically show a series of dots (.) indicating passed tests. If any tests fail (F) or encounter errors (E), they will be clearly marked, and a summary of the failures/errors will be provided at the end of the output.
This output confirms that all tests in the tests/backend/services/ directory were found and executed, and the final summary will indicate if all of them passed successfully.
- Ensure your
WebKnoGraph
folder is directly under/content/drive/My Drive/
. - Verify that
src
directory exists withinWebKnoGraph
and containsbackend/
andshared/
. - Make sure the
project_root
variable in the first cell of your notebook exactly matches the absolute path to yourWebKnoGraph
folder on Google Drive. - Always perform a Runtime -> Disconnect and delete runtime before re-running.
- Check your file paths (
!ls -R "/content/drive/My Drive/WebKnoGraph"
) to ensure the module file (some_module.py
) is physically located at the path implied by the import (src/backend/data/
). - Ensure there's an
__init__.py
file (even if empty) in every directory along the import path (e.g.,src/backend/__init__.py
,src/backend/data/__init__.py
). - Verify the exact case-sensitivity of folder and file names.
- Confirm you have copy-pasted the entire content into the file and saved it correctly. An empty or syntax-error-laden file will also cause this.
- Always perform a Runtime -> Disconnect and delete runtime before re-running.
- This typically indicates a conflict from multiple installations or an unclean session.
- Perform a Runtime -> Disconnect and delete runtime and then run all cells from scratch. Ensure the
!pip install
commands run in the very first cell before any other imports.
- Ensure the model training pipeline completes successfully first.
- After training, manually select a valid URL from the dropdown for recommendations. The dropdown might initially show a placeholder if artifacts don't exist.
- If retraining, ensure old output artifacts are cleared or overwritten.
This roadmap outlines the planned feature development and research milestones for WebKnoGraph across upcoming quarters. It is organized around key strategic themes: algorithmic enhancements, deployment, testing, user interface customization, and research paper work. Each milestone reflects a step toward building a robust, AI-driven system for optimizing internal linking at scale.
WebKnoGraph invites contributions from developers, researchers, marketers, and anyone driven by curiosity and purpose. This project evolves through collaboration.
You can contribute by improving the codebase, refining documentation, testing workflows, or proposing new use cases. Every pull request, idea, and experiment helps shape a more open and intelligent future for SEO and internal linking.
Clone the repo, start a branch, and share your expertise. Progress happens when people build together.
WebKnoGraph is released under the Apache License 2.0.
This license allows open use, adaptation, and distribution. You can integrate the project into your own workflows, extend its functionality, or build on top of it. The license ensures the project remains accessible and reusable for individuals, teams, and institutions working at the intersection of SEO, AI, and web infrastructure.
Use the code. Improve the methods. Share what you learn.
This interactive calculator estimates the potential cost savings and ROI from optimizing internal links, based on your keyword data, CPC benchmarks, and click-through assumptions.
Emilija Gjorgjevska brings a rare blend of technical depth, product strategy, and marketing insight to the development of WebKnoGraph. She operates at the intersection of applied AI, SEO engineering, and knowledge representation, crafting solutions that are performant and deeply aligned with the real-world needs of digital platforms.
Beyond code, Emilijaβs background in marketing technology and ontology engineering empowers her to translate abstract research into actionable tooling for SEO professionals, SaaS teams, and content-heavy enterprises. She is a strong advocate for cross-disciplinary collaboration, and her leadership in the WebKnoGraph project signals a new paradigm in how we architect, evaluate, and scale intelligent linking systems, anchored in open science, responsible automation, and strategic real-world value.
In her free time, Emilija co-leads Women in AI & Digital DACH, a community committed to increasing visibility and opportunity for women shaping the future of AI and digital work across the DACH region.