Skip to content

yuyu1815/deepwiki_to_md

Repository files navigation

Deepwiki to Markdown Converter

The Japanese version of this document is available at README_ja.md.

A powerful Python tool designed to scrape content from deepwiki sites and convert it to clean Markdown format. It offers multiple scraping strategies and functions for data processing.

Features

  • Scrapes content from deepwiki sites using multiple strategies:
    • Direct Markdown Fetching (default)
    • Direct HTML Scraping with conversion
    • Simple static fallback
  • Extracts navigation items from specified UI elements to traverse libraries
  • Converts HTML content to Markdown format using markdownify
  • Saves the converted files in an organized directory structure
  • Supports scraping multiple libraries in a single run
  • Includes error handling with domain validation, reachability checks, and retry mechanisms
  • Offers a utility to convert Markdown files to YAML format while preserving formatting
  • Provides a utility to fix links within the scraped Markdown files
  • Supports scraping responses from chat interfaces using Selenium

Requirements

Core Dependencies

  • requests
  • beautifulsoup4
  • argparse
  • markdownify

Optional Dependencies

  • selenium (Required for the chat scraping feature)
  • webdriver-manager (Required for the chat scraping feature)
  • pyyaml (Required for the Markdown to YAML conversion feature)

Installation

Option 1: Install from PyPI

pip install deepwiki-to-md

This will install the core dependencies listed in setup.py. Note that selenium, webdriver-manager, and pyyaml are listed in requirements.txt but not as install dependencies in setup.py. Install them manually or install from source including requirements.txt if you need the chat scraping or YAML conversion features.

Option 2: Install from source

Clone this repository:

git clone https://github.com/yuyu1815/deepwiki_to_md.git
cd deepwiki_to_md

Install the package in development mode, including all dependencies from requirements.txt:

pip install -e . -r requirements.txt

Usage

Basic Usage (Command Line)

If installed from PyPI, you can use the command-line tool:

deepwiki-to-md "https://deepwiki.com/library_path"

Or with explicit parameters:

deepwiki-to-md --library "library_name" "https://deepwiki.example.com/library_path"

If installed from source, you can run the script directly:

python -m deepwiki_to_md.run_scraper "https://deepwiki.com/library_path"

Or with explicit parameters:

python -m deepwiki_to_md.run_scraper --library "library_name" "https://deepwiki.example.com/library_path"

Note: The output directory will be created in the current working directory where the command is executed, not in the package installation directory.

Repository Creation Tool

The package also includes a tool to create repository requests by setting an email and submitting a form:

If installed from PyPI, you can use the command-line tool:

deepwiki-create --url "https://example.com/repository/create" --email "user@example.com"

To run in headless mode (without opening a browser window):

deepwiki-create --url "https://example.com/repository/create" --email "user@example.com" --headless

If installed from source, you can run the script directly:

python -m deepwiki_to_md.create --url "https://example.com/repository/create" --email "user@example.com"

Using the Python API

You can also use the DeepwikiScraper class directly in your Python code:

from deepwiki_to_md import DeepwikiScraper
# Import specific scraper classes if needed for direct use
from deepwiki_to_md.direct_scraper import DirectDeepwikiScraper  # For HTML -> MD
from deepwiki_to_md.direct_md_scraper import DirectMarkdownScraper  # For Direct MD
# Import the RepositoryCreator class for repository creation
from deepwiki_to_md.create import RepositoryCreator

# Create a scraper instance (DirectMarkdownScraper is used by default)
scraper = DeepwikiScraper(output_dir="MyDocuments")

# Scrape a library using the default (DirectMarkdownScraper)
scraper.scrape_library("python", "https://deepwiki.com/python/cpython")

# Create another scraper with a different output directory
other_scraper = DeepwikiScraper(output_dir="OtherDocuments")

# Scrape another library (still uses DirectMarkdownScraper by default)
other_scraper.scrape_library("javascript", "https://deepwiki.example.com/javascript")

# --- Using DirectDeepwikiScraper explicitly (HTML to Markdown) ---
# Create a scraper instance explicitly using DirectDeepwikiScraper
# This scraper fetches HTML and converts it to Markdown
html_scraper = DeepwikiScraper(
    output_dir="HtmlScrapedDocuments",
    use_direct_scraper=True,  # Enable DirectDeepwikiScraper
    use_alternative_scraper=False,  # Disable alternative fallback for clarity
    use_direct_md_scraper=False  # Disable DirectMarkdownScraper
)
html_scraper.scrape_library("go", "https://deepwiki.com/go")

# --- Using DirectMarkdownScraper explicitly (Direct Markdown Fetching) ---
# Create a scraper instance explicitly using DirectMarkdownScraper
# This is already the default, but can be specified for clarity or if other defaults change
md_scraper = DeepwikiScraper(
    output_dir="DirectMarkdownDocuments",
    use_direct_scraper=False,
    use_alternative_scraper=False,
    use_direct_md_scraper=True  # Enable DirectMarkdownScraper (this is the default)
)
md_scraper.scrape_library("rust", "https://deepwiki.com/rust")

# --- Using the individual direct scrapers directly ---
# These classes can be used independently for scraping specific pages or lists of pages

# Create a DirectDeepwikiScraper instance (HTML to Markdown)
direct_html_scraper = DirectDeepwikiScraper(output_dir="DirectHtmlScraped")

# Scrape a specific page directly (HTML to Markdown)
direct_html_scraper.scrape_page(
    "https://deepwiki.com/python/cpython/2.1-bytecode-interpreter-and-optimization",
    "python_bytecode",  # Library name/path part for output folder
    save_html=True  # Optionally save the original HTML
)

# Create a DirectMarkdownScraper instance (Direct Markdown Fetching)
direct_md_scraper = DirectMarkdownScraper(output_dir="DirectMarkdownFetched")

# Scrape a specific page directly as Markdown
direct_md_scraper.scrape_page(
   "https://deepwiki.com/python/cpython/2.1-bytecode-interpreter-and-optimization",
    "python_bytecode"  # Library name/path part for output folder
)

# --- Using the RepositoryCreator for repository creation requests ---
# Create a RepositoryCreator instance
creator = RepositoryCreator(headless=False)  # Set headless=True to run without browser UI

try:
  # Send a repository creation request
  success = creator.create(
    url="https://example.com/repository/create",
    email="user@example.com"
  )

  if success:
    print("Repository creation request sent successfully")
  else:
    print("Failed to send repository creation request")
finally:
  # Always close the browser when done
  creator.close()

Command-line Arguments

For deepwiki-to-md or python -m deepwiki_to_md.run_scraper:

  • library_url: URL of the library to scrape (can be provided as a positional argument).
  • --library, -l: Library name and URL to scrape. Can be specified multiple times for different libraries. Format: --library NAME URL.
  • --output-dir, -o: Output directory for Markdown files (default: Documents).
  • --use-direct-scraper: Use DirectDeepwikiScraper (HTML to Markdown conversion). Prioritized over --use-direct-md-scraper if both are specified.
  • --no-direct-scraper: Disable DirectDeepwikiScraper.
  • --use-alternative-scraper: Use the scrape_deepwiki function from direct_scraper.py as a fallback if the primary method fails (default: True).
  • --no-alternative-scraper: Disable the alternative scraper fallback.
  • --use-direct-md-scraper: Use DirectMarkdownScraper (fetches Markdown directly). This is the default behavior if no scraper type is explicitly specified.
  • --no-direct-md-scraper: Disable DirectMarkdownScraper.

Scraper Priority:

  • If --use-direct-scraper is specified, DirectDeepwikiScraper (HTML to Markdown) is used.
  • If --use-direct-md-scraper is specified (and --use-direct-scraper is not), DirectMarkdownScraper (Direct Markdown) is used.
  • If neither is specified, DirectMarkdownScraper (Direct Markdown) is used by default.
  • The --use-alternative-scraper flag controls a fallback mechanism within the chosen primary scraper.

For deepwiki-create or python -m deepwiki_to_md.create:

  • --url (required): The URL of the repository creation page.
  • --email (required): The email address to notify.
  • --headless: Run the browser in headless mode (without UI).

Examples (Command Line)

Simplified usage (uses DirectMarkdownScraper by default):

python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python/cpython"
# Or if installed via pip: deepwiki-to-md "https://deepwiki.com/python/cpython"

Scrape a single library with explicit parameters:

python -m deepwiki_to_md.run_scraper --library "python" "https://deepwiki.com/python/cpython"

Scrape multiple libraries:

python -m deepwiki_to_md.run_scraper --library "python" "https://deepwiki.com/python/cpython" --library "microsoft/vscode" "https://deepwiki.com/microsoft/vscode"

Specify a custom output directory:

python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python/cpython" --output-dir "MyDocuments"

Explicitly use DirectMarkdownScraper (Direct Markdown):

python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python/cpython" --use-direct-md-scraper

Explicitly use DirectDeepwikiScraper (HTML to Markdown):

python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python/cpython" --use-direct-scraper

Disable the alternative scraper fallback:

python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python/cpython" --no-alternative-scraper

Using the repository creation tool:

deepwiki-create --url "https://example.com/repository/create" --email "user@example.com"

Using the repository creation tool in headless mode:

deepwiki-create --url "https://example.com/repository/create" --email "user@example.com" --headless

Usage with run_direct_scraper.py

You can also use the run_direct_scraper.py script, which is a simplified entry point specifically for the DirectDeepwikiScraper (HTML to Markdown):

python -m deepwiki_to_md.run_direct_scraper "https://deepwiki.com/python/cpython"
# Or with explicit parameters:
python -m deepwiki_to_md.run_direct_scraper --library "python" "https://deepwiki.com/python/cpython"
# To save HTML as well:
python -m deepwiki_to_md.run_direct_scraper "https://deepwiki.com/python/cpython" --save-html

Arguments for run_direct_scraper.py:

  • library_url: URL of the library (positional).
  • --library, -l: Library name and URL (can be multiple).
  • --output-dir, -o: Output directory (default: DynamicDocuments).
  • --save-html: Save original HTML files alongside Markdown.

Output Structure

The converted Markdown files will be saved in the following directory structure:

<output_dir>/
├── <library_name1>/
│   └── md/
│       ├── <page_name1>.md
│       ├── <page_name2>.md
│       └── ...
│   └── html/ # Only if --save-html is used with DirectDeepwikiScraper
│       ├── <page_name1>.html
│       ├── <page_name2>.html
│       └── ...
├── <library_name2>/
│   └── md/
│       ├── <page_name1>.md
│       ├── <page_name2>.md
│       └── ...
└── ...
  • <output_dir> is the directory specified by --output-dir (default: Documents for run_scraper.py, DynamicDocuments for run_direct_scraper.py).
  • <library_name> is the name provided for the library (or inferred from the URL path).
  • Each page from the Deepwiki site is saved as a separate .md file within the md subdirectory.
  • Original HTML is saved in the html subdirectory if the --save-html option is used with DirectDeepwikiScraper.

How It Works

The tool offers different scraping strategies to maximize compatibility and output quality:

1. Direct Markdown Scraping (DirectMarkdownScraper - Default)

  • Priority: Highest (used by default if no other scraper is explicitly chosen).
  • Method: Attempts to fetch the raw Markdown content directly from the Deepwiki site's underlying data source or API. This is done by sending requests with specialized headers that mimic internal application requests.
  • Process:
    • Sends requests designed to retrieve Markdown data (using specific Accept headers or query parameters)
    • Parses the response to extract the Markdown content
    • Performs minimal cleaning on the extracted Markdown
    • Splits the content into multiple files based on level 2 headings (##)
    • Saves the cleaned and split Markdown content directly to .md files
  • Advantage: Produces the highest fidelity Markdown, preserving the original formatting and structure as intended by the author.

2. Direct HTML Scraping (DirectDeepwikiScraper)

  • Priority: Medium (used if --use-direct-scraper is specified).
  • Method: Connects to the Deepwiki site using headers that mimic a standard browser request to fetch the fully rendered HTML page.
  • Process:
    • Fetches the full HTML of the page using the scrape_deepwiki function
    • Uses BeautifulSoup to parse the HTML
    • Identifies the main content area using a list of potential CSS selectors
    • Uses the markdownify library to convert the selected HTML content to Markdown
    • Saves the converted Markdown
  • Advantage: More robust than basic static scraping if direct Markdown fetching fails or is unavailable.

3. Alternative Scraper Fallback

  • Priority: Lowest (used as a fallback if --use-alternative-scraper is enabled).
  • Method: A simpler static requests mechanism with specific headers designed to fetch the page HTML reliably.

Markdown to YAML Conversion Utility

The tool provides a utility to convert Markdown files to YAML format while preserving formatting. This is particularly useful for processing the scraped content for LLMs.

Using the Conversion Tool (Command Line)

python -m deepwiki_to_md.chat convert --md "path/to/markdown/file.md"
# Or if console script entry point is installed:
# deepwiki-chat convert --md "path/to/markdown/file.md"

To specify a custom output directory:

python -m deepwiki_to_md.chat convert --md "path/to/markdown/file.md" --output "path/to/output/directory"

Using the Python API (Markdown to YAML)

from deepwiki_to_md.md_to_yaml import convert_md_file_to_yaml, markdown_to_yaml

# Convert a Markdown file to YAML
yaml_file_path = convert_md_file_to_yaml("path/to/markdown/file.md")

# Convert a Markdown file to YAML with a custom output directory
yaml_file_path = convert_md_file_to_yaml("path/to/markdown/file.md", "path/to/output/directory")

# Or convert a Markdown string directly to a YAML string
markdown_string = "# My Document\n\nThis is the content."
yaml_string = markdown_to_yaml(markdown_string)
print(yaml_string)

YAML Format

The converted YAML file includes a structured representation of the document while embedding the original Markdown content:

timestamp: 'YYYY-MM-DD HH:MM:SS'  # Timestamp of the conversion
title: Extracted Document Title    # Title extracted from the first H1/H2 header
content: |
  # Original Title
  ## Section 1

  Content of section 1.

  * List item 1
  * List item 2

  print("code")

  [Link Text](url)

  ## Section 2

  Content of section 2.
  ...                              # Full original Markdown content is preserved
links:
  - text: Link Text
    url: url                       # List of links extracted from the Markdown
images: [ ]                         # List of images extracted (currently empty)
metadata:
  headers: # List of all header texts
    - Original Title
    - Section 1
    - Section 2
    ...
  paragraphs_count: 5              # Count of paragraphs
  lists_count: 1                   # Count of lists
  tables_count: 0                  # Count of tables

Markdown Link Fixing Utility

The tool automatically runs a link-fixing utility on the generated .md files. This utility finds Markdown links in the format Text and replaces them with Text.

Using the Link Fixing Tool (Command Line)

python -m deepwiki_to_md.fix_markdown_links "path/to/your/markdown/directory"

Using the Python API (Link Fixing)

from deepwiki_to_md.fix_markdown_links import fix_markdown_links

# Fix links in all markdown files within a directory
fix_markdown_links("path/to/your/markdown/directory")

Chat Scraping Feature (Requires Selenium)

The tool includes a feature to interact with chat interfaces using Selenium and save the responses.

Using the Chat Scraper (Command Line)

python -m deepwiki_to_md.chat --url "https://deepwiki.com/some_chat_page" --message "Your message here" --wait 10 --debug --format "html,md,yaml" --output "MyChatResponses" --deep

Arguments for chat.py:

  • --url: URL of the chat interface.
  • --message: Message to send.
  • --selector: CSS selector for the chat input (default: textarea).
  • --button: CSS selector for the submit button (default: button).
  • --wait: Time to wait for response in seconds (default: 30).
  • --debug: Enable debug mode.
  • --output: Output directory (default: ChatResponses).
  • --deep: Enable "Deep Research" mode (specific to some interfaces).
  • --headless: Run browser in headless mode.
  • --format: Output format(s): html, md, yaml, or comma-separated list (default: html).

Note: The chat scraper uses Selenium, which requires a compatible browser installed.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

md the contents of deepwiki and add it as documentation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages