The Japanese version of this document is available at README_ja.md.
A powerful Python tool designed to scrape content from deepwiki sites and convert it to clean Markdown format. It offers multiple scraping strategies and functions for data processing.
- Scrapes content from deepwiki sites using multiple strategies:
- Direct Markdown Fetching (default)
- Direct HTML Scraping with conversion
- Simple static fallback
- Extracts navigation items from specified UI elements to traverse libraries
- Converts HTML content to Markdown format using
markdownify
- Saves the converted files in an organized directory structure
- Supports scraping multiple libraries in a single run
- Includes error handling with domain validation, reachability checks, and retry mechanisms
- Offers a utility to convert Markdown files to YAML format while preserving formatting
- Provides a utility to fix links within the scraped Markdown files
- Supports scraping responses from chat interfaces using Selenium
- Python 3.6 or higher
- Required Python packages (see
requirements.txt
):
requests
beautifulsoup4
argparse
markdownify
selenium
(Required for the chat scraping feature)webdriver-manager
(Required for the chat scraping feature)pyyaml
(Required for the Markdown to YAML conversion feature)
pip install deepwiki-to-md
This will install the core dependencies listed in setup.py. Note that selenium
, webdriver-manager
, and pyyaml
are listed in requirements.txt
but not as install dependencies in setup.py
. Install them manually or install from source including requirements.txt
if you need the chat scraping or YAML conversion features.
Clone this repository:
git clone https://github.com/yuyu1815/deepwiki_to_md.git
cd deepwiki_to_md
Install the package in development mode, including all dependencies from requirements.txt:
pip install -e . -r requirements.txt
If installed from PyPI, you can use the command-line tool:
deepwiki-to-md "https://deepwiki.com/library_path"
Or with explicit parameters:
deepwiki-to-md --library "library_name" "https://deepwiki.example.com/library_path"
If installed from source, you can run the script directly:
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/library_path"
Or with explicit parameters:
python -m deepwiki_to_md.run_scraper --library "library_name" "https://deepwiki.example.com/library_path"
Note: The output directory will be created in the current working directory where the command is executed, not in the package installation directory.
The package also includes a tool to create repository requests by setting an email and submitting a form:
If installed from PyPI, you can use the command-line tool:
deepwiki-create --url "https://example.com/repository/create" --email "user@example.com"
To run in headless mode (without opening a browser window):
deepwiki-create --url "https://example.com/repository/create" --email "user@example.com" --headless
If installed from source, you can run the script directly:
python -m deepwiki_to_md.create --url "https://example.com/repository/create" --email "user@example.com"
You can also use the DeepwikiScraper class directly in your Python code:
from deepwiki_to_md import DeepwikiScraper
# Import specific scraper classes if needed for direct use
from deepwiki_to_md.direct_scraper import DirectDeepwikiScraper # For HTML -> MD
from deepwiki_to_md.direct_md_scraper import DirectMarkdownScraper # For Direct MD
# Import the RepositoryCreator class for repository creation
from deepwiki_to_md.create import RepositoryCreator
# Create a scraper instance (DirectMarkdownScraper is used by default)
scraper = DeepwikiScraper(output_dir="MyDocuments")
# Scrape a library using the default (DirectMarkdownScraper)
scraper.scrape_library("python", "https://deepwiki.com/python/cpython")
# Create another scraper with a different output directory
other_scraper = DeepwikiScraper(output_dir="OtherDocuments")
# Scrape another library (still uses DirectMarkdownScraper by default)
other_scraper.scrape_library("javascript", "https://deepwiki.example.com/javascript")
# --- Using DirectDeepwikiScraper explicitly (HTML to Markdown) ---
# Create a scraper instance explicitly using DirectDeepwikiScraper
# This scraper fetches HTML and converts it to Markdown
html_scraper = DeepwikiScraper(
output_dir="HtmlScrapedDocuments",
use_direct_scraper=True, # Enable DirectDeepwikiScraper
use_alternative_scraper=False, # Disable alternative fallback for clarity
use_direct_md_scraper=False # Disable DirectMarkdownScraper
)
html_scraper.scrape_library("go", "https://deepwiki.com/go")
# --- Using DirectMarkdownScraper explicitly (Direct Markdown Fetching) ---
# Create a scraper instance explicitly using DirectMarkdownScraper
# This is already the default, but can be specified for clarity or if other defaults change
md_scraper = DeepwikiScraper(
output_dir="DirectMarkdownDocuments",
use_direct_scraper=False,
use_alternative_scraper=False,
use_direct_md_scraper=True # Enable DirectMarkdownScraper (this is the default)
)
md_scraper.scrape_library("rust", "https://deepwiki.com/rust")
# --- Using the individual direct scrapers directly ---
# These classes can be used independently for scraping specific pages or lists of pages
# Create a DirectDeepwikiScraper instance (HTML to Markdown)
direct_html_scraper = DirectDeepwikiScraper(output_dir="DirectHtmlScraped")
# Scrape a specific page directly (HTML to Markdown)
direct_html_scraper.scrape_page(
"https://deepwiki.com/python/cpython/2.1-bytecode-interpreter-and-optimization",
"python_bytecode", # Library name/path part for output folder
save_html=True # Optionally save the original HTML
)
# Create a DirectMarkdownScraper instance (Direct Markdown Fetching)
direct_md_scraper = DirectMarkdownScraper(output_dir="DirectMarkdownFetched")
# Scrape a specific page directly as Markdown
direct_md_scraper.scrape_page(
"https://deepwiki.com/python/cpython/2.1-bytecode-interpreter-and-optimization",
"python_bytecode" # Library name/path part for output folder
)
# --- Using the RepositoryCreator for repository creation requests ---
# Create a RepositoryCreator instance
creator = RepositoryCreator(headless=False) # Set headless=True to run without browser UI
try:
# Send a repository creation request
success = creator.create(
url="https://example.com/repository/create",
email="user@example.com"
)
if success:
print("Repository creation request sent successfully")
else:
print("Failed to send repository creation request")
finally:
# Always close the browser when done
creator.close()
For deepwiki-to-md
or python -m deepwiki_to_md.run_scraper
:
library_url
: URL of the library to scrape (can be provided as a positional argument).--library
,-l
: Library name and URL to scrape. Can be specified multiple times for different libraries. Format:--library NAME URL
.--output-dir
,-o
: Output directory for Markdown files (default: Documents).--use-direct-scraper
: Use DirectDeepwikiScraper (HTML to Markdown conversion). Prioritized over--use-direct-md-scraper
if both are specified.--no-direct-scraper
: Disable DirectDeepwikiScraper.--use-alternative-scraper
: Use the scrape_deepwiki function from direct_scraper.py as a fallback if the primary method fails (default: True).--no-alternative-scraper
: Disable the alternative scraper fallback.--use-direct-md-scraper
: Use DirectMarkdownScraper (fetches Markdown directly). This is the default behavior if no scraper type is explicitly specified.--no-direct-md-scraper
: Disable DirectMarkdownScraper.
Scraper Priority:
- If
--use-direct-scraper
is specified, DirectDeepwikiScraper (HTML to Markdown) is used. - If
--use-direct-md-scraper
is specified (and--use-direct-scraper
is not), DirectMarkdownScraper (Direct Markdown) is used. - If neither is specified, DirectMarkdownScraper (Direct Markdown) is used by default.
- The
--use-alternative-scraper
flag controls a fallback mechanism within the chosen primary scraper.
For deepwiki-create
or python -m deepwiki_to_md.create
:
--url
(required): The URL of the repository creation page.--email
(required): The email address to notify.--headless
: Run the browser in headless mode (without UI).
Simplified usage (uses DirectMarkdownScraper by default):
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python/cpython"
# Or if installed via pip: deepwiki-to-md "https://deepwiki.com/python/cpython"
Scrape a single library with explicit parameters:
python -m deepwiki_to_md.run_scraper --library "python" "https://deepwiki.com/python/cpython"
Scrape multiple libraries:
python -m deepwiki_to_md.run_scraper --library "python" "https://deepwiki.com/python/cpython" --library "microsoft/vscode" "https://deepwiki.com/microsoft/vscode"
Specify a custom output directory:
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python/cpython" --output-dir "MyDocuments"
Explicitly use DirectMarkdownScraper (Direct Markdown):
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python/cpython" --use-direct-md-scraper
Explicitly use DirectDeepwikiScraper (HTML to Markdown):
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python/cpython" --use-direct-scraper
Disable the alternative scraper fallback:
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python/cpython" --no-alternative-scraper
Using the repository creation tool:
deepwiki-create --url "https://example.com/repository/create" --email "user@example.com"
Using the repository creation tool in headless mode:
deepwiki-create --url "https://example.com/repository/create" --email "user@example.com" --headless
You can also use the run_direct_scraper.py script, which is a simplified entry point specifically for the DirectDeepwikiScraper (HTML to Markdown):
python -m deepwiki_to_md.run_direct_scraper "https://deepwiki.com/python/cpython"
# Or with explicit parameters:
python -m deepwiki_to_md.run_direct_scraper --library "python" "https://deepwiki.com/python/cpython"
# To save HTML as well:
python -m deepwiki_to_md.run_direct_scraper "https://deepwiki.com/python/cpython" --save-html
Arguments for run_direct_scraper.py:
library_url
: URL of the library (positional).--library
,-l
: Library name and URL (can be multiple).--output-dir
,-o
: Output directory (default: DynamicDocuments).--save-html
: Save original HTML files alongside Markdown.
The converted Markdown files will be saved in the following directory structure:
<output_dir>/
├── <library_name1>/
│ └── md/
│ ├── <page_name1>.md
│ ├── <page_name2>.md
│ └── ...
│ └── html/ # Only if --save-html is used with DirectDeepwikiScraper
│ ├── <page_name1>.html
│ ├── <page_name2>.html
│ └── ...
├── <library_name2>/
│ └── md/
│ ├── <page_name1>.md
│ ├── <page_name2>.md
│ └── ...
└── ...
<output_dir>
is the directory specified by--output-dir
(default: Documents for run_scraper.py, DynamicDocuments for run_direct_scraper.py).<library_name>
is the name provided for the library (or inferred from the URL path).- Each page from the Deepwiki site is saved as a separate .md file within the md subdirectory.
- Original HTML is saved in the html subdirectory if the
--save-html
option is used with DirectDeepwikiScraper.
The tool offers different scraping strategies to maximize compatibility and output quality:
- Priority: Highest (used by default if no other scraper is explicitly chosen).
- Method: Attempts to fetch the raw Markdown content directly from the Deepwiki site's underlying data source or API. This is done by sending requests with specialized headers that mimic internal application requests.
- Process:
- Sends requests designed to retrieve Markdown data (using specific Accept headers or query parameters)
- Parses the response to extract the Markdown content
- Performs minimal cleaning on the extracted Markdown
- Splits the content into multiple files based on level 2 headings (##)
- Saves the cleaned and split Markdown content directly to .md files
- Advantage: Produces the highest fidelity Markdown, preserving the original formatting and structure as intended by the author.
- Priority: Medium (used if
--use-direct-scraper
is specified). - Method: Connects to the Deepwiki site using headers that mimic a standard browser request to fetch the fully rendered HTML page.
- Process:
- Fetches the full HTML of the page using the scrape_deepwiki function
- Uses BeautifulSoup to parse the HTML
- Identifies the main content area using a list of potential CSS selectors
- Uses the markdownify library to convert the selected HTML content to Markdown
- Saves the converted Markdown
- Advantage: More robust than basic static scraping if direct Markdown fetching fails or is unavailable.
- Priority: Lowest (used as a fallback if
--use-alternative-scraper
is enabled). - Method: A simpler static requests mechanism with specific headers designed to fetch the page HTML reliably.
The tool provides a utility to convert Markdown files to YAML format while preserving formatting. This is particularly useful for processing the scraped content for LLMs.
python -m deepwiki_to_md.chat convert --md "path/to/markdown/file.md"
# Or if console script entry point is installed:
# deepwiki-chat convert --md "path/to/markdown/file.md"
To specify a custom output directory:
python -m deepwiki_to_md.chat convert --md "path/to/markdown/file.md" --output "path/to/output/directory"
from deepwiki_to_md.md_to_yaml import convert_md_file_to_yaml, markdown_to_yaml
# Convert a Markdown file to YAML
yaml_file_path = convert_md_file_to_yaml("path/to/markdown/file.md")
# Convert a Markdown file to YAML with a custom output directory
yaml_file_path = convert_md_file_to_yaml("path/to/markdown/file.md", "path/to/output/directory")
# Or convert a Markdown string directly to a YAML string
markdown_string = "# My Document\n\nThis is the content."
yaml_string = markdown_to_yaml(markdown_string)
print(yaml_string)
The converted YAML file includes a structured representation of the document while embedding the original Markdown content:
timestamp: 'YYYY-MM-DD HH:MM:SS' # Timestamp of the conversion
title: Extracted Document Title # Title extracted from the first H1/H2 header
content: |
# Original Title
## Section 1
Content of section 1.
* List item 1
* List item 2
print("code")
[Link Text](url)
## Section 2
Content of section 2.
... # Full original Markdown content is preserved
links:
- text: Link Text
url: url # List of links extracted from the Markdown
images: [ ] # List of images extracted (currently empty)
metadata:
headers: # List of all header texts
- Original Title
- Section 1
- Section 2
...
paragraphs_count: 5 # Count of paragraphs
lists_count: 1 # Count of lists
tables_count: 0 # Count of tables
The tool automatically runs a link-fixing utility on the generated .md files. This utility finds Markdown links in the format Text and replaces them with Text.
python -m deepwiki_to_md.fix_markdown_links "path/to/your/markdown/directory"
from deepwiki_to_md.fix_markdown_links import fix_markdown_links
# Fix links in all markdown files within a directory
fix_markdown_links("path/to/your/markdown/directory")
The tool includes a feature to interact with chat interfaces using Selenium and save the responses.
python -m deepwiki_to_md.chat --url "https://deepwiki.com/some_chat_page" --message "Your message here" --wait 10 --debug --format "html,md,yaml" --output "MyChatResponses" --deep
Arguments for chat.py:
--url
: URL of the chat interface.--message
: Message to send.--selector
: CSS selector for the chat input (default: textarea).--button
: CSS selector for the submit button (default: button).--wait
: Time to wait for response in seconds (default: 30).--debug
: Enable debug mode.--output
: Output directory (default: ChatResponses).--deep
: Enable "Deep Research" mode (specific to some interfaces).--headless
: Run browser in headless mode.--format
: Output format(s): html, md, yaml, or comma-separated list (default: html).
Note: The chat scraper uses Selenium, which requires a compatible browser installed.
This project is licensed under the MIT License - see the LICENSE file for details.