twars-url2md
is a fast and robust command-line tool and Rust library that fetches web pages, intelligently cleans up their HTML content, and converts them into clean, readable Markdown files. It's designed for high-performance batch processing, making it ideal for archiving, research, content conversion, or any task requiring structured text from web sources.
- What it Does
- Who It's For
- Why It's Useful
- Installation
- Usage
- Technical Details
- Development
- Contributing
- Troubleshooting
- License
- Author
- Acknowledgments
twars-url2md
takes one or more URLs (or local HTML files) as input. For each URL, it:
- Fetches the web page content, handling various network conditions and CDN behaviors.
- Cleans the HTML by removing scripts, styles, ads, and other unnecessary elements to isolate the core content, powered by Monolith.
- Converts the cleaned HTML into well-formatted Markdown using htmd.
- Saves the Markdown to your local filesystem, either as individual files organized by URL structure or packed into a single file.
This tool is valuable for:
- Developers: Integrating web content conversion into applications or build processes.
- Researchers & Academics: Archiving web pages for citation, analysis, or offline reading.
- Content Creators & Curators: Converting articles or blog posts into Markdown for easier editing and republishing.
- Knowledge Workers: Building personal knowledge bases from online resources.
- Anyone needing to quickly and reliably transform web pages into a clean, portable text format.
twars-url2md
stands out due to its combination of speed, reliability, and the quality of its output:
- 🚀 High-Performance Processing: Leverages asynchronous operations and adaptive concurrency (scaling with your CPU cores) for rapid batch processing of hundreds of URLs.
- 📝 Clean & Readable Markdown: Produces well-structured Markdown that accurately represents the original content's hierarchy.
- 🧹 Advanced HTML Cleaning: Goes beyond simple conversion by first pruning irrelevant HTML elements, resulting in focused and clutter-free Markdown.
- 🔄 Robust Error Handling: Features automatic retries with exponential backoff for network issues and graceful recovery from HTML parsing errors, ensuring maximum success during large jobs.
- 🔗 Intelligent URL Handling:
- Extracts URLs from plain text, HTML, and Markdown input.
- Resolves relative links if a base URL is provided.
- Supports processing local HTML files (
/path/to/file.html
orfile:///path/to/file.html
). - Validates URLs and filters out non-HTTP(S) schemes.
- 🛡️ CDN Compatibility: Designed to work smoothly with modern CDNs (like Cloudflare, Fastly, Akamai), employing browser-like headers and HTTP/2 support to avoid bot detection and connection issues.
- 📂 Flexible Input/Output:
- Accepts URLs via command-line arguments, input files, or standard input (stdin).
- Organizes output into a logical directory structure mirroring the source URLs or creates a single
.md
file. - Can "pack" content from multiple URLs into one consolidated Markdown file, with original URLs as headers.
- 🖥️ Cross-Platform: Works consistently on Windows, macOS, and Linux.
- 🔧 Configurable: Offers options for verbose logging, custom output paths, and retry limits.
You can install twars-url2md
using pre-compiled binaries (recommended for most users) or by building it from source using Cargo.
Linux and macOS:
curl -fsSL https://raw.githubusercontent.com/twardoch/twars-url2md/main/install.sh | bash
Or with custom install directory:
curl -fsSL https://raw.githubusercontent.com/twardoch/twars-url2md/main/install.sh | bash -s -- --install-dir ~/.local/bin
Download the latest release for your platform from the GitHub Releases page.
macOS:
# Intel x86_64
curl -L https://github.com/twardoch/twars-url2md/releases/latest/download/twars-url2md-macos-x86_64.tar.gz | tar xz
# Apple Silicon (M1/M2)
curl -L https://github.com/twardoch/twars-url2md/releases/latest/download/twars-url2md-macos-aarch64.tar.gz | tar xz
# Move to a directory in your PATH
sudo mv twars-url2md /usr/local/bin/
Linux:
# x86_64
curl -L https://github.com/twardoch/twars-url2md/releases/latest/download/twars-url2md-linux-x86_64.tar.gz | tar xz
# ARM64 (aarch64)
curl -L https://github.com/twardoch/twars-url2md/releases/latest/download/twars-url2md-linux-aarch64.tar.gz | tar xz
# Static binary (musl)
curl -L https://github.com/twardoch/twars-url2md/releases/latest/download/twars-url2md-linux-x86_64-musl.tar.gz | tar xz
# Move to a directory in your PATH
sudo mv twars-url2md /usr/local/bin/
Windows:
# Download
Invoke-WebRequest -Uri https://github.com/twardoch/twars-url2md/releases/latest/download/twars-url2md-windows-x86_64.zip -OutFile twars-url2md.zip
# Extract
Expand-Archive twars-url2md.zip -DestinationPath .
# Move twars-url2md.exe to a directory in your system's PATH
# For example, move it to C:\Windows\System32 or add its current directory to PATH
Note: For Windows, ensure the directory where you place twars-url2md.exe
is included in your PATH
environment variable to run it from any command prompt.
If you have Rust installed (version 1.70.0 or later), you can install twars-url2md
directly from Crates.io:
cargo install twars-url2md
To build twars-url2md
from the source code:
- Clone the repository:
git clone https://github.com/twardoch/twars-url2md.git cd twars-url2md
- Build in release mode:
The executable will be located at
cargo build --release
target/release/twars-url2md
. - Install (optional):
You can install it to your Cargo binary path (
~/.cargo/bin/
) by running:cargo install --path .
After installation, verify it by running:
twars-url2md --version
twars-url2md
can be used as a command-line tool or as a library in your Rust projects.
The basic syntax for the CLI is:
twars-url2md [OPTIONS] [URLS...]
If URLs are provided directly as arguments, they will be processed. Otherwise, use --input
or --stdin
.
CLI Options:
You can view all options by running twars-url2md --help
. Here are the main ones:
-i, --input <FILE>
: Input file containing URLs (one per line, or text with extractable URLs).-o, --output <PATH>
: Output directory for Markdown files. If<PATH>
ends with.md
, all content will be saved into this single file instead of a directory structure (unless--pack
is also used).--stdin
: Read URLs from standard input.--base-url <URL>
: Base URL for resolving relative links found in the input content (e.g., if parsing URLs from an HTML page).-p, --pack <FILE.md>
: Pack all converted Markdown content into a single specified.md
file. Each URL's content will be headed by its original URL.-v, --verbose
: Enable verbose output with detailed logging (INFO and DEBUG levels).-h, --help
: Print help information.-V, --version
: Print version information.
Input Formats:
The tool can extract URLs from various input sources when using -i
or --stdin
:
- Plain URLs: A list of URLs, one per line.
https://example.com https://another-site.com/page
- HTML Content: Text containing HTML with links (e.g.,
<a href="https://example.com">Example</a>
). - Markdown Content: Text containing Markdown links (e.g.,
[Example](https://example.com)
). - Local Files: Paths to local HTML files.
Note: For local files, the content is read and converted to Markdown directly.
/path/to/your/file.html file:///absolute/path/to/another/file.html
CLI Examples:
-
Process a single URL, output to console (default if no -o or --pack):
twars-url2md https://www.rust-lang.org
(Note: This will print Markdown to stdout. For saving, use
-o
or--pack
) -
Process multiple URLs, save to default directory structure (
./output/<domain>/...
):twars-url2md https://www.rust-lang.org https://crates.io -o ./output
-
Process URLs from a file, save to a custom directory:
# urls.txt contains one URL per line twars-url2md -i urls.txt -o ./markdown_files
-
Process URLs from stdin, with verbose logging:
echo "https://example.com" | twars-url2md --stdin -o ./output -v
-
Extract URLs from a webpage and process them (using
curl
as an example source):curl -s https://news.ycombinator.com | \ twars-url2md --stdin --base-url https://news.ycombinator.com -o ./hn_articles
-
Process local HTML files (using
find
to supply file paths):find . -name "*.html" | twars-url2md --stdin -o ./local_markdown
-
Create a single combined Markdown file from multiple URLs (
--pack
):twars-url2md -i urls.txt --pack combined_report.md
Each URL's content in
combined_report.md
will be preceded by a header like# https://example.com/some/page
. -
Output to a single
.md
file (alternative to directory structure):# This is useful if you have one primary URL or want a simpler output than --pack twars-url2md https://example.com/main_article -o article.md
If multiple URLs are processed and output is a single file (not using
--pack
), their content will be concatenated. -
Use both individual file output and packed output:
twars-url2md -i urls.txt -o ./individual_files --pack all_content.md
twars-url2md
can also be used as a Rust library to integrate its functionality into your own projects.
Add it to your Cargo.toml
:
[dependencies]
twars-url2md = "0.3.0" # Replace with the latest version from crates.io
tokio = { version = "1", features = ["full"] }
anyhow = "1"
(Check Crates.io for the most current version number.)
Example:
use twars_url2md::{process_urls, Config, url::extract_urls_from_text};
use std::path::PathBuf;
use anyhow::Result;
#[tokio::main]
async fn main() -> Result<()> {
// Example text from which to extract URLs
let text_with_urls = "Check out https://www.rust-lang.org and also see https://crates.io for packages.";
// Extract URLs. A base_url (Option<&str>) can be provided if needed.
let urls_to_process = extract_urls_from_text(text_with_urls, None);
if urls_to_process.is_empty() {
println!("No URLs found to process.");
return Ok(());
}
// Configure the processing task
let config = Config {
verbose: true, // Enable detailed internal logging (uses `tracing` crate)
max_retries: 3, // Max *additional* retries after the initial attempt (so, 3 means up to 4 total attempts)
output_base: PathBuf::from("./my_markdown_output"), // Base path for output
single_file: false, // False: create directory structure; True: if output_base is a file.md, save all to it
has_output: true, // True if output_base is a path to save to, false if just processing (e.g. for pack_file only)
pack_file: Some(PathBuf::from("./packed_documentation.md")), // Optional: combine all into one .md file
};
// Ensure the output directory exists if not using pack_file primarily or single_file mode
if config.has_output && !config.single_file && config.output_base.extension().is_none() {
if !config.output_base.exists() {
tokio::fs::create_dir_all(&config.output_base).await?;
}
} else if config.has_output && config.single_file { // Output is a single file
if let Some(parent) = config.output_base.parent() { // Ensure parent directory for the single file exists
if !parent.exists() {
tokio::fs::create_dir_all(parent).await?;
}
}
}
// Process the URLs
// `process_urls` returns a Result containing a list of (String, anyhow::Error) for failed URLs.
match process_urls(urls_to_process, config).await {
Ok(errors) => {
if errors.is_empty() {
println!("All URLs processed successfully!");
} else {
eprintln!("Some URLs failed to process:");
for (url, error) in errors {
eprintln!("- {}: {}", url, error);
}
}
}
Err(e) => {
eprintln!("A critical error occurred during processing: {}", e);
}
}
Ok(())
}
For more detailed information on the library API, please refer to the official documentation on docs.rs.
This section provides a deeper dive into the architecture and inner workings of twars-url2md
.
twars-url2md
is built with a modular design in Rust, emphasizing performance, concurrency, and resilience.
Core Components:
┌──────────────────┐ ┌───────────────────┐ ┌──────────────────┐
│ CLI / Library │────▶│ URL Extractor │────▶│ HTTP Client │
│ (Input Handler) │ │ (src/url.rs) │ │ (src/html.rs) │
└──────────────────┘ └───────────────────┘ └─────────┬────────┘
│
▼
┌──────────────────┐ ┌───────────────────┐ ┌─────────┴────────┐
│ Output Writer │◀────│ Markdown Converter│◀────│ HTML Cleaner │
│ (File System) │ │ (src/markdown.rs) │ │ (Monolith Lib) │
└──────────────────┘ └───────────────────┘ └──────────────────┘
-
CLI / Library Interface (
src/cli.rs
,src/lib.rs
):- Handles command-line argument parsing (using
clap
) or library function calls. - Manages input sources (files, stdin, direct arguments).
- Orchestrates the overall processing flow.
- Handles command-line argument parsing (using
-
URL Extractor & Validator (
src/url.rs
):- Extracts URLs from various text formats (plain text, HTML, Markdown) using libraries like
linkify
. - Validates, normalizes, and deduplicates URLs.
- Resolves relative URLs against a provided base URL.
- Generates structured output paths based on URL components.
- Extracts URLs from various text formats (plain text, HTML, Markdown) using libraries like
-
HTTP Client (
src/html.rs
):- Fetches web content using a robust,
curl
-based HTTP client (curl_rust
crate). - Configured to mimic browser behavior to enhance compatibility with modern websites and CDNs (see "HTTP Client Details & CDN Compatibility" below).
- Implements retry logic with exponential backoff for transient network errors.
- Fetches web content using a robust,
-
HTML Cleaner (Monolith Integration):
- Utilizes the Monolith library to process and clean raw HTML.
- Strips scripts, styles, iframes, and other non-essential elements.
- Aims to isolate the main article or content body of the page.
- Includes panic recovery: if Monolith encounters an issue with severely malformed HTML,
twars-url2md
attempts a fallback to basic HTML processing or skips the URL, logging an error and allowing the batch job to continue.
-
Markdown Converter (
src/markdown.rs
):- Converts the cleaned HTML into Markdown using the htmd library.
- Preserves semantic structure (headings, lists, links, etc.).
-
Output Writer (
src/lib.rs
):- Manages writing the Markdown content to the filesystem.
- Supports creating a directory structure based on URL paths, outputting to a single specified file, or packing all content into one file.
-
Concurrent Processing:
- Leverages Tokio's asynchronous runtime for non-blocking I/O.
- Processes multiple URLs concurrently using an adaptive worker pool. The number of concurrent workers is typically based on available CPU cores (e.g.,
min(CPU_COUNT * 2, 16)
), optimizing throughput without overloading the system. - Uses
futures::stream::StreamExt::buffer_unordered
for managing concurrent tasks.
-
Robust Error Handling:
- Uses the
anyhow
crate for flexible and context-rich error reporting. - Retry Mechanism: For network-related issues during fetching,
twars-url2md
automatically retries. TheConfig.max_retries
field (default2
for CLI, configurable for library use) specifies the number of additional attempts after the first one. So,max_retries: 2
means up to 3 total attempts. Retries use exponential backoff. - Panic Recovery: Specifically for the Monolith HTML cleaning stage, if Monolith panics (e.g., due to extremely malformed HTML), the application catches the panic for that specific URL, logs an error, and attempts to fall back to a simpler HTML processing path or skips the URL, allowing the batch job to continue.
- Uses the
The built-in HTTP client is carefully configured to maximize compatibility with various web servers and Content Delivery Networks (CDNs):
- Underlying Engine: Uses
curl
via thecurl_rust
crate, known for its robustness and wide protocol support. - User-Agent: Sends a common browser User-Agent string (e.g.,
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
) to appear like a standard browser. This is defined insrc/lib.rs
asUSER_AGENT_STRING
. - HTTP/2 Support: Auto-negotiates HTTP version, preferring HTTP/2 where available for better performance and compatibility with CDNs that might otherwise challenge or block older clients.
- Browser-like Headers: Sends a comprehensive set of HTTP headers (e.g.,
Accept
,Accept-Language
,Sec-Ch-Ua
,Sec-Fetch-Site
, etc.) to further mimic legitimate browser traffic, reducing the likelihood of being blocked by bot detection systems. - Timeouts: Implements connection and total request timeouts (e.g., 20s for connection, 60s total by default in
curl_http_client
) to prevent indefinite hangs.
- Directory Structure (default with
-o <directory>
): Creates a hierarchy likeoutput_dir/example.com/path/to/page.md
. - Single File (with
-o <filename>.md
): Concatenates all Markdown content into the specified file. If multiple URLs are processed, their content is simply joined. This mode is active if the path given to-o
ends with.md
and is not a directory. - Packed File (with
--pack <filename>.md
): Combines Markdown from all URLs into a single file. Each URL's content is clearly demarcated by a heading like# <URL>
. This mode also preserves the original input order of URLs in the output file.
- Utilizes the
tracing
crate for structured and configurable logging. - Log levels (from most to least severe):
ERROR
,WARN
,INFO
,DEBUG
,TRACE
. - Verbosity is controlled by the
-v, --verbose
flag (enables INFO and DEBUG fortwars_url2md
modules by settingRUST_LOG=info,twars_url2md=debug
). - Finer-grained control is available via the
RUST_LOG
environment variable (e.g.,RUST_LOG=twars_url2md=trace
for maximum detail from this application, orRUST_LOG=info
for general info level). Seesrc/main.rs
for initialization logic.
- The build process (via
build.rs
) embeds build time, target architecture, and profile (debug/release) into the binary. This information is accessible via thetwars-url2md --version
command (seesrc/lib.rs
version()
function).
This section provides guidance for setting up a development environment, building the project, and running tests.
- Rust: Version 1.70.0 or later. You can install Rust via rustup.
- Cargo: The Rust package manager (installed with Rust).
- Git: For cloning the repository.
- (Optional) Python 3: For running the issue verification suite (
issues/issuetest.py
).
- Clone the repository:
git clone https://github.com/twardoch/twars-url2md.git cd twars-url2md
-
Debug build:
cargo build
The executable will be in
target/debug/twars-url2md
. -
Release build (optimized):
cargo build --release
The executable will be in
target/release/twars-url2md
. -
Run directly with arguments (debug mode):
cargo run -- -i urls.txt -o ./output
Consistent code style is maintained using rustfmt
, and clippy
is used for linting.
-
Format code:
cargo fmt
-
Run linter (Clippy):
cargo clippy --all-targets --all-features -- -D warnings
(The
-D warnings
flag promotes warnings to errors, ensuring high code quality.)
The project has a suite of unit and integration tests.
-
Run all tests:
cargo test --all-features
-
Run a specific test function:
# Example: cargo test test_url_extraction --all-features cargo test <TEST_FUNCTION_NAME> --all-features
(Replace
<TEST_FUNCTION_NAME>
with the actual test function name.) -
Run tests with output (e.g., for debugging print statements):
cargo test --all-features -- --nocapture
-
Run only integration tests: Integration tests are typically located in the
tests/
directory (e.g.,tests/integration/e2e_tests.rs
).# Example: cargo test --test e2e_tests --all-features cargo test --test <INTEGRATION_TEST_FILENAME_WITHOUT_RS> --all-features
-
Issue Verification Suite: The project includes an issue verification script to test various CLI functionalities and confirm fixes for reported issues.
python3 issues/issuetest.py
(Ensure you have Python 3 installed and any dependencies listed in or for that script.)
- Generate and open local documentation:
cargo doc --no-deps --open
- Check for known vulnerabilities in dependencies:
(You may need to install
cargo audit
cargo-audit
first:cargo install cargo-audit
)
- Verify package before publishing:
cargo package
- Publish to Crates.io:
cargo publish
Contributions are welcome and greatly appreciated! Whether it's bug reports, feature suggestions, documentation improvements, or code contributions, your help makes twars-url2md
better.
Please see the CONTRIBUTING.md file for detailed guidelines on how to contribute to the project, including information on reporting issues, submitting pull requests, and the code of conduct.
If you're looking for ways to contribute, here are some areas where help would be valuable:
- Additional Output Formats: Implementing converters for formats like AsciiDoc, reStructuredText, or others.
- Performance Optimizations: Identifying and improving bottlenecks in processing speed or memory usage.
- Enhanced Error Messages: Making error messages even more user-friendly and actionable.
- Expanded Test Coverage: Adding more unit, integration, or end-to-end tests, especially for edge cases.
- Documentation: Improving the README, API documentation, or adding usage examples.
- New Features: Proposing and implementing new functionalities that align with the tool's purpose.
Before starting significant work, it's a good idea to open an issue to discuss your proposed changes.
If you encounter issues while using twars-url2md
, this section may help.
-
SSL/TLS Certificate Errors:
- Issue: You might see errors related to SSL/TLS certificate validation, especially with sites using self-signed certificates or older TLS configurations.
- Details:
twars-url2md
usescurl
which typically relies on the system's certificate store. Ensure your system's CA certificates are up-to-date. For specific problematic sites, this can be complex. The tool aims for secure defaults. - Action: Check verbose output (
-v
) for more details on the TLS handshake. If it's a corporate environment with a custom CA, ensure that CA is trusted by your system.
-
CDN-Protected Sites (e.g., Cloudflare, Akamai, Adobe):
- Issue: Previously, some sites behind aggressive CDNs might have blocked requests or timed out.
- Solution: Recent versions of
twars-url2md
have significantly improved CDN compatibility by:- Auto-negotiating HTTP/2, which is preferred by many CDNs.
- Sending a comprehensive set of browser-like HTTP headers (including User-Agent, Sec-CH-UA, etc.) to avoid simplistic bot detection.
- Using
curl
as the underlying HTTP client, which has a network stack more aligned with browsers.
- Action: Ensure you are using the latest version of
twars-url2md
. If you still encounter issues, verbose logging (-v
orRUST_LOG
) can provide clues.
-
Timeouts on Large or Slow Pages:
- Issue: The tool might time out when processing very large pages or pages from slow-responding servers.
- Details: Default timeouts are generally generous (e.g., 60 seconds for the entire request).
- Action: Check your network connection. If the page is exceptionally large or the server is consistently slow, this might be unavoidable with default settings. Verbose output can confirm if a timeout is the cause.
-
Monolith Panics or Poor Conversion on Specific Pages:
- Issue: The Monolith library (for HTML cleaning) or the
htmd
library (for Markdown conversion) might struggle with extremely complex, malformed, or unusual HTML structures. - Details:
twars-url2md
includes panic recovery for Monolith. If Monolith panics, the tool logs an error and attempts to fall back to a more basic HTML processing step or skips the URL. This prevents the entire batch from failing. - Action: If a specific page consistently fails to convert well, it might be due to its unique structure. You can report such pages as issues, providing the URL.
- Issue: The Monolith library (for HTML cleaning) or the
For more detailed insight into what the tool is doing, especially when troubleshooting, use verbose logging.
-
Using the
-v
flag: The simplest way to get more logs is to add the-v
or--verbose
flag to your command. This typically sets the log level to showINFO
messages from all crates andDEBUG
messages fromtwars_url2md
itself (RUST_LOG=info,twars_url2md=debug
).twars-url2md -i urls.txt -o output -v
-
Using the
RUST_LOG
Environment Variable: For more fine-grained control, you can use theRUST_LOG
environment variable.twars-url2md
uses thetracing
library.Syntax:
RUST_LOG="target[span{field=value}]=level"
(simplified:RUST_LOG="crate_name=level,another_crate=level"
)Examples:
- Enable
DEBUG
level fortwars_url2md
andINFO
for everything else (similar to-v
):RUST_LOG=info,twars_url2md=debug twars-url2md -i urls.txt -o output
- Enable
TRACE
level fortwars_url2md
(very detailed, for deep debugging):RUST_LOG=twars_url2md=trace twars-url2md -i urls.txt -o output
- Enable
DEBUG
for the HTML processing module specifically:RUST_LOG=twars_url2md::html=debug twars-url2md -i urls.txt -o output
- Show all
DEBUG
level messages from all crates:RUST_LOG=debug twars-url2md -i urls.txt -o output
Logging Levels (most to least verbose):
TRACE
: Extremely detailed information, typically for fine-grained debugging.DEBUG
: Detailed information useful for debugging.INFO
: Informational messages about the progress of the application.WARN
: Warnings about potential issues that don't stop execution.ERROR
: Errors that prevent a specific operation (e.g., processing one URL) but don't crash the application.
- Enable
If you consistently encounter an issue not covered here, consider opening an issue on the GitHub repository with detailed information, including the command you ran, the output, and logs if possible.
This project is licensed under the MIT License. See the LICENSE file for details.
Adam Twardoch (@twardoch)
twars-url2md
builds upon the excellent work of others:
- Monolith for robust HTML cleaning and resource embedding.
- htmd for efficient HTML-to-Markdown conversion.
- The Rust community and the creators of the many high-quality crates used in this project.
For bug reports, feature requests, or questions, please open an issue on the GitHub repository.