Convert HTML to clean, readable Markdown. Designed for content extraction, this library handles common HTML patterns while filtering out non-content elements like navigation and and scripts.
Add html2markdown
to your list of dependencies in mix.exs
:
def deps do
[
{:html2markdown, "~> 0.3.0"}
]
end
# Basic conversion
Html2Markdown.convert("<h1>Hello World</h1><p>Welcome to <strong>Elixir</strong>!</p>")
# => "\n# Hello World\n\n\n\nWelcome to **Elixir**!\n"
# With custom options
Html2Markdown.convert(html, %{
navigation_classes: ["nav", "menu", "custom-nav"],
normalize_whitespace: true
})
- Smart Content Extraction: Automatically removes navigation, ads, and other non-content elements
- HTML5 Support: Handles modern semantic elements like
<details>
,<mark>
,<time>
- Table Conversion: Converts HTML tables to clean Markdown tables
- Entity Handling: Properly decodes HTML entities (
&
,<
,
, etc.) - Configurable: Customize filtering and processing behavior
Html2Markdown.convert(html, %{
# CSS classes that identify navigation elements to remove
navigation_classes: ["footer", "menu", "nav", "sidebar", "aside"],
# HTML tags to filter out during conversion
non_content_tags: ["script", "style", "form", "nav", ...],
# Markdown flavor (currently :basic, future: :gfm, :commonmark)
markdown_flavor: :basic,
# Normalize whitespace (collapses multiple spaces, trims)
normalize_whitespace: true
})
Extract readable content from web pages:
{:ok, %{body: html}} = Req.get!(url)
markdown = Html2Markdown.convert(html)
Convert existing HTML content to Markdown:
# Convert blog posts from HTML to Markdown
html_content
|> Html2Markdown.convert(%{normalize_whitespace: true})
|> save_as_markdown()
Clean up HTML emails for plain text storage:
email_html
|> Html2Markdown.convert(%{
non_content_tags: ["style", "script", "meta"],
navigation_classes: ["unsubscribe", "footer"]
})
- Headings:
<h1>
through<h6>
- Text: Paragraphs, emphasis (
<em>
,<i>
), strong (<strong>
,<b>
) - Lists: Ordered and unordered lists with nesting
- Links:
<a>
tags with proper URL handling - Images:
<img>
and<picture>
elements - Code: Both inline
<code>
and block<pre>
elements - Tables: Full table support with headers
- Quotes:
<blockquote>
and<q>
elements - HTML5:
<details>
,<summary>
,<mark>
,<abbr>
,<cite>
,<time>
,<video>
Full documentation is available at https://hexdocs.pm/html2markdown.
This project includes comprehensive testing and quality assurance tools:
# Run all tests
mix test
# Run tests with coverage
mix coveralls.html
# Run all quality checks (formatting, security, linting)
mix quality
# Individual checks
mix format --check-formatted # Code formatting
mix credo --only warning # Code linting
mix sobelow --config # Security analysis
This project uses GitHub Actions for continuous integration with:
- Multi-version testing (Elixir 1.15-1.17, OTP 25-27)
- Code quality enforcement
- Security scanning
- Test coverage reporting
MIT License - see LICENSE file for details.