Skip to content

agoodway/html2markdown

Repository files navigation

Html2Markdown

Hex.pm Hex Docs License CI

Convert HTML to clean, readable Markdown. Designed for content extraction, this library handles common HTML patterns while filtering out non-content elements like navigation and and scripts.

Installation

Add html2markdown to your list of dependencies in mix.exs:

def deps do
  [
    {:html2markdown, "~> 0.3.0"}
  ]
end

Quick Start

# Basic conversion
Html2Markdown.convert("<h1>Hello World</h1><p>Welcome to <strong>Elixir</strong>!</p>")
# => "\n# Hello World\n\n\n\nWelcome to **Elixir**!\n"

# With custom options
Html2Markdown.convert(html, %{
  navigation_classes: ["nav", "menu", "custom-nav"],
  normalize_whitespace: true
})

Features

  • Smart Content Extraction: Automatically removes navigation, ads, and other non-content elements
  • HTML5 Support: Handles modern semantic elements like <details>, <mark>, <time>
  • Table Conversion: Converts HTML tables to clean Markdown tables
  • Entity Handling: Properly decodes HTML entities (&amp;, &lt;, &nbsp;, etc.)
  • Configurable: Customize filtering and processing behavior

Configuration Options

Html2Markdown.convert(html, %{
  # CSS classes that identify navigation elements to remove
  navigation_classes: ["footer", "menu", "nav", "sidebar", "aside"],
  
  # HTML tags to filter out during conversion
  non_content_tags: ["script", "style", "form", "nav", ...],
  
  # Markdown flavor (currently :basic, future: :gfm, :commonmark)
  markdown_flavor: :basic,
  
  # Normalize whitespace (collapses multiple spaces, trims)
  normalize_whitespace: true
})

Common Use Cases

Web Scraping

Extract readable content from web pages:

{:ok, %{body: html}} = Req.get!(url)
markdown = Html2Markdown.convert(html)

Content Migration

Convert existing HTML content to Markdown:

# Convert blog posts from HTML to Markdown
html_content
|> Html2Markdown.convert(%{normalize_whitespace: true})
|> save_as_markdown()

Email Processing

Clean up HTML emails for plain text storage:

email_html
|> Html2Markdown.convert(%{
  non_content_tags: ["style", "script", "meta"],
  navigation_classes: ["unsubscribe", "footer"]
})

Supported Elements

  • Headings: <h1> through <h6>
  • Text: Paragraphs, emphasis (<em>, <i>), strong (<strong>, <b>)
  • Lists: Ordered and unordered lists with nesting
  • Links: <a> tags with proper URL handling
  • Images: <img> and <picture> elements
  • Code: Both inline <code> and block <pre> elements
  • Tables: Full table support with headers
  • Quotes: <blockquote> and <q> elements
  • HTML5: <details>, <summary>, <mark>, <abbr>, <cite>, <time>, <video>

Documentation

Full documentation is available at https://hexdocs.pm/html2markdown.

Development

This project includes comprehensive testing and quality assurance tools:

Running Tests

# Run all tests
mix test

# Run tests with coverage
mix coveralls.html

Code Quality

# Run all quality checks (formatting, security, linting)
mix quality

# Individual checks
mix format --check-formatted  # Code formatting
mix credo --only warning       # Code linting
mix sobelow --config          # Security analysis

CI/CD

This project uses GitHub Actions for continuous integration with:

  • Multi-version testing (Elixir 1.15-1.17, OTP 25-27)
  • Code quality enforcement
  • Security scanning
  • Test coverage reporting

License

MIT License - see LICENSE file for details.

About

Convert HTML to Markdown with Elixir

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 5

Languages