Skip to content

dfop02/html4docx

Repository files navigation

HTML FOR DOCX

Tests Version Supported Versions

Convert html to docx, this project is a fork from descontinued pqzx/html2docx.

How install

pip install html-for-docx

Usage

The basic usage

Add HTML formatted to an existing Docx

from html4docx import HtmlToDocx

parser = HtmlToDocx()
html_string = '<h1>Hello world</h1>'
parser.add_html_to_document(html_string, filename_docx)

You can use python-docx to manipulate the file as well, here an example

from docx import Document
from html4docx import HtmlToDocx

document = Document()
new_parser = HtmlToDocx()

html_string = '<h1>Hello world</h1>'
new_parser.add_html_to_document(html_string, document)

document.save('your_file_name.docx')

Convert files directly

from html4docx import HtmlToDocx

new_parser = HtmlToDocx()
new_parser.parse_html_file(input_html_file_path, output_docx_file_path)
# You can also define a encoding, by default is utf-8
new_parser.parse_html_file(input_html_file_path, output_docx_file_path, 'utf-8')

Convert files from a string

from html4docx import HtmlToDocx

new_parser = HtmlToDocx()
docx = new_parser.parse_html_string(input_html_file_string)

Change table styles

Tables are not styled by default. Use the table_style attribute on the parser to set a table style before convert html. The style is used for all tables.

from html4docx import HtmlToDocx

new_parser = HtmlToDocx()
new_parser.table_style = 'Light Shading Accent 4'
docx = new_parser.parse_html_string(input_html_file_string)

To add borders to tables, use the Table Grid style:

new_parser.table_style = 'Table Grid'

All table styles we support can be found here.

Metadata

You're able to read or set docx metadata:

from docx import Document
from html4docx import HtmlToDocx

document = Document()
new_parser = HtmlToDocx()
new_parser.set_initial_attrs(document)
metadata = new_parser.metadata

# You can get metadata as dict
metadata_json = metadata.get_metadata()
print(metadata_json['author']) # Jane
# or just print all metadata if if you want
metadata.get_metadata(print_result=True)

# Set new metadata
metadata.set_metadata(author="Jane", created="2025-07-18T09:30:00")
document.save('your_file_name.docx')

You can find all available metadata attributes here.

Why

My goal in forking and fixing/updating this package was to complete my current task at work, which involves converting HTML to DOCX. The original package lacked a few features and had some bugs, preventing me from completing the task. Instead of creating a new package from scratch, I preferred to update this one.

Differences (fixes and new features)

Fixes

  • Fix table_style not working | Dfop02 from Issue
  • Handle missing run for leading br tag | dashingdove from PR
  • Fix base64 images | djplaner from Issue
  • Handle img tag without src attribute | johnjor from PR
  • Fix bug when any style has !important | Dfop02
  • Fix 'style lookup by style_id is deprecated.' | Dfop02
  • Fix background-color not working | Dfop02
  • Fix crashes when img or bookmark is created without paragraph | Dfop02
  • Fix Ordered and Unordered Lists | TaylorN15 from PR

New Features

  • Add Witdh/Height style to images | maifeeulasad from PR
  • Support px, cm, pt, in, rem, em, mm, pc and % units for styles | Dfop02
  • Improve performance on large tables | dashingdove from PR
  • Support for HTML Pagination | Evilran from PR
  • Support Table style | Evilran from PR
  • Support alternative encoding | HebaElwazzan from PR
  • Support colors by name | Dfop02
  • Support font_size when text, ex.: small, medium, etc. | Dfop02
  • Support to internal links (Anchor) | Dfop02
  • Support to rowspan and colspan in tables. | Dfop02 from Issue
  • Support to 'vertical-align' in table cells. | Dfop02
  • Support to metadata | Dfop02
  • Add support to table cells style (border, background-color, width, height, margin) | Dfop02
  • Being able to use inline images on same paragraph. | Dfop02
  • Refactory Tests to be more consistent and less 'human validation' | Dfop02

Known Issues

  • Maximum Nesting Depth: Ordered lists support up to 3 nested levels. Any additional depth beyond level 3 will be treated as level 3.
  • Counter Reset Behavior:
    • At level 1, starting a new ordered list will reset the counter.
    • At levels 2 and 3, the counter will continue from the previous item unless explicitly reset.

Project Guidelines

This project is primarily designed for compatibility with Microsoft Word, but it currently works well with LibreOffice and Google Docs, based on our testing. The goal is to maintain this cross-platform harmony while continuing to implement fixes and updates.

⚠️ However, please note that Microsoft Word is the priority. Bugs or issues specific to other editors (e.g., LibreOffice or Google Docs) may be considered, but fixing them is secondary to maintaining full compatibility with Word.

License

This project is licensed under the MIT License - see the LICENSE file for details

About

Convert html to docx

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published

Contributors 3

  •  
  •  
  •