Convert html to docx, this project is a fork from descontinued pqzx/html2docx.
pip install html-for-docx
Add HTML formatted to an existing Docx
from html4docx import HtmlToDocx
parser = HtmlToDocx()
html_string = '<h1>Hello world</h1>'
parser.add_html_to_document(html_string, filename_docx)
You can use python-docx
to manipulate the file as well, here an example
from docx import Document
from html4docx import HtmlToDocx
document = Document()
new_parser = HtmlToDocx()
html_string = '<h1>Hello world</h1>'
new_parser.add_html_to_document(html_string, document)
document.save('your_file_name.docx')
from html4docx import HtmlToDocx
new_parser = HtmlToDocx()
new_parser.parse_html_file(input_html_file_path, output_docx_file_path)
# You can also define a encoding, by default is utf-8
new_parser.parse_html_file(input_html_file_path, output_docx_file_path, 'utf-8')
from html4docx import HtmlToDocx
new_parser = HtmlToDocx()
docx = new_parser.parse_html_string(input_html_file_string)
Tables are not styled by default. Use the table_style
attribute on the parser to set a table style before convert html. The style is used for all tables.
from html4docx import HtmlToDocx
new_parser = HtmlToDocx()
new_parser.table_style = 'Light Shading Accent 4'
docx = new_parser.parse_html_string(input_html_file_string)
To add borders to tables, use the Table Grid
style:
new_parser.table_style = 'Table Grid'
All table styles we support can be found here.
You're able to read or set docx metadata:
from docx import Document
from html4docx import HtmlToDocx
document = Document()
new_parser = HtmlToDocx()
new_parser.set_initial_attrs(document)
metadata = new_parser.metadata
# You can get metadata as dict
metadata_json = metadata.get_metadata()
print(metadata_json['author']) # Jane
# or just print all metadata if if you want
metadata.get_metadata(print_result=True)
# Set new metadata
metadata.set_metadata(author="Jane", created="2025-07-18T09:30:00")
document.save('your_file_name.docx')
You can find all available metadata attributes here.
My goal in forking and fixing/updating this package was to complete my current task at work, which involves converting HTML to DOCX. The original package lacked a few features and had some bugs, preventing me from completing the task. Instead of creating a new package from scratch, I preferred to update this one.
Fixes
- Fix
table_style
not working | Dfop02 from Issue - Handle missing run for leading br tag | dashingdove from PR
- Fix base64 images | djplaner from Issue
- Handle img tag without src attribute | johnjor from PR
- Fix bug when any style has
!important
| Dfop02 - Fix 'style lookup by style_id is deprecated.' | Dfop02
- Fix
background-color
not working | Dfop02 - Fix crashes when img or bookmark is created without paragraph | Dfop02
- Fix Ordered and Unordered Lists | TaylorN15 from PR
New Features
- Add Witdh/Height style to images | maifeeulasad from PR
- Support px, cm, pt, in, rem, em, mm, pc and % units for styles | Dfop02
- Improve performance on large tables | dashingdove from PR
- Support for HTML Pagination | Evilran from PR
- Support Table style | Evilran from PR
- Support alternative encoding | HebaElwazzan from PR
- Support colors by name | Dfop02
- Support font_size when text, ex.: small, medium, etc. | Dfop02
- Support to internal links (Anchor) | Dfop02
- Support to rowspan and colspan in tables. | Dfop02 from Issue
- Support to 'vertical-align' in table cells. | Dfop02
- Support to metadata | Dfop02
- Add support to table cells style (border, background-color, width, height, margin) | Dfop02
- Being able to use inline images on same paragraph. | Dfop02
- Refactory Tests to be more consistent and less 'human validation' | Dfop02
- Maximum Nesting Depth: Ordered lists support up to 3 nested levels. Any additional depth beyond level 3 will be treated as level 3.
- Counter Reset Behavior:
- At level 1, starting a new ordered list will reset the counter.
- At levels 2 and 3, the counter will continue from the previous item unless explicitly reset.
This project is primarily designed for compatibility with Microsoft Word, but it currently works well with LibreOffice and Google Docs, based on our testing. The goal is to maintain this cross-platform harmony while continuing to implement fixes and updates.
⚠️ However, please note that Microsoft Word is the priority. Bugs or issues specific to other editors (e.g., LibreOffice or Google Docs) may be considered, but fixing them is secondary to maintaining full compatibility with Word.
This project is licensed under the MIT License - see the LICENSE file for details