Skip to content

Conversation

@Libres-coder
Copy link

@Libres-coder Libres-coder commented Oct 26, 2025

Summary

Implements optional table merging for tables spanning multiple pages. Currently disabled by default to ensure backward compatibility.

Related Issye

#131

Solution

  • Added post-processing to merge Markdown and HTML tables from consecutive pages
  • Merges based on column count and header similarity (≥80% for Markdown)
  • Graceful fallback if merging fails

Changes

Modified:

  • config.py - Added ENABLE_TABLE_MERGE, ENABLE_HTML_TABLE_MERGE, TABLE_MERGE_LOG flags
  • run_dpsk_ocr_pdf.py - Integrated optional merge post-processing

Added:

  • process/table_merger.py - Table extraction, parsing, and merge logic

Features

  • Backward compatible (disabled by default)
  • No new dependencies
  • Supports Markdown and HTML tables
  • Error handling with fallback

Usage

Enable in config.py:

ENABLE_TABLE_MERGE = True

@Libres-coder Libres-coder marked this pull request as draft October 26, 2025 18:39
@minhnguyent546
Copy link

Could you please support merging HTML tables also?

@Libres-coder Libres-coder marked this pull request as ready for review October 27, 2025 12:13
@Libres-coder
Copy link
Author

Could you please support merging HTML tables also?

get it

@Libres-coder
Copy link
Author

@ImadSaddik @Ucas-HaoranWei ptal,thx

@athrael-soju
Copy link

This is a much needed feature. Any progress on this PR?

@Libres-coder
Copy link
Author

@athrael-soju This feature has been implemented.The cross-page table merging is now available. It's disabled by default to ensure backward compatibility.The feature supports both Markdown and HTML tables with automatic fallback on errors.
To enable it, simply set in config.py:

ENABLE_TABLE_MERGE = True

@Libres-coder
Copy link
Author

@athrael-soju I've tested it with both Markdown and HTML tables. Here's my local demo:

image image

@athrael-soju
Copy link

@Libres-coder this is excellent, thanks for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants