MAINT: Move code from _page.py to _text_extraction #3343

MartinThoma · 2025-06-30T10:53:06Z

The goal of this PR is to increase maintainabilty by reducing the size of __page.py. This helps, because:

The risk of merge conflicts decreases
It becomes easier to see which kind of change is done

It also prepares for a big more complex refactoring (see #3339 )

For the Reviewer

This PR is almost only copy-paste (moving the location of the code, but no additions/changes):

The license preamble was added
The new TextExtration class was added + used by _page.py
_get_actual_font_widths and _handle_tj were only moved

codecov · 2025-06-30T11:02:27Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.73%. Comparing base (dfadde5) to head (aae5d9f).
Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3343   +/-   ##
=======================================
  Coverage   96.73%   96.73%           
=======================================
  Files          53       54    +1     
  Lines        9060     9069    +9     
  Branches     1676     1676           
=======================================
+ Hits         8764     8773    +9     
  Misses        177      177           
  Partials      119      119

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull Request Overview

This PR moves code related to PDF text extraction from _page.py to a dedicated file in _text_extraction to improve maintainability and reduce merge conflicts. Key changes include:

Moving the implementations of _get_actual_font_widths and _handle_tj into a new TextExtraction class.
Adding a license preamble and creating a new file (pypdf/_text_extraction/_text_extractor.py) to house the extracted methods.
Updating _page.py to remove duplicate logic and use the new TextExtraction instance.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
pypdf/_text_extraction/_text_extractor.py	New file with the TextExtraction class and moved text extraction logic.
pypdf/_page.py	Removed duplicate extraction functions and updated calls to use the new TextExtraction class.

pypdf/_page.py

stefan6419846

Just did a quick check myself and the diff looks correct.

MAINT: Move code from _page.py to _text_extraction

0214f87

MartinThoma force-pushed the refactor-page-prep branch from 76ca472 to 0214f87 Compare June 30, 2025 10:54

MartinThoma requested a review from Copilot June 30, 2025 12:05

MartinThoma marked this pull request as ready for review June 30, 2025 12:05

Copilot AI reviewed Jun 30, 2025

View reviewed changes

pypdf/_page.py Show resolved Hide resolved

Merge branch 'main' into refactor-page-prep

aae5d9f

stefan6419846 approved these changes Jul 1, 2025

View reviewed changes

stefan6419846 merged commit af645a4 into main Jul 1, 2025
16 checks passed

stefan6419846 deleted the refactor-page-prep branch July 1, 2025 11:52

larsga pushed a commit to larsga/pypdf that referenced this pull request Jul 21, 2025

MAINT: Move code from _page.py to _text_extraction (py-pdf#3343)

b0d6324

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MAINT: Move code from _page.py to _text_extraction #3343

MAINT: Move code from _page.py to _text_extraction #3343

Uh oh!

MartinThoma commented Jun 30, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jun 30, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

stefan6419846 left a comment

Uh oh!

Uh oh!

Uh oh!

MAINT: Move code from _page.py to _text_extraction #3343

MAINT: Move code from _page.py to _text_extraction #3343

Uh oh!

Conversation

MartinThoma commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

For the Reviewer

Uh oh!

codecov bot commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

stefan6419846 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MartinThoma commented Jun 30, 2025 •

edited

Loading

codecov bot commented Jun 30, 2025 •

edited

Loading