MAINT: Refactor _page.py #3339

MartinThoma · 2025-06-29T13:26:57Z

This is an experiment to see how well Github Copilot works

The goal of this PR is to make the PageObject easier to understand and maintain.

For the Reviewer

I'm uncertain if the license header is necessary / correct. I think most (all?) of the text extraction part is written by @pubpub-zz
See Refactor regular text extraction into dedicated module #3010

This is an experiment to see how well Github Copilot works

codecov · 2025-06-29T14:03:15Z

Codecov Report

Attention: Patch coverage is 95.72650% with 10 lines in your changes missing coverage. Please review.

Project coverage is 96.71%. Comparing base (af645a4) to head (347bdf4).

Files with missing lines	Patch %	Lines
pypdf/_text_extraction/_text_extractor.py	95.65%	7 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3339      +/-   ##
==========================================
- Coverage   96.73%   96.71%   -0.03%     
==========================================
  Files          54       54              
  Lines        9073     9096      +23     
  Branches     1676     1661      -15     
==========================================
+ Hits         8777     8797      +20     
- Misses        177      180       +3     
  Partials      119      119

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

stefan6419846

Please re-use the existing text extraction submodule instead of a new one. Additionally, we should review the coverage.

Just for interest: Did you do a complete review/diff of the changes to see whether anything was changed during refactoring without it being expected?

MartinThoma · 2025-06-29T19:04:13Z

Just for interest: Did you do a complete review/diff of the changes to see whether anything was changed during refactoring without it being expected?

I did some smoke-tests and all parts I checked were identical.
(Except for the parts that I wanted to be changed, of course 😄 )

However, I did not check everything.

MartinThoma · 2025-06-29T19:05:58Z

I was pleasently surprised how well agent mode of Github copilot worked for this. I essentially gave it #3010 (comment) (rephrased a little bit) and it did almost do all the work. After that I used ruff check / ruff format + fix some mypy issues and that was it :-)

MartinThoma · 2025-06-29T19:11:05Z

Additionally, we should review the coverage.

The missing lines are not new. So I would expect that they are currently missing as well.

MartinThoma · 2025-06-29T19:11:24Z

Please re-use the existing text extraction submodule instead of a new one.

I'll look into it 👍

MartinThoma · 2025-06-29T20:01:01Z

@stefan6419846 I feel like this change is too hard to review. What do you think about me doing this in 2 (or more) PRs?

In a first PR, I would create pypdf/_text_extraction/_text_extractor.py: I would create the TextExtraction class with a basic initializer that gets called from PageObject._extract_text, but not refactor how _extract_text works internally. So most of the code there should be character-for-character identical.

stefan6419846 · 2025-06-30T06:43:29Z

@MartinThoma The easier to review the code, the better - there surely are similar reasons why some of the larger recent PRs are still in my review queue.

stefan6419846 · 2025-07-22T19:38:03Z

@MartinThoma What is the current state of this PR?

MartinThoma force-pushed the refactor-page-copilot branch 6 times, most recently from 993054c to 9f1cce2 Compare June 29, 2025 13:51

MAINT: Refactor _page.py

abb5130

This is an experiment to see how well Github Copilot works

MartinThoma force-pushed the refactor-page-copilot branch from 9f1cce2 to abb5130 Compare June 29, 2025 13:54

MartinThoma marked this pull request as ready for review June 29, 2025 14:21

stefan6419846 requested changes Jun 29, 2025

View reviewed changes

MartinThoma added 2 commits June 29, 2025 21:39

Remove code duplication

f4e9285

Move _text_extractor into _text_extraction

1ed2e38

MartinThoma force-pushed the refactor-page-copilot branch from f47f1d9 to 1ed2e38 Compare June 29, 2025 19:52

MartinThoma mentioned this pull request Jun 30, 2025

MAINT: Move code from _page.py to _text_extraction #3343

Merged

MartinThoma added 3 commits July 1, 2025 16:43

Merge branch 'main' into refactor-page-copilot

347bdf4

Merge branch 'main' into refactor-page-copilot

9608900

Reduce diff

2c643d8

MartinThoma force-pushed the refactor-page-copilot branch 3 times, most recently from b85814e to f082a56 Compare July 4, 2025 20:19

Reduce diff

c93aadd

MartinThoma force-pushed the refactor-page-copilot branch from f082a56 to c93aadd Compare July 4, 2025 20:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MAINT: Refactor _page.py #3339

MAINT: Refactor _page.py #3339

MartinThoma commented Jun 29, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jun 29, 2025 •

edited

Loading

Uh oh!

stefan6419846 left a comment

Uh oh!

MartinThoma commented Jun 29, 2025 •

edited

Loading

Uh oh!

MartinThoma commented Jun 29, 2025

Uh oh!

MartinThoma commented Jun 29, 2025

Uh oh!

MartinThoma commented Jun 29, 2025

Uh oh!

MartinThoma commented Jun 29, 2025

Uh oh!

stefan6419846 commented Jun 30, 2025

Uh oh!

stefan6419846 commented Jul 22, 2025

Uh oh!

Uh oh!

MAINT: Refactor _page.py #3339

Are you sure you want to change the base?

MAINT: Refactor _page.py #3339

Conversation

MartinThoma commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

For the Reviewer

Uh oh!

codecov bot commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

stefan6419846 left a comment

Choose a reason for hiding this comment

Uh oh!

MartinThoma commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartinThoma commented Jun 29, 2025

Uh oh!

MartinThoma commented Jun 29, 2025

Uh oh!

MartinThoma commented Jun 29, 2025

Uh oh!

MartinThoma commented Jun 29, 2025

Uh oh!

stefan6419846 commented Jun 30, 2025

Uh oh!

stefan6419846 commented Jul 22, 2025

Uh oh!

Uh oh!

MartinThoma commented Jun 29, 2025 •

edited

Loading

codecov bot commented Jun 29, 2025 •

edited

Loading

MartinThoma commented Jun 29, 2025 •

edited

Loading