Skip to content

MAINT: Refactor _page.py #3339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

MAINT: Refactor _page.py #3339

wants to merge 7 commits into from

Conversation

MartinThoma
Copy link
Member

@MartinThoma MartinThoma commented Jun 29, 2025

This is an experiment to see how well Github Copilot works

The goal of this PR is to make the PageObject easier to understand and maintain.

For the Reviewer

@MartinThoma MartinThoma force-pushed the refactor-page-copilot branch 6 times, most recently from 993054c to 9f1cce2 Compare June 29, 2025 13:51
This is an experiment to see how well Github Copilot works
@MartinThoma MartinThoma force-pushed the refactor-page-copilot branch from 9f1cce2 to abb5130 Compare June 29, 2025 13:54
Copy link

codecov bot commented Jun 29, 2025

Codecov Report

Attention: Patch coverage is 95.72650% with 10 lines in your changes missing coverage. Please review.

Project coverage is 96.71%. Comparing base (af645a4) to head (347bdf4).

Files with missing lines Patch % Lines
pypdf/_text_extraction/_text_extractor.py 95.65% 7 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3339      +/-   ##
==========================================
- Coverage   96.73%   96.71%   -0.03%     
==========================================
  Files          54       54              
  Lines        9073     9096      +23     
  Branches     1676     1661      -15     
==========================================
+ Hits         8777     8797      +20     
- Misses        177      180       +3     
  Partials      119      119              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@MartinThoma MartinThoma marked this pull request as ready for review June 29, 2025 14:21
Copy link
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please re-use the existing text extraction submodule instead of a new one. Additionally, we should review the coverage.

Just for interest: Did you do a complete review/diff of the changes to see whether anything was changed during refactoring without it being expected?

@MartinThoma
Copy link
Member Author

MartinThoma commented Jun 29, 2025

Just for interest: Did you do a complete review/diff of the changes to see whether anything was changed during refactoring without it being expected?

I did some smoke-tests and all parts I checked were identical.
(Except for the parts that I wanted to be changed, of course 😄 )

However, I did not check everything.

@MartinThoma
Copy link
Member Author

I was pleasently surprised how well agent mode of Github copilot worked for this. I essentially gave it #3010 (comment) (rephrased a little bit) and it did almost do all the work. After that I used ruff check / ruff format + fix some mypy issues and that was it :-)

@MartinThoma
Copy link
Member Author

Additionally, we should review the coverage.

The missing lines are not new. So I would expect that they are currently missing as well.

@MartinThoma
Copy link
Member Author

Please re-use the existing text extraction submodule instead of a new one.

I'll look into it 👍

@MartinThoma MartinThoma force-pushed the refactor-page-copilot branch from f47f1d9 to 1ed2e38 Compare June 29, 2025 19:52
@MartinThoma
Copy link
Member Author

@stefan6419846 I feel like this change is too hard to review. What do you think about me doing this in 2 (or more) PRs?

In a first PR, I would create pypdf/_text_extraction/_text_extractor.py: I would create the TextExtraction class with a basic initializer that gets called from PageObject._extract_text, but not refactor how _extract_text works internally. So most of the code there should be character-for-character identical.

@stefan6419846
Copy link
Collaborator

@MartinThoma The easier to review the code, the better - there surely are similar reasons why some of the larger recent PRs are still in my review queue.

@MartinThoma MartinThoma force-pushed the refactor-page-copilot branch 3 times, most recently from b85814e to f082a56 Compare July 4, 2025 20:19
@MartinThoma MartinThoma force-pushed the refactor-page-copilot branch from f082a56 to c93aadd Compare July 4, 2025 20:22
@stefan6419846
Copy link
Collaborator

@MartinThoma What is the current state of this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants