-
Notifications
You must be signed in to change notification settings - Fork 1.5k
ENH: Add all font metrics for base 14 Type 1 PDF fonts. #3363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[about failing tests] EDIT 2 The essentials in my view: having more complete font metrics available so that we can properly generate right-aligned and centered output. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3363 +/- ##
=======================================
Coverage 96.85% 96.86%
=======================================
Files 54 55 +1
Lines 9163 9180 +17
Branches 1676 1677 +1
=======================================
+ Hits 8875 8892 +17
Misses 172 172
Partials 116 116 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
pypdf/_codecs/base14_fontmetrics.py
Outdated
|
||
|
||
FONT_METRICS : Dict[str, Tuple[Dict[str, object], Dict[Any, float]]] = { | ||
"Courier": ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing this complex typing and unclear indexing when using the data, I would suggest to use actual data containers for each font.
Example container:
@dataclass(frozen=True)
class Font:
name: str
family: str
weight: str
ascent: float
descent: float
cap_height: float
x_height: float
italic_angle: float
flags: int
bbox: Tuple[float, float, float, float]
character_widths: Dict[str, int]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is roughly what we have over here: https://github.com/py-pdf/pypdf/blob/main/pypdf/_text_extraction/_layout_mode/_font.py
Do I try to replace this? Or do I add a separate font.py in pypdf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this would probably come down to re-implementing the method that produces the font metrics to make it produce the above Font dataclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I try to replace this? Or do I add a separate font.py in pypdf?
Ideally, we avoid duplication if possible.
Also, this would probably come down to re-implementing the method that produces the font metrics to make it produce the above Font dataclass.
I see nothing which would argue against it, especially considering future maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, for future purposes, it would be nice to extend the existing Font dataclass so that it includes:
- name: str
- family: str
- weight: str
- ascent: float
- descent: float
- cap_height: float
- x_height: float
- italic_angle: float
- flags: int
- bbox: Tuple[float, float, float, float]
Some of this might already be included in
- char_map (dict): character map
- font_dictionary (dict): font dictionary
To be honest, I have no idea what a dataclass really does, and I wouldn't know how to extend the existing class. So, I propose that I redo this PR with the FONT_METRICS as I've reproduced them locally, and with the script that I've adapted, but without the addition of a new Font dataclass, or changes to the existing one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be nice, but I'll see what I can do myself as well. I've looked a bit further in the code and in the spec. The existing font class has width_map
, which corresponds with the character widths in the FONT_METRICS file. It also includes font_dictionary
, which, as far as I can tell, just is a pdf font dictionary as you would find it in the page resources of a PDF file.
According to the spec, a font dictionary must usually include a set of font descriptors:
Except for Type 0 fonts, Type 3 fonts in non-tagged PDF documents, and certain standard Type 1 fonts,
every font dictionary shall contain a subsidiary dictionary, the font descriptor, containing font-wide
metrics and other attributes of the font; see 9.8, "Font descriptors".
These font descriptors include basically all other information in the FONT_METRICS class. So, the point is that the existing class will usually include the font descriptors for a font, but not for the 14 core fonts. So, one would actually need a separate dataclass FontDescriptors that reads the first part of the FONT_METRICS, or a dataclass FontDictionary, that includes both the FontDescriptors and the character widths. That way, you could parse the FONT_METRICS using the FontDictionary dataclass and then use the resulting dictionary to instantiate the Font.
Still thinking :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been looking at pdfminer.six's PDFFont class, https://github.com/pdfminer/pdfminer.six/blob/51683b2528e2aa685dd8b9e61f6ccf9f76a59a62/pdfminer/pdffont.py#L869 , and the more I think about it, the current Font class used by pypdf for text extraction is actually a specific case of what could be a more generic font class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I finally get it. The pypdf Font class has no specific attributes (?) as described in the font descriptor. In most cases, the font descriptor is a dict within a font dictionary, meaning that the information contained in that dictionary would already be available within the pypdf font class. This is not the case, though, for the 14 core fonts. Instead, the FONT_METRICS that I took from pdfminer.six actually are a fallback for when those metrics aren't available. The same holds for the font widths.
So, the most elegant solution would be to:
- Move the Font dataclass to the top pypdf directory (one patch)
- Add attributes for the information one would expect in a font descriptor (one patch)
- Fill the font descriptor information from the font dictionary or from FONT_METRICS (one patch)
- Parse widths from FONT_METRICS (if absent; one patch)
This way, the typing can all be done in the Font dataclass.
I must note that, for the other patches I'm preparing (text alignment and text wrapping) I'll initially only need the font widths, but to begin mimicking how pdftk generates a text appearance stream would require the detailed information from the font metrics.
How does this plan sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that the Font class should be public API if you mean this. For the other cases, this is hard to evaluate from my side without seeing it implemented, thus I cannot provide a proper/reliable statement about this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, things begin taking more and more sense. First, there actually is quite some code duplication between pypdf/_cmap.py
and pypdf/_text_extraction/_layout_mode/_font.py
. Second, in both cases the font metrics added in this patch would enable removing / simplifying some existing functionality and adding font descriptor information. For my own purposes, however, I first and foremost only need the font widths. So, I simplified this PR and limited it to the point that build_font_width_map
in _cmap.py reads font widths from the new font metrics information.
Specifically, I removed the huge typing and now limited it to one cast in _cmap.py:
font_width_map = cast(dict[str, float], FONT_METRICS[font_name][1])
Some points for further improvement might be:
- Removal of
_default_fonts_space_width
from _cmap.py, since that information is already present in the new font metrics, and it is incomplete as well - Turning the type of font_width_map within _cmap.py from Dict[Any, float] tot Dict[str, int] (or maybe Dict[Union[str, int], int]
- Removing
_font_widths.py
, which contains incomplete information and does not have license information (for this, I have patches, but I removed it from this PR for now) - Addition of a font descriptor dataclass to _cmap.py
I'd be happy to provide additional PRs for any of the above. For now however, I would really like to concentrate on this PR specifically, since it is a prerequisite for generating a text appearance stream with proper text wrapping and with proper right aligned or centered text.
3f6c20d
to
e2658b8
Compare
This patch adds a new file with the font metrics for the 14 core Type 1 pdf fonts. The file was inspired by the pdfminer.six project, where a very similar one is called fontmetrics.py. The information itself is generated by a separate file added with this patch: resources/get_core_fontmetrics.py. The PDF specification expects a pdf reader to include these font metrics. For the purposes of pypdf, these metrics are needed to build build_font_width_map font widths when a font is not embedded.
Version 1.7 of the PDF reference lists various alternatives names as accepted for the 14 core fonts, such as Arial for Helvetica and CourierNew for Courier. Add these alternative names to the font metrics.
This patch reads the character widths for the core 14 Type 1 PDF fonts from the _codecs/core_fontmetrics.py file.
Thanks for the further changes. There still are some further changes I would like to see:
|
No problem!
I think about 20%? The resulting formatting is still very like pdfminer (although their original script did not produce unicode codes but ints). EDIT:
This is how it is in the original AFM files:
and
So, the same copyright is in two locations, one after Comment and one after Notice. Only the second, however, includes mention of the trademark. And do notice the lack of a space between "Reserved." and "Helvetica". So, with all that, you indeed get a rather strange copyright detection logic!
OK. Then I think I need more information about what that would entail. What I think it does:
Is that correct? If not, perhaps it might be easier to reach out via discord, where I use the same user name. |
In the first step, it would be sufficient for me to just create a new dataclass with the properties outlined previously. Merging it with the existing class can always happen later, but I would like to see real structured data for now. We basically need to define the new dataclass and adapt the generator script to call the dataclass constructor accordingly. Regarding the pdfminer.six code and the copyrights, I will need to have another look at it in the next days. |
I'm afraid that means I'm officially out of my depth, I think I just lack the prerequisite knowledge to understand what you intend the generator script to do. If possible, please contact me on Discord one of these days so that we can discuss. And as always, thanks for your comments! |
P.S., this is what Google Gemini thinks:
|
I have just used your script and adapted it to show what I mean: https://gist.github.com/stefan6419846/3d368b26ee5260a7886657909f26ca15 The This will generate the code to create instances of the dataclass shown in #3363 (comment) together with the copyrights. The only things are are currently missing are the dataclass definition itself and the necessary imports as well as the mapping in the footer, but this should be easy enough to add to the script. By the way: We should probably keep the script outside of the original code, but run some linting and testing on it nevertheless to ensure it matches our standards and does not break for some reason. |
@stefan6419846 OK, now I understand better! Basically, you want the script to immediately produce the specific Font instances, instead of ending up in an intermediate form. With regard to linting / typing, as far as I can tell, I can just run ruff and mypy locally where it can find the script...? At least, that seems to work here. I understand also, that using this, I can do something like:
and then do stuff like:
One question; Why do you only add 255 widths? The AFMs contain about 314 widths. Second; so, this really collects a lot of information in the Font class. In my local patches, I need the widths (and I expect also some other metrics) for text wrapping a text stream when flattening an annotation. With my original patches, I changed build_font_width_map to include the character widths that are now in the Font instances in FONT_METRICS. However, the new Font class would not include character widths that are available for embedded fonts. So, how would you proceed with this? I see two ways forward:
EDIT
|
Yes, although for the repository and its CI/CD, we might need to extend the current configuration. I am open to help with this if desired/required.
I used a mix of the original pdfminer.six code and your code for writing my script. The limitation is from pdfminer.six. This seems to mostly eliminate the characters with the code
I would split this into two separate topics. This PR should focus on the new container and retrieving the data from the AFM files to use them where appropriate. Possibly unifying this with handling embedded fonts could/should be a separate step in a PR afterwards. This way, we avoid too large PRs which simplifies reviewing the changes from my side. |
This patch includes font metrics for the standard 14 fonts. This is intended to be useful for generating a text appearance stream, especially if you want to take into account right-aligned or centred text. (I have some other patches that include this as well as text wrapping.)
Note that some of this information was already included in _font_widths.py, but that information is incomplete. I thought it better to copy this information from pdfminer.six and be able to potentially benefit from their work later on, than to improve on what already was included here.
The first three patches introduce the new functionality. The last three patches are for moving the Font class to the new font metrics information and removing the old _font_widths.py file.
This is what the spec has to say about it: