Skip to content

ENH: Add all font metrics for base 14 Type 1 PDF fonts. #3363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

PJBrs
Copy link
Contributor

@PJBrs PJBrs commented Jul 7, 2025

This patch includes font metrics for the standard 14 fonts. This is intended to be useful for generating a text appearance stream, especially if you want to take into account right-aligned or centred text. (I have some other patches that include this as well as text wrapping.)

Note that some of this information was already included in _font_widths.py, but that information is incomplete. I thought it better to copy this information from pdfminer.six and be able to potentially benefit from their work later on, than to improve on what already was included here.

The first three patches introduce the new functionality. The last three patches are for moving the Font class to the new font metrics information and removing the old _font_widths.py file.

This is what the spec has to say about it:

9.6.2.2 Standard Type 1 fonts (standard 14 fonts) (PDF 1.0-1.7)

The PostScript language names of 14 Type 1 fonts, known as the standard 14 fonts, are as follows:
Times-Roman, Helvetica, Courier, Symbol, Times-Bold, Helvetica-Bold, Courier-Bold, ZapfDingbats,
Times-Italic, Helvetica-Oblique, Courier-Oblique, Times-BoldItalic, Helvetica-BoldOblique, CourierBoldOblique.

In PDF 1.0 to PDF 1.7, the FirstChar, LastChar, Widths and FontDescriptor (see Table 109: Entries in
a Type 1 font dictionary") were optional in Type 1 font dictionaries for the standard 14 fonts. PDF
processors supporting PDF 1.0 to PDF 1.7 files shall have these fonts, or their font metrics and suitable
substitution fonts, available.

These fonts, or their font metrics and suitable substitution fonts, shall be available to the PDF processor.

@PJBrs PJBrs changed the title ENH / ROB: Add all font metrics for base 14 Type 1 PDF fonts. ENH: Add all font metrics for base 14 Type 1 PDF fonts. Jul 7, 2025
@PJBrs
Copy link
Contributor Author

PJBrs commented Jul 7, 2025

[about failing tests]
EDIT
I tested on the wrong branch and will investigate.

EDIT 2
Switching over the Font class to the new font metrics causes space changes in the extractor output. Otherwise the output is the same. I don't know this code, but I still assume that this is for the better. I can also drop the last three patches. But I'd like to hear from a reviewer to see what a good way forward would be.

The essentials in my view: having more complete font metrics available so that we can properly generate right-aligned and centered output.

Copy link

codecov bot commented Jul 7, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.86%. Comparing base (d76976b) to head (8cb825d).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3363   +/-   ##
=======================================
  Coverage   96.85%   96.86%           
=======================================
  Files          54       55    +1     
  Lines        9163     9180   +17     
  Branches     1676     1677    +1     
=======================================
+ Hits         8875     8892   +17     
  Misses        172      172           
  Partials      116      116           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.



FONT_METRICS : Dict[str, Tuple[Dict[str, object], Dict[Any, float]]] = {
"Courier": (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this complex typing and unclear indexing when using the data, I would suggest to use actual data containers for each font.

Example container:

@dataclass(frozen=True)
class Font:
    name: str
    family: str
    weight: str

    ascent: float
    descent: float
    cap_height: float
    x_height: float
    italic_angle: float
    flags: int
    bbox: Tuple[float, float, float, float]

    character_widths: Dict[str, int]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is roughly what we have over here: https://github.com/py-pdf/pypdf/blob/main/pypdf/_text_extraction/_layout_mode/_font.py

Do I try to replace this? Or do I add a separate font.py in pypdf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this would probably come down to re-implementing the method that produces the font metrics to make it produce the above Font dataclass.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I try to replace this? Or do I add a separate font.py in pypdf?

Ideally, we avoid duplication if possible.

Also, this would probably come down to re-implementing the method that produces the font metrics to make it produce the above Font dataclass.

I see nothing which would argue against it, especially considering future maintainability.

Copy link
Contributor Author

@PJBrs PJBrs Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, for future purposes, it would be nice to extend the existing Font dataclass so that it includes:

  • name: str
  • family: str
  • weight: str
  • ascent: float
  • descent: float
  • cap_height: float
  • x_height: float
  • italic_angle: float
  • flags: int
  • bbox: Tuple[float, float, float, float]

Some of this might already be included in

  • char_map (dict): character map
  • font_dictionary (dict): font dictionary

To be honest, I have no idea what a dataclass really does, and I wouldn't know how to extend the existing class. So, I propose that I redo this PR with the FONT_METRICS as I've reproduced them locally, and with the script that I've adapted, but without the addition of a new Font dataclass, or changes to the existing one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be nice, but I'll see what I can do myself as well. I've looked a bit further in the code and in the spec. The existing font class has width_map, which corresponds with the character widths in the FONT_METRICS file. It also includes font_dictionary, which, as far as I can tell, just is a pdf font dictionary as you would find it in the page resources of a PDF file.

According to the spec, a font dictionary must usually include a set of font descriptors:

Except for Type 0 fonts, Type 3 fonts in non-tagged PDF documents, and certain standard Type 1 fonts,
every font dictionary shall contain a subsidiary dictionary, the font descriptor, containing font-wide
metrics and other attributes of the font; see 9.8, "Font descriptors".

These font descriptors include basically all other information in the FONT_METRICS class. So, the point is that the existing class will usually include the font descriptors for a font, but not for the 14 core fonts. So, one would actually need a separate dataclass FontDescriptors that reads the first part of the FONT_METRICS, or a dataclass FontDictionary, that includes both the FontDescriptors and the character widths. That way, you could parse the FONT_METRICS using the FontDictionary dataclass and then use the resulting dictionary to instantiate the Font.

Still thinking :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been looking at pdfminer.six's PDFFont class, https://github.com/pdfminer/pdfminer.six/blob/51683b2528e2aa685dd8b9e61f6ccf9f76a59a62/pdfminer/pdffont.py#L869 , and the more I think about it, the current Font class used by pypdf for text extraction is actually a specific case of what could be a more generic font class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I finally get it. The pypdf Font class has no specific attributes (?) as described in the font descriptor. In most cases, the font descriptor is a dict within a font dictionary, meaning that the information contained in that dictionary would already be available within the pypdf font class. This is not the case, though, for the 14 core fonts. Instead, the FONT_METRICS that I took from pdfminer.six actually are a fallback for when those metrics aren't available. The same holds for the font widths.

So, the most elegant solution would be to:

  • Move the Font dataclass to the top pypdf directory (one patch)
  • Add attributes for the information one would expect in a font descriptor (one patch)
  • Fill the font descriptor information from the font dictionary or from FONT_METRICS (one patch)
  • Parse widths from FONT_METRICS (if absent; one patch)

This way, the typing can all be done in the Font dataclass.

I must note that, for the other patches I'm preparing (text alignment and text wrapping) I'll initially only need the font widths, but to begin mimicking how pdftk generates a text appearance stream would require the detailed information from the font metrics.

How does this plan sound?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that the Font class should be public API if you mean this. For the other cases, this is hard to evaluate from my side without seeing it implemented, thus I cannot provide a proper/reliable statement about this here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, things begin taking more and more sense. First, there actually is quite some code duplication between pypdf/_cmap.py and pypdf/_text_extraction/_layout_mode/_font.py. Second, in both cases the font metrics added in this patch would enable removing / simplifying some existing functionality and adding font descriptor information. For my own purposes, however, I first and foremost only need the font widths. So, I simplified this PR and limited it to the point that build_font_width_map in _cmap.py reads font widths from the new font metrics information.

Specifically, I removed the huge typing and now limited it to one cast in _cmap.py:
font_width_map = cast(dict[str, float], FONT_METRICS[font_name][1])

Some points for further improvement might be:

  • Removal of _default_fonts_space_width from _cmap.py, since that information is already present in the new font metrics, and it is incomplete as well
  • Turning the type of font_width_map within _cmap.py from Dict[Any, float] tot Dict[str, int] (or maybe Dict[Union[str, int], int]
  • Removing _font_widths.py, which contains incomplete information and does not have license information (for this, I have patches, but I removed it from this PR for now)
  • Addition of a font descriptor dataclass to _cmap.py

I'd be happy to provide additional PRs for any of the above. For now however, I would really like to concentrate on this PR specifically, since it is a prerequisite for generating a text appearance stream with proper text wrapping and with proper right aligned or centered text.

@PJBrs PJBrs marked this pull request as draft July 8, 2025 16:01
@PJBrs PJBrs force-pushed the fontwork branch 2 times, most recently from 3f6c20d to e2658b8 Compare July 12, 2025 15:34
PJBrs added 4 commits July 12, 2025 18:15
This patch adds a new file with the font metrics
for the 14 core Type 1 pdf fonts. The file was
inspired by the pdfminer.six project, where a
very similar one is called fontmetrics.py.
The information itself is generated by a
separate file added with this patch:
resources/get_core_fontmetrics.py.

The PDF specification expects a pdf reader to
include these font metrics. For the purposes
of pypdf, these metrics are needed to build
build_font_width_map font widths when a font is
not embedded.
Version 1.7 of the PDF reference lists various alternatives
names as accepted for the 14 core fonts, such as Arial for
Helvetica and CourierNew for Courier. Add these alternative
names to the font metrics.
This patch reads the character widths for the core 14
Type 1 PDF fonts from the _codecs/core_fontmetrics.py
file.
@PJBrs PJBrs marked this pull request as ready for review July 12, 2025 16:25
@stefan6419846
Copy link
Collaborator

Thanks for the further changes. There still are some further changes I would like to see:

  • Move the generator script code into pypdf itself instead of the resources. Otherwise, linting does not trigger for example.
  • To which extent does the generator script re-use code from pdfminer.six?
  • Why is there such a "strange" copyright detection logic instead of just reading the corresponding fields?
  • Although there might be changes in a later PR, I would still prefer to have a proper object-oriented approach for the new data. Past has shown that nothing is more permanent than a temporary solution.

@PJBrs
Copy link
Contributor Author

PJBrs commented Jul 13, 2025

Thanks for the further changes. There still are some further changes I would like to see:

  • Move the generator script code into pypdf itself instead of the resources. Otherwise, linting does not trigger for example.

No problem!

  • To which extent does the generator script re-use code from pdfminer.six?

I think about 20%? The resulting formatting is still very like pdfminer (although their original script did not produce unicode codes but ints). EDIT:
grep -F -x -f /usr/lib64/python3.9/site-packages/pdfminer/fontmetrics.py resources/get_core_fontmetrics.py (excluding lines with only hashes or quotation marks):

            f = line.strip().split(" ")
            if not f:
                continue
            k = f[0]
            if k == "FontName":
                fontname = f[1]
                props = {"FontName": fontname, "Flags": 0}
            elif k == "C":
            elif k in ("CapHeight", "XHeight", "ItalicAngle", "Ascender", "Descender"):
                k = {"Ascender": "Ascent", "Descender": "Descent"}.get(k, k)
                props[k] = float(f[1])
            elif k in ("FontName", "FamilyName", "Weight"):
                k = {"FamilyName": "FontFamily", "Weight": "FontWeight"}.get(k, k)
                props[k] = f[1]
            elif k == "IsFixedPitch":
                if f[1].lower() == "true":
                    props["Flags"] = 64
            elif k == "FontBBox":
                props[k] = tuple(map(float, f[1:5]))

  • Why is there such a "strange" copyright detection logic instead of just reading the corresponding fields?

This is how it is in the original AFM files:

Comment Copyright (c) 1985, 1987, 1989, 1990, 1997 Adobe Systems Incorporated. All Rights Reserved.

and

Notice Copyright (c) 1985, 1987, 1989, 1990, 1997 Adobe Systems Incorporated. All Rights Reserved.Helvetica is a trademark of Linotype-Hell AG and/or its subsidiaries.

So, the same copyright is in two locations, one after Comment and one after Notice. Only the second, however, includes mention of the trademark. And do notice the lack of a space between "Reserved." and "Helvetica". So, with all that, you indeed get a rather strange copyright detection logic!

  • Although there might be changes in a later PR, I would still prefer to have a proper object-oriented approach for the new data. Past has shown that nothing is more permanent than a temporary solution.

OK. Then I think I need more information about what that would entail. What I think it does:

  • Keep all information as it is now
  • For the widths information, this can be used as I propose using it now, in _cmap.build_font_width_map
  • Add a @dataclass font_descriptor to _cmap.py that can initialise itself using the information in core_fontmetrics.py
  • Ideally, such a dataclass could actually use either the Type 1 font information from core_fontmetrics, or an existing /Descriptors dict within an embedded /Font resource. But I would need an easy test solution for that case (ideally, a form that doesn't use one of the core fonts).

Is that correct?

If not, perhaps it might be easier to reach out via discord, where I use the same user name.

@stefan6419846
Copy link
Collaborator

In the first step, it would be sufficient for me to just create a new dataclass with the properties outlined previously. Merging it with the existing class can always happen later, but I would like to see real structured data for now. We basically need to define the new dataclass and adapt the generator script to call the dataclass constructor accordingly.

Regarding the pdfminer.six code and the copyrights, I will need to have another look at it in the next days.

@PJBrs
Copy link
Contributor Author

PJBrs commented Jul 13, 2025

I'm afraid that means I'm officially out of my depth, I think I just lack the prerequisite knowledge to understand what you intend the generator script to do. If possible, please contact me on Discord one of these days so that we can discuss.

And as always, thanks for your comments!

@PJBrs PJBrs marked this pull request as draft July 13, 2025 13:12
@PJBrs
Copy link
Contributor Author

PJBrs commented Jul 13, 2025

P.S., this is what Google Gemini thinks:

import re
from dataclasses import dataclass
from typing import Dict, Tuple

@dataclass(frozen=True)
class Font:
    """
    Dataclass to store parsed information from a Type 1 font AFM file.
    """
    name: str
    family: str
    weight: str

    ascent: float
    descent: float
    cap_height: float
    x_height: float
    italic_angle: float
    flags: int
    bbox: Tuple[float, float, float, float]

    character_widths: Dict[str, int]

def parse_afm_file(afm_content: str) -> Font:
    """
    Parses the content of a Type 1 font AFM file and returns a Font dataclass instance.

    Args:
        afm_content: A string containing the full content of the AFM file.

    Returns:
        A Font dataclass instance populated with the parsed data.

    Raises:
        ValueError: If essential font properties are missing or malformed in the AFM content.
    """
    font_properties = {}
    character_widths = {}
    in_char_metrics_section = False

    lines = afm_content.splitlines()

    for line in lines:
        line = line.strip()
        if not line:
            continue

        # Check for start/end of character metrics section
        if line == "StartCharMetrics":
            in_char_metrics_section = True
            continue
        elif line == "EndCharMetrics":
            in_char_metrics_section = False
            continue

        if in_char_metrics_section:
            # Parse character metrics: C <char_code>; WX <width_x>; N <char_name>; ...
            match = re.match(r"C\s+\d+;\s+WX\s+([\d.]+);\s+N\s+([a-zA-Z0-9_.]+);", line)
            if match:
                width = int(float(match.group(1))) # Widths are typically integers in AFM
                char_name = match.group(2)
                character_widths[char_name] = width
        else:
            # Parse general font properties
            parts = line.split(' ', 1) # Split only on the first space
            if len(parts) == 2:
                key, value = parts[0], parts[1]
                font_properties[key] = value

    # Extract and convert properties, handling potential missing values
    try:
        name = font_properties.get("FontName", "Unknown")
        family = font_properties.get("FamilyName", "Unknown")
        weight = font_properties.get("Weight", "Unknown")

        ascent = float(font_properties.get("Ascender", 0.0))
        descent = float(font_properties.get("Descender", 0.0))
        cap_height = float(font_properties.get("CapHeight", 0.0))
        x_height = float(font_properties.get("XHeight", 0.0))
        italic_angle = float(font_properties.get("ItalicAngle", 0.0))

        # Calculate flags: bit 0 is set if IsFixedPitch is true
        is_fixed_pitch = font_properties.get("IsFixedPitch", "false").lower() == "true"
        flags = 1 if is_fixed_pitch else 0

        # Parse FontBBox
        bbox_str = font_properties.get("FontBBox", "0 0 0 0")
        bbox_values = tuple(map(float, bbox_str.split()))
        if len(bbox_values) != 4:
            raise ValueError(f"Malformed FontBBox: {bbox_str}")
        bbox = bbox_values

    except (KeyError, ValueError) as e:
        raise ValueError(f"Error parsing AFM file: Missing or malformed property - {e}")

    return Font(
        name=name,
        family=family,
        weight=weight,
        ascent=ascent,
        descent=descent,
        cap_height=cap_height,
        x_height=x_height,
        italic_angle=italic_angle,
        flags=flags,
        bbox=bbox,
        character_widths=character_widths
    )

@stefan6419846
Copy link
Collaborator

I have just used your script and adapted it to show what I mean: https://gist.github.com/stefan6419846/3d368b26ee5260a7886657909f26ca15 The adobe_glyphs module imported there is a standalone copy of https://github.com/py-pdf/pypdf/blob/main/pypdf/_codecs/adobe_glyphs.py

This will generate the code to create instances of the dataclass shown in #3363 (comment) together with the copyrights. The only things are are currently missing are the dataclass definition itself and the necessary imports as well as the mapping in the footer, but this should be easy enough to add to the script.

By the way: We should probably keep the script outside of the original code, but run some linting and testing on it nevertheless to ensure it matches our standards and does not break for some reason.

@PJBrs
Copy link
Contributor Author

PJBrs commented Jul 15, 2025

@stefan6419846 OK, now I understand better! Basically, you want the script to immediately produce the specific Font instances, instead of ending up in an intermediate form. With regard to linting / typing, as far as I can tell, I can just run ruff and mypy locally where it can find the script...? At least, that seems to work here.

I understand also, that using this, I can do something like:

if font_name in FONT_METRICS:
    font = FONT_METRICS[font_name]

and then do stuff like:

total_width = sum(font.character_widths[char] for char in "This is a long sentence")
print (total_width)

One question; Why do you only add 255 widths? The AFMs contain about 314 widths.

Second; so, this really collects a lot of information in the Font class. In my local patches, I need the widths (and I expect also some other metrics) for text wrapping a text stream when flattening an annotation. With my original patches, I changed build_font_width_map to include the character widths that are now in the Font instances in FONT_METRICS. However, the new Font class would not include character widths that are available for embedded fonts. So, how would you proceed with this? I see two ways forward:

  1. Change build_font_width_map and use the new Font instances, something like this:
+    else:
+        font_name = str(ft["/BaseFont"])[1:]
+        if font_name in FONT_METRICS:
+            font_width_map = cast(Dict[str, float], FONT_METRICS[font_name].character_widths)

  1. Change the Font dataclass so that you can initialise it with a /Font resource dictionary, set attributes bases on FONT_METRICS instances when available, and otherwise set character_widths using the existing build_font_width_map. This makes it possible, in the future, to also set the other metrics based on embedded font information, if available.

EDIT

  1. Where would you add the Font class?
  2. Where would you place the font generation script?

@stefan6419846
Copy link
Collaborator

With regard to linting / typing, as far as I can tell, I can just run ruff and mypy locally where it can find the script...?

Yes, although for the repository and its CI/CD, we might need to extend the current configuration. I am open to help with this if desired/required.

One question; Why do you only add 255 widths? The AFMs contain about 314 widths.

I used a mix of the original pdfminer.six code and your code for writing my script. The limitation is from pdfminer.six. This seems to mostly eliminate the characters with the code -1. I have no hard opinion on this/do not know what is correct.

However, the new Font class would not include character widths that are available for embedded fonts. So, how would you proceed with this?

I would split this into two separate topics. This PR should focus on the new container and retrieving the data from the AFM files to use them where appropriate. Possibly unifying this with handling embedded fonts could/should be a separate step in a PR afterwards. This way, we avoid too large PRs which simplifies reviewing the changes from my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants