Skip to content

Add UnicodeRange for font query #377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

dhardy
Copy link
Contributor

@dhardy dhardy commented Jun 7, 2025

This is a partial solution to #371.
My testing shows improvements; e.g. that arrows now use the most appropriate font instead of the first which happens to contain a match (this makes styles much more consistent).

  • Add enum UnicodeRange. New code because I couldn't find an appropriate impl on crates.io (though there are some things touching on this in ttf_parser and in read-fonts).
  • Fill in remaining ranges
  • Add struct UnicodeRanges
  • Filter Query by matching ranges
  • Extend query when no matches are available

The missing part (last item) is to add a second-stage fallback (all fonts functional over the range) for when there are no matches. This is not quite so straightforward since Collection doesn't have a function to list all available families; I'd like some guidance on the best way to do this within fontique. Should Query::matches_with automatically do this when it has no other matches, or should it be up to the caller to call something else like Query::all_fonts_for_range(range: UnicodeRange) in this case?

Parley changes are needed to use this. kas-text is updated here: kas-gui/kas-text#97.

@dhardy
Copy link
Contributor Author

dhardy commented Jun 7, 2025

The test input from #371; also some Hebrew characters:
image

Font matches

[2025-06-07T14:11:48Z DEBUG kas_text::fonts::resolver] select: Script::Latn, Some(BasicLatin), GenericFamily::SystemUi, FontWeight(400), FontWidth(256), Normal
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Cantarell
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: DejaVu Sans
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::resolver] select: Script::Zyyy, Some(BasicLatin), GenericFamily::SystemUi, FontWeight(400), FontWidth(256), Normal
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Cantarell
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::resolver] select: Script::Hebr, Some(Hebrew), GenericFamily::SystemUi, FontWeight(400), FontWidth(256), Normal
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Noto Sans Hebrew
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::resolver] select: Script::Latn, Some(BasicLatin), GenericFamily::Serif, FontWeight(400), FontWidth(256), Normal
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Noto Serif
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Nimbus Roman
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: URW Bookman
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: C059
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: P052
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Standard Symbols PS
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Caladea
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Symbola
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: STIX
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Lohit Bengali
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Lohit Gujarati
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Lohit Marathi
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Lohit Tamil
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Lohit Kannada
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Lohit Telugu
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: DejaVu Sans
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::resolver] select: Script::Zyyy, Some(BasicLatin), GenericFamily::Serif, FontWeight(400), FontWidth(256), Normal
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Noto Serif
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Nimbus Roman
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: URW Bookman
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: C059
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: P052
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Standard Symbols PS
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Caladea
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Symbola
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: STIX
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Lohit Bengali
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Lohit Gujarati
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Lohit Marathi
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Lohit Tamil
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Lohit Kannada
[2025-06-07T14:11:48Z DEBUG kas_text::fonts::library] match: Lohit Telugu
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::resolver] select: Script::Zyyy, Some(Dingbats), GenericFamily::Serif, FontWeight(400), FontWidth(256), Normal
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: Symbola
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: STIX
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::resolver] select: Script::Zyyy, None, GenericFamily::Serif, FontWeight(400), FontWidth(256), Normal
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: Noto Serif
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: Nimbus Roman
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: URW Bookman
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: C059
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: P052
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: Standard Symbols PS
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: Caladea
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: Symbola
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: STIX
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: Lohit Bengali
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: Lohit Gujarati
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: Lohit Marathi
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: Lohit Tamil
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: Lohit Kannada
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: Lohit Telugu
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::resolver] select: Script::Zyyy, Some(Arrows), GenericFamily::Serif, FontWeight(400), FontWidth(256), Normal
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: Symbola
[2025-06-07T14:11:56Z DEBUG kas_text::fonts::library] match: STIX

Note here that Hebrew and Arrow ranges only match one/two fonts; before this they would match many more (inappropriately). The None case (second-last match) is because UnicodeRange is incomplete.

@nicoburns
Copy link
Contributor

Hmm... I believe that for CSS matching we need support for arbitrary numeric ranges (not just "standard ranges") specified as part of the FontInfoOverride. See https://developer.mozilla.org/en-US/docs/Web/CSS/@font-face/unicode-range. So perhaps it would make sense to have some variant on Vec<Range<u32>> (SmallVec<[Range<u32>; 4]>?) as the core type for unicode ranges within Fontique?

Reading this PR it looks like the 128 standard ranges thing is part of an OpenType standard? So I guess we may still need that code (not my area of expertise), although presumably one can also check which codepoints a font actually supports?

@dhardy
Copy link
Contributor Author

dhardy commented Jun 7, 2025

although presumably one can also check which codepoints a font actually supports?

Yes, of course. That's largely orthogonal to this PR. I had assumed that Parley would already do this, but maybe not; this may be why kas-text already did better than Parley in #371.

The way this works in kas-text is that, per glyph, I create a hash of the font selector (family & attributes), the script and (after this PR) the unicode_range); this is used to select a matching font list and cache it via the hash. Then, per glyph, I take the first matching font face from that list which contains the char (with the quirks that it prefers to use the last char's face if possible).

kas-text docs describe both FontId (the list) and FaceId (one font face), though that's a little dated (there's no longer a "default font").

This is per-char fallback, and the PR here mostly makes it more efficient (those font lists are smaller, omitting fonts that likely wouldn't cover the current char).

Using this PR without per-char fallback would probably mostly work, but likely not everywhere.

The second (missing) part of this PR would help by providing some likely font matches when no suitable fonts are found otherwise. E.g. if someone uses an arrow then the family won't identify a suitable font (though some contain a few common arrow glyphs anyway) and the script ("common") won't either, but the UnicodeRange will.


I believe that for CSS matching we need support for arbitrary numeric ranges

Hmm, UnicodeRange is intended to mirror ulUnicodeRange from the OS/2 table, which most fonts provide. I guess the CSS unicode-range is designed to do a similar thing, but (a) is more flexible and (b) relies on a web page's CSS to specify available fonts (with the ranges they cover).

For fonts found on the system, I think we normally only have the info from the OS/2 table.

The other reason I chose to precisely mirror the OS/2 specification is because of how kas-text caches a set of fonts as a FontId; I want some finite small-ish number of categories here (much fewer than the number of possible char codes and independent of fonts for simplicity).

I guess there are other possible approaches; e.g. building a (very large) map from (FontFamily, FontWeight, FontWidth, FontStyle, char) to a font face directly, or caching maps from char to a font face for each (FontFamily, FontWeight, FontWidth, FontStyle) or just performing the whole look-up chain for each char (probably too slow for reactive UI).

I'm not sure if you want to copy the approach I've adopted for kas-text in Parley since it's not exactly compatible with CSS's unicode-range — but I don't see a good alternative that allows fast cached look-ups (and is not horribly complex / needing a huge hash map).

I would like this to be adopted by fontique anyway, since it's not incompatible with using another approach for Parley.

@dhardy
Copy link
Contributor Author

dhardy commented Jun 9, 2025

Combined with #378 (which effectively just massively increases the number of font matches), this PR is effectively a fast pre-filter for callbacks. E.g. with this PR, matching for a font supporting Hebrew (Script and UnicodeRange properties) yields 6 matches; without this PR (only Script property), it yields 61 (of which only 17 and 50 are obviously Hebrew fonts).

That is because the Script property is only used for fallbacks, not to prune other matches. I'm not sure whether that should change (not if we also have this PR, possibly otherwise though I don't know if it would be problematic), but even if so it still wouldn't work for arrows (which have a UnicodeRange but not a specific Script).

@nicoburns
Copy link
Contributor

this PR is effectively a fast pre-filter for callbacks

I am understanding correctly that the invariant this PR is relying on / taking advantage of is "If a font doesn't list a 'standard unicode range' then it doesn't contain any glyphs for any codepoints within that range"? If a font does support a given "standard range" then we then still need to check whether the font actually contain the specific glyph we are looking for but this allows for fast-rejecting when a font doesn't cover a given "standard range"?

@dhardy
Copy link
Contributor Author

dhardy commented Jun 9, 2025

Honestly, the only specification I found for behaviour is this one:

This field is used to specify the Unicode blocks or ranges encompassed by the font file in 'cmap' subtables for platform 3, encoding ID 1 (Microsoft platform, Unicode BMP) and platform 3, encoding ID 10 (Microsoft platform, Unicode full repertoire). If a bit is set (1), then the Unicode ranges assigned to that bit are considered functional. If the bit is clear (0), then the range is not considered functional. Each of the bits is treated as an independent flag and the bits can be set in any combination. The determination of “functional” is left up to the font designer, although character set selection should attempt to be functional by ranges if at all possible.

I fully expect that there are some chars contained by some fonts which do not indicate that they are "functional" over the corresponding input range, and thus are rejected by the new filter in this PR.

But does that matter? Lets consider ("Arrow" range), using a Sans-Serif font family:

  • Without this PR, would be picked from the first matching font, the Sans-Serif font. This may be more consistent with other text in that font.
  • With this PR and a suitable "Arrow" font, would be picked from the arrow font. This is (according to my testing) more consistent with other, less common arrows like , but maybe less consistent with ordinary Sans-Serif text.
  • With this PR without a suitable "Arrow" font, would not be matched. That is a problem if a system doesn't have such a font.

The latter point can be addressed by changing how glyph fallback works: instead of checking one long list of fonts, use at least two: a pre-filtered list of preferred fonts (possibly also using Script for filtering) and a longer (unfiltered) list. This would require further changes.

Related questions:

  • Which font pick is preferable? (See example above — IMO the "Arrow"-specific font is preferable in the very specific case I tested.)
  • How do you ensure that a space between two Arabic glyphs does not revert to the default Sans-Serif font? (This is important for shaping runs.) The (first) answer I came up with is the afore-mentioned kas-text quirk, but this feels like a cheap hack. A better answer may be to ensure the Script and locale affect the font selection.

Conclusion: we likely still want UnicodeRange, but the query API needs some other changes (new issue).

@khaledhosny
Copy link

Note that fonts often “lie” about ulUnicodeRange. The main user of this information is Windows, and it does sometimes use it in surprising ways, so fonts might set or unset a specific ulUnicodeRange bits to work around some specific Windows behavior. So I’d use ulUnicodeRange with caution. Or, better yet, not use it at all and rely on fonts cmap table for determining what characters are supported or not. Also, all available ulUnicodeRange bits were exhausted as of Unicode 5.1, so any ranges added after that can’t be represented with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants