ICU-22885 Add parsing of approximately sign #3454

sffc · 2025-03-26T02:31:11Z

This adds support for parsing the approximately sign and fixes the bug observed in ICU-22885.

Checklist

Required: Issue filed: ICU-22885
Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

FrankYFTang · 2025-03-26T19:52:16Z

Do we also need a Java fix for this?

FrankYFTang · 2025-03-26T19:53:48Z

icu4c/source/common/static_unicode_sets.cpp

    gUnicodeSets[INFINITY_SIGN] = new UnicodeSet(u"[∞]", status);
+    U_ASSERT(gUnicodeSets[APPROXIMATELY_SIGN] == nullptr);
+    gUnicodeSets[APPROXIMATELY_SIGN] = new UnicodeSet(u"[∼~≈≃約]", status); // this set was manually curated


how is this set determeind? What does it based on ? having "約" in this set is strange? Could we have a comments about this?

I claim no moral authority for how this set was formed beyond "this set was manually curated".

What I did was open the xml files and look for characters used in the approximately pattern in various locales.

so you mean these characters are gathered by looking at the content of some xml files and some particuarl field in those xml files? If so, could you point out which XML files and which particular fields about HOW you "manually curated" ? Give a little bit more details of how you did that

OK, here is what I tried

find common/main/* |xargs egrep approximatelySign |egrep -v "↑↑↑|unconfirmed"|cut -d '>' -f 2|cut -d '<' -f 1|sort -u - ~ ∼ ≃ ≈ ca. dáàshì dáàṣì 約

so... maybe adding comment as "This set of characters is gathered from the values of approximatelySign element of CLDR common/main/*.xml files." /

sffc · 2025-03-26T22:19:47Z

Added Java and fixed the comment.

richgillam · 2025-03-26T22:50:54Z

I'm coming to the party late, so I apologize if these are dumb questions, but what are you actually doing here, and why is that the appropriate response to the original issue? If I'm reading this correctly, this makes the number parser explicitly aware of the approximately sign (in all the various locales), but just basically ignores it in parsing. Is that right, and is that what we want to do?

What does this symbol mean in practice? Why would somebody be using it in text that we parse?
I get why calling abort() is a bad idea, but why wouldn't it be better to just signal a parse error?
If ignoring the character is the right thing to do, why do we need code to explicitly identify it as the approximately sign? Couldn't you just have a generic list of characters that should be ignored in parsing?

sffc · 2025-03-26T22:59:41Z

The bug uncovered that we didn't handle the approximately sign in parsing, even though it is supported in patters and in formatting, which I added relatively recently (a few years ago).

Treating the approximately sign the same way as the plus sign makes sense to me. With the plus sign, we accept it and it doesn't impact the resulting parsed value. I mostly copied the plus sign code to make the approximately sign code.

richgillam · 2025-03-26T23:04:15Z

The bug uncovered that we didn't handle the approximately sign in parsing, even though it is supported in patters and in formatting, which I added relatively recently (a few years ago).

Treating the approximately sign the same way as the plus sign makes sense to me. With the plus sign, we accept it and it doesn't impact the resulting parsed value. I mostly copied the plus sign code to make the approximately sign code.

Okay, I'll accept that. Thanks for the explanation. Given that, the code looks okay to me.

richgillam

LOKTM

FrankYFTang · 2025-03-26T23:04:52Z

icu4c/source/common/static_unicode_sets.cpp

    gUnicodeSets[INFINITY_SIGN] = new UnicodeSet(u"[∞]", status);
+    U_ASSERT(gUnicodeSets[APPROXIMATELY_SIGN] == nullptr);
+    // This set of characters was manually curated from the values of the approximatelySign element of CLDR common/main/*.xml files.


please wrap the line in the comment. This is way too long I think.

FrankYFTang · 2025-03-26T23:05:40Z

icu4c/source/i18n/numparse_symbols.cpp

@@ -195,4 +195,18 @@ void PlusSignMatcher::accept(StringSegment& segment, ParsedNumber& result) const
 }


+ApproximatelySignMatcher::ApproximatelySignMatcher(const DecimalFormatSymbols& dfs, bool allowTrailing)
+        : SymbolMatcher(dfs.getConstSymbol(DecimalFormatSymbols::kApproximatelySignSymbol), unisets::APPROXIMATELY_SIGN),


line wrap for these two lines please

FrankYFTang · 2025-03-26T23:05:53Z

icu4c/source/test/intltest/numfmtst.cpp

+    dfmt.parse(u"≈200", result, status);
+    ASSERT_SUCCESS(status);
+    if (result.getInt64() != 200) {
+        errln(UnicodeString(u"Got unexpected parse result: ") + DoubleToUnicodeString(result.getInt64()));


line wrap please

FrankYFTang · 2025-03-26T23:06:41Z

icu4j/main/core/src/main/java/com/ibm/icu/impl/StaticUnicodeSets.java

        unicodeSets.put(Key.INFINITY_SIGN, new UnicodeSet("[∞]").freeze());
+        // This set of characters was manually curated from the values of the approximatelySign element of CLDR common/main/*.xml files.


line wrap te comment plese.

ICU-22885 Add parsing of approximately sign

f17e5c0

sffc assigned FrankYFTang Mar 26, 2025

sffc requested a review from richgillam March 26, 2025 02:33

FrankYFTang reviewed Mar 26, 2025

View reviewed changes

sffc added 3 commits March 26, 2025 15:14

Port to Java

226cfe1

Add it as a standard matcher, C and J

36a388d

Improve comment

cb354b1

sffc requested a review from FrankYFTang March 26, 2025 22:19

richgillam approved these changes Mar 26, 2025

View reviewed changes

FrankYFTang reviewed Mar 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ICU-22885 Add parsing of approximately sign #3454

ICU-22885 Add parsing of approximately sign #3454

Uh oh!

sffc commented Mar 26, 2025 •

edited

Loading

Uh oh!

FrankYFTang commented Mar 26, 2025

Uh oh!

FrankYFTang Mar 26, 2025

Uh oh!

sffc Mar 26, 2025

Uh oh!

FrankYFTang Mar 26, 2025

Uh oh!

FrankYFTang Mar 26, 2025

Uh oh!

sffc commented Mar 26, 2025

Uh oh!

richgillam commented Mar 26, 2025

Uh oh!

sffc commented Mar 26, 2025

Uh oh!

richgillam commented Mar 26, 2025

Uh oh!

richgillam left a comment

Uh oh!

FrankYFTang Mar 26, 2025

Uh oh!

FrankYFTang Mar 26, 2025

Uh oh!

FrankYFTang Mar 26, 2025

Uh oh!

FrankYFTang Mar 26, 2025

Uh oh!

Uh oh!

		unicodeSets.put(Key.INFINITY_SIGN, new UnicodeSet("[∞]").freeze());
		// This set of characters was manually curated from the values of the approximatelySign element of CLDR common/main/*.xml files.

Uh oh!

ICU-22885 Add parsing of approximately sign #3454

Are you sure you want to change the base?

ICU-22885 Add parsing of approximately sign #3454

Uh oh!

Conversation

sffc commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

FrankYFTang commented Mar 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sffc commented Mar 26, 2025

Uh oh!

richgillam commented Mar 26, 2025

Uh oh!

sffc commented Mar 26, 2025

Uh oh!

richgillam commented Mar 26, 2025

Uh oh!

richgillam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sffc commented Mar 26, 2025 •

edited

Loading