Skip to content

ICU-22885 Add parsing of approximately sign #3454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

sffc
Copy link
Member

@sffc sffc commented Mar 26, 2025

This adds support for parsing the approximately sign and fixes the bug observed in ICU-22885.

Checklist

  • Required: Issue filed: ICU-22885
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@FrankYFTang
Copy link
Contributor

Do we also need a Java fix for this?

gUnicodeSets[INFINITY_SIGN] = new UnicodeSet(u"[∞]", status);
U_ASSERT(gUnicodeSets[APPROXIMATELY_SIGN] == nullptr);
gUnicodeSets[APPROXIMATELY_SIGN] = new UnicodeSet(u"[∼~≈≃約]", status); // this set was manually curated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this set determeind? What does it based on ? having "約" in this set is strange? Could we have a comments about this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I claim no moral authority for how this set was formed beyond "this set was manually curated".

What I did was open the xml files and look for characters used in the approximately pattern in various locales.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you mean these characters are gathered by looking at the content of some xml files and some particuarl field in those xml files? If so, could you point out which XML files and which particular fields about HOW you "manually curated" ? Give a little bit more details of how you did that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, here is what I tried

find common/main/* |xargs egrep approximatelySign |egrep -v "↑↑↑|unconfirmed"|cut -d '>' -f 2|cut -d '<' -f 1|sort -u
-
~
∼
≃
≈
ca.
dáàshì
dáàṣì
約

so... maybe adding comment as "This set of characters is gathered from the values of approximatelySign element of CLDR common/main/*.xml files." /

@sffc sffc requested a review from FrankYFTang March 26, 2025 22:19
@sffc
Copy link
Member Author

sffc commented Mar 26, 2025

Added Java and fixed the comment.

@richgillam
Copy link
Contributor

I'm coming to the party late, so I apologize if these are dumb questions, but what are you actually doing here, and why is that the appropriate response to the original issue? If I'm reading this correctly, this makes the number parser explicitly aware of the approximately sign (in all the various locales), but just basically ignores it in parsing. Is that right, and is that what we want to do?

  1. What does this symbol mean in practice? Why would somebody be using it in text that we parse?
  2. I get why calling abort() is a bad idea, but why wouldn't it be better to just signal a parse error?
  3. If ignoring the character is the right thing to do, why do we need code to explicitly identify it as the approximately sign? Couldn't you just have a generic list of characters that should be ignored in parsing?

@sffc
Copy link
Member Author

sffc commented Mar 26, 2025

The bug uncovered that we didn't handle the approximately sign in parsing, even though it is supported in patters and in formatting, which I added relatively recently (a few years ago).

Treating the approximately sign the same way as the plus sign makes sense to me. With the plus sign, we accept it and it doesn't impact the resulting parsed value. I mostly copied the plus sign code to make the approximately sign code.

@richgillam
Copy link
Contributor

The bug uncovered that we didn't handle the approximately sign in parsing, even though it is supported in patters and in formatting, which I added relatively recently (a few years ago).

Treating the approximately sign the same way as the plus sign makes sense to me. With the plus sign, we accept it and it doesn't impact the resulting parsed value. I mostly copied the plus sign code to make the approximately sign code.

Okay, I'll accept that. Thanks for the explanation. Given that, the code looks okay to me.

Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOKTM

gUnicodeSets[INFINITY_SIGN] = new UnicodeSet(u"[∞]", status);
U_ASSERT(gUnicodeSets[APPROXIMATELY_SIGN] == nullptr);
// This set of characters was manually curated from the values of the approximatelySign element of CLDR common/main/*.xml files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please wrap the line in the comment. This is way too long I think.

@@ -195,4 +195,18 @@ void PlusSignMatcher::accept(StringSegment& segment, ParsedNumber& result) const
}


ApproximatelySignMatcher::ApproximatelySignMatcher(const DecimalFormatSymbols& dfs, bool allowTrailing)
: SymbolMatcher(dfs.getConstSymbol(DecimalFormatSymbols::kApproximatelySignSymbol), unisets::APPROXIMATELY_SIGN),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line wrap for these two lines please

dfmt.parse(u"≈200", result, status);
ASSERT_SUCCESS(status);
if (result.getInt64() != 200) {
errln(UnicodeString(u"Got unexpected parse result: ") + DoubleToUnicodeString(result.getInt64()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line wrap please

unicodeSets.put(Key.INFINITY_SIGN, new UnicodeSet("[∞]").freeze());
// This set of characters was manually curated from the values of the approximatelySign element of CLDR common/main/*.xml files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line wrap te comment plese.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants