-
-
Notifications
You must be signed in to change notification settings - Fork 800
ICU-22885 Add parsing of approximately sign #3454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Do we also need a Java fix for this? |
gUnicodeSets[INFINITY_SIGN] = new UnicodeSet(u"[∞]", status); | ||
U_ASSERT(gUnicodeSets[APPROXIMATELY_SIGN] == nullptr); | ||
gUnicodeSets[APPROXIMATELY_SIGN] = new UnicodeSet(u"[∼~≈≃約]", status); // this set was manually curated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this set determeind? What does it based on ? having "約" in this set is strange? Could we have a comments about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I claim no moral authority for how this set was formed beyond "this set was manually curated".
What I did was open the xml files and look for characters used in the approximately pattern in various locales.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you mean these characters are gathered by looking at the content of some xml files and some particuarl field in those xml files? If so, could you point out which XML files and which particular fields about HOW you "manually curated" ? Give a little bit more details of how you did that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, here is what I tried
find common/main/* |xargs egrep approximatelySign |egrep -v "↑↑↑|unconfirmed"|cut -d '>' -f 2|cut -d '<' -f 1|sort -u
-
~
∼
≃
≈
ca.
dáàshì
dáàṣì
約
so... maybe adding comment as "This set of characters is gathered from the values of approximatelySign element of CLDR common/main/*.xml files." /
Added Java and fixed the comment. |
I'm coming to the party late, so I apologize if these are dumb questions, but what are you actually doing here, and why is that the appropriate response to the original issue? If I'm reading this correctly, this makes the number parser explicitly aware of the approximately sign (in all the various locales), but just basically ignores it in parsing. Is that right, and is that what we want to do?
|
The bug uncovered that we didn't handle the approximately sign in parsing, even though it is supported in patters and in formatting, which I added relatively recently (a few years ago). Treating the approximately sign the same way as the plus sign makes sense to me. With the plus sign, we accept it and it doesn't impact the resulting parsed value. I mostly copied the plus sign code to make the approximately sign code. |
Okay, I'll accept that. Thanks for the explanation. Given that, the code looks okay to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOKTM
gUnicodeSets[INFINITY_SIGN] = new UnicodeSet(u"[∞]", status); | ||
U_ASSERT(gUnicodeSets[APPROXIMATELY_SIGN] == nullptr); | ||
// This set of characters was manually curated from the values of the approximatelySign element of CLDR common/main/*.xml files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please wrap the line in the comment. This is way too long I think.
@@ -195,4 +195,18 @@ void PlusSignMatcher::accept(StringSegment& segment, ParsedNumber& result) const | |||
} | |||
|
|||
|
|||
ApproximatelySignMatcher::ApproximatelySignMatcher(const DecimalFormatSymbols& dfs, bool allowTrailing) | |||
: SymbolMatcher(dfs.getConstSymbol(DecimalFormatSymbols::kApproximatelySignSymbol), unisets::APPROXIMATELY_SIGN), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line wrap for these two lines please
dfmt.parse(u"≈200", result, status); | ||
ASSERT_SUCCESS(status); | ||
if (result.getInt64() != 200) { | ||
errln(UnicodeString(u"Got unexpected parse result: ") + DoubleToUnicodeString(result.getInt64())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line wrap please
unicodeSets.put(Key.INFINITY_SIGN, new UnicodeSet("[∞]").freeze()); | ||
// This set of characters was manually curated from the values of the approximatelySign element of CLDR common/main/*.xml files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line wrap te comment plese.
This adds support for parsing the approximately sign and fixes the bug observed in ICU-22885.
Checklist