|
| 1 | +Name: icu |
| 2 | +URL: https://github.com/unicode-org/icu |
| 3 | +Version: 71-1 |
| 4 | +CPEPrefix: cpe:/a:icu-project:international_components_for_unicode:71.1 |
| 5 | +License: MIT |
| 6 | +Security Critical: yes |
| 7 | + |
| 8 | +Description: |
| 9 | +This directory contains the source code of ICU 71.1 for C/C++. |
| 10 | + |
| 11 | +A. How to update ICU |
| 12 | + |
| 13 | +1. Run "scripts/update.sh <version>" (e.g. 71-1). |
| 14 | + This will download ICU from the upstream git repository. |
| 15 | + It does preserve Chrome-specific build files and |
| 16 | + converter files. (see section C) |
| 17 | + |
| 18 | + source.gni and icu.gyp* files are automatically updated, too. |
| 19 | + |
| 20 | +2. Review and apply patches/changes in "D. Local Modifications" if |
| 21 | + necessary/applicable. Update patch files in patches/. |
| 22 | + |
| 23 | +3. Follow the instructions in section B on building ICU data files |
| 24 | + |
| 25 | +B. How to build ICU data files |
| 26 | + |
| 27 | + |
| 28 | +Pre-built data files are generated and checked in with the following steps |
| 29 | + |
| 30 | +1. icu data files for Chrome OS, Linux, Mac and Windows |
| 31 | + |
| 32 | + a. Make a icu data build directory outside the Chromium source tree |
| 33 | + and cd to that directory (say, $ICUBUILDIR). |
| 34 | + |
| 35 | + b. Run |
| 36 | + ${CHROME_ICU_TREE_TOP}/scripts/make_data_all.sh |
| 37 | + |
| 38 | + This script takes the following steps: |
| 39 | + |
| 40 | + i) Run |
| 41 | + ${CHROME_ICU_TREE_TOP}/source/runConfigureICU Linux --disable-layout --disable-tests |
| 42 | + |
| 43 | + ii) Run make |
| 44 | + |
| 45 | + iii) (cd data && make clean) |
| 46 | + |
| 47 | + iv) scripts/config_data.sh common |
| 48 | + This configure the build with filer for common. |
| 49 | + |
| 50 | + v) Run make |
| 51 | + |
| 52 | + vi) scripts/copy_data.sh common |
| 53 | + This copies the ICU data files for non-Android platforms |
| 54 | + (both Little and Big Endian) to the following locations: |
| 55 | + |
| 56 | + common/icudtl.dat |
| 57 | + common/icudtb.dat |
| 58 | + |
| 59 | + vii) Repeat step iii) - vi) for chromeos to produce chromeos/icudtl.dat |
| 60 | + |
| 61 | + viii) cast/patch_locale.sh |
| 62 | + Modify the file for cast, android, ios and flutter. |
| 63 | + |
| 64 | + ix) Repeat step iii) - vi) for cast, andriod and ios to produce |
| 65 | + cast/icudtl.dat |
| 66 | + andriod/icudtl.dat |
| 67 | + ios/icudtl.dat |
| 68 | + |
| 69 | + x) flutter/patch_brkitr.sh |
| 70 | + On top of cast/patch_locale.sh.sh (step viii)), further patch |
| 71 | + the code for flutter. |
| 72 | + |
| 73 | + xi) Repeat step iii) - vi) for flutter to produce |
| 74 | + flutter/icudtl.dat |
| 75 | + |
| 76 | + xii) scripts/clean_up_data_source.sh |
| 77 | + |
| 78 | + This reverts the result of cast/patch_locale.sh and flutter/patch_brkitr.sh |
| 79 | + make the tree ready for committing updated ICU data files for |
| 80 | + non-Android and Android platforms. |
| 81 | + |
| 82 | + c. Whenever data is updated (e.g timezone update), take step b as long |
| 83 | + as the ICU build directory used in a. is kept. |
| 84 | + |
| 85 | +2. Note on the locale data customization |
| 86 | + |
| 87 | + - filter/chromeos.json |
| 88 | + a. Filter the locale data for ChromeOS's UI langauges : |
| 89 | + locales, lang, region, currency, zone |
| 90 | + b. Filter the locale data for non-UI languages to the bare minimum : |
| 91 | + ExemplarCharacters, LocaleScript, layout, and the name of the |
| 92 | + language for a locale in its native language. |
| 93 | + c. Filter the legacy Chinese character set-based collation |
| 94 | + (big5han/gb2312han) that don't make any sense and nobdoy uses. |
| 95 | + |
| 96 | + - filter/common.json |
| 97 | + Same as above in filter/chromeos.json, AND |
| 98 | + e. Filter exemplar cities in timezone data (data/zone). |
| 99 | + |
| 100 | + - filter/android.json and filter/ios.json |
| 101 | + a. Filter the locale data for Android / iOS UI langauges : |
| 102 | + locales, lang, region, currency, zone |
| 103 | + b. Filter the locale data for non-UI languages to the bare minimum : |
| 104 | + ExemplarCharacters, LocaleScript, layout, and the name of the |
| 105 | + language for a locale in its native language. |
| 106 | + c. Filter the legacy Chinese character set-based collation |
| 107 | + d. Filter source/data/{region,lang} to exclude these data |
| 108 | + except the language and script names of zh_Hans and zh_Hant. |
| 109 | + e. Keep only the minimal calendar data in data/locales. |
| 110 | + f. Include currency display names for a smaller subset of currencies. |
| 111 | + g. Minimize the locale data for 9 locales to which Chrome on Android |
| 112 | + is not localized. |
| 113 | + |
| 114 | + |
| 115 | +C. Chromium-specific data build files and converters |
| 116 | + |
| 117 | +They're preserved in step A.1 above. In general, there's no need to touch |
| 118 | +them when updating ICU. |
| 119 | + |
| 120 | +1. source/data/mappings |
| 121 | + - convrtrs.txt : Lists encodings and aliases required by the WHATWG |
| 122 | + Encoding spec plus a few extra (see the file as to why). |
| 123 | + |
| 124 | + - ucmlocal.txt : to list only converters we need. |
| 125 | + |
| 126 | + - *html.ucm: Mapping files per WHATWG encoding standards for EUC-JP, |
| 127 | + Shift_JIS, Big5 (Big5+Big5HKSCS), EUC-KR and all the single byte encodings. |
| 128 | + They're generated with scripts/{eucjp,sjis,big5,euckr,single_byte}_gen.sh. |
| 129 | + |
| 130 | + - gb18030.ucm and windows-936.ucm |
| 131 | + gb_table.patch was applied for the following changes. No need |
| 132 | + to apply it again. The patch is kept for the record. |
| 133 | + a. Map \xA3\xA0 to U+3000 instead of U+E5E5 in gb18030 and windows-936 per |
| 134 | + the encoding spec (one-way mapping in toUnicode direction). |
| 135 | + b. Map \xA8\xBF to U+01F9 instead of U+E7C8. Add one-way map |
| 136 | + from U+1E3F to \xA8\xBC (windows-936/GBK). |
| 137 | + See https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c3 |
| 138 | + |
| 139 | +2. source/data/brkitr |
| 140 | + - dictionaries/khmerdict.txt: Abridged Khmer dictionary. See |
| 141 | + https://unicode-org.atlassian.net/browse/ICU-9451 |
| 142 | + - dictionaries/laodict.txt: Abridged Lao dictionary. We keep using the smaller |
| 143 | + old version from ICU69-1. |
| 144 | + - rules/word_ja.txt (used only on Android) |
| 145 | + Added for Japanese-specific word-breaking without the C+J dictionary. |
| 146 | + - rules/{root,zh,zh_Hant}.txt |
| 147 | + a. Use line_normal by default. |
| 148 | + b. Drop local patches we used to have for the following issues. They'll |
| 149 | + be dealt with in the upstream (Unicode/CLDR). |
| 150 | + http://unicode.org/cldr/trac/ticket/6557 |
| 151 | + http://unicode.org/cldr/trac/ticket/4200 (http://crbug.com/39779) |
| 152 | + |
| 153 | +3. Add {an,ku,tg,wa}.txt to source/data/{locale,lang} |
| 154 | + with the minimal locale data necessary for spellchecker and |
| 155 | + and language menus. |
| 156 | + |
| 157 | +D. Local Modifications |
| 158 | + |
| 159 | +1. Applied locale data patches from Google obtained by diff'ing |
| 160 | + the upstream copy and Google's internal copy for source/data |
| 161 | + |
| 162 | + - patches/locale_google.patch: |
| 163 | + * Google's internal ICU locale changes |
| 164 | + * Simpler region names for Hong Kong and Macau in all locales |
| 165 | + * Currency signs in ru and uk locales (do not include 'tr' locale changes) |
| 166 | + * AM/PM, midnight, noon formatting for a few Indian locales |
| 167 | + * Timezone name changes in Korean and Chinese locales |
| 168 | + * Default digit for Arabic locale is European digits. |
| 169 | + |
| 170 | + - patches/locale1.patch: Minor fixes for Korean |
| 171 | + |
| 172 | + |
| 173 | +2. Breakiterator patches |
| 174 | + - patches/wordbrk.patch for word.txt |
| 175 | + a. Move full stops (U+002E, U+FF0E) from MidNumLet to MidNum so that |
| 176 | + FQDN labels can be split at '.' |
| 177 | + b. Move fullwidth digits (U+FF10 - U+FF19) from Ideographic to Numeric. |
| 178 | + See http://unicode.org/cldr/trac/ticket/6555 |
| 179 | + |
| 180 | + - patches/khmer-dictbe.patch |
| 181 | + Adjust parameters to use a smaller Khmer dictionary (khmerdict.txt). |
| 182 | + https://unicode-org.atlassian.net/browse/ICU-9451 |
| 183 | + |
| 184 | + - Add several common Chinese words that were dropped previously to |
| 185 | + source/data/cjdict/brkitr/cjdict.txt |
| 186 | + patch: patches/cjdict.patch |
| 187 | + upstream bug: https://unicode-org.atlassian.net/browse/ICU-10888 |
| 188 | + |
| 189 | +3. Timezone data update |
| 190 | + Run scripts/update_tz.sh to grab the latest version of the |
| 191 | + following timezone data files and put them in source/data/misc |
| 192 | + |
| 193 | + metaZones.txt |
| 194 | + timezoneTypes.txt |
| 195 | + windowsZones.txt |
| 196 | + zoneinfo64.txt |
| 197 | + |
| 198 | + As of Oct 13, 2022, the latest version is 2022e |
| 199 | + and the above files are available at the ICU github repos. |
| 200 | + |
| 201 | +4. Build-related changes |
| 202 | + |
| 203 | + - patches/configure.patch: |
| 204 | + * Remove a section of configure that will cause breakage while |
| 205 | + running runConfigureICU. |
| 206 | + |
| 207 | + - patches/wpo.patch (only needed when icudata dll is used). |
| 208 | + upstream bugs : https://unicode-org.atlassian.net/browse/ICU-8043 |
| 209 | + https://unicode-org.atlassian.net/browse/ICU-5701 |
| 210 | + |
| 211 | + - patches/data_symb.patch : |
| 212 | + Put ICU_DATA_ENTRY_POINT(icudtXX_dat) in common when we use |
| 213 | + the icu data file or icudt.dll |
| 214 | + |
| 215 | + - patches/unused-var-unary-operators.patch: |
| 216 | + upstream bug: https://unicode-org.atlassian.net/browse/ICU-21966 |
| 217 | + upstream PR: https://github.com/unicode-org/icu/pull/2055 |
| 218 | + |
| 219 | +5. ISO-2022-JP encoding (fromUnicode) change per WHATWG encoding spec. |
| 220 | + - patches/iso2022jp.patch |
| 221 | + - upstream bug: |
| 222 | + https://unicode-org.atlassian.net/browse/ICU-20251 |
| 223 | + |
| 224 | +6. Enable tracing of file but not resource, only for Chromium |
| 225 | + to reduce performance impact/risk. |
| 226 | + - patches/restrace.patch |
| 227 | + |
| 228 | +7. Patch Arabic date time pattern back to 67 value to avoid test |
| 229 | + breakage in |
| 230 | + third_party/blink/web_tests/fast/forms/datetimelocal/datetimelocal-appearance-l10n.html |
| 231 | + - patches/ardatepattern.patch |
| 232 | + - https://bugs.chromium.org/p/chromium/issues/detail?id=1139186 |
| 233 | + |
| 234 | +8. Remove explicit std::atomic<NumberRangeFormatterImpl*> template |
| 235 | + instantiation |
| 236 | + patches/atomic_template_instantiation.patch |
| 237 | + - The explicit instantiation was added to silence MSVC C4251 warnings: |
| 238 | + https://unicode-org.atlassian.net/browse/ICU-20157 |
| 239 | + Small test cases show that it is generally an error to instantiate |
| 240 | + std::atomic<T*> with an incomplete type T with MSVC, clang, and GCC, so this |
| 241 | + instantiation never should have worked: |
| 242 | + https://gcc.godbolt.org/z/34xx8h |
| 243 | + At this time, it's not clear if this particular instantiation with |
| 244 | + NumberRangeFormatterImpl* was ever necessary for MSVC. Further testing with |
| 245 | + MSVC is required to upstream this patch. |
| 246 | + - https://unicode-org.atlassian.net/browse/ICU-21482 |
| 247 | + |
| 248 | +9. Patch source/common/uposixdefs.h so it compiles on Fuchsia on Macs. |
| 249 | + patches/fuchsia.patch |
| 250 | + - context bug: https://bugs.chromium.org/p/chromium/issues/detail?id=1184527 |
| 251 | + |
| 252 | +10. Patch i18n/dtitvfmt.cpp to fix DateIntervalFormat regression |
| 253 | + patches/DateIntervalFormatnormalizeHourMetacharacters.patch |
| 254 | + - https://github.com/unicode-org/icu/pull/2060 |
| 255 | + - https://unicode-org.atlassian.net/browse/ICU-21984 |
| 256 | + |
| 257 | +11. Patch common/locid.cpp to fix heap-buffer-overflow |
| 258 | + patches/AliasDataBuilder-readAlias.patch |
| 259 | + - https://patch-diff.githubusercontent.com/raw/unicode-org/icu/pull/2067 |
| 260 | + - https://unicode-org.atlassian.net/browse/ICU-21994 |
| 261 | + |
| 262 | +12. Patch source/i18n/collationdatabuilder.* |
| 263 | + patches/collationdatabuilder.patch |
| 264 | + - https://github.com/unicode-org/icu/pull/2052 |
| 265 | + - https://unicode-org.atlassian.net/browse/ICU-20715 |
| 266 | + |
| 267 | +13. Patch i18n/formatted_string_builder to fix int32_t overflow bug |
| 268 | + patches/formatted_string_builder.patch |
| 269 | + - https://github.com/unicode-org/icu/pull/2070 |
| 270 | + - https://unicode-org.atlassian.net/browse/ICU-22005 |
| 271 | + |
| 272 | +14. Patch to fix C++20 enum issues |
| 273 | + patches/cxx20enum.patch |
| 274 | + - https://unicode-org.atlassian.net/browse/ICU-22014 |
| 275 | + - https://github.com/unicode-org/icu/pull/2084 |
| 276 | + |
| 277 | +15. Patch to remove ATOMIC_VAR_INIT for C++20 |
| 278 | + patches/rmATOMIC_VAR_INIT.patch |
| 279 | + - https://github.com/unicode-org/icu/pull/2090 |
| 280 | + |
| 281 | +16. Patch Calendar and TimeZone code to fix out-of-bound result in get() |
| 282 | + patches/calendar-get-out-of-bound.patch |
| 283 | + - https://unicode-org.atlassian.net/browse/ICU-22023 |
| 284 | + - https://github.com/unicode-org/icu/pull/2086 |
| 285 | + AND |
| 286 | + patches/calendar-get-out-of-bound2.patch |
| 287 | + - https://unicode-org.atlassian.net/browse/ICU-22043 |
| 288 | + - https://github.com/unicode-org/icu/pull/2100 |
| 289 | + |
| 290 | +17. Patch TimeZone to fix incorrect name for "Africa/Casablanca" |
| 291 | + patches/timezone-rawoffset.patch |
| 292 | + - https://github.com/unicode-org/icu/pull/2096 |
| 293 | + - https://unicode-org.atlassian.net/browse/ICU-22041 |
| 294 | + |
| 295 | +18. Patch NumberRangeFormatter to fix numbering system resolution. |
| 296 | + patches/number_range_format.patch |
| 297 | + - https://github.com/unicode-org/icu/pull/2085 |
| 298 | + - https://unicode-org.atlassian.net/browse/ICU-22017 |
| 299 | + |
| 300 | +19. Patch Calendar to return error ASAP to avoid incorrect assert |
| 301 | + patches/calendar_return_error_early.patch |
| 302 | + - https://github.com/unicode-org/icu/pull/2177 |
| 303 | + - https://unicode-org.atlassian.net/browse/ICU-22070 |
| 304 | + |
0 commit comments