Skip to content

Commit e6a0575

Browse files
committed
DRY logic to support files in arbitrary encoding
Introduce `read_utf8_lines` as a wrapper around `File.readlines` that ensures that it: 1. Takes the encoding the file was saved in into (by looking at its BOM if present) 2. Yield the lines as UTF-8 strings (after converting it to that encoding), especially to ensure that any subsequent matching using `RegExp.match?` doesn't throw a `Encoding::CompatibilityError` (as our `RegExp`s are UTF-8) This commit addresses the suggestion from #418 (comment)
1 parent c1e7e21 commit e6a0575

File tree

2 files changed

+39
-16
lines changed

2 files changed

+39
-16
lines changed

lib/fastlane/plugin/wpmreleasetoolkit/helper/ios/ios_l10n_helper.rb

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,28 @@ def self.strings_file_type(path:)
3636
end
3737
end
3838

39+
# Read a file line by line and iterate over it (just like `File.readlines` does),
40+
# except that it also detects the encoding used by the file (using the BOM if present) when reading it,
41+
# and then convert each line to UTF-8 before yielding it
42+
#
43+
# This is particularly useful if you need to then use a `RegExp` to match part of the lines you're iterating over,
44+
# as the `RegExp` (which will typically be UTF-8) and the string you're matching with it have to use the same encoding
45+
# (otherwise we would get a `Encoding::CompatibilityError`)
46+
#
47+
# @important If you are then using a `RegExp` to match the UTF-8 lines you iterate on,
48+
# remember to use the `u` flag on it (`/…/u`) to make it UTF-8-aware too.
49+
#
50+
# @param [String] file The path to the file to read
51+
# @yield each line read from the file, after converting it to the UTF-8 encoding
52+
#
53+
def self.read_utf8_lines(file)
54+
# Be sure to guess file encoding using the Byte-Order-Mark, and fallback to UTF-8 if there's no BOM.
55+
File.readlines(file, mode: 'rb:BOM|UTF-8').map do |line|
56+
# Ensure the line is re-encoded to UTF-8 regardless of the encoding that was used in the input file
57+
line.encode(Encoding::UTF_8)
58+
end
59+
end
60+
3961
# Merge the content of multiple `.strings` files into a new `.strings` text file.
4062
#
4163
# @param [Hash<String, String>] paths The paths of the `.strings` files to merge together, associated with the prefix to prepend to each of their respective keys
@@ -68,11 +90,9 @@ def self.merge_strings(paths:, output_path:)
6890
all_keys_found += string_keys
6991

7092
tmp_file.write("/* MARK: - #{File.basename(input_file)} */\n\n")
71-
# Read line-by-line to reduce memory footprint during content copy; Be sure to guess file encoding using the Byte-Order-Mark.
72-
File.readlines(input_file, mode: 'rb:BOM|UTF-8').each do |line|
93+
# Read line-by-line to reduce memory footprint during content copy
94+
read_utf8_lines(input_file).each do |line|
7395
unless prefix.nil? || prefix.empty?
74-
# We need to ensure the line and RegExp are using the same encoding, so we transcode everything to UTF-8.
75-
line.encode!(Encoding::UTF_8)
7696
# The `/u` modifier on the RegExps is to make them UTF-8
7797
line.gsub!(/^(\s*")/u, "\\1#{prefix}") # Lines starting with a quote are considered to be start of a key; add prefix right after the quote
7898
line.gsub!(/^(\s*)([A-Z0-9_]+)(\s*=\s*")/ui, "\\1\"#{prefix}\\2\"\\3") # Lines starting with an identifier followed by a '=' are considered to be an unquoted key (typical in InfoPlist.strings files for example)

lib/fastlane/plugin/wpmreleasetoolkit/helper/ios/ios_strings_file_validation_helper.rb

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -11,51 +11,51 @@ class StringsFileValidationHelper
1111

1212
TRANSITIONS = {
1313
root: {
14-
/\s/ => :root,
14+
/\s/u => :root,
1515
'/' => :maybe_comment_start,
1616
'"' => :in_quoted_key
1717
},
1818
maybe_comment_start: {
1919
'/' => :in_line_comment,
20-
/\*/ => :in_block_comment
20+
/\*/u => :in_block_comment
2121
},
2222
in_line_comment: {
2323
"\n" => :root,
24-
/./ => :in_line_comment
24+
/./u => :in_line_comment
2525
},
2626
in_block_comment: {
2727
/\*/ => :maybe_block_comment_end,
28-
/./m => :in_block_comment
28+
/./mu => :in_block_comment
2929
},
3030
maybe_block_comment_end: {
3131
'/' => :root,
32-
/./m => :in_block_comment
32+
/./mu => :in_block_comment
3333
},
3434
in_quoted_key: {
3535
'"' => lambda do |state, _|
3636
state.found_key = state.buffer.string.dup
3737
state.buffer.string = ''
3838
:after_quoted_key_before_eq
3939
end,
40-
/./ => lambda do |state, c|
40+
/./u => lambda do |state, c|
4141
state.buffer.write(c)
4242
:in_quoted_key
4343
end
4444
},
4545
after_quoted_key_before_eq: {
46-
/\s/ => :after_quoted_key_before_eq,
46+
/\s/u => :after_quoted_key_before_eq,
4747
'=' => :after_quoted_key_and_eq
4848
},
4949
after_quoted_key_and_eq: {
50-
/\s/ => :after_quoted_key_and_eq,
50+
/\s/u => :after_quoted_key_and_eq,
5151
'"' => :in_quoted_value
5252
},
5353
in_quoted_value: {
5454
'"' => :after_quoted_value,
55-
/./m => :in_quoted_value
55+
/./mu => :in_quoted_value
5656
},
5757
after_quoted_value: {
58-
/\s/ => :after_quoted_value,
58+
/\s/u => :after_quoted_value,
5959
';' => :root
6060
}
6161
}.freeze
@@ -70,8 +70,11 @@ def self.find_duplicated_keys(file:)
7070

7171
state = State.new(context: :root, buffer: StringIO.new, in_escaped_ctx: false, found_key: nil)
7272

73-
File.readlines(file, mode: 'rb:BOM|UTF-8').each_with_index do |line, line_no|
74-
line.encode('UTF-8').chars.each_with_index do |c, col_no|
73+
# Using our `each_utf8_line` helper instead of `File.readlines` ensures we can also read files that are
74+
# encoded in UTF-16, yet process each of their lines as a UTF-8 string, so that `RegExp#match?` don't throw
75+
# an `Encoding::CompatibilityError` exception. (Note how all our `RegExp`s in `TRANSITIONS` have the `u` flag)
76+
Fastlane::Helper::Ios::L10nHelper.read_utf8_lines(file).each_with_index do |line, line_no|
77+
line.chars.each_with_index do |c, col_no|
7578
# Handle escaped characters at a global level.
7679
# This is more straightforward than having to account for it in the `TRANSITIONS` table.
7780
if state.in_escaped_ctx || c == '\\'

0 commit comments

Comments
 (0)