diff --git a/CHANGELOG.md b/CHANGELOG.md index 54b59a1..20f90b7 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,6 +1,45 @@ # SmarterCSV 1.x Change Log +## 1.13.0 (2024-11-06) ⚡ POTENTIALLY BREAKING ⚡ + + CHANGED DEFAULT BEHAVIOR + ======================== + The changes are to improve robustness and to reduce the risk of data loss + + * implementing auto-detection of extra columns (thanks to James Fenley) + + * improved handling of unbalanced quote_char in input ([issue 288](https://github.com/tilo/smarter_csv/issues/288)) thanks to Simon Rentzke), and ([issue 283](https://github.com/tilo/smarter_csv/issues/283)) thanks to James Fenley, Randall B, Matthew Kennedy) + -> SmarterCSV will now raise `SmarterCSV::MalformedCSV` for unbalanced quote_char. + + * bugfix / improved handling of extra columns in input data ([issue 284](https://github.com/tilo/smarter_csv/issues/284)) (thanks to James Fenley) + + * previous behavior: + when a CSV row had more columns than listed in the header, the additional columns were ignored + + * new behavior: + * new default behavior is to auto-generate additional headers, e.g. :column_7, :column_8, etc + * you can set option `:strict` to true in order to get a `SmarterCSV::MalformedCSV` exception instead + + * setting `user_provided_headers` now implies `headers_in_file: false` ([issue 282](https://github.com/tilo/smarter_csv/issues/282)) + + The option `user_provided_headers` can be used to specify headers when there are none in the input, OR to completely override headers that are in the input (file). + + SmarterCSV is now using a safer default behavior. + + * previous behavior: + Setting `user_provided_headers` did not change the default `headers_in_file: true` + If the input had no headers, this would cause the first line to be erroneously treated as a header, and the user could lose the first row of data. + + * new behavior: + Setting `user_provided_headers` sets`headers_in_file: false` + a) Improved behavior if there was no header in the input data. + b) If there was a header in the input data, and `user_provided_headers` is used to override the headers in the file, then please explicitly specify `headers_in_file: true`, otherwise you will get an extra hash which includes the header data. + + IF you set `user_provided_headers` and the file has a header, then provide `headers_in_file: true` to avoid getting that extra record. + + * handling of numeric columns with leading zeroes, e.g. ZIP codes. ([issue #151](https://github.com/tilo/smarter_csv/issues/151) thanks to David Moles). `convert_values_to_numeric: { except: [:zip] }` will now return a string for that column instead. + ## 1.12.1 (2024-07-10) * Improved column separator detection by ignoring quoted sections [#276](https://github.com/tilo/smarter_csv/pull/276) (thanks to Nicolas Castellanos) diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md index f771736..8dc180e 100644 --- a/CONTRIBUTORS.md +++ b/CONTRIBUTORS.md @@ -54,3 +54,7 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed * [Kenton Hirowatari](https://github.com/hirowatari) * [Daniel Pepper](https://github.com/dpep) * [Nicolas Castellanos](https://github.com/nicastelo) + * [James Fenley](https://github.com/rex-remind101) + * [Simon Rentzke](https://github.com/simonrentzke) + * [Randall B](https://github.com/randall-coding) + * [Matthew Kennedy](https://github.com/MattKitmanLabs) diff --git a/docs/data_transformations.md b/docs/data_transformations.md index c1c5ded..20d9071 100644 --- a/docs/data_transformations.md +++ b/docs/data_transformations.md @@ -33,6 +33,8 @@ Here is an example of using `convert_values_to_numeric` for numbers with leading => [{:zip=>"00480"}, {:zip=>"51903"}, {:zip=>"12354"}, {:zip=>"02343"}] ``` +This will return the column `:zip` as a string with all digits intact. + ## Remove Zero Values `remove_zero_values` is disabled by default. When enabled, it removes key/value pairs which have a numeric value equal to zero. diff --git a/docs/header_transformations.md b/docs/header_transformations.md index 86af3b1..c397a57 100644 --- a/docs/header_transformations.md +++ b/docs/header_transformations.md @@ -64,6 +64,8 @@ If you want to have an underscore between the header and the number, you can set => [{:first_name=>"Carl", :middle_name=>"Edward", :last_name=>"Sagan"}] ``` +If you set `duplicate_header_suffix: nil`, you get the same behavior as earlier versions, which raised the `SmarterCSV::DuplicateHeaders` error. + ## Key Mapping The above example already illustrates how intermediate keys can be mapped into something different. diff --git a/docs/options.md b/docs/options.md index fff28a1..e2f3f51 100644 --- a/docs/options.md +++ b/docs/options.md @@ -41,7 +41,7 @@ | :skip_lines | nil | how many lines to skip before the first line or header line is processed | | :comment_regexp | nil | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/ | --------------------------------------------------------------------------------------------------------------------------------- - | :col_sep | :auto | column separator (default was ',') | + | :col_sep | :auto | column separator (default was ',') | | :force_simple_split | false | force simple splitting on :col_sep character for non-standard CSV-files. | | | | e.g. when :quote_char is not properly escaped | | :row_sep | :auto | row separator or record separator (previous default was system's $/ , which defaulted to "\n") | @@ -49,9 +49,10 @@ | :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. | | :quote_char | '"' | quotation character | --------------------------------------------------------------------------------------------------------------------------------- - | :headers_in_file | true | Whether or not the file contains headers as the first line. | - | | | Important if the file does not contain headers, | - | | | otherwise you would lose the first line of data. | + | :headers_in_file | true(1) | Whether or not the file contains headers as the first line. | + | | | (1): if `user_provided_headers` is given, the default is `false`, | + | | | unless you specify it to be explicitly `true`. | + | | | This prevents losing the first line of data, which is otherwise assumed to be a header. | | :duplicate_header_suffix | '' | Adds numbers to duplicated headers and separates them by the given suffix. | | | | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior) | | :user_provided_headers | nil | *careful with that axe!* | @@ -61,6 +62,8 @@ | :remove_empty_hashes | true | remove / ignore any hashes which don't have any key/value pairs or all empty values | | :verbose | false | print out line number while processing (to track down problems in input files) | | :with_line_numbers | false | add :csv_line_number to each data hash | + | :missing_header_prefix | column_ | can be set to a string of your liking | + | :strict | false | When set to `true`, extra columns will raise MalformedCSV exception | --------------------------------------------------------------------------------------------------------------------------------- Additional 1.x Options which may be replaced in 2.0 @@ -71,11 +74,11 @@ There have been a lot of 1-offs and feature creep around these options, and goin | Option | Default | Explanation | --------------------------------------------------------------------------------------------------------------------------------- | :key_mapping | nil | a hash which maps headers from the CSV file to keys in the result hash | - | :silence_missing_keys | false | ignore missing keys in `key_mapping` | - | | | if set to true: makes all mapped keys optional | + | :silence_missing_keys | false | ignore missing keys in `key_mapping` | + | | | if set to true: makes all mapped keys optional | | | | if given an array, makes only the keys listed in it optional | - | :required_keys | nil | An array. Specify the required names AFTER header transformation. | - | :required_headers | nil | (DEPRECATED / renamed) Use `required_keys` instead | + | :required_keys | nil | An array. Specify the required names AFTER header transformation. | + | :required_headers | nil | (DEPRECATED / renamed) Use `required_keys` instead | | | | or an exception is raised No validation if nil is given. | | :remove_unmapped_keys | false | when using :key_mapping option, should non-mapped keys / columns be removed? | | :downcase_header | true | downcase all column headers | diff --git a/ext/smarter_csv/smarter_csv.c b/ext/smarter_csv/smarter_csv.c index 6f70320..76931e7 100644 --- a/ext/smarter_csv/smarter_csv.c +++ b/ext/smarter_csv/smarter_csv.c @@ -9,9 +9,10 @@ #define true ((bool)1) #endif -/* - max_size: pass nil if no limit is specified - */ +VALUE SmarterCSV = Qnil; +VALUE eMalformedCSVError = Qnil; +VALUE Parser = Qnil; + static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quote_char, VALUE max_size) { if (RB_TYPE_P(line, T_NIL) == 1) { return rb_ary_new(); @@ -24,7 +25,7 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot rb_encoding *encoding = rb_enc_get(line); /* get the encoding from the input line */ char *startP = RSTRING_PTR(line); /* may not be null terminated */ long line_len = RSTRING_LEN(line); - char *endP = startP + line_len ; /* points behind the string */ + char *endP = startP + line_len; /* points behind the string */ char *p = startP; char *col_sepP = RSTRING_PTR(col_sep); @@ -39,18 +40,19 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot VALUE field; long i; - char prev_char = '\0'; // Store the previous character for comparison against an escape character - long backslash_count = 0; // to count consecutive backslash characters + /* Variables for escaped quote handling */ + long backslash_count = 0; + bool in_quotes = false; while (p < endP) { /* does the remaining string start with col_sep ? */ col_sep_found = true; - for(i=0; (i < col_sep_len) && (p+i < endP) ; i++) { + for(i=0; (i < col_sep_len) && (p+i < endP); i++) { col_sep_found = col_sep_found && (*(p+i) == *(col_sepP+i)); } - /* if col_sep was found and we have even quotes */ - if (col_sep_found && (quote_count % 2 == 0)) { - /* if max_size != nil && lements.size >= header_size */ + /* if col_sep was found and we're not inside quotes */ + if (col_sep_found && !in_quotes) { + /* if max_size != nil && elements.size >= header_size */ if ((max_size != Qnil) && RARRAY_LEN(elements) >= NUM2INT(max_size)) { break; } else { @@ -60,22 +62,30 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot p += col_sep_len; startP = p; + backslash_count = 0; // Reset backslash count at the start of a new field } } else { if (*p == '\\') { backslash_count++; } else { - if (*p == *quoteP && (backslash_count % 2 == 0)) { - quote_count++; + if (*p == *quoteP) { + if (backslash_count % 2 == 0) { + /* Even number of backslashes means quote is not escaped */ + in_quotes = !in_quotes; + } + /* Else, quote is escaped; do nothing */ } - backslash_count = 0; // no more consecutive backslash characters + backslash_count = 0; // Reset after any character other than backslash } p++; } - - prev_char = *(p - 1); // Update the previous character } /* while */ + /* Check for unclosed quotes at the end of the line */ + if (in_quotes) { + rb_raise(eMalformedCSVError, "Unclosed quoted field detected in line: %s", StringValueCStr(line)); + } + /* check if the last part of the line needs to be processed */ if ((max_size == Qnil) || RARRAY_LEN(elements) < NUM2INT(max_size)) { /* copy the remaining line as a field with original encoding onto the results */ @@ -86,12 +96,11 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot return elements; } -VALUE SmarterCSV = Qnil; -VALUE Parser = Qnil; - void Init_smarter_csv(void) { - SmarterCSV = rb_define_module("SmarterCSV"); - Parser = rb_define_module_under(SmarterCSV, "Parser"); + // these modules and the error class are already defined in Ruby code, make them accessible: + SmarterCSV = rb_const_get(rb_cObject, rb_intern("SmarterCSV")); + Parser = rb_const_get(SmarterCSV, rb_intern("Parser")); + eMalformedCSVError = rb_const_get(SmarterCSV, rb_intern("MalformedCSV")); rb_define_module_function(Parser, "parse_csv_line_c", rb_parse_csv_line, 4); } diff --git a/lib/smarter_csv/auto_detection.rb b/lib/smarter_csv/auto_detection.rb index 31564f2..717ff4b 100644 --- a/lib/smarter_csv/auto_detection.rb +++ b/lib/smarter_csv/auto_detection.rb @@ -13,14 +13,13 @@ def guess_column_separator(filehandle, options) delimiters = [',', "\t", ';', ':', '|'] line = nil + escaped_quote = Regexp.escape(options[:quote_char]) has_header = options[:headers_in_file] candidates = Hash.new(0) count = has_header ? 1 : 5 count.times do line = readline_with_counts(filehandle, options) delimiters.each do |d| - escaped_quote = Regexp.escape(options[:quote_char]) - # Count only non-quoted occurrences of the delimiter non_quoted_text = line.split(/#{escaped_quote}[^#{escaped_quote}]*#{escaped_quote}/).join diff --git a/lib/smarter_csv/errors.rb b/lib/smarter_csv/errors.rb index ffe8d7d..ad285d2 100644 --- a/lib/smarter_csv/errors.rb +++ b/lib/smarter_csv/errors.rb @@ -11,6 +11,7 @@ class DuplicateHeaders < SmarterCSVException; end class MissingKeys < SmarterCSVException; end # previously known as MissingHeaders class NoColSepDetected < SmarterCSVException; end class KeyMappingError < SmarterCSVException; end + class MalformedCSV < SmarterCSVException; end # Writer: class InvalidInputData < SmarterCSVException; end end diff --git a/lib/smarter_csv/options.rb b/lib/smarter_csv/options.rb index 1c89459..6686f8e 100644 --- a/lib/smarter_csv/options.rb +++ b/lib/smarter_csv/options.rb @@ -26,6 +26,7 @@ module Options invalid_byte_sequence: '', keep_original_headers: false, key_mapping: nil, + missing_header_prefix: 'column_', quote_char: '"', remove_empty_hashes: true, remove_empty_values: true, @@ -37,6 +38,7 @@ module Options row_sep: :auto, # was: $/, silence_missing_keys: false, skip_lines: nil, + strict: false, strings_as_keys: false, strip_chars_from_headers: nil, strip_whitespace: true, @@ -50,6 +52,18 @@ module Options def process_options(given_options = {}) puts "User provided options:\n#{pp(given_options)}\n" if given_options[:verbose] + # Special case for :user_provided_headers: + # + # If we would use the default `headers_in_file: true`, and `:user_provided_headers` are given, + # we could lose the first data row + # + # We now err on the side of treating an actual header as data, rather than losing a data row. + # + if given_options[:user_provided_headers] && !given_options.keys.include?(:headers_in_file) + given_options[:headers_in_file] = false + puts "WARNING: setting `headers_in_file: false` as a precaution to not lose the first row. Set explicitly to `true` if you have headers." + end + @options = DEFAULT_OPTIONS.dup.merge!(given_options) # fix invalid input diff --git a/lib/smarter_csv/parser.rb b/lib/smarter_csv/parser.rb index f038a4a..abe68c9 100644 --- a/lib/smarter_csv/parser.rb +++ b/lib/smarter_csv/parser.rb @@ -7,6 +7,8 @@ module Parser ### ### Thin wrapper around C-extension ### + ### NOTE: we are no longer passing-in header_size + ### def parse(line, options, header_size = nil) # puts "SmarterCSV.parse OPTIONS: #{options[:acceleration]}" if options[:verbose] @@ -31,59 +33,83 @@ def parse(line, options, header_size = nil) # - we are not assuming that quotes inside a fields need to be doubled # - we are not assuming that all fields need to be quoted (0 is even) # - works with multi-char col_sep - # - if header_size is given, only up to header_size fields are parsed # - # We use header_size for parsing the body lines to make sure we always match the number of headers - # in case there are trailing col_sep characters in line + # NOTE: we are no longer passing-in header_size # - # Our convention is that empty fields are returned as empty strings, not as nil. + # - if header_size was given, only up to header_size fields are parsed # + # We used header_size for parsing the body lines to make sure we always match the number of headers + # in case there are trailing col_sep characters in line # - # the purpose of the max_size parameter is to handle a corner case where - # CSV lines contain more fields than the header. - # In which case the remaining fields in the line are ignored + # the purpose of the max_size parameter was to handle a corner case where + # CSV lines contain more fields than the header. In which case the remaining fields in the line were ignored # + # Our convention is that empty fields are returned as empty strings, not as nil. + def parse_csv_line_ruby(line, options, header_size = nil) - return [] if line.nil? + return [[], 0] if line.nil? line_size = line.size col_sep = options[:col_sep] col_sep_size = col_sep.size quote = options[:quote_char] - quote_count = 0 elements = [] start = 0 i = 0 - previous_char = '' + backslash_count = 0 + in_quotes = false + while i < line_size - if line[i...i+col_sep_size] == col_sep && quote_count.even? + # Check if the current position matches the column separator and we're not inside quotes + if line[i...i+col_sep_size] == col_sep && !in_quotes break if !header_size.nil? && elements.size >= header_size elements << cleanup_quotes(line[start...i], quote) - previous_char = line[i] - i += col_sep.size + i += col_sep_size start = i + backslash_count = 0 # Reset backslash count at the start of a new field else - quote_count += 1 if line[i] == quote && previous_char != '\\' - previous_char = line[i] + if line[i] == '\\' + backslash_count += 1 + else + if line[i] == quote + if backslash_count % 2 == 0 + # Even number of backslashes means quote is not escaped + in_quotes = !in_quotes + end + # Else, quote is escaped; do nothing + end + backslash_count = 0 # Reset after any character other than backslash + end i += 1 end end - elements << cleanup_quotes(line[start..-1], quote) if header_size.nil? || elements.size < header_size + + # Check for unclosed quotes at the end of the line + if in_quotes + raise MalformedCSV, "Unclosed quoted field detected in line: #{line}" + end + + # Process the remaining field + if header_size.nil? || elements.size < header_size + elements << cleanup_quotes(line[start..-1], quote) + end + [elements, elements.size] end def cleanup_quotes(field, quote) return field if field.nil? - # return if field !~ /#{quote}/ # this check can probably eliminated - + # Remove surrounding quotes if present if field.start_with?(quote) && field.end_with?(quote) - field.delete_prefix!(quote) - field.delete_suffix!(quote) + field = field[1..-2] end - field.gsub!("#{quote}#{quote}", quote) + + # Replace double quotes with a single quote + field.gsub!("#{quote * 2}", quote) + field end end diff --git a/lib/smarter_csv/reader.rb b/lib/smarter_csv/reader.rb index be0527f..c7f76b5 100644 --- a/lib/smarter_csv/reader.rb +++ b/lib/smarter_csv/reader.rb @@ -62,7 +62,8 @@ def process(&block) # rubocop:disable Lint/UnusedMethodArgument skip_lines(fh, options) - @headers, header_size = process_headers(fh, options) + # NOTE: we are no longer using header_size + @headers, _header_size = process_headers(fh, options) @headerA = @headers # @headerA is deprecated, use @headers puts "Effective headers:\n#{pp(@headers)}\n" if @verbose @@ -97,14 +98,23 @@ def process(&block) # rubocop:disable Lint/UnusedMethodArgument multiline = count_quote_chars(line, options[:quote_char]).odd? while multiline - next_line = fh.readline(options[:row_sep]) - next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8 - line += next_line - @file_line_count += 1 - - break if fh.eof? # Exit loop if end of file is reached - - multiline = count_quote_chars(line, options[:quote_char]).odd? + begin + next_line = fh.readline(options[:row_sep]) + next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8 + line += next_line + @file_line_count += 1 + + multiline = count_quote_chars(line, options[:quote_char]).odd? + rescue EOFError + # End of file reached. Check if quotes are balanced. + total_quotes = count_quote_chars(line, options[:quote_char]) + if total_quotes.odd? + raise MalformedCSV, "Unclosed quoted field detected in multiline data" + else + # Quotes are balanced; proceed without raising an error. + break + end + end end # :nocov: @@ -116,7 +126,18 @@ def process(&block) # rubocop:disable Lint/UnusedMethodArgument line.chomp!(options[:row_sep]) # --- SPLIT LINE & DATA TRANSFORMATIONS ------------------------------------------------------------ - dataA, _data_size = parse(line, options, header_size) + dataA, data_size = parse(line, options) # we parse the extra columns + + if options[:strict] + raise SmarterCSV::HeaderSizeMismatch, "extra columns detected on line #{@file_line_count}" + else + # we create additional columns on-the-fly + current_size = @headers.size + while current_size < data_size + @headers << "#{options[:missing_header_prefix]}#{current_size + 1}".to_sym + current_size += 1 + end + end dataA.map!{|x| x.strip} if options[:strip_whitespace] diff --git a/lib/smarter_csv/version.rb b/lib/smarter_csv/version.rb index df26db3..410fc6c 100644 --- a/lib/smarter_csv/version.rb +++ b/lib/smarter_csv/version.rb @@ -1,5 +1,5 @@ # frozen_string_literal: true module SmarterCSV - VERSION = "1.12.1" + VERSION = "1.13.0" end diff --git a/spec/features/converters/convert_values_to_numeric_spec.rb b/spec/features/converters/convert_values_to_numeric_spec.rb index f61e95b..e24640b 100644 --- a/spec/features/converters/convert_values_to_numeric_spec.rb +++ b/spec/features/converters/convert_values_to_numeric_spec.rb @@ -26,12 +26,13 @@ end it 'can be prevented for some keys' do - options = { convert_values_to_numeric: { except: :reference }} + options = { convert_values_to_numeric: { except: [:reference, :zip_code] }} data = SmarterCSV.process("#{fixture_path}/numeric.csv", options) data.each do |hash| expect(hash[:wealth]).to be_a_kind_of(Numeric) unless hash[:wealth].nil? expect(hash[:reference]).to be_a_kind_of(String) unless hash[:reference].nil? + expect(hash[:zip_code]).to be_a_kind_of(String) unless hash[:zip_code].nil? end end diff --git a/spec/features/general/basic_spec.rb b/spec/features/general/basic_spec.rb index 1ebbaf8..b1e56cf 100644 --- a/spec/features/general/basic_spec.rb +++ b/spec/features/general/basic_spec.rb @@ -48,8 +48,8 @@ end end - context 'with full user_provided_headers' do - let(:options) { super().merge({user_provided_headers: %i[a b c d e f]}) } + context 'with correct cardinality user_provided_headers' do + let(:options) { super().merge({user_provided_headers: %i[a b c d e f], headers_in_file: true}) } it 'replaces headers with user_provided_headers' do data = reader.process @@ -60,8 +60,41 @@ end end - context 'with partial user_provided_headers' do - let(:options) { super().merge({user_provided_headers: %i[a b c d e]}) } + # We simulate that there is no existing header in the input, by skipping the first line. + # Because we don't provide `headers_in_file` in our options, they will be set to `false`. + # + context 'without headers_in_file, and with insufficient user_provided_headers' do + let(:options) { super().merge({user_provided_headers: %i[a b c d e], skip_lines: 1}) } + + it 'processes all 5 data rows, automatically adds column_6' do + data = reader.process + expect(data.size).to eq 5 + end + + it 'has no raw_header' do + reader.process + expect(reader.raw_header).to eq nil # because there was no header in the input + end + + it 'automatically adds a column_6 because the given headers were not sufficient' do + reader.process + expect(reader.headers).to eq %i[a b c d e column_6] + end + + context 'when :strict is set' do + let(:options) { super().merge({user_provided_headers: %i[a b c d e], strict: true, skip_lines: 1}) } + + it 'raises an exception if the number of user_provided_headers is incorrect' do + expect do + reader.process + end.to raise_exception(SmarterCSV::HeaderSizeMismatch) + end + end + end + + # if there is an existing header in the input, the user_provided_headers must have same cardinality + context 'with incorrect cardinality of user_provided_headers' do + let(:options) { super().merge({user_provided_headers: %i[a b c d e], headers_in_file: true}) } it 'raises an exception if the number of user_provided_headers is incorrect' do expect do @@ -71,7 +104,7 @@ end context 'with empty user_provided_headers' do - let(:options) { super().merge({user_provided_headers: []}) } + let(:options) { super().merge({user_provided_headers: [], headers_in_file: true}) } it 'raises an exception if the user_provided_headers is empty' do expect do @@ -80,10 +113,10 @@ end end - context 'with incorrect user_provided_headers' do - let(:options) { super().merge({user_provided_headers: {}}) } + context 'with invalid user_provided_headers' do + let(:options) { super().merge({user_provided_headers: {}, headers_in_file: true}) } - it 'raises an exception if the user_provided_headers is of incorrect type' do + it 'raises an exception if the user_provided_headers is of invalid type' do expect do reader.process end.to raise_exception(SmarterCSV::IncorrectOption, /ERROR: incorrect format for user_provided_headers! Expecting array with headers/) diff --git a/spec/features/header_handling/duplicate_headers_spec.rb b/spec/features/header_handling/duplicate_headers_spec.rb index 6288e97..a45cfea 100644 --- a/spec/features/header_handling/duplicate_headers_spec.rb +++ b/spec/features/header_handling/duplicate_headers_spec.rb @@ -12,7 +12,7 @@ it 'raises error when user_provided_headers with duplicates are given' do expect do - options = {user_provided_headers: %i[a b c d a]} + options = {user_provided_headers: %i[a b c d a], headers_in_file: false} SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options) end.to raise_exception(SmarterCSV::DuplicateHeaders) end @@ -49,7 +49,7 @@ end it 'raises when duplicate headers are given' do - options.merge!({user_provided_headers: %i[a b c a a]}) + options.merge!({user_provided_headers: %i[a b c a a], headers_in_file: false}) expect do SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options) end.to raise_exception(SmarterCSV::DuplicateHeaders) @@ -76,7 +76,7 @@ end it 'raises when duplicate headers are given' do - options.merge!({user_provided_headers: %i[a b c a a]}) + options.merge!({user_provided_headers: %i[a b c a a], headers_in_file: false}) expect do SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options) end.to raise_exception(SmarterCSV::DuplicateHeaders) diff --git a/spec/features/special_cases/extra_columns_spec.rb b/spec/features/special_cases/extra_columns_spec.rb new file mode 100644 index 0000000..17a17b8 --- /dev/null +++ b/spec/features/special_cases/extra_columns_spec.rb @@ -0,0 +1,109 @@ +# frozen_string_literal: true + +fixture_path = 'spec/fixtures' + +# when rows with extra columns are present, we want to be robust against this error, and we auto-generate headers +# +describe 'CSV file with more columns that shown in header' do + let(:csv_path) { "#{fixture_path}/extra_columns.csv" } + + [true, false].each do |bool| + context "with#{bool ? ' C-' : 'out '}acceleration" do + let(:options) { { acceleration: bool } } + let(:reader) { SmarterCSV::Reader.new(csv_path, options) } + + context "when strict mode" do + before do + options.merge!(strict: true) + end + + it "raises an exception" do + expect{reader.process}.to raise_exception SmarterCSV::HeaderSizeMismatch, "extra columns detected on line 2" + end + end + + it "reads all lines of the file" do + data = reader.process + expect(data.size).to eq 5 + end + + context "when default behavior" do + # default behavior is to remove empty values + # it finds column_11, but it is empty, and will not show up as a hash + it "generates all extra columns" do + reader.process + expect(reader.headers).to eq %i[one two three four five six column_7 column_8 column_9 column_10 column_11] + end + + it "parses all rows correctly" do + data = reader.process + expect(data[0]).to eq({one: 1, two: 2, five: 5, column_8: 8, column_9: 9}) + expect(data[1]).to eq({one: 1, two: 2, three: 3, four: 4, five: 5, six: 6}) + expect(data[2]).to eq({one: 1, two: 2, three: 3, four: 4, five: 5, six: 6}) + expect(data[3]).to eq({one: 1, column_10: 10}) + expect(data[4]).to eq({one: 1, column_10: 10}) + end + end + + context "when keeping empty values" do + before do + options.merge!(remove_empty_values: false) + end + + it "generates all extra columns" do + reader.process + expect(reader.headers).to eq %i[one two three four five six column_7 column_8 column_9 column_10 column_11] + end + + it "parses all rows correctly" do + data = reader.process + expect(data[0]).to eq({one: 1, two: 2, three: '', four: '', five: 5, six: '', column_7: '', column_8: 8, column_9: 9}) + expect(data[1]).to eq({one: 1, two: 2, three: 3, four: 4, five: 5, six: 6, column_7: nil, column_8: nil, column_9: nil}) + expect(data[2]).to eq({one: 1, two: 2, three: 3, four: 4, five: 5, six: 6, column_7: '', column_8: nil, column_9: nil}) + expect(data[3]).to eq({one: 1, two: '', three: '', four: '', five: '', six: '', column_7: '', column_8: '', column_9: '', column_10: 10}) + expect(data[4]).to eq({one: 1, two: '', three: '', four: '', five: '', six: '', column_7: '', column_8: '', column_9: '', column_10: 10, column_11: ''}) + end + end + + context "when missing_header_prefix is changed" do + before do + options.merge!(missing_header_prefix: 'col_') + end + + it "generates all extra columns" do + reader.process + expect(reader.headers).to eq %i[one two three four five six col_7 col_8 col_9 col_10 col_11] + end + + it "parses all rows correctly" do + data = reader.process + expect(data[0]).to eq({one: 1, two: 2, five: 5, col_8: 8, col_9: 9}) + expect(data[1]).to eq({one: 1, two: 2, three: 3, four: 4, five: 5, six: 6}) + expect(data[2]).to eq({one: 1, two: 2, three: 3, four: 4, five: 5, six: 6}) + expect(data[3]).to eq({one: 1, col_10: 10}) + expect(data[4]).to eq({one: 1, col_10: 10}) + end + end + + context "when user did not provide enough headers manually" do + before do + options.merge!(headers_in_file: true, user_provided_headers: %i[a b c d e f]) + end + + it "generates all extra columns" do + reader.process + expect(reader.headers).to eq %i[a b c d e f column_7 column_8 column_9 column_10 column_11] + end + + it "parses all rows correctly" do + data = reader.process + expect(data[0]).to eq({a: 1, b: 2, e: 5, column_8: 8, column_9: 9}) + expect(data[1]).to eq({a: 1, b: 2, c: 3, d: 4, e: 5, f: 6}) + expect(data[2]).to eq({a: 1, b: 2, c: 3, d: 4, e: 5, f: 6}) + expect(data[3]).to eq({a: 1, column_10: 10}) + expect(data[4]).to eq({a: 1, column_10: 10}) + end + end + end + end +end diff --git a/spec/features/special_cases/malformed_spec.rb b/spec/features/special_cases/malformed_spec.rb index dc94989..1df8143 100644 --- a/spec/features/special_cases/malformed_spec.rb +++ b/spec/features/special_cases/malformed_spec.rb @@ -3,23 +3,48 @@ fixture_path = 'spec/fixtures' # according to RFC-4180 quotes inside of "words" shouldbe doubled, but our parser is robust against that. -describe 'malformed CSV quotes' do - context "malformed quotes in header" do - let(:csv_path) { "#{fixture_path}/malformed_header.csv" } - it 'should be resilient against single quotes' do - data = SmarterCSV.process(csv_path) - expect(data[0]).to eq({name: "Arnold Schwarzenegger", dobdob: "1947-07-30"}) - expect(data[1]).to eq({name: "Jeff Bridges", dobdob: "1949-12-04"}) - end - end - context "malformed quotes in content" do - let(:csv_path) { "#{fixture_path}/malformed.csv" } +[true, false].each do |bool| + describe "handling files with escaped quote chars with#{bool ? ' C-' : 'out '}acceleration" do + let(:options) { { acceleration: bool } } + let(:reader) { SmarterCSV::Reader.new(csv_path, options) } + + describe 'malformed CSV quotes' do + context "malformed quotes in header" do + let(:csv_path) { "#{fixture_path}/malformed_header.csv" } + it 'should be resilient against single quotes' do + data = reader.process + expect(data[0]).to eq({name: "Arnold Schwarzenegger", dobdob: "1947-07-30"}) + expect(data[1]).to eq({name: "Jeff Bridges", dobdob: "1949-12-04"}) + end + end + + context "malformed quotes in content" do + let(:csv_path) { "#{fixture_path}/malformed.csv" } + + it 'should be resilient against single quotes' do + data = reader.process + expect(data[0]).to eq({name: "Arnold Schwarzenegger", dob: "1947-07-30"}) + expect(data[1]).to eq({name: "Jeff \"the dude\" Bridges", dob: "1949-12-04"}) + end + end - it 'should be resilient against single quotes' do - data = SmarterCSV.process(csv_path) - expect(data[0]).to eq({name: "Arnold Schwarzenegger", dob: "1947-07-30"}) - expect(data[1]).to eq({name: "Jeff \"the dude\" Bridges", dob: "1949-12-04"}) + context "malformed quotes in content" do + let(:csv_path) { "#{fixture_path}/malformed_data_eof.csv" } + + it 'should raise MalformedCSV error' do + expect { reader.process }.to raise_error(SmarterCSV::MalformedCSV, "Unclosed quoted field detected in multiline data") + end + end + + context "malformed quotes in content" do + let(:csv_path) { "#{fixture_path}/malformed_data_gobbled.csv" } + + it 'should raise MalformedCSV error' do + expect { reader.process }.to raise_error(SmarterCSV::MalformedCSV, /Unclosed quoted field detected/) + end + end end + end end diff --git a/spec/fixtures/extra_columns.csv b/spec/fixtures/extra_columns.csv new file mode 100644 index 0000000..2e3355b --- /dev/null +++ b/spec/fixtures/extra_columns.csv @@ -0,0 +1,6 @@ +one,two,three,four,five,six +1,2,,,5,,,8,9 +1,2,3,4,5,6 +1,2,3,4,5,6, +1,,,,,,,,,10 +1,,,,,,,,,10, diff --git a/spec/fixtures/malformed_data_eof.csv b/spec/fixtures/malformed_data_eof.csv new file mode 100644 index 0000000..ebd4265 --- /dev/null +++ b/spec/fixtures/malformed_data_eof.csv @@ -0,0 +1,3 @@ +name,dob,height +Arnold Schwarzenegger,1947-07-30,188 +Jeff Bridges,1949-12-04,"5 diff --git a/spec/fixtures/malformed_data_gobbled.csv b/spec/fixtures/malformed_data_gobbled.csv new file mode 100644 index 0000000..92fb050 --- /dev/null +++ b/spec/fixtures/malformed_data_gobbled.csv @@ -0,0 +1,3 @@ +name,dob,height +Arnold Schwarzenegger,1947-07-30,6'2" +Jeff Bridges,1949-12-04, diff --git a/spec/fixtures/numeric.csv b/spec/fixtures/numeric.csv index 7d6e374..cd0f593 100644 --- a/spec/fixtures/numeric.csv +++ b/spec/fixtures/numeric.csv @@ -1,5 +1,5 @@ -First Name,Last Name,Reference, Wealth -Dan,McAllister,0123,3.5 +First Name,Last Name,Reference, Wealth, Zip Code +Dan,McAllister,0123,3.5,93456 ,,, -Miles,O'Brian,2345,3 -Nancy,Homes,2345,01 +Miles,O'Brian,2345,3,00223 +Nancy,Homes,2345,01,02343