tilo · tilo · Nov 5, 2024 · Aug 4, 2024 · Nov 5, 2024 · Nov 5, 2024
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,45 @@
 
 # SmarterCSV 1.x Change Log
 
+## 1.13.0 (2024-11-06) ⚡ POTENTIALLY BREAKING ⚡
+
+  CHANGED DEFAULT BEHAVIOR
+  ========================
+  The changes are to improve robustness and to reduce the risk of data loss
+
+  * implementing auto-detection of extra columns (thanks to James Fenley)
+
+  * improved handling of unbalanced quote_char in input ([issue 288](https://github.com/tilo/smarter_csv/issues/288)) thanks to Simon Rentzke), and ([issue 283](https://github.com/tilo/smarter_csv/issues/283)) thanks to James Fenley, Randall B, Matthew Kennedy)
+    -> SmarterCSV will now raise `SmarterCSV::MalformedCSV` for unbalanced quote_char.
+
+  * bugfix / improved handling of extra columns in input data ([issue 284](https://github.com/tilo/smarter_csv/issues/284)) (thanks to James Fenley)
+
+    * previous behavior:
+      when a CSV row had more columns than listed in the header, the additional columns were ignored
+
+    * new behavior:
+      * new default behavior is to auto-generate additional headers, e.g. :column_7, :column_8, etc
+      * you can set option `:strict` to true in order to get a `SmarterCSV::MalformedCSV` exception instead
+
+  * setting `user_provided_headers` now implies `headers_in_file: false` ([issue 282](https://github.com/tilo/smarter_csv/issues/282))
+
+    The option `user_provided_headers` can be used to specify headers when there are none in the input, OR to completely override headers that are in the input (file).
+
+    SmarterCSV is now using a safer default behavior.
+
+    * previous behavior:
+      Setting `user_provided_headers` did not change the default `headers_in_file: true`
+      If the input had no headers, this would cause the first line to be erroneously treated as a header, and the user could lose the first row of data.
+
+    * new behavior:
+      Setting `user_provided_headers` sets`headers_in_file: false`
+      a) Improved behavior if there was no header in the input data.
+      b) If there was a header in the input data, and `user_provided_headers` is used to override the headers in the file, then please explicitly specify `headers_in_file: true`, otherwise you will get an extra hash which includes the header data.
+
+    IF you set `user_provided_headers` and the file has a header, then provide `headers_in_file: true` to avoid getting that extra record.
+
+   * handling of numeric columns with leading zeroes, e.g. ZIP codes. ([issue #151](https://github.com/tilo/smarter_csv/issues/151) thanks to David Moles). `convert_values_to_numeric: { except: [:zip] }` will now return a string for that column instead.
+
 ## 1.12.1 (2024-07-10)
   * Improved column separator detection by ignoring quoted sections [#276](https://github.com/tilo/smarter_csv/pull/276) (thanks to Nicolas Castellanos)
 

diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md
@@ -54,3 +54,7 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed
  * [Kenton Hirowatari](https://github.com/hirowatari)
  * [Daniel Pepper](https://github.com/dpep)
  * [Nicolas Castellanos](https://github.com/nicastelo)
+ * [James Fenley](https://github.com/rex-remind101)
+ * [Simon Rentzke](https://github.com/simonrentzke)
+ * [Randall B](https://github.com/randall-coding)
+ * [Matthew Kennedy](https://github.com/MattKitmanLabs)
diff --git a/docs/data_transformations.md b/docs/data_transformations.md
@@ -33,6 +33,8 @@ Here is an example of using `convert_values_to_numeric` for numbers with leading
    => [{:zip=>"00480"}, {:zip=>"51903"}, {:zip=>"12354"}, {:zip=>"02343"}]
 ```   
 
+This will return the column `:zip` as a string with all digits intact.
+
 ## Remove Zero Values
 `remove_zero_values` is disabled by default.
 When enabled, it removes key/value pairs which have a numeric value equal to zero.

diff --git a/docs/header_transformations.md b/docs/header_transformations.md
@@ -64,6 +64,8 @@ If you want to have an underscore between the header and the number, you can set
    => [{:first_name=>"Carl", :middle_name=>"Edward", :last_name=>"Sagan"}]
 ```
 
+If you set `duplicate_header_suffix: nil`, you get the same behavior as earlier versions, which raised the `SmarterCSV::DuplicateHeaders` error.
+
 ## Key Mapping
 
 The above example already illustrates how intermediate keys can be mapped into something different.

diff --git a/docs/options.md b/docs/options.md
@@ -41,17 +41,18 @@
      | :skip_lines                 |   nil    | how many lines to skip before the first line or header line is processed             |
      | :comment_regexp             |   nil    | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/       |
      ---------------------------------------------------------------------------------------------------------------------------------
-     | :col_sep                    |   :auto   | column separator (default was ',')                                           |
+     | :col_sep                    |   :auto   | column separator (default was ',')                                                  |
      | :force_simple_split         |   false  | force simple splitting on :col_sep character for non-standard CSV-files.             |
      |                             |          | e.g. when :quote_char is not properly escaped                                        |
      | :row_sep                    |  :auto   | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
      |                             |          | This can also be set to :auto, but will process the whole cvs file first  (slow!)    |
      | :auto_row_sep_chars         |   500    | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
      | :quote_char                 |   '"'    | quotation character                                                                  |
      ---------------------------------------------------------------------------------------------------------------------------------
-     | :headers_in_file            |   true   | Whether or not the file contains headers as the first line.                          |
-     |                             |          | Important if the file does not contain headers,                                      |
-     |                             |          | otherwise you would lose the first line of data.                                     |
+     | :headers_in_file            |  true(1) | Whether or not the file contains headers as the first line.                          |
+     |                             |          | (1): if `user_provided_headers` is given, the default is `false`,                    |
+     |                             |          | unless you specify it to be explicitly `true`.                                       |
+     |                             |          | This prevents losing the first line of data, which is otherwise assumed to be a header. |
      | :duplicate_header_suffix    |   ''     | Adds numbers to duplicated headers and separates them by the given suffix.           |
      |                             |          | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior)        |
      | :user_provided_headers      |   nil    | *careful with that axe!*                                                             |
@@ -61,6 +62,8 @@
      | :remove_empty_hashes        |   true   | remove / ignore any hashes which don't have any key/value pairs or all empty values  |
      | :verbose                    |   false  | print out line number while processing (to track down problems in input files)       |
      | :with_line_numbers          |   false  | add :csv_line_number to each data hash                                               |
+     | :missing_header_prefix      |  column_ | can be set to a string of your liking                                                |
+     | :strict                     |   false  | When set to `true`, extra columns will raise MalformedCSV exception                  |
      ---------------------------------------------------------------------------------------------------------------------------------
 
 Additional 1.x Options which may be replaced in 2.0
@@ -71,11 +74,11 @@ There have been a lot of 1-offs and feature creep around these options, and goin
      | Option                      | Default  |  Explanation                                                                         |
      ---------------------------------------------------------------------------------------------------------------------------------
      | :key_mapping                |   nil    | a hash which maps headers from the CSV file to keys in the result hash               |
-     | :silence_missing_keys        |   false  | ignore missing keys in `key_mapping`                                   |
-     |                             |          | if set to true: makes all mapped keys optional                         |
+     | :silence_missing_keys        |   false  | ignore missing keys in `key_mapping`                                                |
+     |                             |          | if set to true: makes all mapped keys optional                                       |
      |                             |          | if given an array, makes only the keys listed in it optional                         |
-     | :required_keys              |   nil    | An array. Specify the required names AFTER header transformation.                  |
-     | :required_headers           |   nil    | (DEPRECATED / renamed) Use `required_keys` instead                          |
+     | :required_keys              |   nil    | An array. Specify the required names AFTER header transformation.                    |
+     | :required_headers           |   nil    | (DEPRECATED / renamed) Use `required_keys` instead                                   |
      |                             |          | or an exception is raised   No validation if nil is given.                           |
      | :remove_unmapped_keys       |   false  | when using :key_mapping option, should non-mapped keys / columns be removed?         |
      | :downcase_header            |   true   | downcase all column headers                                                          |

diff --git a/ext/smarter_csv/smarter_csv.c b/ext/smarter_csv/smarter_csv.c
@@ -9,9 +9,10 @@
   #define true  ((bool)1)
 #endif
 
-/*
-   max_size: pass nil if no limit is specified
- */
+VALUE SmarterCSV = Qnil;
+VALUE eMalformedCSVError = Qnil;
+VALUE Parser = Qnil;
+
 static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quote_char, VALUE max_size) {
   if (RB_TYPE_P(line, T_NIL) == 1) {
     return rb_ary_new();
@@ -24,7 +25,7 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
   rb_encoding *encoding = rb_enc_get(line); /* get the encoding from the input line */
   char *startP = RSTRING_PTR(line); /* may not be null terminated */
   long line_len = RSTRING_LEN(line);
-  char *endP = startP + line_len ; /* points behind the string */
+  char *endP = startP + line_len; /* points behind the string */
   char *p = startP;
 
   char *col_sepP = RSTRING_PTR(col_sep);
@@ -39,18 +40,19 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
   VALUE field;
   long i;
 
-  char prev_char = '\0'; // Store the previous character for comparison against an escape character
-  long backslash_count = 0; // to count consecutive backslash characters
+  /* Variables for escaped quote handling */
+  long backslash_count = 0;
+  bool in_quotes = false;
 
   while (p < endP) {
     /* does the remaining string start with col_sep ? */
     col_sep_found = true;
-    for(i=0; (i < col_sep_len) && (p+i < endP) ; i++) {
+    for(i=0; (i < col_sep_len) && (p+i < endP); i++) {
       col_sep_found = col_sep_found && (*(p+i) == *(col_sepP+i));
     }
-    /* if col_sep was found and we have even quotes */
-    if (col_sep_found && (quote_count % 2 == 0)) {
-      /* if max_size != nil && lements.size >= header_size */
+    /* if col_sep was found and we're not inside quotes */
+    if (col_sep_found && !in_quotes) {
+      /* if max_size != nil && elements.size >= header_size */
       if ((max_size != Qnil) && RARRAY_LEN(elements) >= NUM2INT(max_size)) {
         break;
       } else {
@@ -60,22 +62,30 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
 
         p += col_sep_len;
         startP = p;
+        backslash_count = 0; // Reset backslash count at the start of a new field
       }
     } else {
       if (*p == '\\') {
         backslash_count++;
       } else {
-        if (*p == *quoteP && (backslash_count % 2 == 0)) {
-          quote_count++;
+        if (*p == *quoteP) {
+          if (backslash_count % 2 == 0) {
+            /* Even number of backslashes means quote is not escaped */
+            in_quotes = !in_quotes;
+          }
+          /* Else, quote is escaped; do nothing */
         }
-        backslash_count = 0; // no more consecutive backslash characters
+        backslash_count = 0; // Reset after any character other than backslash
       }
       p++;
     }
-
-    prev_char = *(p - 1); // Update the previous character
   } /* while */
 
+  /* Check for unclosed quotes at the end of the line */
+  if (in_quotes) {
+    rb_raise(eMalformedCSVError, "Unclosed quoted field detected in line: %s", StringValueCStr(line));
+  }
+
   /* check if the last part of the line needs to be processed */
   if ((max_size == Qnil) || RARRAY_LEN(elements) < NUM2INT(max_size)) {
     /* copy the remaining line as a field with original encoding onto the results */
@@ -86,12 +96,11 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
   return elements;
 }
 
-VALUE SmarterCSV = Qnil;
-VALUE Parser = Qnil;
-
 void Init_smarter_csv(void) {
-  SmarterCSV = rb_define_module("SmarterCSV");
-  Parser = rb_define_module_under(SmarterCSV, "Parser");
+  // these modules and the error class are already defined in Ruby code, make them accessible:
+  SmarterCSV = rb_const_get(rb_cObject, rb_intern("SmarterCSV"));
+  Parser = rb_const_get(SmarterCSV, rb_intern("Parser"));
+  eMalformedCSVError = rb_const_get(SmarterCSV, rb_intern("MalformedCSV"));
 
   rb_define_module_function(Parser, "parse_csv_line_c", rb_parse_csv_line, 4);
 }
diff --git a/lib/smarter_csv/auto_detection.rb b/lib/smarter_csv/auto_detection.rb
@@ -13,14 +13,13 @@ def guess_column_separator(filehandle, options)
       delimiters = [',', "\t", ';', ':', '|']
 
       line = nil
+      escaped_quote = Regexp.escape(options[:quote_char])
       has_header = options[:headers_in_file]
       candidates = Hash.new(0)
       count = has_header ? 1 : 5
       count.times do
         line = readline_with_counts(filehandle, options)
         delimiters.each do |d|
-          escaped_quote = Regexp.escape(options[:quote_char])
-
           # Count only non-quoted occurrences of the delimiter
           non_quoted_text = line.split(/#{escaped_quote}[^#{escaped_quote}]*#{escaped_quote}/).join
 

diff --git a/lib/smarter_csv/errors.rb b/lib/smarter_csv/errors.rb
@@ -11,6 +11,7 @@ class DuplicateHeaders < SmarterCSVException; end
   class MissingKeys < SmarterCSVException; end # previously known as MissingHeaders
   class NoColSepDetected < SmarterCSVException; end
   class KeyMappingError < SmarterCSVException; end
+  class MalformedCSV < SmarterCSVException; end
   # Writer:
   class InvalidInputData < SmarterCSVException; end
 end
diff --git a/lib/smarter_csv/options.rb b/lib/smarter_csv/options.rb
@@ -26,6 +26,7 @@ module Options
       invalid_byte_sequence: '',
       keep_original_headers: false,
       key_mapping: nil,
+      missing_header_prefix: 'column_',
       quote_char: '"',
       remove_empty_hashes: true,
       remove_empty_values: true,
@@ -37,6 +38,7 @@ module Options
       row_sep: :auto, # was: $/,
       silence_missing_keys: false,
       skip_lines: nil,
+      strict: false,
       strings_as_keys: false,
       strip_chars_from_headers: nil,
       strip_whitespace: true,
@@ -50,6 +52,18 @@ module Options
     def process_options(given_options = {})
       puts "User provided options:\n#{pp(given_options)}\n" if given_options[:verbose]
 
+      # Special case for :user_provided_headers:
+      #
+      # If we would use the default `headers_in_file: true`, and `:user_provided_headers` are given,
+      # we could lose the first data row
+      #
+      # We now err on the side of treating an actual header as data, rather than losing a data row.
+      #
+      if given_options[:user_provided_headers] && !given_options.keys.include?(:headers_in_file)
+        given_options[:headers_in_file] = false
+        puts "WARNING: setting `headers_in_file: false` as a precaution to not lose the first row. Set explicitly to `true` if you have headers."
+      end
+
       @options = DEFAULT_OPTIONS.dup.merge!(given_options)
 
       # fix invalid input