Increase speed of Table#to_csv #348

wishdev · 2025-07-02T23:56:27Z

If we set the encoding when we call Table#to_csv we do not need to go row by row to determine the encoding to use. This allows the use of CSV#generate_lines as a faster exporter.

Add a single_encoding option to Table#to_csv to use the encoding of the first row of the table as the encoding.

kou · 2025-07-03T00:13:59Z

Hmm. I'm not sure whether this approach is optimal.

Could you also add a benchmark to https://github.com/ruby/csv/tree/main/benchmark ? Let's discuss which approach is better with numbers.

wishdev · 2025-07-03T03:10:34Z

Benchmark added

At 100 rows I get about a 5.8x improvement
At 1000 rows I get about a 6.7x improvement

tompng · 2025-07-03T10:52:12Z

Argument +single_encoding+, if true, encoding will be determined from the first row.

I think it should scan every row and header.
The below is an example CSV that encoding can't be determined from header/first row. I think this kind of CSV is pretty common.

id, language, description
1, English, ascii-only row
2, Français, alphabet + cédille first non-ascii row
3, Español, alphabet + ntilde character row

If it's going to scan every row, can single_encoding be automatically determined instead of specifying by a keyword parameter?
This way, we don't need to change the method signature, Table#to_csv will be faster in many cases, and
we don't need additional option to get this speed up benefit.

wishdev · 2025-07-03T13:11:51Z

Argument +single_encoding+, if true, encoding will be determined from the first row.

I think it should scan every row and header. The below is an example CSV that encoding can't be determined from header/first row. I think this kind of CSV is pretty common.

To be very clear, and I apologize my documentation skills suck, the concept of scanning every row is the current state of affairs. That option remains totally available for usage and remains the default.

But, when I have 100,000 rows of data and I know the encoding, why would I want to scan every row? The single_encoding argument simply allows the class to make a calculation I could do by hand and provide via the existing :encoding option (although this code makes :encoding much faster as well - see benchmark).

So I guess I described this work wrong - I'm not seeking to increase the speed of Table#to_csv but rather provide a fast path, when asked for, that removes the line by line checking for information that I can either provide (via :encoding) or have the system make the calculation based on the first row of a homogeneous data set (via :single_encoding).

If one is working with "random" CSV files (from a client for example) then this path may not make sense. The default continues to be the "safe and secure" (but sometimes failing option - see the added test) of scanning each row and exporting each row as an individual single row CSV string that is joined together at the end. But, again, at 1,000+ rows there is a noticeable slowdown for information that may not be required to be calculated if one is producing the information "in house".

One could remove :single_encoding, provide the little bit of code it does as a separate method (Row#calculate_encoding) and then have one set the :encoding option to the value of that method when manually called. I'm not certain what that gains other than a cookbook method that isn't presented to all users?

wishdev · 2025-07-03T14:17:53Z

Also, I've added another test which shows that @tompng example works perfectly fine. The strings are all UTF-8 by default so there is no issue with the first row being "ascii-only".

tompng · 2025-07-03T15:34:44Z

When encoding is known, to_csv(encoding: Encoding::UTF_8), your fast path improvement is significant 👍
For single_encoding: true option, it needs discussion. If there is a way to moderately improve without that option, the reason of introducing new option will reduce or might disappear. If there's a possibility, adding new option casually is not a good idea.

I tried the code below with the benchmark. Full scan + fast path also gets 3.5x - 4.3x improvement.

def to_csv(write_headers: true, limit: nil, **options)
  ...
  if options[:encoding]
    # fast path, 5.5x-6.2x improvement
    array.push(CSV.generate_lines(rows, **options))
  else
    # full scan + fast path, 3.5x-4.3x improvement
    options[:encoding] = detect_encoding_with_full_row_scan
    array.push(CSV.generate_lines(rows, **options))
  end
  array.join("")
end

This full-scan encoding detection improves most cases. Just calling table.to_csv will get much faster.
This fact will reduce the reason of introducing single_encoding option.

wishdev · 2025-07-03T16:29:54Z

So the first test example has a string that is valid as either ASCII-8BIT or UTF-8. Currently table.to_csv fails to process that table because it is 2 encodings that are not compatible. What does detect_encoding_with_full_row_scan do with that? Is it a fast fail? If 2 compatible encodings exist within the file which one wins? Currently it's the first encoding on a line that gets chosen - do we now scan each column?

I don't know enough about encodings to figure out what a default option should be. This was crafted to offer advantages without interfering for a second with existing code. I'm not comfortable for detect_encoding_with_full_row_scan because I'm not an expert in the edge cases.

Would there be objections to at least moving the row encoding detection code into something like CSV::detect_encoding? Fast path for options[:encoding]seems to not be an issue That combination would at least allows for something like single_encoding to be placed in a prepended module if desired by the end user (i.e. me) fairly straight forwardly.

wishdev · 2025-07-04T16:15:08Z

I've reworked it to remove the single_encoding option and move the row encoding code into a private method of CSV (to allow my module prepending to work).

As I said I'm not comfortable modifying the existing path when the encoding is not provided. That can be left for others to investigate. This certainly does not to prevent any further enhancements.

kou

Could you rebase on main to fix the /opt/hostedtoolcache/Ruby/3.4.4/x64/lib/ruby/3.4.0/rubygems/dependency.rb:303:in 'Gem::Dependency#to_specs': Could not find 'csv' (= 3.3.0) - did find: [csv-3.3.2,csv-3.0.2,csv-3.0.1] (Gem::MissingSpecVersionError) in our benchmark CI?

Can we discuss the row_encoding extraction as a separated issue? I'm not sure whether it's a good approach for now...

kou · 2025-07-04T21:12:14Z

test/csv/test_table.rb

+              CSV::Row.new(%w{A}, ["\x00\xac"]) ]
+    table = CSV::Table.new(rows)
+
+    assert_equal('UTF-8', table.to_csv(encoding: 'UTF-8').encoding.to_s)


Suggested change

assert_equal('UTF-8', table.to_csv(encoding: 'UTF-8').encoding.to_s)

assert_equal(Encoding::UTF_8, table.to_csv(encoding: 'UTF-8').encoding)

If we set the encoding when we call Table#to_csv we do not need to go row by row to determine the encoding to use. This allows the use of CSV#generate_lines as a faster exporter.

wishdev · 2025-07-04T22:42:01Z

Rebased to the latest main

kou · 2025-07-05T08:31:22Z

Do you want to discuss the row_encoding approach in this PR?

wishdev · 2025-07-06T15:23:40Z

I'm more than happy to discuss anything at all. However, as I've been pretty clear on, I'm not comfortable changing the default as I don't know enough about encoding to analyze that type of change.

I think what is currently here is a great starting point for any further changes along those lines in the future........

wishdev force-pushed the lines branch from 33f6cbc to 1f39fd2 Compare July 4, 2025 16:11

kou reviewed Jul 4, 2025

View reviewed changes

wishdev added 3 commits July 4, 2025 15:39

Increase speed of Table#to_csv when encoding is provided

c71fd12

If we set the encoding when we call Table#to_csv we do not need to go row by row to determine the encoding to use. This allows the use of CSV#generate_lines as a faster exporter.

Add benchmark for Table#to_csv improvement

b2b6e80

Move CSV row encoding calculation to its own method

cf0cec0

wishdev force-pushed the lines branch from 1f39fd2 to cf0cec0 Compare July 4, 2025 22:39

Update test based on @kou suggestion

1df04bd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase speed of Table#to_csv #348

Increase speed of Table#to_csv #348

Uh oh!

wishdev commented Jul 2, 2025

Uh oh!

kou commented Jul 3, 2025

Uh oh!

wishdev commented Jul 3, 2025

Uh oh!

tompng commented Jul 3, 2025

Uh oh!

wishdev commented Jul 3, 2025

Uh oh!

wishdev commented Jul 3, 2025

Uh oh!

tompng commented Jul 3, 2025

Uh oh!

wishdev commented Jul 3, 2025

Uh oh!

wishdev commented Jul 4, 2025

Uh oh!

kou left a comment

Uh oh!

kou Jul 4, 2025

Uh oh!

wishdev Jul 4, 2025

Uh oh!

wishdev commented Jul 4, 2025

Uh oh!

kou commented Jul 5, 2025

Uh oh!

wishdev commented Jul 6, 2025

Uh oh!

Uh oh!

	assert_equal('UTF-8', table.to_csv(encoding: 'UTF-8').encoding.to_s)
	assert_equal(Encoding::UTF_8, table.to_csv(encoding: 'UTF-8').encoding)

Increase speed of Table#to_csv #348

Are you sure you want to change the base?

Increase speed of Table#to_csv #348

Uh oh!

Conversation

wishdev commented Jul 2, 2025

Uh oh!

kou commented Jul 3, 2025

Uh oh!

wishdev commented Jul 3, 2025

Uh oh!

tompng commented Jul 3, 2025

Uh oh!

wishdev commented Jul 3, 2025

Uh oh!

wishdev commented Jul 3, 2025

Uh oh!

tompng commented Jul 3, 2025

Uh oh!

wishdev commented Jul 3, 2025

Uh oh!

wishdev commented Jul 4, 2025

Uh oh!

kou left a comment

Choose a reason for hiding this comment

Uh oh!

kou Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

wishdev Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

wishdev commented Jul 4, 2025

Uh oh!

kou commented Jul 5, 2025

Uh oh!

wishdev commented Jul 6, 2025

Uh oh!

Uh oh!