Skip to content

Increase speed of Table#to_csv #348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Increase speed of Table#to_csv #348

wants to merge 4 commits into from

Conversation

wishdev
Copy link

@wishdev wishdev commented Jul 2, 2025

If we set the encoding when we call Table#to_csv we do not need to go row by row to determine the encoding to use. This allows the use of CSV#generate_lines as a faster exporter.

Add a single_encoding option to Table#to_csv to use the encoding of the first row of the table as the encoding.

@kou
Copy link
Member

kou commented Jul 3, 2025

Hmm. I'm not sure whether this approach is optimal.

Could you also add a benchmark to https://github.com/ruby/csv/tree/main/benchmark ? Let's discuss which approach is better with numbers.

@wishdev
Copy link
Author

wishdev commented Jul 3, 2025

Benchmark added

At 100 rows I get about a 5.8x improvement
At 1000 rows I get about a 6.7x improvement

@tompng
Copy link
Member

tompng commented Jul 3, 2025

Argument +single_encoding+, if true, encoding will be determined from the first row.

I think it should scan every row and header.
The below is an example CSV that encoding can't be determined from header/first row. I think this kind of CSV is pretty common.

id, language, description
1, English, ascii-only row
2, Français, alphabet + cédille first non-ascii row
3, Español, alphabet + ntilde character row

If it's going to scan every row, can single_encoding be automatically determined instead of specifying by a keyword parameter?
This way, we don't need to change the method signature, Table#to_csv will be faster in many cases, and
we don't need additional option to get this speed up benefit.

@wishdev
Copy link
Author

wishdev commented Jul 3, 2025

Argument +single_encoding+, if true, encoding will be determined from the first row.

I think it should scan every row and header. The below is an example CSV that encoding can't be determined from header/first row. I think this kind of CSV is pretty common.

To be very clear, and I apologize my documentation skills suck, the concept of scanning every row is the current state of affairs. That option remains totally available for usage and remains the default.

But, when I have 100,000 rows of data and I know the encoding, why would I want to scan every row? The single_encoding argument simply allows the class to make a calculation I could do by hand and provide via the existing :encoding option (although this code makes :encoding much faster as well - see benchmark).

So I guess I described this work wrong - I'm not seeking to increase the speed of Table#to_csv but rather provide a fast path, when asked for, that removes the line by line checking for information that I can either provide (via :encoding) or have the system make the calculation based on the first row of a homogeneous data set (via :single_encoding).

If one is working with "random" CSV files (from a client for example) then this path may not make sense. The default continues to be the "safe and secure" (but sometimes failing option - see the added test) of scanning each row and exporting each row as an individual single row CSV string that is joined together at the end. But, again, at 1,000+ rows there is a noticeable slowdown for information that may not be required to be calculated if one is producing the information "in house".

One could remove :single_encoding, provide the little bit of code it does as a separate method (Row#calculate_encoding) and then have one set the :encoding option to the value of that method when manually called. I'm not certain what that gains other than a cookbook method that isn't presented to all users?

@wishdev
Copy link
Author

wishdev commented Jul 3, 2025

Also, I've added another test which shows that @tompng example works perfectly fine. The strings are all UTF-8 by default so there is no issue with the first row being "ascii-only".

@tompng
Copy link
Member

tompng commented Jul 3, 2025

When encoding is known, to_csv(encoding: Encoding::UTF_8), your fast path improvement is significant 👍
For single_encoding: true option, it needs discussion. If there is a way to moderately improve without that option, the reason of introducing new option will reduce or might disappear. If there's a possibility, adding new option casually is not a good idea.

I tried the code below with the benchmark. Full scan + fast path also gets 3.5x - 4.3x improvement.

def to_csv(write_headers: true, limit: nil, **options)
  ...
  if options[:encoding]
    # fast path, 5.5x-6.2x improvement
    array.push(CSV.generate_lines(rows, **options))
  else
    # full scan + fast path, 3.5x-4.3x improvement
    options[:encoding] = detect_encoding_with_full_row_scan
    array.push(CSV.generate_lines(rows, **options))
  end
  array.join("")
end

This full-scan encoding detection improves most cases. Just calling table.to_csv will get much faster.
This fact will reduce the reason of introducing single_encoding option.

@wishdev
Copy link
Author

wishdev commented Jul 3, 2025

So the first test example has a string that is valid as either ASCII-8BIT or UTF-8. Currently table.to_csv fails to process that table because it is 2 encodings that are not compatible. What does detect_encoding_with_full_row_scan do with that? Is it a fast fail? If 2 compatible encodings exist within the file which one wins? Currently it's the first encoding on a line that gets chosen - do we now scan each column?

I don't know enough about encodings to figure out what a default option should be. This was crafted to offer advantages without interfering for a second with existing code. I'm not comfortable for detect_encoding_with_full_row_scan because I'm not an expert in the edge cases.

Would there be objections to at least moving the row encoding detection code into something like CSV::detect_encoding? Fast path for options[:encoding]seems to not be an issue That combination would at least allows for something like single_encoding to be placed in a prepended module if desired by the end user (i.e. me) fairly straight forwardly.

@wishdev
Copy link
Author

wishdev commented Jul 4, 2025

I've reworked it to remove the single_encoding option and move the row encoding code into a private method of CSV (to allow my module prepending to work).

As I said I'm not comfortable modifying the existing path when the encoding is not provided. That can be left for others to investigate. This certainly does not to prevent any further enhancements.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rebase on main to fix the /opt/hostedtoolcache/Ruby/3.4.4/x64/lib/ruby/3.4.0/rubygems/dependency.rb:303:in 'Gem::Dependency#to_specs': Could not find 'csv' (= 3.3.0) - did find: [csv-3.3.2,csv-3.0.2,csv-3.0.1] (Gem::MissingSpecVersionError) in our benchmark CI?

Can we discuss the row_encoding extraction as a separated issue? I'm not sure whether it's a good approach for now...

CSV::Row.new(%w{A}, ["\x00\xac"]) ]
table = CSV::Table.new(rows)

assert_equal('UTF-8', table.to_csv(encoding: 'UTF-8').encoding.to_s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert_equal('UTF-8', table.to_csv(encoding: 'UTF-8').encoding.to_s)
assert_equal(Encoding::UTF_8, table.to_csv(encoding: 'UTF-8').encoding)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

wishdev added 3 commits July 4, 2025 15:39
If we set the encoding when we call Table#to_csv we do not need
to go row by row to determine the encoding to use. This allows
the use of CSV#generate_lines as a faster exporter.
@wishdev
Copy link
Author

wishdev commented Jul 4, 2025

Rebased to the latest main

@kou
Copy link
Member

kou commented Jul 5, 2025

Do you want to discuss the row_encoding approach in this PR?

@wishdev
Copy link
Author

wishdev commented Jul 6, 2025

I'm more than happy to discuss anything at all. However, as I've been pretty clear on, I'm not comfortable changing the default as I don't know enough about encoding to analyze that type of change.

I think what is currently here is a great starting point for any further changes along those lines in the future........

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants