-
Notifications
You must be signed in to change notification settings - Fork 122
Increase speed of Table#to_csv #348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hmm. I'm not sure whether this approach is optimal. Could you also add a benchmark to https://github.com/ruby/csv/tree/main/benchmark ? Let's discuss which approach is better with numbers. |
Benchmark added At 100 rows I get about a 5.8x improvement |
I think it should scan every row and header.
If it's going to scan every row, can |
To be very clear, and I apologize my documentation skills suck, the concept of scanning every row is the current state of affairs. That option remains totally available for usage and remains the default. But, when I have 100,000 rows of data and I know the encoding, why would I want to scan every row? The single_encoding argument simply allows the class to make a calculation I could do by hand and provide via the existing :encoding option (although this code makes :encoding much faster as well - see benchmark). So I guess I described this work wrong - I'm not seeking to increase the speed of Table#to_csv but rather provide a fast path, when asked for, that removes the line by line checking for information that I can either provide (via :encoding) or have the system make the calculation based on the first row of a homogeneous data set (via :single_encoding). If one is working with "random" CSV files (from a client for example) then this path may not make sense. The default continues to be the "safe and secure" (but sometimes failing option - see the added test) of scanning each row and exporting each row as an individual single row CSV string that is joined together at the end. But, again, at 1,000+ rows there is a noticeable slowdown for information that may not be required to be calculated if one is producing the information "in house". One could remove :single_encoding, provide the little bit of code it does as a separate method (Row#calculate_encoding) and then have one set the :encoding option to the value of that method when manually called. I'm not certain what that gains other than a cookbook method that isn't presented to all users? |
Also, I've added another test which shows that @tompng example works perfectly fine. The strings are all UTF-8 by default so there is no issue with the first row being "ascii-only". |
When encoding is known, I tried the code below with the benchmark. Full scan + fast path also gets 3.5x - 4.3x improvement. def to_csv(write_headers: true, limit: nil, **options)
...
if options[:encoding]
# fast path, 5.5x-6.2x improvement
array.push(CSV.generate_lines(rows, **options))
else
# full scan + fast path, 3.5x-4.3x improvement
options[:encoding] = detect_encoding_with_full_row_scan
array.push(CSV.generate_lines(rows, **options))
end
array.join("")
end This full-scan encoding detection improves most cases. Just calling |
So the first test example has a string that is valid as either ASCII-8BIT or UTF-8. Currently I don't know enough about encodings to figure out what a default option should be. This was crafted to offer advantages without interfering for a second with existing code. I'm not comfortable for Would there be objections to at least moving the row encoding detection code into something like |
I've reworked it to remove the single_encoding option and move the row encoding code into a private method of CSV (to allow my module prepending to work). As I said I'm not comfortable modifying the existing path when the encoding is not provided. That can be left for others to investigate. This certainly does not to prevent any further enhancements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you rebase on main to fix the /opt/hostedtoolcache/Ruby/3.4.4/x64/lib/ruby/3.4.0/rubygems/dependency.rb:303:in 'Gem::Dependency#to_specs': Could not find 'csv' (= 3.3.0) - did find: [csv-3.3.2,csv-3.0.2,csv-3.0.1] (Gem::MissingSpecVersionError)
in our benchmark CI?
Can we discuss the row_encoding
extraction as a separated issue? I'm not sure whether it's a good approach for now...
test/csv/test_table.rb
Outdated
CSV::Row.new(%w{A}, ["\x00\xac"]) ] | ||
table = CSV::Table.new(rows) | ||
|
||
assert_equal('UTF-8', table.to_csv(encoding: 'UTF-8').encoding.to_s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_equal('UTF-8', table.to_csv(encoding: 'UTF-8').encoding.to_s) | |
assert_equal(Encoding::UTF_8, table.to_csv(encoding: 'UTF-8').encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
If we set the encoding when we call Table#to_csv we do not need to go row by row to determine the encoding to use. This allows the use of CSV#generate_lines as a faster exporter.
Rebased to the latest main |
Do you want to discuss the |
I'm more than happy to discuss anything at all. However, as I've been pretty clear on, I'm not comfortable changing the default as I don't know enough about encoding to analyze that type of change. I think what is currently here is a great starting point for any further changes along those lines in the future........ |
If we set the encoding when we call Table#to_csv we do not need to go row by row to determine the encoding to use. This allows the use of CSV#generate_lines as a faster exporter.
Add a single_encoding option to Table#to_csv to use the encoding of the first row of the table as the encoding.