-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Is your feature request related to a problem or challenge?
I'm trying to read CSVs that include newlines in (quoted) values.
Describe the solution you'd like
Some googling revealed that this isn't supported currently by the arrow-csv
crate, whereas that functionality does exist in the C++ (ParseOptions::newlines_in_values
) and Python (ParseOptions.newlines_in_values
) implementations.
Ideally, a newlines_in_values
field could be added to datafusion::common::config::CsvOptions
to support this functionality.
Note that the Python docs call out the performance implications of this:
Setting this to True reduces the performance of multi-threaded CSV reading.
I haven't dug into the implementation, but I imagine it becomes harder to find the right split point for multi-threaded reading (though, it seems not dissimilar to finding the prev/next linebreak, so perhaps not insurmountable...).
Describe alternatives you've considered
The only alternative I can see would be to preprocess the CSV before feeding it into DF. I haven't explored this option as I imagine it would take a lot of DF plumbing, and it seems valuable to have parity with other arrow CSV packages (C++ and Python, at least).
Additional context
I was originally planning to report this against the arrow-rs
repository, but since my use-case is with datafusion
I decided to report it here. Let me know if this issue would be more appropriate there and I can move/copy it.