Skip to content

Add support for newlines_in_values to CsvOptions #11472

@connec

Description

@connec

Is your feature request related to a problem or challenge?

I'm trying to read CSVs that include newlines in (quoted) values.

Describe the solution you'd like

Some googling revealed that this isn't supported currently by the arrow-csv crate, whereas that functionality does exist in the C++ (ParseOptions::newlines_in_values) and Python (ParseOptions.newlines_in_values) implementations.

Ideally, a newlines_in_values field could be added to datafusion::common::config::CsvOptions to support this functionality.

Note that the Python docs call out the performance implications of this:

Setting this to True reduces the performance of multi-threaded CSV reading.

I haven't dug into the implementation, but I imagine it becomes harder to find the right split point for multi-threaded reading (though, it seems not dissimilar to finding the prev/next linebreak, so perhaps not insurmountable...).

Describe alternatives you've considered

The only alternative I can see would be to preprocess the CSV before feeding it into DF. I haven't explored this option as I imagine it would take a lot of DF plumbing, and it seems valuable to have parity with other arrow CSV packages (C++ and Python, at least).

Additional context

I was originally planning to report this against the arrow-rs repository, but since my use-case is with datafusion I decided to report it here. Let me know if this issue would be more appropriate there and I can move/copy it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions