Skip to content

Using select with explicit header names requires all the column names to be specified #701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
miromarszal opened this issue Jul 29, 2020 · 4 comments
Labels
improvement Improve an existing feature/functionality

Comments

@miromarszal
Copy link

Consider the following example.

csv = """1,2,3
1,2,3
1,2,3"""
CSV.File(IOBuffer(csv), select=[2, 3], header=0)
3-element CSV.File{false}:
 CSV.Row: (Column2 = 2, Column3 = 3)
 CSV.Row: (Column2 = 2, Column3 = 3)
 CSV.Row: (Column2 = 2, Column3 = 3)
CSV.File("data.csv", select=[2, 3], header=["a", "b", "c"])
3-element CSV.File{false}:
 CSV.Row: (b = 2, c = 3)
 CSV.Row: (b = 2, c = 3)
 CSV.Row: (b = 2, c = 3)
CSV.File("data.csv", select=[2, 3], header=["b", "c"])
thread = 1 warning: parsed expected 2 columns, but didn't reach end of line around data row: 1. Ignoring any extra columns on this row
thread = 1 warning: parsed expected 2 columns, but didn't reach end of line around data row: 2. Ignoring any extra columns on this row
thread = 1 warning: parsed expected 2 columns, but didn't reach end of line around data row: 3. Ignoring any extra columns on this row

3-element CSV.File{false}:
 CSV.Row: (c = 2,)
 CSV.Row: (c = 2,)
 CSV.Row: (c = 2,)

I find this rather surprising. I can either specify all the column names, which may be not too nice in a file with a large number of columns, or go with header=0 and rename columns afterwards, which feels like an unnecessary step.

@miromarszal miromarszal changed the title Using select with explicit header names requires all the column names to be specified Using select with explicit header names requires all the column names to be specified Jul 29, 2020
@quinnj
Copy link
Member

quinnj commented Oct 17, 2020

Sorry for the slow response here; yes, I can see how this is a bit confusing, but when you provide the header, it is expected that you're providing all the column names. It is a bit awkward with select when you only want a few columns, but I'm not quite sure what we can do that would be better here. If it's not obvious, even when you're select-ing a subset of columns, we still need to "skip" over the other columns.

In general, I've grown to have the impression that we perhaps give too much weight to user-provided headers. If provided, we basically take that as absolute truth for the # of columns. Perhaps we need to rethink the approach here and have CSV.jl do more of it's own work around what's actually in the file, allowing header to "rename" columns after the fact. I.e. we could allow passing header as a Dict to rename columns while parsing.

@wnoise
Copy link

wnoise commented Sep 3, 2021

Renaming columns while parsing is a feature I would definitely appreciate.

@BioTurboNick
Copy link

I just encountered this issue.

The main issue is the file I was reading used spaces as delimiters and in a text field, so the input columns was variable. But I just needed the first few. So I thought I could select the first few by index and then provide the column names. Nope!

I got warnings about ignoring extra columns, which seemed fine; but I'm confused why it would drop rows.

@nickrobinson251 nickrobinson251 added the improvement Improve an existing feature/functionality label Oct 10, 2022
@roland-KA
Copy link

I've also just stumbled over the behaviour of CSV.jl that you have to give all header names, even if you only select a subset. That doesn't seem to me very useful.

My typical use-case is, that I have files with quite a large number of columns, but I need only a few (say 100 cols and I need just 10). So I select the useful ones using column numbers. And of course it doesn't make any sense, to pass 100 column names, if I only need 10 of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improve an existing feature/functionality
Projects
None yet
Development

No branches or pull requests

6 participants