Skip to content

Error when passing as source a vector with fewer unique elements than files. #1136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
snovum opened this issue Apr 16, 2024 · 0 comments
Open

Comments

@snovum
Copy link

snovum commented Apr 16, 2024

In cases where data from n sources is collected periodically in k separate files such that k > n (e.g. patient, financial data etc.) using the source kwarg to create a source column with non unique identifier returns a BoundsError.

Suppose I have the following CSV files.

Alice1 = DataFrame(A = 1:5, B = ["M", "F", "F", "M", "F"])

Alice2 = DataFrame(A = 11:15, B = ["A", "B", "C", "D", "E"])

Bob1 = DataFrame(A = 16:21, B = ["A", "B", "C", "D", "E","T"])

Bob2 = DataFrame(A = 16:21, B = ["A", "B", "C", "D", "E","T"])

CSV.write("folder/Alice-2024-04-15.csv",Alice1)
CSV.write("folder/Alice-2024-04-16.csv", Alice2)
CSV.write("folder/Bob-2024-04-15.csv", Bob1)
CSV.write("folder/Bob-2024-04-16.csv", Bob2) 

fls = glob("*.csv", "folder")

I want to combine them into a new DataFrame while creating a new column that contains just the first part of the filename i.e. the patient names. I can create a new column with the filenames using the source kwarg

julia> DF = CSV.read(fls, DataFrame; source = "patients" => split.(basename.(fls), '-'), pool = false)
22×3 DataFrame
 Row │ A      B        patients                          
     │ Int64  String1  Array…                            
─────┼───────────────────────────────────────────────────
   1 │     1  M        SubString{String}["Alice", "2024…
   2 │     2  F        SubString{String}["Alice", "2024…
   3 │     3  F        SubString{String}["Alice", "2024…
   4 │     4  M        SubString{String}["Alice", "2024…
   5 │     5  F        SubString{String}["Alice", "2024…
   6 │    11  A        SubString{String}["Alice", "2024…
   7 │    12  B        SubString{String}["Alice", "2024…
   8 │    13  C        SubString{String}["Alice", "2024…
   9 │    14  D        SubString{String}["Alice", "2024…
  10 │    15  E        SubString{String}["Alice", "2024…
  11 │    16  A        SubString{String}["Bob", "2024",…
  12 │    17  B        SubString{String}["Bob", "2024",…
  13 │    18  C        SubString{String}["Bob", "2024",…
  14 │    19  D        SubString{String}["Bob", "2024",…
  15 │    20  E        SubString{String}["Bob", "2024",…
  16 │    21  T        SubString{String}["Bob", "2024",…
  17 │    16  A        SubString{String}["Bob", "2024",…
  18 │    17  B        SubString{String}["Bob", "2024",…
  19 │    18  C        SubString{String}["Bob", "2024",…
  20 │    19  D        SubString{String}["Bob", "2024",…
  21 │    20  E        SubString{String}["Bob", "2024",…
  22 │    21  T        SubString{String}["Bob", "2024",…

However if I try to create the column by taking the first element of each substring via first.(split.(basename.(fls),'-')) . The following error is returned.

CSV.read(fls,DataFrame; source = "patients" => first.(split.(basename.(fls), '-')), pool = false)
ERROR: BoundsError: attempt to access 2-element Vector{SubString{String}} at index [4]
Stacktrace:
 [1] setindex!
   @ ./array.jl:1021 [inlined]
 [2] setindex!
   @ ./multidimensional.jl:698 [inlined]
 [3] _invert(d::Dict{SubString{String}, UInt32})
   @ PooledArrays ~/.julia/packages/PooledArrays/Vy2X0/src/PooledArrays.jl:26
 [4] PooledArray
   @ ~/.julia/packages/PooledArrays/Vy2X0/src/PooledArrays.jl:87 [inlined]
 [5] CSV.File(sources::Vector{String}; source::Pair{String, Vector{SubString{String}}}, kw::@Kwargs{pool::Bool})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:941
 [6] File
   @ ~/.julia/packages/CSV/tmZyn/src/file.jl:901 [inlined]
 [7] read(source::Vector{String}, sink::Type; copycols::Bool, kwargs::@Kwargs{source::Pair{String, Vector{…}}, pool::Bool})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/CSV.jl:117
 [8] top-level scope
   @ REPL[367]:1
Some type information was truncated. Use `show(err)` to see complete types.

as already pointed out here by @nilshg, this is due to the fact that there are fewer unique elements than files, which can be helpful in identifying unique sources.

In this case the desired result is the creation of a source column of non - unique identifiers.

DF.patients = first.(DF.patients); 
DF
22×3 DataFrame
 Row │ A      B        patients  
     │ Int64  String1  SubStrin… 
─────┼───────────────────────────
   1 │     1  M        Alice
   2 │     2  F        Alice
   3 │     3  F        Alice
   4 │     4  M        Alice
   5 │     5  F        Alice
   6 │    11  A        Alice
   7 │    12  B        Alice
   8 │    13  C        Alice
   9 │    14  D        Alice
  10 │    15  E        Alice
  11 │    16  A        Bob
  12 │    17  B        Bob
  13 │    18  C        Bob
  14 │    19  D        Bob
  15 │    20  E        Bob
  16 │    21  T        Bob
  17 │    16  A        Bob
  18 │    17  B        Bob
  19 │    18  C        Bob
  20 │    19  D        Bob
  21 │    20  E        Bob
  22 │    21  T        Bob

It might be nice to incorporate the possibility of non-unique sources with the source kwarg because it seems rather intuitive especially in cases where data may have accumulated over separate files meaningfully over time.

Alice-2024-04-15.csv
Alice-2024-04-16.csv
Bob-2024-04-15.csv
Bob-2024-04-16.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant