Specifying Ranges for Index-Numbered Formats #1803

osevill · 2025-05-10T01:25:02Z

osevill
May 10, 2025

My original issue was that I wanted to use miller to clean up a large, ragged csv file (+1 million rows, 200 fields), where some of the data rows have more or fewer fields than the header row. After encountering out-of-memory issues using unsparsify in the default retain-in-memory mode, I read the docs and learned that if you specify the fields (in my case, all of them), unsparsify will run in streaming mode, and use much less memory.
After some trial and error, this seems to work for me:
mlr --csv --allow-ragged-csv-input --headerless-csv-input cut -f 1,2,3,...,200 then unsparsify -f 1,2,3,...,200 ./source_file.csv > ./output_file.csv

where I replace 1,2,3,...,200 above with every number from 1 to 200, comma-delimited. This results in a very long mlr statement.

(I'm cutting to exclude any fields after field 200; I'm unsparsifying to backfill any missing fields with a blank/empty value.)

My question is... does mlr support ranges for index-numbered headers, which would allow me to use something like -f 1:200 for field numbers 1 through 200, inclusive?

If not, is this a feature that could be added when headers are referenced as index numbers?

Interesting enough, the DeepWiki @aborruso linked to in another discussion indicates mlr does support ranges in the format 1-n, but when I tried this, it did not work.

If there's a better way to go about cleaning up a large, ragged csv file while avoiding memory issues, I welcome any suggestions.

Alternatively to adding range support, could unsparsify have a new flag for --streaming, where the user doesn't need to list every field in the header in order to activate streaming mode.

I tested all the above using mlr 6.12 on a Windows pc with 16 GB RAM.

With almost no other apps running, I was able to successfully clean up only about 100,000 rows by adding a head -n 100000 verb before the unsparsify in default memory-retention mode. Also, I need to use the --csv format because some of my data has commas encapsulated in double-quotes, and the csv format in mlr is the only one I'm aware of that ignores embedded commas as field separators.

Thank you.

aborruso · 2025-05-10T06:47:22Z

aborruso
May 10, 2025

Interesting enough, the DeepWiki @aborruso linked to in another discussion indicates mlr does support ranges in the format 1-n, but when I tried this, it did not work.

I feel guilty, but I am afraid it is a hallucination. Looking at the text that deepwiki uses as a source, it is not talking about ranges but a comma-separated list of fields.
Deepwiki often works well, but it is good to always compare the answer with the text it cites.

1 reply

osevill May 11, 2025
Author

no problem... nothing to feel guilty about. The mlr DeepWiki does seem helpful as a starting point for inquiry.

aborruso · 2025-05-10T07:09:27Z

aborruso
May 10, 2025

@osevill you could use a scripting language to generate you the list from 1 to 200. I use bash here, but you can do it with whatever you prefer.

My sample input file

a,b,c
3,"lorem, ipsum",5
3,2
"lorem, ipsum",4,5

Let us assume that we want only 3 columns: printf "%s" "$(seq -s, 1 3)" > fields.txt.
The file contains 1,2,3

And then you can "print" the fields.txt content inside the miller command.

Running

mlr --csv --allow-ragged-csv-input --headerless-csv-input cut -f $(cat fields.txt) then unsparsify -f $(cat fields.txt) input.csv

you get

1,2,3
a,b,c
3,"lorem, ipsum",5
3,2,
"lorem, ipsum",4,5

1 reply

osevill May 11, 2025
Author

Thank you @aborruso! I did not know you could substitute $(cat <local_file.txt>) like this. It seems to work nicely. Range support would still be nice, but your suggestion is definitely less cumbersome than entering the list on the command line.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Specifying Ranges for Index-Numbered Formats #1803

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Specifying Ranges for Index-Numbered Formats #1803

Uh oh!

osevill May 10, 2025

Replies: 2 comments · 2 replies

Uh oh!

aborruso May 10, 2025

Uh oh!

osevill May 11, 2025 Author

Uh oh!

Uh oh!

aborruso May 10, 2025

Uh oh!

osevill May 11, 2025 Author

osevill
May 10, 2025

Replies: 2 comments 2 replies

aborruso
May 10, 2025

osevill May 11, 2025
Author

aborruso
May 10, 2025

osevill May 11, 2025
Author