Replies: 2 comments 2 replies
-
I feel guilty, but I am afraid it is a hallucination. Looking at the text that deepwiki uses as a source, it is not talking about ranges but a comma-separated list of fields. |
Beta Was this translation helpful? Give feedback.
-
@osevill you could use a scripting language to generate you the list from 1 to 200. I use bash here, but you can do it with whatever you prefer. My sample input file
Let us assume that we want only 3 columns: And then you can "print" the fields.txt content inside the miller command. Running
you get
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
My original issue was that I wanted to use miller to clean up a large, ragged csv file (+1 million rows, 200 fields), where some of the data rows have more or fewer fields than the header row. After encountering out-of-memory issues using unsparsify in the default retain-in-memory mode, I read the docs and learned that if you specify the fields (in my case, all of them), unsparsify will run in streaming mode, and use much less memory.
After some trial and error, this seems to work for me:
mlr --csv --allow-ragged-csv-input --headerless-csv-input cut -f 1,2,3,...,200 then unsparsify -f 1,2,3,...,200 ./source_file.csv > ./output_file.csv
where I replace
1,2,3,...,200
above with every number from 1 to 200, comma-delimited. This results in a very long mlr statement.(I'm cutting to exclude any fields after field 200; I'm unsparsifying to backfill any missing fields with a blank/empty value.)
My question is... does mlr support ranges for index-numbered headers, which would allow me to use something like -f 1:200 for field numbers 1 through 200, inclusive?
If not, is this a feature that could be added when headers are referenced as index numbers?
Interesting enough, the DeepWiki @aborruso linked to in another discussion indicates mlr does support ranges in the format 1-n, but when I tried this, it did not work.
If there's a better way to go about cleaning up a large, ragged csv file while avoiding memory issues, I welcome any suggestions.
Alternatively to adding range support, could unsparsify have a new flag for --streaming, where the user doesn't need to list every field in the header in order to activate streaming mode.
I tested all the above using mlr 6.12 on a Windows pc with 16 GB RAM.
With almost no other apps running, I was able to successfully clean up only about 100,000 rows by adding a
head -n 100000
verb before the unsparsify in default memory-retention mode. Also, I need to use the --csv format because some of my data has commas encapsulated in double-quotes, and the csv format in mlr is the only one I'm aware of that ignores embedded commas as field separators.Thank you.
Beta Was this translation helpful? Give feedback.
All reactions