-
-
Notifications
You must be signed in to change notification settings - Fork 90
adding sequence numbers by group #322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
One way is to add a new special replacement symbol to
|
Implemented @ggrothendieck
|
What do you think about this for
In terms of the example this would allow the
|
or maybe instead of the {enr} option it would be {rnr} which stands for numbering via run-length encoding and is more consistent with streaming. Here the first row gets numbered 1 and each subsequent row gets the same number as the immediately prior row if its value is the same or is incremented by 1 if not. For example, if the field in question in the first 5 rows has the values a, a, b, a, a then {rnr} would number them as 1,1,2,3,3. |
Can do this.
I thought about this before, but -f may specify multiple columns.
I can imagine the need to set --incr
I thought about this before, can do this.
It's too complicated. any use case? |
Multiple columns for -gThis is possible in which case all rows having the same value in all grouping columns would be sequentially numbered. --startAn example where one might want to use start at a number not 1 is if the numbering represents a year and we want to start with the year 2000, say, and then increment by 1 year or by 10 years or other. Also, I think there could be examples where the index should start at 0. For example, suppose there are A time units between rows. It might make more sense to number them as 0,A,2A,... rnr/enrHere is an example of rnr/enr. The objective is to calculate the number of stretches of weight gain and weight loss. The example is taken from stackoverflow where it was solved using R. Below we use gawk in place of csvtk for that one step in order to be able to actually run it. This is using Windows quoting and continuation so minor changes would be needed on Linux. (There are a number of other related questions on stackoverflow that involve this too. Some of them would also require cumulative sums, cumulative sums by group and/or filling in empty values with the most recent non-empty value. I don't think csvtk has any of those but I might be wrong on that. In R these are available as
giving (using the rnr.awk and weight.csv files further below):
File weight.csv:
File rnr.awk:
|
{gnr} does not seem to work in the following case: File a.csv:
Here
giving
Version used:
|
OK, all implemented. Please test it.
An example with two columns as the group fields
|
A few comments:
where nr.csv is:
and clean.awk is:
Suppose a.csv is:
Specifying {cumsum} for the Balance column and {gcumsum} for the Group_Balance column and specifying -g Item would give:
|
Whether using just one pair of --start and --incr, or using separate pairs, is hard to satisfy different users. I prefer using just one pair to simplify the setting. For {cumsum} and other stuff, I don't think this replace command should be overloaded with increasingly complex functionality. |
Is there some current way to do a cumulative sum or cumulative sum by group in csvtk? |
Unfortunately, no. |
Originally posted by @ggrothendieck in #320
The text was updated successfully, but these errors were encountered: