Skip to content

adding sequence numbers by group #322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shenwei356 opened this issue Apr 18, 2025 · 12 comments
Open

adding sequence numbers by group #322

shenwei356 opened this issue Apr 18, 2025 · 12 comments

Comments

@shenwei356
Copy link
Owner

Assume this is the input ledger.csv:

transaction uuid,date,payee,amount
aaa123,2024-03-01,Bill's Garage,50
aaa123,2024-03-01,Bank account,50
456abc,2024-03-05,Solar Power Monopoly,200
456abc,2024-03-05,Credit card,200

and that we define a gawk program, seq.awk, which adds sequence numbers within transaction uuid:

# seq.awk:
BEGIN { OFS = "," }
NR == 1 { print $0, "n" }
NR > 1 { if ($1 == prev) ++n; else n=1; prev = $1; print $0, "payee " n }

Then we can run:

gawk --csv -f seq.awk ledger.csv | csvtk spread -k n -v payee

giving

transaction uuid,date,amount,payee 1,payee 2
aaa123,2024-03-01,50,Bill's Garage,Bank account
456abc,2024-03-05,200,Solar Power Monopoly,Credit card

I didn't notice anything in csvtk for adding sequence numbers by group but if it exists we could use that in place of the gawk program.

Originally posted by @ggrothendieck in #320

@shenwei356
Copy link
Owner Author

One way is to add a new special replacement symbol to csvtk replace.

{nr}    Record number, starting from 1 (existing)
{xx}    Record number of the value in the specified column

shenwei356 added a commit that referenced this issue Apr 18, 2025
@shenwei356
Copy link
Owner Author

Implemented @ggrothendieck

Special replacement symbols:

  {nr}    Record number, starting from 1
  {gnr}   Record number within a group (value of field -g/--gnr-field), starting from 1

  -g, --gnr-field string                    select a field for a group-specific record number {gnr}
$ cat ledger.csv \
    | csvtk mutate -f 'transaction uuid' -n n  \
    | csvtk replace -f n -p .+ -r 'payee {gnr}' -g 'transaction uuid' 

transaction uuid   date         payee                  amount   n      
----------------   ----------   --------------------   ------   -------
aaa123             2024-03-01   Bill's Garage          50       payee 1
aaa123             2024-03-01   Bank account           50       payee 2
456abc             2024-03-05   Solar Power Monopoly   200      payee 1
456abc             2024-03-05   Credit card            200      payee 2


$ cat ledger.csv \
    | csvtk mutate -f 'transaction uuid' -n n  \
    | csvtk replace -f n -p .+ -r 'payee {gnr}' -g 'transaction uuid' \
    | csvtk spread -k n -v payee \
    | csvtk pretty 

transaction uuid   date         amount   payee 1                payee 2     
----------------   ----------   ------   --------------------   ------------
aaa123             2024-03-01   50       Bill's Garage          Bank account
456abc             2024-03-05   200      Solar Power Monopoly   Credit card 

@ggrothendieck
Copy link

What do you think about this for replace

  • -p defaults to .*
  • -g defaults to the -f field
  • flags to specify the start (--start) and increment (--incr) for numbering. Defaults are 1 and 1.
  • {enr} which is enumeration numbering where all rows in the first group get assigned 1, all rows in the second group get assigned 2 and so on.

In terms of the example this would allow the replace line to be shortened to:

csvtk replace -f n -r "payee {gnr}"

@ggrothendieck
Copy link

ggrothendieck commented Apr 18, 2025

or maybe instead of the {enr} option it would be {rnr} which stands for numbering via run-length encoding and is more consistent with streaming.

Here the first row gets numbered 1 and each subsequent row gets the same number as the immediately prior row if its value is the same or is incremented by 1 if not. For example, if the field in question in the first 5 rows has the values a, a, b, a, a then {rnr} would number them as 1,1,2,3,3.

@shenwei356
Copy link
Owner Author

-p defaults to .*

Can do this.

-g defaults to the -f field

I thought about this before, but -f may specify multiple columns.

flags to specify the start (--start) and increment (--incr) for numbering. Defaults are 1 and 1.

I can imagine the need to set --incr

{enr} which is enumeration numbering where all rows in the first group get assigned 1, all rows in the second group get assigned 2 and so on.

I thought about this before, can do this.

or maybe instead of the {enr} option

It's too complicated. any use case?

@ggrothendieck
Copy link

Multiple columns for -g

This is possible in which case all rows having the same value in all grouping columns would be sequentially numbered.

--start

An example where one might want to use start at a number not 1 is if the numbering represents a year and we want to start with the year 2000, say, and then increment by 1 year or by 10 years or other. Also, I think there could be examples where the index should start at 0. For example, suppose there are A time units between rows. It might make more sense to number them as 0,A,2A,...
rather than as 1,A+1,2A+1,... .

rnr/enr

Here is an example of rnr/enr. The objective is to calculate the number of stretches of weight gain and weight loss. The example is taken from stackoverflow where it was solved using R.

Below we use gawk in place of csvtk for that one step in order to be able to actually run it. This is using Windows quoting and continuation so minor changes would be needed on Linux.

(There are a number of other related questions on stackoverflow that involve this too. Some of them would also require cumulative sums, cumulative sums by group and/or filling in empty values with the most recent non-empty value. I don't think csvtk has any of those but I might be wrong on that. In R these are available as cumsum, using cumsum with ave or in dplyr with group_by, fill in dplyr or na.locf in zoo.)

:: calculate the number of runs of weight loss and weight gain

csvtk mutate -n runs -f "Weight Change" --at 1 weight.csv |^
csvtk replace -p "[0-9.]" -r "" -f runs |^
gawk --csv -f rnr.awk |^
csvtk summary -f n:max

giving (using the rnr.awk and weight.csv files further below):

n:max
5.00

File weight.csv:

ID,Week,Weight Change
1,1,-0.5
1,2,-0.2
1,3,1
1,4,0.5
1,5,-0.5
2,1,-0.2
2,2,-0.2
2,3,0.6
2,4,-0.5
2,5,-0.3

File rnr.awk:

# rnr.awk:
BEGIN { OFS = "," }
NR == 1 { print $0, "n" }
NR > 1 { 
	if ($1 == prev) n = prevn; else n++
	prev = $1
	prevn = n
	print $0, n
}

@ggrothendieck
Copy link

ggrothendieck commented Apr 20, 2025

{gnr} does not seem to work in the following case:

File a.csv:

a,b,c
11,a,A
11,b,B
12,x,X
12,y,Y

Here n gets all 1's rather than 1,2,1,2. We are using Windows quoting here.

csvtk mutate -f a -n n a.csv | csvtk replace -f n -p ".*" -r "{gnr}" -g "a"

giving

a,b,c,n
11,a,A,1
11,b,B,1
12,x,X,1
12,y,Y,1

Version used:

>csvtk --version
csvtk v0.34.0

shenwei356 added a commit that referenced this issue Apr 21, 2025
@shenwei356
Copy link
Owner Author

OK, all implemented. Please test it.

Special replacement symbols:

  {nr}    Record number, starting from 1
  {gnr}   Record number within a group,    defined by value(s) of field(s) via -g/--group
  {enr}   Enumeration numbering of groups, defined by value(s) of field(s) via -g/--group
  {rnr}   Running numbering of groups,     defined by value(s) of field(s) via -g/--group

          Examples:
            group   {nr}   {gnr}   {enr}   {rnr}
            A        1       1       1       1
            A        2       2       1       1
            B        3       1       2       2
            B        4       2       2       2
            A        5       3       1       3
            C        6       1       3       4

An example with two columns as the group fields

$ echo -ne "a,b\nA,A\nA,A\nA,Z\nB,X\nB,X\nA,Z\nC,Y\n" \
    | csvtk mutate -f 1 -n n \
    | csvtk replace  -f n -p .+ -g a,b -r "{nr}   {gnr}   {enr}   {rnr}" \
           --start-gnr 100 --start-enr 100 --start-rnr 100 \
           --incr-gnr 10 --incr-enr 10 --incr-rnr 10 \
           --nr-width 4 \
    | csvtk rename -f n -n "nr     gnr    enr    rnr" \
    | csvtk pretty

a   b   nr     gnr    enr    rnr 
-   -   -------------------------
A   A   0001   0100   0100   0100
A   A   0002   0110   0100   0100
A   Z   0003   0100   0110   0110
B   X   0004   0100   0120   0120
B   X   0005   0110   0120   0120
A   Z   0006   0110   0110   0130
C   Y   0007   0100   0130   0140

@ggrothendieck
Copy link

A few comments:

  • I tried running that test code on Windows (with minor modifications for continuation lines, etc.) and got the same results.

  • regarding the example, in general one would want the nr,gnr,enr,rnr to be separate columns. Not sure how to do that in csvtk so I used a gawk program to clean it up but maybe you can suggest a better way

csvtk mutate -f 1 -n n nr.csv | ^
csvtk replace  -f n -p .+ -g a,b -r "{nr},{gnr},{enr},{rnr}" | ^
gawk -f clean.awk

where nr.csv is:

a,b
A,A
A,A
A,Z
B,X
B,X
A,Z
C,Y

and clean.awk is:

NR==1 {$0 = "a,b,nr,gnr,enr,rnr" }
{ gsub(/"/, "") }; 1
  • the difference between the {*nr} codes could cause confusion if not documented so it is important to explain what they mean in the --help output. That is, {nr} gives consecutive numbers, {gnr} (within group) does that separately within each group, {enr} (enumeration or across groups) labels each row of the same group with identical values and {rnr} (run length encoding) labels each run of identical values with the next consecutive number.

  • suggest that there be a --start and --incr which are used as defaults for the others.

  • consider removing --start-* and --incr-* and just having --start and --incr. Although that would
    mean using more than one replace if you needed different --start or --incr values for two of them in most cases the defaults will be used or the same values will be used so it would reduce the number of flags and ease the most frequent cases at the expense of slightly more verbosity for the less frequent cases. Also note this would remove the asymmetry where we do not currently have separate -g-* flags but we do have separate --start-* and --incr-* flags.

  • while we are at it other codes that could be useful would be {cumsum} which takes the cumulative running sum. {cumsum} of a column that is all 1's is the same as {nr} and {cumsum} of a column that is all 10's is the same as --start 10 --incr 10. Also one could have {gcumsum} to perform cumulative sums by group. Also {cummax}, {cummin}, {cumprod} and the {g...} versions of them. For example, suppose we want to generate balances and balances by Item. Then we could use {cumsum} and {gcumsum} for that on the Amount column.

Suppose a.csv is:

Item,Amount
A,3
A,-2
A,1
B,6
B,4

Specifying {cumsum} for the Balance column and {gcumsum} for the Group_Balance column and specifying -g Item would give:

Item,Amount,Balance,Group_Balance
A,3,3,3
A,-2,1,1
A,1,2,2
B,6,8,6
B,4,12,10

@shenwei356
Copy link
Owner Author

cat nr.csv \
    | csvtk mutate -n nr  | csvtk replace -f nr  -r "{nr}" \
    | csvtk mutate -n gnr | csvtk replace -f gnr -r "{gnr}" -g a,b \
    | csvtk mutate -n enr | csvtk replace -f enr -r "{enr}" -g a,b \
    | csvtk mutate -n rnr | csvtk replace -f rnr -r "{rnr}" -g a,b \
    | csvtk pretty

a   b   nr   gnr   enr   rnr
-   -   --   ---   ---   ---
A   A   1    1     1     1
A   A   2    2     1     1
A   Z   3    1     2     2
B   X   4    1     3     3
B   X   5    2     3     3
A   Z   6    2     2     4
C   Y   7    1     4     5

Whether using just one pair of --start and --incr, or using separate pairs, is hard to satisfy different users. I prefer using just one pair to simplify the setting.

For {cumsum} and other stuff, I don't think this replace command should be overloaded with increasingly complex functionality.

@ggrothendieck
Copy link

Is there some current way to do a cumulative sum or cumulative sum by group in csvtk?

@shenwei356
Copy link
Owner Author

Unfortunately, no. csvtk summary can only output the sum by group.

shenwei356 added a commit that referenced this issue Apr 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants