Skip to content

nth function/method for groupby #3128

@samukweku

Description

@samukweku

An nth function, similar to pandas groupby nth, where the selected row/rows stay as part of the new aggregations in the presence of by:

from datatable import dt, f, by

data = {'x': [1, 1, 1, 2, 2, 3],
        'y': [1, 2, 3, 1, 2, 1],
        'n': [3, 2, 1, 1, 2, 1]}

DT = dt.Frame(data)

DT

   |     x      y      n
   | int32  int32  int32
-- + -----  -----  -----
 0 |     1      1      3
 1 |     1      2      2
 2 |     1      3      1
 3 |     2      1      1
 4 |     2      2      2
 5 |     3      1      1
[6 rows x 3 columns]

The nth function/method should return the row or rows along with other aggregations:

DT[:, {'sum': f.y.sum(), 'nth' : f.y.nth(1)}, 'x']

    |   x    sum      nth
    | int32  int64  int32
 -- + -----  -----  -----
  0 |     1      6      2
  1 |     2      3      2
  2 |     3      1     NA
 [3 rows x 3 columns]

One way to currently implement this would be to do a cbind:

dt.cbind(DT[:, {"sum": f.y.sum()}, "x"], DT[1, 'n', by('x', add_columns=False)], force=True)

which might not properly align; a much safer option would be a left join.

The implementation in #2176 selects a single row or slice; and since within datatable i is executed before j, the results wont come out right. Also, at the moment, sequences are not accepted within the i; it would be nice to be able to select lists with the nth function

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions