-
Notifications
You must be signed in to change notification settings - Fork 162
Open
Description
An nth
function, similar to pandas groupby nth, where the selected row/rows stay as part of the new aggregations in the presence of by
:
from datatable import dt, f, by
data = {'x': [1, 1, 1, 2, 2, 3],
'y': [1, 2, 3, 1, 2, 1],
'n': [3, 2, 1, 1, 2, 1]}
DT = dt.Frame(data)
DT
| x y n
| int32 int32 int32
-- + ----- ----- -----
0 | 1 1 3
1 | 1 2 2
2 | 1 3 1
3 | 2 1 1
4 | 2 2 2
5 | 3 1 1
[6 rows x 3 columns]
The nth
function/method should return the row or rows along with other aggregations:
DT[:, {'sum': f.y.sum(), 'nth' : f.y.nth(1)}, 'x']
| x sum nth
| int32 int64 int32
-- + ----- ----- -----
0 | 1 6 2
1 | 2 3 2
2 | 3 1 NA
[3 rows x 3 columns]
One way to currently implement this would be to do a cbind
:
dt.cbind(DT[:, {"sum": f.y.sum()}, "x"], DT[1, 'n', by('x', add_columns=False)], force=True)
which might not properly align; a much safer option would be a left join.
The implementation in #2176 selects a single row or slice; and since within datatable i
is executed before j
, the results wont come out right. Also, at the moment, sequences are not accepted within the i
; it would be nice to be able to select lists with the nth
function
GitHunter0
Metadata
Metadata
Assignees
Labels
No labels