groupby still seems slow, even with flox. Am I doing it right? #7566

flyaflya · 2023-02-27T21:18:16Z

flyaflya
Feb 27, 2023

Below, I use a 5,000 row dataframe and the Split-Apply-Combine paradigm. Using groupby on a pandas dataframe is WAY faster than using groupby on an xarray dataset. I thought this might get resolved with the introduction of flox, but if you look at the below, the performance difference remains.

import pandas as pd
import xarray as xr
import flox
import timeit

xr.set_options(use_flox=True)
print(xr.__version__)

## get primarys cast in the primaries as of February 25th, 2016. 
shipDF = pd.read_csv("https://raw.githubusercontent.com/flyaflya/persuasive/main/shipments.csv", 
                     parse_dates=['plannedShipDate','actualShipDate'],
                     index_col = 'shipID')

## create 
shipLineItemDS = shipDF.head(4000).pipe(xr.Dataset.from_dataframe)

print("Start of Timing Test")

## even using flox, xarray groupby method is slow
print(timeit.timeit('shipLineItemDS.groupby("shipID").first()', number = 3, globals=globals()))

print("-----------------------")

## pandas method is fast
print(timeit.timeit('shipLineItemDS.to_dataframe().groupby("shipID").first()', number = 3, globals=globals()))

print("End of Timing Test")

2023.2.0
Start of Timing Test
27.578695357000015
-----------------------
0.022047137999834376
End of Timing Test

The output reveals a dramatic difference in speed between using an xarray groupby function (~27seconds) or a pandas groupby method (0.02 seconds). Is this still a known issue or am I not accessing flox properly?

Answered by dcherian

Feb 27, 2023

We haven't enabled first, last with flox yet, I need to think about how to do it with dask. I would just use pandas since you can.

Or go to flox directly if you really want array support (it'll work for numpy arrays; and nanfirst will work for numpy and dask)

flox.xarray.xarray_reduce(da, by, func="first")

EDIT: Adding some tests to flox would be a very helpful contribution! (xarray-contrib/flox#29)

View full answer

dcherian · 2023-02-27T23:30:42Z

dcherian
Feb 27, 2023
Maintainer

We haven't enabled first, last with flox yet, I need to think about how to do it with dask. I would just use pandas since you can.

Or go to flox directly if you really want array support (it'll work for numpy arrays; and nanfirst will work for numpy and dask)

flox.xarray.xarray_reduce(da, by, func="first")

EDIT: Adding some tests to flox would be a very helpful contribution! (xarray-contrib/flox#29)

0 replies

flyaflya · 2023-02-28T16:52:23Z

flyaflya
Feb 28, 2023
Author

Deepak:

Thanks for this answer. I tested with mean() to get a comparison, and you are right, mean() is a whole lot faster.

In terms of going xarray -> pandas -> xarray, I have a strong preference to stay in one package for data manipulation; too much mental friction traversing packages all the time when I am not in Python on a daily basis.

I will play with going to flox directly and will consider contributing tests - although I know ZERO about dask. I typically work with in-memory datasets, but really like labels :-)

Here is the code and test results showing that .mean() is indeed faster than .first().

import pandas as pd
import xarray as xr
import timeit
import flox

xr.set_options(use_flox=True)
print(xr.__version__)

## get shipment data as xarray dataset
shipDS = (pd.read_csv("https://raw.githubusercontent.com/flyaflya/persuasive/main/shipments.csv", 
                     parse_dates=['plannedShipDate','actualShipDate'])
          .pipe(xr.Dataset.from_dataframe)
)
shipDS = shipDS.head(10000) # 96,805 observations reduced to 10,0000 to shorten test time

## time comparison
print("--------------------------------------------------------------------")
print("XARRAY: dataset-> Split-Apply-Combine using .first() --- Time to Complete:")
print(timeit.timeit('shipDS.groupby("shipID").first()', number = 3, globals=globals()))
print("--------------------------------------------------------------------")
print("XARRAY & Pandas: dataset-> to_dataframe->Split-Apply-Combine-> to_dataset using .first() --- Time to Complete:")
print(timeit.timeit('shipDS.to_dataframe().groupby("shipID").first().pipe(xr.Dataset.from_dataframe)', number = 3, globals=globals()))
print("--------------------------------------------------------------------")

## note from https://github.com/pydata/xarray/discussions/7566#discussioncomment-5132779
## suggests the .mean() would be much faster, so showing time test below
print("--------------------------------------------------------------------")
print("XARRAY: dataset-> Split-Apply-Combine using .mean() --- Time to Complete:")
print(timeit.timeit('shipDS.groupby("shipID").mean()', number = 3, globals=globals()))
print("--------------------------------------------------------------------")
print("XARRAY & Pandas: dataset-> to_dataframe->Split-Apply-Combine-> to_dataset using .mean() --- Time to Complete:")
print(timeit.timeit('shipDS.to_dataframe().groupby("shipID").mean(numeric_only = True).pipe(xr.Dataset.from_dataframe)', number = 3, globals=globals()))
print("--------------------------------------------------------------------")

2023.2.0
--------------------------------------------------------------------
XARRAY: dataset-> Split-Apply-Combine using .first() --- Time to Complete:
88.56418734400359
--------------------------------------------------------------------
XARRAY & Pandas: dataset-> to_dataframe->Split-Apply-Combine-> to_dataset using .first() --- Time to Complete:
0.05769414600217715
--------------------------------------------------------------------
--------------------------------------------------------------------
XARRAY: dataset-> Split-Apply-Combine using .mean() --- Time to Complete:
0.15912020800169557
--------------------------------------------------------------------
XARRAY & Pandas: dataset-> to_dataframe->Split-Apply-Combine-> to_dataset using .mean() --- Time to Complete:
0.03216619799786713
--------------------------------------------------------------------

2 replies

keewis Feb 28, 2023
Maintainer

note that the time returned by timeit.timeit is cumulative, so the result has to be treated with care. The ipython %timeit magic is a bit easier to interpret:

In [21]: %timeit shipLineItemDS.to_dataframe().groupby("shipID").mean(numeric_only=True)
592 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [22]: %timeit shipLineItemDS.groupby("shipID").mean()
3.25 ms ± 6.84 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [23]: %timeit shipLineItemDS.to_dataframe().groupby("shipID").first().to_xarray()
2.07 ms ± 5.34 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [24]: %timeit shipLineItemDS.groupby("shipID").first()
1.97 s ± 4.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

that still shows that pandas is much faster, but the total time is not nearly as dramatic.

flyaflya Feb 28, 2023
Author

thanks @keewis. I did not realize the cumulative part. That last line, shipLineItemDS.groupby("shipID").first(), is the killer though. For the full 96,000+ observation dataset, it takes 5 minutes for the result on replit. See here:

https://replit.com/@AdamFleischhack/SardonicPrimeBlocks#main.py

When using R or pandas, this result is sub-second.... so there must be some room for optimization here. I am gald the flox improvements for aggregation functions like mean() do seem really successful.

dcherian · 2023-07-27T16:31:43Z

dcherian
Jul 27, 2023
Maintainer

Update: the latest flox supports first, last for non-dask and nanfirst, nanlast for all array types

0 replies

flyaflya · 2023-07-27T16:52:31Z

flyaflya
Jul 27, 2023
Author

thanks for the follow-up!! It does seem to work faster. It is still magnitudes slower than pandas, but the acheived time reduction from the older flox versions does dramatically increase the usability of this workflow.

import pandas as pd
import xarray as xr
import flox
import timeit

xr.set_options(use_flox=True)
print(xr.__version__)

## get primarys cast in the primaries as of February 25th, 2016. 
shipDF = pd.read_csv("https://raw.githubusercontent.com/flyaflya/persuasive/main/shipments.csv", 
                     parse_dates=['plannedShipDate','actualShipDate'],
                     index_col = 'shipID')

## create 
shipLineItemDS = shipDF.head(4000).pipe(xr.Dataset.from_dataframe)

print("Start of Timing Test")

## even using flox, xarray groupby method is slow
print(timeit.timeit('shipLineItemDS.groupby("shipID").first()', number = 3, globals=globals()))

print("-----------------------")

## pandas method is fast
print(timeit.timeit('shipLineItemDS.to_dataframe().groupby("shipID").first()', number = 3, globals=globals()))

print("End of Timing Test")

2023.7.0
Start of Timing Test
5.711965520000035
-----------------------
0.00877992899995661
End of Timing Test

2 replies

dcherian Jul 27, 2023
Maintainer

Ah we don't route Xarray's groupby.first to flox yet.

%timeit flox.xarray.xarray_reduce(shipLineItemDS,"shipID",func="first",skipna=False)
%timeit shipLineItemDS.to_dataframe().groupby("shipID").first()

Still slower but not bad.

5.5 ms ± 223 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.12 ms ± 47.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

flyaflya Jul 27, 2023
Author

oooh. that is much better!! is that routing of Xarray's groupby.first in the works? If not, I can file an issue if that is helpful.

Thanks for this!!

Uh oh!

groupby still seems slow, even with flox. Am I doing it right? #7566

Uh oh!

flyaflya Feb 27, 2023

Replies: 4 comments · 4 replies

Uh oh!

Uh oh!

dcherian Feb 27, 2023 Maintainer

Uh oh!

flyaflya Feb 28, 2023 Author

Uh oh!

keewis Feb 28, 2023 Maintainer

Uh oh!

flyaflya Feb 28, 2023 Author

Uh oh!

Uh oh!

dcherian Jul 27, 2023 Maintainer

Uh oh!

flyaflya Jul 27, 2023 Author

Uh oh!

dcherian Jul 27, 2023 Maintainer

Uh oh!

flyaflya Jul 27, 2023 Author

flyaflya
Feb 27, 2023

Replies: 4 comments 4 replies

dcherian
Feb 27, 2023
Maintainer

flyaflya
Feb 28, 2023
Author

keewis Feb 28, 2023
Maintainer

flyaflya Feb 28, 2023
Author

dcherian
Jul 27, 2023
Maintainer

flyaflya
Jul 27, 2023
Author

dcherian Jul 27, 2023
Maintainer

flyaflya Jul 27, 2023
Author