DataFrames - Pandas or an alternative? #9

cmutel · 2024-04-12T18:53:37Z

cmutel
Apr 12, 2024
Maintainer

Here are some Pandas alternatives:

Polars: Written in Rust for better speed. Mostly API compatible but not always.
Dask: Scales to multiple machines. Can also run distributed tasks across clusters.
Modin: Drop-in replacement for Pandas with multiple cores. Uses Dask, Ray, or MPI for dispatch.
xarray: Not a dataframe per se, but allows for label across multiple dimensions and otherwise follows Numpy.

There are also Big Data alternatives like Spark, Snowflake, Databricks, and dbt.

We don't have great criteria for choosing between these alternatives. This might be one case where we postpone making a decision.

romainsacchi · 2024-04-12T19:30:09Z

romainsacchi
Apr 12, 2024

Note that xarray isn't very good with storing sparse data. You quickly run into memory issues I noticed. I believe it is possible to initialize it with a scipy.sparse matrix, but this is not really meant to be used that way (and working directly with scipy and dictionaries is often more straightforward). Storing non-null exchanges in a dataframe makes a lot more sense IMO (but since I do not really have a good mental picture of the prototype in question, I may be wrong).

1 reply

michaelweinold Apr 13, 2024

I investigated this a few weeks ago. It seems that xarray does now support sparse matrices (through scipy.sparse, as you noted). See also: sustainableaviation/EcoPyLot#8
Generally, I am not convinced additional dimensions are justified in most cases - simply adding another label column to a two-dimensional tabular data structure introduces less overhead.
I found that moving from Pandas to Apache Spark was rather straighforward - most of the PySpark API is similar. So postponing "big data" infrastructure decisions might be prudent, @cmutel.

romainsacchi · 2024-04-12T19:38:27Z

romainsacchi
Apr 12, 2024

Other candidate: datatable. Their pitch: "a toolkit for performing big data (up to 100GB) operations on a single-node machine, at the maximum speed possible". It seems faster than polars on some operations.

0 replies

cmutel · 2024-09-28T06:11:26Z

cmutel
Sep 28, 2024
Maintainer Author

This feels like premature optimization. Let's build a data API which allows data to be returned as Numpy arrays (both normal and recordarrays, DataFrames, and plain Python or Javascript objects), and build something more complicated only when we really need it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DataFrames - Pandas or an alternative? #9

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DataFrames - Pandas or an alternative? #9

Uh oh!

Uh oh!

cmutel Apr 12, 2024 Maintainer

Replies: 3 comments · 1 reply

Uh oh!

romainsacchi Apr 12, 2024

Uh oh!

michaelweinold Apr 13, 2024

Uh oh!

romainsacchi Apr 12, 2024

Uh oh!

cmutel Sep 28, 2024 Maintainer Author

cmutel
Apr 12, 2024
Maintainer

Replies: 3 comments 1 reply

romainsacchi
Apr 12, 2024

romainsacchi
Apr 12, 2024

cmutel
Sep 28, 2024
Maintainer Author