-
Notifications
You must be signed in to change notification settings - Fork 706
Polars Backend over Pandas #1951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @kailukowiak, this would indeed be quite a big ask and huge shift for a library named Have you had a look at our SDK for pandas at scale work?
This is currently available as a release candidate but we hope to release it in the coming weeks. One major shortcoming of polars that I have raised to the maintainers is that it's limited to a single node. This is why we have preferred to invest in Modin and Ray to support distributed computing. |
Hi @jaidisido. Yes, I definitely understand that it would be a lot of work and my heart did sink when I saw the repo had been renamed (I believe) from I've used wrangler and aws batch/pcluster before and ran into API call throttling issues but it's possible that won't be an issue any more because I was using the now defunct Governed Tables. I presume updating Athena/Lake Formation would be a cheaper and less limited api call. My main concern with the distributed approach using pandas is the |
When using Modin/Ray the data is spread across the cluster, whereas with pandas/polars all the data must live in the same node. So instead of using a massive EC2 instance like in your case, you can create a cluster of smaller machines and the library handles distributing the data across. |
Yes, my concern is that because of the distributed nature, the I'll close the issue now. Thanks. |
Given the popularity that polars has been gaining throughout 2023, what are the odds of revisiting the decision to invest in modin/ray? Polars does seem to be the future of distributed data-frames (within-machine). |
Given all the hype around Polars recently, and other packages like scikit learn now supporting Polars dataframes, it would make sense to re-evaluate this. |
Start working on this aws_sdk_polars |
Sweet. I love it. Thanks. |
unfortunately does not cover athena (yet), but nice contribution though |
I prioritize to support S3 and other io intensive computation first. 99% computation is on Athena serverside, migrating to polars don't give you much benefit. That's why I have a super light weight library for Athena + Polars only, check this out: https://github.com/MacHu-GWU/fixa-project/blob/main/fixa/aws/aws_athena_query.py |
Hi @MacHu-GWU have you considered contributing polars backend a part of aws-sdk-pandas project? |
I read the aws-sdk-pandas source code, based on how it's been designed, I feel like it is impossible to create an abstract dataframe engine layer to make it compatible with both pandas and polars. |
Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed. |
Is your feature request related to a problem? Please describe.
Pandas can be slow and memory intensive. When dealing with large files I need lots more memory in my EC2 instance than if I was using Polars.
Also, and this is a matter of personal preference but the Polars API can be much cleaner.
Describe the solution you'd like
It would be really nice if I could use a faster and more memory efficient DataFrame API to ingest and export data.
Describe alternatives you've considered
I often convert Pandas DFs to Polars ones, and then process the data before writing it back out. This works fine on small data sets but it would be nice on large ones to never have to allocate all the memory needed for Pandas.
Comments
I know this is a large ask and currently Polars isn't that popular but I think this would be a huge performance increase if implemented and would make my ETL much prettier (subjectively) too.
Additional context
Add any other context or screenshots about the feature request here.
P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.
The text was updated successfully, but these errors were encountered: