-
Notifications
You must be signed in to change notification settings - Fork 16
[packages/calitp_data_analysis] Auth helpers for accessing GeoPandas geospatial data in Google Cloud Storage #4004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
f5d77f9
to
a079d32
Compare
Enforce poetry version for the calitp-data-analysis package so that we have compatible versions across developer machines
a079d32
to
92e492a
Compare
e07281d
to
8fe14e4
Compare
import gcsfs # type: ignore | ||
import geopandas # type: ignore | ||
import google.auth # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vevetron How have you all typically been handling mypy errors like Skipping analyzing *: module is installed, but missing library stubs or py.typed marker [import-untyped]
? There are type/stub packages somewhat related to these, but none looked exactly right. The types-geopandas package is only compatible with geopandas version 1.0.1 and above and we use version 0.14. The stub packages for the google packages are not actually maintained by google. Would you prefer I try to use some of those instead of ignoring here?
@@ -21,7 +21,7 @@ repos: | |||
# Suppress SyntaxWarning about invalid escape sequence from calitp-data-infra dependency without modifying source | |||
entry: env PYTHONWARNINGS="ignore::SyntaxWarning" flake8 | |||
- repo: https://github.com/psf/black | |||
rev: 23.1.0 | |||
rev: 24.10.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this match the Black version specified in toml files. Let me know if that's ok or not.
I'm surprised GH Actions is running Black across all files in the repo and not sticking with just files that were added or modified as part of this PR. It is failing the build–is that intentional? As-is, it's failing the build so it seems I would either need to change that from running across all files or go ahead and commit the auto-reformatting changes from other files I didn't touch in this PR. Let me know what's preferred! https://github.com/cal-itp/data-infra/actions/runs/15863835524/job/44726744447?pr=4004#step:5:75 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good so far! I have a couple comments but they're both pretty minor! I'm excited to re-review, let me know also when you handle the gpd.read_file
case as documented in the original issue.
@@ -0,0 +1,40 @@ | |||
import gcsfs # type: ignore | |||
import geopandas # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nit - it's typical to use import geopandas as gpd
Passes the auth credentials from Google auth as storage option token | ||
""" | ||
storage_options = kwargs.get("storage_options", {}) | {"token": self.token} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether this has the intended effect - if you do something like the below, you still get an error because it still passes storage_options
twice, just one of the version has an empty dict.
>>> a = reader.read_parquet("gs://calitp-analytics-data/data-analyses/general_csis/general_CAPTI_alignment_metrics_intake/cleaned_arcgis_geography.parquet", storage_options={"foo": "bar"})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jovyan/data-infra/packages/calitp-data-analysis/calitp_data_analysis/gcs_geopandas.py", line 26, in read_parquet
return geopandas.read_parquet(
^^^^^^^^^^^^^^^^^^^^^^^
TypeError: geopandas.io.arrow._read_parquet() got multiple values for keyword argument 'storage_options'
Description
Describe your changes and why you're making them. Please include the context, motivation, and relevant dependencies.
Resolves #3935
Type of change
How has this been tested?
See unit tests added in this PR's diff.
Post-merge follow-ups
follow-ups