Skip to content

[packages/calitp_data_analysis] Auth helpers for accessing GeoPandas geospatial data in Google Cloud Storage #4004

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ repos:
# Suppress SyntaxWarning about invalid escape sequence from calitp-data-infra dependency without modifying source
entry: env PYTHONWARNINGS="ignore::SyntaxWarning" flake8
- repo: https://github.com/psf/black
rev: 23.1.0
rev: 24.10.0
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this match the Black version specified in toml files. Let me know if that's ok or not.

hooks:
- id: black
args: ["--config=./pyproject.toml"]
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import gcsfs # type: ignore
import geopandas # type: ignore

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit - it's typical to use import geopandas as gpd

import google.auth # type: ignore
Comment on lines +1 to +3
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vevetron How have you all typically been handling mypy errors like Skipping analyzing *: module is installed, but missing library stubs or py.typed marker [import-untyped]? There are type/stub packages somewhat related to these, but none looked exactly right. The types-geopandas package is only compatible with geopandas version 1.0.1 and above and we use version 0.14. The stub packages for the google packages are not actually maintained by google. Would you prefer I try to use some of those instead of ignoring here?



class GCSGeoPandas:
"""
GCSGeoPandas contains authentication helpers for interacting with Google Cloud Storage with GeoPandas
"""

def __init__(self):
"""Sets instance token to credentials returned from Google auth request"""
credentials, _ = google.auth.default()
self.token = credentials

def gcs_filesystem(self, **kwargs):
"""Returns a Google Cloud Storage Filesystem"""
return gcsfs.GCSFileSystem(token=self.token, **kwargs)

def read_parquet(self, path, *args, **kwargs):
"""Delegates to geopandas.read_parquet with storage option token

Passes the auth credentials from Google auth as storage option token
"""
storage_options = kwargs.get("storage_options", {}) | {"token": self.token}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether this has the intended effect - if you do something like the below, you still get an error because it still passes storage_options twice, just one of the version has an empty dict.

>>> a = reader.read_parquet("gs://calitp-analytics-data/data-analyses/general_csis/general_CAPTI_alignment_metrics_intake/cleaned_arcgis_geography.parquet", storage_options={"foo": "bar"})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jovyan/data-infra/packages/calitp-data-analysis/calitp_data_analysis/gcs_geopandas.py", line 26, in read_parquet
    return geopandas.read_parquet(
           ^^^^^^^^^^^^^^^^^^^^^^^
TypeError: geopandas.io.arrow._read_parquet() got multiple values for keyword argument 'storage_options'

return geopandas.read_parquet(
path, storage_options=storage_options, *args, **kwargs
)

def geo_data_frame_to_parquet(self, geo_data_frame, path, *args, **kwargs):
"""Delegates to .to_parquet on the passed geo_data_frame, providing Google Cloud Storage file system

Fetches gcsfs.GCSFileSystem instance

Passes the auth credentials from Google Cloud Storage as storage option token
"""
gcs_filesystem = self.gcs_filesystem()
return geo_data_frame.to_parquet(
path, filesystem=gcs_filesystem, *args, **kwargs
)
Loading
Loading