Skip to content

[packages/calitp_data_analysis] Auth helpers for accessing GeoPandas geospatial data in Google Cloud Storage #4004

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

raebot
Copy link

@raebot raebot commented Jun 24, 2025

Description

Describe your changes and why you're making them. Please include the context, motivation, and relevant dependencies.

Resolves #3935

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

How has this been tested?

See unit tests added in this PR's diff.

Post-merge follow-ups

  • No action required
  • Actions required (specified below)

follow-ups

@raebot raebot changed the title Auth helpers for accessing GeoPandas geospatial data in Google Cloud Storage [packages/calitp_data_analysis] Auth helpers for accessing GeoPandas geospatial data in Google Cloud Storage Jun 24, 2025
@raebot raebot force-pushed the gcs-auth-geopandas-helpers-3935 branch from f5d77f9 to a079d32 Compare June 24, 2025 22:20
@raebot raebot force-pushed the gcs-auth-geopandas-helpers-3935 branch from a079d32 to 92e492a Compare June 24, 2025 22:21
@raebot raebot force-pushed the gcs-auth-geopandas-helpers-3935 branch from e07281d to 8fe14e4 Compare June 24, 2025 23:45
Comment on lines +1 to +3
import gcsfs # type: ignore
import geopandas # type: ignore
import google.auth # type: ignore
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vevetron How have you all typically been handling mypy errors like Skipping analyzing *: module is installed, but missing library stubs or py.typed marker [import-untyped]? There are type/stub packages somewhat related to these, but none looked exactly right. The types-geopandas package is only compatible with geopandas version 1.0.1 and above and we use version 0.14. The stub packages for the google packages are not actually maintained by google. Would you prefer I try to use some of those instead of ignoring here?

@@ -21,7 +21,7 @@ repos:
# Suppress SyntaxWarning about invalid escape sequence from calitp-data-infra dependency without modifying source
entry: env PYTHONWARNINGS="ignore::SyntaxWarning" flake8
- repo: https://github.com/psf/black
rev: 23.1.0
rev: 24.10.0
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this match the Black version specified in toml files. Let me know if that's ok or not.

@raebot raebot requested review from vevetron and mrtopsyt June 24, 2025 23:57
@raebot raebot marked this pull request as ready for review June 24, 2025 23:58
@raebot
Copy link
Author

raebot commented Jun 25, 2025

I'm surprised GH Actions is running Black across all files in the repo and not sticking with just files that were added or modified as part of this PR. It is failing the build–is that intentional? As-is, it's failing the build so it seems I would either need to change that from running across all files or go ahead and commit the auto-reformatting changes from other files I didn't touch in this PR. Let me know what's preferred! https://github.com/cal-itp/data-infra/actions/runs/15863835524/job/44726744447?pr=4004#step:5:75

Copy link

@mrtopsyt mrtopsyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good so far! I have a couple comments but they're both pretty minor! I'm excited to re-review, let me know also when you handle the gpd.read_file case as documented in the original issue.

@@ -0,0 +1,40 @@
import gcsfs # type: ignore
import geopandas # type: ignore

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit - it's typical to use import geopandas as gpd

Passes the auth credentials from Google auth as storage option token
"""
storage_options = kwargs.get("storage_options", {}) | {"token": self.token}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether this has the intended effect - if you do something like the below, you still get an error because it still passes storage_options twice, just one of the version has an empty dict.

>>> a = reader.read_parquet("gs://calitp-analytics-data/data-analyses/general_csis/general_CAPTI_alignment_metrics_intake/cleaned_arcgis_geography.parquet", storage_options={"foo": "bar"})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jovyan/data-infra/packages/calitp-data-analysis/calitp_data_analysis/gcs_geopandas.py", line 26, in read_parquet
    return geopandas.read_parquet(
           ^^^^^^^^^^^^^^^^^^^^^^^
TypeError: geopandas.io.arrow._read_parquet() got multiple values for keyword argument 'storage_options'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: Add decorator to simplify sso-related geopandas changes
2 participants