Skip to content

Feat: pems_data package #187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Feat: pems_data package #187

wants to merge 8 commits into from

Conversation

thekaveman
Copy link
Member

@thekaveman thekaveman commented Jul 21, 2025

Closes #179

What this PR does

Follows the same general approach and outline from #181 and #184.

New pems_data package

Create a new Python package with its own pyproject.toml listing dependencies and other metadata. The code is organized into the src structure similar to pems_web in #181 and pems_streamlit in #184.

Refactor pems_streamlit data access

#166 originally introduced loading of some key data from S3. This PR refactors that access from out of pems_streamlit and into pems_data in two initial modules:

  • pems_data.s3 contains general S3 access classes for PeMS
  • pems_data.stations contains station-specific S3 access classes for PeMS

pems_data is introduced as a dependency of pems_streamlit, which then uses the refactored classes in the same was as before.

Devcontainer

Add pems_data to the devcontainer Dockerfile so it is installed (as editable), with its dependencies.

Tests

Ensures coverage is reported for the new pems_data package. No changes needed to CI since pems_data is installed as a dependency of pems_streamlit as normal (we don't need editable installs in CI).

@thekaveman thekaveman self-assigned this Jul 21, 2025
Copy link

github-actions bot commented Jul 21, 2025

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  pems_data/src/pems_data/services
  stations.py
  pems_data/src/pems_data/sources
  base.py
  s3.py
Project Total  

This report was generated by python-coverage-comment-action

@thekaveman thekaveman force-pushed the feat/pems-data branch 2 times, most recently from a2ae975 to c341d86 Compare July 22, 2025 00:11
@thekaveman thekaveman force-pushed the refactor/pems-streamlit branch from e259419 to 3ed130e Compare July 22, 2025 13:41
@thekaveman thekaveman force-pushed the refactor/pems-streamlit branch from 774bb4e to 8a3ce21 Compare July 23, 2025 16:49
@thekaveman thekaveman force-pushed the feat/pems-data branch 2 times, most recently from 03937a7 to 46be947 Compare July 23, 2025 17:22
@thekaveman thekaveman force-pushed the refactor/pems-streamlit branch from 7e174b4 to 9be9a63 Compare July 23, 2025 22:56
Base automatically changed from refactor/pems-streamlit to main July 24, 2025 14:59
@thekaveman thekaveman marked this pull request as ready for review July 24, 2025 14:59
@thekaveman thekaveman requested a review from a team as a code owner July 24, 2025 14:59
Copy link
Member

@lalver1 lalver1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This organization is really nice @thekaveman!

There are maybe just a few things that need tweaking since I am testing these changes (started by running bin/build.sh) and I'm seeing this so far:

  • web built ✅
  • dev built ✅
  • streamlit ⚠️
    • ERROR [streamlit app_container 9/9] RUN pip install $(find /wheels -name pems_streamlit*.whl)
    • ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/caltrans/app/pems_data'
    • I think the reason is because streamlit's pyproject lists "pems_data @ file:./pems_data" as a dependency but we have not copied pems_data into the app_container stage. The fix would be to add COPY pems_data pems_data to the streamlit Dockerfile
    • Adding the above gets us over the error and pems_data starts getting installed, but then I see this error
      • Processing ./pems_data (from pems-streamlit==2024.12.2.dev183+gb6191ea.d20250725)
      • Getting requirements to build wheel: finished with status 'error'
      • LookupError: setuptools-scm was unable to detect version for /caltrans/app.
    • The fix is to also copy the .git directory to the app_container stage so that setuptools-scm can see it in the container. This fixes this but then we get this error
      • creating src/pems_data.egg-info
      • error: could not create 'src/pems_data.egg-info': Permission denied
    • I'm testing/troubleshooting this point now but wanted to let you know what I'm seeing so far.

@thekaveman
Copy link
Member Author

@lalver1 yep you are totally right. My bad, I was testing within the devcontainer and didn't run the Streamlit container directly.

Working on a fix...

@thekaveman
Copy link
Member Author

thekaveman commented Jul 25, 2025

@lalver1 1f9e7d9 should fix this issue, sorry and thanks again for catching it!

Your diagnosis was correct:

I think the reason is because streamlit's pyproject lists "pems_data @ file:./pems_data" as a dependency but we have not copied pems_data into the app_container stage.

And then of course you saw the issues with trying to copy the extra files in...

The issue happens because for the devcontainer, we want the file: dependency, but in the app container we just want to install the package as a package.

This commit makes a few changes. It sort of feels like a "hack", but honestly I wanted to try and avoid having to use a tool like poetry or hatch or something. Those tools were built to solve issues like this, but this workaround also works. LMK what you think:

  • Keep the pems_data @ file: ./pems_data reference in pems_streamlit/pyproject.toml -- this is for the devcontainer
  • In the Docker build for pems_streamlit:
    • Rewrite the reference to a direct package reference: pems_data @ file: ./pems_data becomes pems_data
    • Build the pems_data and pems_streamlit wheels and download all their dependencies as wheels into a "wheelhouse"
    • Install pems_streamlit from the "wheelhouse"

@thekaveman thekaveman requested a review from lalver1 July 25, 2025 19:28
@thekaveman
Copy link
Member Author

@lalver1 I'm considering whether we maybe want to go the other way...

  • Make the reference in pems_streamlit/pyproject.toml just the package pems_data
  • In the devcontainer build, rewrite it to the file: ... reference

Maybe that feels cleaner? And also, I'm worried about rewriting a file that is part of the source in terms of calculating the version info, since that looks at git metadata. If we rewrite a file, that would look like a pending change to git, which might change the version calculation.

@thekaveman
Copy link
Member Author

@lalver1 I did go the other way, turns out we don't need to rewrite anything. The direct package dependency reference works in both!

@lalver1
Copy link
Member

lalver1 commented Jul 25, 2025

Nice job @thekaveman! 🙌 It's so nice we didn't need to overwrite anything in the Dockerfile.
The locally running app works fine, and the devcontainer version does too, I can see changes made to the editable packages when running the debugger 👍 I'll just wait for a rebase to approve this.

- get a fully-formed bucket URL
- get a list of object prefixes matching a pattern
- read a parquet file into a DataFrame
- metadata and imputed aggregate 5min URLs
- get district metadata
- get station imputed aggregate 5min data
using a file: dependency on pems_data doesn't work during Docker build time
since we 1) don't copy the pems_data source for the install stage, and 2) we
don't copy the .git directory either -- nor do we want to copy either of these

instead, use a normal package dependency during Docker build
and build and download both local packages and all dependencies
into a 'wheelhouse' for install

update CI to install all 3 packages for tests
create an abstract base class interface for a data source,
which represents the "how" of fetching data

refactor S3Bucket to S3DataSource
the concept of a Service here is for the "what" of accessing
specific data for business logic

refactor the StationsBucket to a StationsService
StationsService needs an IDataSource to read from (the "how")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create pems_data Python package for data access layer
2 participants