Skip to content

Enable huge 'select' statements #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed

Conversation

wchinfeman-cedar
Copy link

@wchinfeman-cedar wchinfeman-cedar commented Jun 24, 2024

Summary

Use temp-tables to enable us to pass in a million IDs for a given 'select' statement

In postgres, there is some limit to how long you can make a select * from table where id in (1,2,3,4,5.....) statement.

But at Cedar, we sometimes have really big files which come from third parties. And such a file may contain a list of millions of IDs. And we may want to select our data out of the database, based on these millions of incoming IDs. This feature enables us to do this.

There are lots of existing helper methods that allow us to implement this. The general approach is:

  • use the existing machinery to create a temp-table containing the data that we care about.
  • join this temp-table against the real table, in order to pull data from the real table

Test Plan

How can reviewers test your change manually?

Tinker with the automated tests that are part of this PR

What automated tests did you add? If none, please state why.

I wanted to rely just on the existing tests, since I viewed this change as being simply about implementation details. Then I realized that my first attempt was actually broken in a way that was not reflected by the tests. The problem is that you can get extra records back, if your filter_data contains duplicates. So I had to write tests that illustrated this potential problem.

Security Impact

Please answer the questions below about your PR.
If yes, please explain and add cedar-team/security as reviewers.

Add/modify authentication, authorization, encryption, or HTTP headers?

No

Add/modify any PII or PHI data fields?

No

Add new third parties the app interacts with (e.g., Nordis/Stripe)?

No

Ask patients to enter a new kind of data, (e.g. insurance capture)?

No

Impact any of the other [Security Engineering Practices]

No

@@ -539,32 +536,13 @@ def bulk_select_model_dicts(
connection = connections[db_name]

with connection.cursor() as cursor:
Copy link
Author

@wchinfeman-cedar wchinfeman-cedar Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I need a transaction here, now that I am making a temp-table? Or is there an explicit cleanup step?

@wchinfeman-cedar wchinfeman-cedar marked this pull request as ready for review June 25, 2024 14:40
@wchinfeman-cedar
Copy link
Author

Closing! Because I tested it and you can put 1-million IDs into your where in(...) clause and it will still work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant