-
Couldn't load subscription status.
- Fork 680
feat(api): implement upsert() using MERGE INTO
#11624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
5a27b34 to
21994eb
Compare
|
@cpcloud Requesting review for initial feedback while I try to improve backend support; I assume I could always mark a lot of them |
46b0e4a to
8e47b79
Compare
| query = query.sql(self.dialect) | ||
|
|
||
| if "MERGE" in query: | ||
| query = f"{query};" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm honestly not sure how to better do this; I wasn't able to figure out how to add a semicolon to an expression in SQLGlot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't ideal but I think is fine with me. The "cleaner" way would be to override the _build_upsert_from_table method, but the amount of boilerplate for that feels not worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or actually, is it not possible to just always stick on a ; at the end of _build_upsert_from_table() for every backend? or does that break on some backends?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or actually, is it not possible to just always stick on a ; at the end of _build_upsert_from_table() for every backend?
I don't actually know how to stick a semicolon on the end of a SQLGlot expression. 😅 If _build_upsert_from_table() was handling the conversion to SQL, that would have been simple enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I understand.
huh, is this a bug in the upstream mssql engine? Can you confirm that only merge statements require trailing semicolons, but other sql queries/statements do not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, semicolons are only enforced for certain statements, including MERGE.
There's also a link in that answer to the official documentation stating that, at some point in the future, they will require terminators for every statement, but that clearly hasn't happened in the nearly 10 years since that answer was given.
2e1403d to
b40103e
Compare
25061ad to
d052b43
Compare
d2b5922 to
8b0d6fe
Compare
|
@cpcloud I'm temporarily cherry-picked the changes from #11636 and the updates to With that, all of the functionality to implement |
adc70e7 to
ffc8125
Compare
@cpcloud FYI I've done this and everything is passing! Should be ready to go. |
9224992 to
a1c5406
Compare
|
@NickCrews Addressed all your comments (except didn't add an xfail test, but replied to your suggestion). Please feel free to take another look! |
| compiler = self.compiler | ||
| quoted = compiler.quoted | ||
|
|
||
| columns = self._get_columns_to_insert( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming an existing table with columns {i: int64, s: string, f: float64}, can you add a tests for upserting objects (using condition i=i) with schemas
- {i: int64, s: string, f: float64} (works)
- {s: !string, f: float32, i: uint8} (different order and flavors, but dtypes still compatible, works)
- {i: int64} (success, but nothing is updated)
- {i: int64, s: string} (only s is updated)
- {s: string} (error, i not present)
- {i:int64, b: boolean} (error, b is not in dest table)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, depending on how the "should we fallback to positional ordering" decision goes, we should add tests for that too. But I think all the above tests should still be valid regardless of what we decide there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NickCrews I've implemented all of the above, with the following differences:
- {i: int64}
(success, but nothing is updated)(error, no cols to insert) - {i:int64, b: boolean} (error,
b is not in dest tablecols to insert includes columns not in table)
The latter relates to your questions around the columns to insert behavior, I think. The former is somewhat related; seems it could be nice to support the no-update case, but I also don't know how important it is.
|
Thanks @deepyaman, those tweaks all look great. I came up with several more test cases I'd love to see. Sorry for the continually moving target. |
| target_cols = self.get_schema(target, catalog=catalog, database=db).keys() | ||
|
|
||
| columns = ( | ||
| return ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, I see what you mean how you are inheriting this logic from insert. I think it is essential that they use the same logic.
But, I think this is a footgun waiting to happen. I think we should make a breaking change to .insert() and require that source be a subset of target. What do you think of this change @cpcloud ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is anywhere else in ibis where we rely on positional ordering of columns, is there? I think we should keep that in mind, if we make this change here, then I think we should wipe out all other instances of relying on positional ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking copilot Is there anywhere in this codebase where we rely on the position of columns in tables/relations? Or do we always match by column name? gives:
Looking at the codebase, Ibis primarily relies on column names rather than positions, but there are some specific cases where positional ordering is used:
Column Name-Based Operations (Majority)
Most operations use column names:
Table.join()matches columns by name (e.g.,"movieId")Table.select()references columns by nameSchema.name_at_position()exists but is used to look up names by position, then uses namesFieldoperations match by column name
Position-Based Operations (Specific Cases)
However, there are cases where position matters:
-
Positional table access:
Table.__getitem__supportst[0]for first column,t[1]for second, etc. -
Positional joins: There's a
"positional"join kind mentioned inJoinKindthat joins tables by row position rather than column values. Seetest_positional_join. -
Schema comparison:
Schema.equals()explicitly states that "The order of fields in the schema is taken into account when computing equality." -
Column insertion during
insert(): InSQLBackend._build_insert_from_table, columns are matched by position when source columns are not a subset of target columns. -
Info operations:
Table.info()includes aposfield tracking column position.
So the answer is: primarily name-based, but position is significant for schema equality, positional joins, and some insertion scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to put this aside for now, since I think at the very least such a change should come in a commit separate from the Backend.upsert() implementation.
5f12ee5 to
10be51c
Compare
@NickCrews No worries, thanks for the revised review! I believe I've largely addressed your updated feedback. |
a2458ce to
613dce6
Compare
Description of changes
Implement
Backend.upsert()usingsqlglot.expressions.merge()under the hood. Upsert support is very important, especially for data engineering use cases.Starting with a most basic implementation, including only supporting one join column. I think this could be expanded to support a list without much effort.
MERGE INTOsupport is limited. DuckDB only added support forMERGEstatements earlier today in 1.4.0, and many other backends don't support it. However, it seems like the more standard/correct approach for supporting upserts, and it doesn't require merge keys defined ahead of time on tables.Backends that work:
(currently using a hack to work around "AS" getting added to MERGE statement);onto the end ofeverystatement 😅)Should work, need help to test:
Backends that don't work:
MERGEstatement is only supported for Iceberg tables.")MERGE INTOis transactional and is supported only for Apache Iceberg tables in Athena engine version 3.")Issues closed
Backend.upsert#5391