feat(api): implement `upsert()` using `MERGE INTO` #11624

deepyaman · 2025-09-16T21:47:07Z

Description of changes

Implement Backend.upsert() using sqlglot.expressions.merge() under the hood. Upsert support is very important, especially for data engineering use cases.

Starting with a most basic implementation, including only supporting one join column. I think this could be expanded to support a list without much effort.

MERGE INTO support is limited. DuckDB only added support for MERGE statements earlier today in 1.4.0, and many other backends don't support it. However, it seems like the more standard/correct approach for supporting upserts, and it doesn't require merge keys defined ahead of time on tables.

Backends that work:

DuckDB (from 1.4.0)
Flink
Oracle ~~(currently using a hack to work around "AS" getting added to MERGE statement)~~
MS SQL (currently throwing a ; onto the end of ~~every~~ statement 😅)
Postgres

Should work, need help to test:

Databricks
Snowflake
BigQuery

Backends that don't work:

PySpark ("MERGE INTO TABLE is not supported temporarily.")
Clickhouse
DataFusion
SQLite (supports the nonstandard UPSERT statement)
Impala ("The MERGE statement is only supported for Iceberg tables.")
MySQL
Polars
RisingWave
Athena ("MERGE INTO is transactional and is supported only for Apache Iceberg tables in Athena engine version 3.")
Trino ("connector does not support modifying table rows")

Issues closed

Resolves Add Backend.upsert #5391

deepyaman · 2025-09-18T00:54:13Z

@cpcloud Requesting review for initial feedback while I try to improve backend support; I assume I could always mark a lot of them notyet if the approach is correct but some translations just need work.

ibis/backends/oracle/__init__.py

ibis/backends/sql/__init__.py

deepyaman · 2025-09-18T21:42:37Z

ibis/backends/mssql/__init__.py

            query = query.sql(self.dialect)

+        if "MERGE" in query:
+            query = f"{query};"


I'm honestly not sure how to better do this; I wasn't able to figure out how to add a semicolon to an expression in SQLGlot.

This isn't ideal but I think is fine with me. The "cleaner" way would be to override the _build_upsert_from_table method, but the amount of boilerplate for that feels not worth it.

Or actually, is it not possible to just always stick on a ; at the end of _build_upsert_from_table() for every backend? or does that break on some backends?

Or actually, is it not possible to just always stick on a ; at the end of _build_upsert_from_table() for every backend?

I don't actually know how to stick a semicolon on the end of a SQLGlot expression. 😅 If _build_upsert_from_table() was handling the conversion to SQL, that would have been simple enough.

Ah, I understand.

huh, is this a bug in the upstream mssql engine? Can you confirm that only merge statements require trailing semicolons, but other sql queries/statements do not?

Yes, semicolons are only enforced for certain statements, including MERGE.

There's also a link in that answer to the official documentation stating that, at some point in the future, they will require terminators for every statement, but that clearly hasn't happened in the nearly 10 years since that answer was given.

ibis/backends/sql/__init__.py

deepyaman · 2025-09-21T03:51:38Z

@cpcloud I'm temporarily cherry-picked the changes from #11636 and the updates to xfail/xpass two DuckDB tests in order to get DuckDB and Oracle tests working (need the newer DuckDB and SQLGlot releases). All of these changes are in the last 3 commits:

With that, all of the functionality to implement Backend.upsert() is working and ready for review. The remaining issues all also exist in #11637, so I won't duplicate solving them. Whenever you do merge #11637, I should be able to back out the last 3 commits and rebase this on top.

deepyaman · 2025-10-07T22:53:35Z

@cpcloud I'm temporarily cherry-picked the changes from #11636 and the updates to xfail/xpass two DuckDB tests in order to get DuckDB and Oracle tests working (need the newer DuckDB and SQLGlot releases). All of these changes are in the last 3 commits:

7fddcb9

0ca5379

adc70e7

With that, all of the functionality to implement Backend.upsert() is working and ready for review. The remaining issues all also exist in #11637, so I won't duplicate solving them. Whenever you do merge #11637, I should be able to back out the last 3 commits and rebase this on top.

@cpcloud FYI I've done this and everything is passing! Should be ready to go.

ibis/backends/tests/test_client.py

deepyaman · 2025-10-16T03:12:32Z

@NickCrews Addressed all your comments (except didn't add an xfail test, but replied to your suggestion). Please feel free to take another look!

NickCrews · 2025-10-16T19:22:37Z

ibis/backends/sql/__init__.py

+        compiler = self.compiler
+        quoted = compiler.quoted
+
+        columns = self._get_columns_to_insert(


Assuming an existing table with columns {i: int64, s: string, f: float64}, can you add a tests for upserting objects (using condition i=i) with schemas

{i: int64, s: string, f: float64} (works)

{s: !string, f: float32, i: uint8} (different order and flavors, but dtypes still compatible, works)

{i: int64} (success, but nothing is updated)

{i: int64, s: string} (only s is updated)

{s: string} (error, i not present)

{i:int64, b: boolean} (error, b is not in dest table)

Then, depending on how the "should we fallback to positional ordering" decision goes, we should add tests for that too. But I think all the above tests should still be valid regardless of what we decide there.

@NickCrews I've implemented all of the above, with the following differences:

{i: int64} ~~(success, but nothing is updated)~~ (error, no cols to insert)

{i:int64, b: boolean} (error, ~~b is not in dest table~~ cols to insert includes columns not in table)

The latter relates to your questions around the columns to insert behavior, I think. The former is somewhat related; seems it could be nice to support the no-update case, but I also don't know how important it is.

NickCrews · 2025-10-16T19:24:44Z

Thanks @deepyaman, those tweaks all look great. I came up with several more test cases I'd love to see. Sorry for the continually moving target.

NickCrews · 2025-10-16T19:28:43Z

ibis/backends/sql/__init__.py

        target_cols = self.get_schema(target, catalog=catalog, database=db).keys()

-        columns = (
+        return (


Oof, I see what you mean how you are inheriting this logic from insert. I think it is essential that they use the same logic.

But, I think this is a footgun waiting to happen. I think we should make a breaking change to .insert() and require that source be a subset of target. What do you think of this change @cpcloud ?

I don't think there is anywhere else in ibis where we rely on positional ordering of columns, is there? I think we should keep that in mind, if we make this change here, then I think we should wipe out all other instances of relying on positional ordering.

Asking copilot Is there anywhere in this codebase where we rely on the position of columns in tables/relations? Or do we always match by column name? gives:

Looking at the codebase, Ibis primarily relies on column names rather than positions, but there are some specific cases where positional ordering is used:

Column Name-Based Operations (Majority)

Most operations use column names:

Table.join() matches columns by name (e.g., "movieId")

Table.select() references columns by name

Schema.name_at_position() exists but is used to look up names by position, then uses names

Field operations match by column name

Position-Based Operations (Specific Cases)

However, there are cases where position matters:

Positional table access: Table.__getitem__ supports t[0] for first column, t[1] for second, etc.

Positional joins: There's a "positional" join kind mentioned in JoinKind that joins tables by row position rather than column values. See test_positional_join.

Schema comparison: Schema.equals() explicitly states that "The order of fields in the schema is taken into account when computing equality."

Column insertion during insert(): In SQLBackend._build_insert_from_table, columns are matched by position when source columns are not a subset of target columns.

Info operations: Table.info() includes a pos field tracking column position.

So the answer is: primarily name-based, but position is significant for schema equality, positional joins, and some insertion scenarios.

I'm going to put this aside for now, since I think at the very least such a change should come in a commit separate from the Backend.upsert() implementation.

deepyaman · 2025-10-27T05:17:24Z

Thanks @deepyaman, those tweaks all look great. I came up with several more test cases I'd love to see. Sorry for the continually moving target.

@NickCrews No worries, thanks for the revised review! I believe I've largely addressed your updated feedback.

github-actions bot added tests Issues or PRs related to tests sql Backends that generate SQL labels Sep 16, 2025

deepyaman mentioned this pull request Sep 16, 2025

feat(deps): support duckdb 1.4.0 #11622

Merged

deepyaman added the feature Features or general enhancements label Sep 16, 2025

github-actions bot added the oracle The Oracle backend label Sep 17, 2025

deepyaman force-pushed the feat/api/backend-upsert branch from 5a27b34 to 21994eb Compare September 17, 2025 23:14

deepyaman mentioned this pull request Sep 18, 2025

Merge SQL generation injects AS into USING clause tobymao/sqlglot#5910

Closed

deepyaman requested a review from cpcloud September 18, 2025 00:52

deepyaman commented Sep 18, 2025

View reviewed changes

ibis/backends/oracle/__init__.py Outdated Show resolved Hide resolved

deepyaman commented Sep 18, 2025

View reviewed changes

ibis/backends/sql/__init__.py Outdated Show resolved Hide resolved

github-actions bot added the mssql The Microsoft SQL Server backend label Sep 18, 2025

deepyaman force-pushed the feat/api/backend-upsert branch from 46b0e4a to 8e47b79 Compare September 18, 2025 04:54

deepyaman commented Sep 18, 2025

View reviewed changes

deepyaman force-pushed the feat/api/backend-upsert branch from 2e1403d to b40103e Compare September 19, 2025 12:57

deepyaman commented Sep 20, 2025

View reviewed changes

ibis/backends/sql/__init__.py Show resolved Hide resolved

deepyaman force-pushed the feat/api/backend-upsert branch from 25061ad to d052b43 Compare September 20, 2025 16:45

github-actions bot added the dependencies Issues or PRs related to dependencies label Sep 20, 2025

deepyaman mentioned this pull request Sep 20, 2025

chore(deps): upgrade DuckDB and SQLGlot dependency #11636

Closed

deepyaman force-pushed the feat/api/backend-upsert branch 4 times, most recently from d2b5922 to 8b0d6fe Compare September 21, 2025 02:57

github-actions bot added the bigquery The BigQuery backend label Sep 21, 2025

deepyaman mentioned this pull request Sep 21, 2025

Support upsert operations for SQL datasets kedro-org/kedro#5090

Open

deepyaman force-pushed the feat/api/backend-upsert branch from adc70e7 to ffc8125 Compare October 7, 2025 07:22

NickCrews reviewed Oct 13, 2025

View reviewed changes

ibis/backends/tests/test_client.py Show resolved Hide resolved

NickCrews reviewed Oct 13, 2025

View reviewed changes

ibis/backends/tests/test_client.py Outdated Show resolved Hide resolved

deepyaman force-pushed the feat/api/backend-upsert branch 2 times, most recently from 9224992 to a1c5406 Compare October 16, 2025 03:00

NickCrews reviewed Oct 16, 2025

View reviewed changes

deepyaman force-pushed the feat/api/backend-upsert branch from 5f12ee5 to 10be51c Compare October 26, 2025 14:52

deepyaman requested a review from NickCrews October 27, 2025 05:17

deepyaman added 20 commits October 28, 2025 07:40

feat(api): implement upsert() using MERGE INTO

bf068a6

fix(oracle): attempt to support MERGE on backend

1bfe6ce

fix(api): ensure DuckDB still works in "hacky" way

a72e4cc

fix(mssql): attempt to support MERGE for backend

6413ea3

chore(oracle): remove hack, fixed in SQLGlot 27.16

4d67936

chore(mssql): only add semicolon for MERGE queries

8017aba

fix(trino): make SQL generation work (but not run)

08bdd76

test(pyspark): specify expected failures for merge

e6904e3

test(pyspark): don't importorskip for all backends

fda5a3b

refactor(backends): make qualifying cols an option

d89dd58

test(backends): xfail where MERGE INTO unsupported

1e06026

refactor(api): move duplicated logic into function

46409f4

test(api): add upsert from more complex expression

8b4a1e5

chore(api): don't qualify MERGE target col names

be1a885

test(api): expect failure using MSSQL and ORDER BY

a0229b9

refactor(api): combine marks as NO_MERGE_SUPPORT

99e785e

test(api): check upsert from memtable with schemas

1fdee05

test(api): check additional input schema scenarios

7640d28

test(api): add expected errs and fix memtable exec

a8de4a5

test(api): don't use poorly-supported unsigned int

613dce6

deepyaman force-pushed the feat/api/backend-upsert branch from a2458ce to 613dce6 Compare October 28, 2025 13:40

Uh oh!

feat(api): implement upsert() using MERGE INTO #11624

Are you sure you want to change the base?

feat(api): implement upsert() using MERGE INTO #11624

Conversation

deepyaman commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Issues closed

Uh oh!

deepyaman commented Sep 18, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

deepyaman commented Sep 21, 2025

Uh oh!

deepyaman commented Oct 7, 2025

Uh oh!

Uh oh!

Uh oh!

deepyaman commented Oct 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickCrews commented Oct 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Column Name-Based Operations (Majority)

Position-Based Operations (Specific Cases)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deepyaman commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(api): implement `upsert()` using `MERGE INTO` #11624

feat(api): implement `upsert()` using `MERGE INTO` #11624

deepyaman commented Sep 16, 2025 •

edited

Loading