SQL server: consistent snapshot and lsn selection #32979

martykulma · 2025-07-10T18:31:29Z

The current mechanism for snapshot and LSN selection is not correct due to the asynchronous nature of MS SQL CDC and the implementation details of SNAPSHOT isolation.

This implementation is similar to the MySQL implementation in that it locks the tables for a brief period to establish a point in time where a consistent view of the DB can be observed. Once the LSN and transaction sequence number (XSN), the table locks are dropped, and we have a consistent version of the table(s) to read from as well as a correct LSN to process CDC events.

Motivation

Fixes https://github.com/MaterializeInc/database-issues/issues/9476

Tips for reviewer

You can look at the commits to see the history, but the actual fix is not until the end.

I didn't squash my commits because the bug that this fixes was created in response to this PR (at the time a draft that was just a test change to show the race condition). It will get squashed on merge!

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

ptravers · 2025-07-23T15:33:37Z

src/sql-server-util/src/cdc.rs

@@ -140,13 +181,57 @@ impl<'a> CdcStream<'a> {
        // the upstream DB is ready for CDC.
        self.wait_for_ready().await?;

+        tracing::info!("Upstream is ready");


nit: might want to add some additional context here and also are we sure we want this at info level?

added a comment

ah I see! would we ever see these concurrently on the same node? would it be clear which database is ready?

🤦 - just realized i didn't include ?tables

ptravers · 2025-07-23T15:34:02Z

src/sql-server-util/src/cdc.rs

+            tracing::trace!(%schema, %table, "locking table");
+            fence_txn.lock_table_exclusive(&*schema, &*table).await?;
+        }
+        tracing::info!(?tables, "Locked tables");


nit: same comment as above.

ptravers · 2025-07-23T17:37:16Z

src/sql-server-util/src/cdc.rs

+        let mut fencing_client = self.client.new_connection().await?;
+        let mut fence_txn = fencing_client.transaction().await?;
+
+        // TODO (maz): we should consider a timeout or a lock + snapshot per-table instead of collectively


what telemetry do we need to be able to make decision about the necessity of locking per table?

I would guess we need to know whether the table would have been locked for less time that way?

follow up: maybe we can batch with a configurable number of tables per batch? we might want to track with a histogram/metric roughly how long we are locking a tables.

seems like this only becomes a problem if there's a very large number of tables or a very slow lock acquisition.

ptravers self-requested a review July 10, 2025 21:59

martykulma force-pushed the sql-server-resume-lsn branch 3 times, most recently from 8ca2ca9 to ba8b692 Compare July 18, 2025 23:46

martykulma changed the title ~~delete column, drop source~~ SQL server: consistent snapshot and lsn selection Jul 22, 2025

martykulma added 6 commits July 22, 2025 22:22

delete column, drop source

60767a2

increase logging to rca

bc54675

Recover LSN by creating a savepoint

639374e

lock tables before capturing lsn, quick and dirty don't judge

913e4ce

poc: overlapping transactions for snapshot

0f66f3b

test: add subsource with concurrent upstream updates

aef6f3e

martykulma force-pushed the sql-server-resume-lsn branch 3 times, most recently from a4eb8f0 to e502e99 Compare July 23, 2025 02:32

Cleanup

831bb01

martykulma force-pushed the sql-server-resume-lsn branch from e502e99 to 831bb01 Compare July 23, 2025 11:47

martykulma marked this pull request as ready for review July 23, 2025 15:04

martykulma requested a review from a team as a code owner July 23, 2025 15:04

ptravers reviewed Jul 23, 2025

View reviewed changes

Leave some comments on the logging choices

6efa280

martykulma requested a review from ptravers July 23, 2025 16:10

ptravers reviewed Jul 23, 2025

View reviewed changes

ptravers approved these changes Jul 23, 2025

View reviewed changes

ptravers mentioned this pull request Jul 23, 2025

sql_server: add emitting probes from source #33130

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SQL server: consistent snapshot and lsn selection #32979

SQL server: consistent snapshot and lsn selection #32979

martykulma commented Jul 10, 2025 •

edited

Loading

Uh oh!

ptravers Jul 23, 2025

Uh oh!

martykulma Jul 23, 2025

Uh oh!

ptravers Jul 23, 2025

Uh oh!

martykulma Jul 23, 2025

Uh oh!

ptravers Jul 23, 2025

Uh oh!

martykulma Jul 23, 2025

Uh oh!

ptravers Jul 23, 2025 •

edited

Loading

Uh oh!

ptravers Jul 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

SQL server: consistent snapshot and lsn selection #32979

Are you sure you want to change the base?

SQL server: consistent snapshot and lsn selection #32979

Conversation

martykulma commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Tips for reviewer

Checklist

Uh oh!

ptravers Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

martykulma Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

ptravers Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

martykulma Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

ptravers Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

martykulma Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

ptravers Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ptravers Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

martykulma commented Jul 10, 2025 •

edited

Loading

ptravers Jul 23, 2025 •

edited

Loading

ptravers Jul 23, 2025 •

edited

Loading