Fix: Reset LTX state after checkpoint during downtime #759

corylanou · 2025-09-22T16:37:59Z

Summary

This PR fixes issue #752 - a critical data loss bug where database restoration fails with 'nonsequential page numbers' after a checkpoint occurs while Litestream is offline.

Problem

When Litestream is killed (crash, OOM, server restart) and SQLite performs a checkpoint while it's down, attempting to restore from the replica fails completely with:

decode database: decode page 17743: copy page 17750 header: 
nonsequential page numbers in snapshot transaction: 17742,17750

Root Cause

When Litestream resumes after detecting a checkpoint occurred during downtime:

The verify() function detects WAL changes and triggers a full snapshot
writeLTXFromDB() creates the snapshot using existing LTX transaction IDs
This creates a mix of old LTX files and a new snapshot with conflicting transaction IDs
During restore, the decoder expects sequential pages and fails on the gaps

Solution

Reset the local LTX state when we detect a snapshot is required after a checkpoint:

Remove all local LTX files
Clear cached metadata
Start fresh with transaction ID 1

This ensures clean state management and prevents conflicts during restoration.

Changes

Added resetLTXState() function to clear local LTX files and metadata
Reset LTX state when snapshot is required with existing position (checkpoint detected)
Added comprehensive tests for checkpoint scenarios during downtime
Changed info.snapshotting tracking to info.reason for better debugging

Testing

Added two comprehensive tests that reproduce the issue:

TestRestoreFailsAfterFullCheckpointWhileDown - Tests FULL checkpoint scenario
TestRestoreLosesRowsAfterAutoCheckpointWhileDown - Tests automatic checkpoint scenario

Both tests now pass with this fix.

Impact

This fix ensures databases can be successfully restored even after SQLite performs automatic or manual checkpoints while Litestream is offline, preventing complete data loss in production scenarios.

Fixes #752

🤖 Generated with Claude Code

Fixes critical data loss issue where database restoration fails with 'nonsequential page numbers' after a checkpoint occurs while Litestream is offline. The root cause was that when Litestream resumed after detecting a checkpoint had occurred during downtime, it would create a snapshot using the existing LTX transaction IDs. This caused confusion in the restore process as it would have a mix of old LTX files and a new snapshot with conflicting transaction IDs. Solution: Reset the local LTX state (remove LTX files and clear cached metadata) when we detect we need a snapshot after a checkpoint. This ensures we start fresh with transaction ID 1, preventing conflicts during restoration. Changes: - Add resetLTXState() function to clear local LTX files and metadata - Reset LTX state when snapshot is required with existing position - Add comprehensive tests for checkpoint scenarios during downtime - Change info.snapshotting tracking to info.reason for better debugging This ensures databases can be successfully restored even after SQLite performs automatic or manual checkpoints while Litestream is offline. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>

benbjohnson · 2025-09-22T18:19:21Z

We can’t reset TXID back to 1 as that will conflict with data already on the replica. We also can’t delete data on the replica (besides for retention reasons) because then we lose backups.

corylanou · 2025-10-14T18:01:31Z

Closing this PR as the issue has been resolved by PR #761 instead.

Why this approach didn't work

As Ben correctly pointed out in the comments, resetting TXID back to 1 would conflict with data already on the replica. We also can't delete data from the replica (besides retention) because we'd lose backups.

The solution that worked (PR #761)

PR #761 takes a different approach:

Detects FULL/RESTART checkpoints by scanning for unexpected WAL salts
Triggers a snapshot at the next sequential TXID (not resetting to 1)
Maintains continuity with existing replica data

Test confirmation

The exact reproduction case from #752 now passes successfully with PR #761 merged.

Thanks for the review and guidance @benbjohnson - the salt detection approach was the right solution!

revert

e205bdd

corylanou closed this Oct 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Reset LTX state after checkpoint during downtime #759

Fix: Reset LTX state after checkpoint during downtime #759

corylanou commented Sep 22, 2025

Uh oh!

benbjohnson commented Sep 22, 2025

Uh oh!

corylanou commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: Reset LTX state after checkpoint during downtime #759

Fix: Reset LTX state after checkpoint during downtime #759

Conversation

corylanou commented Sep 22, 2025

Summary

Problem

Root Cause

Solution

Changes

Testing

Impact

Uh oh!

benbjohnson commented Sep 22, 2025

Uh oh!

corylanou commented Oct 14, 2025

Why this approach didn't work

The solution that worked (PR #761)

Test confirmation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants