Skip to content

Conversation

corylanou
Copy link
Collaborator

Summary

This PR fixes issue #752 - a critical data loss bug where database restoration fails with 'nonsequential page numbers' after a checkpoint occurs while Litestream is offline.

Problem

When Litestream is killed (crash, OOM, server restart) and SQLite performs a checkpoint while it's down, attempting to restore from the replica fails completely with:

decode database: decode page 17743: copy page 17750 header: 
nonsequential page numbers in snapshot transaction: 17742,17750

Root Cause

When Litestream resumes after detecting a checkpoint occurred during downtime:

  1. The verify() function detects WAL changes and triggers a full snapshot
  2. writeLTXFromDB() creates the snapshot using existing LTX transaction IDs
  3. This creates a mix of old LTX files and a new snapshot with conflicting transaction IDs
  4. During restore, the decoder expects sequential pages and fails on the gaps

Solution

Reset the local LTX state when we detect a snapshot is required after a checkpoint:

  • Remove all local LTX files
  • Clear cached metadata
  • Start fresh with transaction ID 1

This ensures clean state management and prevents conflicts during restoration.

Changes

  • Added resetLTXState() function to clear local LTX files and metadata
  • Reset LTX state when snapshot is required with existing position (checkpoint detected)
  • Added comprehensive tests for checkpoint scenarios during downtime
  • Changed info.snapshotting tracking to info.reason for better debugging

Testing

Added two comprehensive tests that reproduce the issue:

  • TestRestoreFailsAfterFullCheckpointWhileDown - Tests FULL checkpoint scenario
  • TestRestoreLosesRowsAfterAutoCheckpointWhileDown - Tests automatic checkpoint scenario

Both tests now pass with this fix.

Impact

This fix ensures databases can be successfully restored even after SQLite performs automatic or manual checkpoints while Litestream is offline, preventing complete data loss in production scenarios.

Fixes #752

🤖 Generated with Claude Code

Fixes critical data loss issue where database restoration fails with
'nonsequential page numbers' after a checkpoint occurs while Litestream
is offline.

The root cause was that when Litestream resumed after detecting a
checkpoint had occurred during downtime, it would create a snapshot
using the existing LTX transaction IDs. This caused confusion in the
restore process as it would have a mix of old LTX files and a new
snapshot with conflicting transaction IDs.

Solution: Reset the local LTX state (remove LTX files and clear cached
metadata) when we detect we need a snapshot after a checkpoint. This
ensures we start fresh with transaction ID 1, preventing conflicts
during restoration.

Changes:
- Add resetLTXState() function to clear local LTX files and metadata
- Reset LTX state when snapshot is required with existing position
- Add comprehensive tests for checkpoint scenarios during downtime
- Change info.snapshotting tracking to info.reason for better debugging

This ensures databases can be successfully restored even after SQLite
performs automatic or manual checkpoints while Litestream is offline.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
@benbjohnson
Copy link
Owner

We can’t reset TXID back to 1 as that will conflict with data already on the replica. We also can’t delete data on the replica (besides for retention reasons) because then we lose backups.

@corylanou
Copy link
Collaborator Author

Closing this PR as the issue has been resolved by PR #761 instead.

Why this approach didn't work

As Ben correctly pointed out in the comments, resetting TXID back to 1 would conflict with data already on the replica. We also can't delete data from the replica (besides retention) because we'd lose backups.

The solution that worked (PR #761)

PR #761 takes a different approach:

  • Detects FULL/RESTART checkpoints by scanning for unexpected WAL salts
  • Triggers a snapshot at the next sequential TXID (not resetting to 1)
  • Maintains continuity with existing replica data

Test confirmation

The exact reproduction case from #752 now passes successfully with PR #761 merged.

Thanks for the review and guidance @benbjohnson - the salt detection approach was the right solution!

@corylanou corylanou closed this Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CRITICAL: Restore fails with 'nonsequential page numbers' after checkpoint during Litestream downtime

2 participants