Fix: Reset LTX state after checkpoint during downtime #759
+247
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes issue #752 - a critical data loss bug where database restoration fails with 'nonsequential page numbers' after a checkpoint occurs while Litestream is offline.
Problem
When Litestream is killed (crash, OOM, server restart) and SQLite performs a checkpoint while it's down, attempting to restore from the replica fails completely with:
Root Cause
When Litestream resumes after detecting a checkpoint occurred during downtime:
verify()
function detects WAL changes and triggers a full snapshotwriteLTXFromDB()
creates the snapshot using existing LTX transaction IDsSolution
Reset the local LTX state when we detect a snapshot is required after a checkpoint:
This ensures clean state management and prevents conflicts during restoration.
Changes
resetLTXState()
function to clear local LTX files and metadatainfo.snapshotting
tracking toinfo.reason
for better debuggingTesting
Added two comprehensive tests that reproduce the issue:
TestRestoreFailsAfterFullCheckpointWhileDown
- Tests FULL checkpoint scenarioTestRestoreLosesRowsAfterAutoCheckpointWhileDown
- Tests automatic checkpoint scenarioBoth tests now pass with this fix.
Impact
This fix ensures databases can be successfully restored even after SQLite performs automatic or manual checkpoints while Litestream is offline, preventing complete data loss in production scenarios.
Fixes #752
🤖 Generated with Claude Code