-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Description
I am reaching out regarding a critical architectural limitation I’ve discovered while attempting to bootstrap a Sui mainnet full node using snapshots. After 48+ hours of intensive work and successfully downloading over 2.2TB of data, I’ve hit an insurmountable blocker that appears to be a fundamental issue with Sui’s snapshot restoration process.
Executive Summary
The Problem: Sui nodes cannot be bootstrapped from snapshots without proper genesis initialization. The node gets stuck at epoch 0 while the snapshot data is from epoch 889+, causing fatal “wrong epoch” errors and peer rejection.
Technical Details
Environment
Sui Version: 1.32.1-homebrew
Server: Hetzner AX162-R (48 cores, 192GB RAM, 2x1.92TB NVMe)
Network: Mainnet
Target Epoch: 889
Downloaded Data
Formal Snapshot (Complete): 43.5GB - Epoch 889
Path: /opt/sui-data/mainnet-db-formal/snapshot/epoch_889/
Contains: 1,228 .ref files (1_1.ref through 1_1228.ref)
Status: Successfully downloaded and verified
Checkpoint Snapshot (Complete): 2.2TB
Path: /opt/sui-data/mainnet-db-checkpoint-snapshot-backup/
Contains: Full transaction/checkpoint history
Status: Complete but unusable due to epoch mismatch
The Fatal Error
When attempting to start the node with restored data:
ERROR: We should never enqueue certificate from wrong epoch.
Expected=0 Certificate=890
Root Cause Analysis
After extensive investigation, I’ve identified the core issue:
Database Initialization Flow:
Snapshot Problem:
Snapshots contain data (SST files) but NOT database metadata
The epoch configuration is stored in RocksDB metadata tables
Without proper initialization, the node defaults to epoch 0
The node receives epoch 889/890 data but expects epoch 0
All peer connections are rejected due to epoch mismatch
What I’ve Tried (All Failed)
sui-tool restore-db
sui-tool restore-db --config-path fullnode.yaml \ --db-checkpoint-path /path/to/snapshot ``` Result: Only copies files, doesn't initialize epoch metadata
- Manual Restoration Attempts:
- Copying formal snapshot to epoch_889 directory
- Combining formal + checkpoint snapshots
- Using --run-with-range-epoch flag
Result: Node still reads epoch 0 from metadata
- Direct Database Manipulation:
- Attempted to manually create proper database structure
- Tried to inject epoch metadata from .ref files
Result: Cannot bypass the initialization requirement
- Tried to inject epoch metadata from .ref files
Technical Findings
- Snapshot Types Mismatch:
- Formal snapshots: Data export without metadata structure
- Checkpoint snapshots: Transaction history but wrong epoch in metadata
- Neither provides complete database initialization
- Missing Components:
- EpochStartConfiguration object
- Proper typed_store entries
- Committee information for epoch
- Database MANIFEST/CURRENT files with correct epoch
- Data Verification:
- Found 154 instances of epoch 889 pattern (0x79 0x03) in .ref files
- Data is present but cannot be utilized without proper metadata
What I Need from Mysten Labs
Option 1: Proper Database Snapshot
Can you provide or point me to:
- A complete RocksDB database backup (not just snapshot) from epoch 889
- Must include all metadata tables and MANIFEST files
- Should preserve the exact typed_store structure
Option 2: Initialization Tools
Are there undocumented tools or methods to:
- Initialize a database directly at a specific epoch?
- Convert snapshots to proper database format?
- Skip the genesis → epoch 0 initialization requirement?
Option 3: Technical Guidance
Can you provide:
- The exact RocksDB column family where epoch metadata is stored?
- The BCS serialization format for EpochStartConfiguration?
- A method to construct proper epoch metadata from snapshot data?
- Any debug flags or environment variables that might help?
Option 4: Alternative Bootstrap Method
Is there any way to:
- Bootstrap from a specific epoch without full sync?
- Use checkpoint data to initialize at epoch 889?
- Create a minimal chain history that satisfies initialization?
Impact and Urgency
This issue is blocking production deployment of a Sui full node for:
- MEV bot operations
- Network participation
- Data availability requirements
The alternative (syncing from genesis) would require:
- 2-4 weeks of sync time
- 7+ TB of data transfer
- Significant bandwidth costs
- Extended downtime
Specific Questions
- Is this a known limitation of the snapshot system?
- Are there plans to provide complete database backups instead of just snapshots?
- Can sui-tool be enhanced to properly initialize epoch metadata?
- Is there a recommended way to get a node running at current epoch without full sync?
- Would Mysten consider providing a database backup service for node operators?
Proposed Solutions
I believe this could be addressed by:
- Providing complete database backups (with metadata) alongside snapshots
- Adding a
--initialize-from-snapshot
flag to sui-node - Documenting the database structure and initialization requirements
- Creating tools to convert snapshots to proper databases
Environment Details for Reproduction
Server: Hetzner AX162-R
OS: Ubuntu 22.04 LTS
Sui Version: 1.32.1-homebrew
Snapshot Source: Google Cloud bucket requester pays
Network: Mainnet
Target Epoch: 889
Data Downloaded: 2.25TB total
Thank you for your attention to this critical issue. I believe addressing this limitation would greatly benefit the entire Sui node operator community.