Skip to content

Critical Issue: Cannot Initialize Sui Mainnet Node from Snapshots - Epoch Metadata Mismatch Reference: Sui Node v1.32.1 - Mainnet Epoch 889 Initialization Problem #23683

@SachitNayak

Description

@SachitNayak

I am reaching out regarding a critical architectural limitation I’ve discovered while attempting to bootstrap a Sui mainnet full node using snapshots. After 48+ hours of intensive work and successfully downloading over 2.2TB of data, I’ve hit an insurmountable blocker that appears to be a fundamental issue with Sui’s snapshot restoration process.

Executive Summary
The Problem: Sui nodes cannot be bootstrapped from snapshots without proper genesis initialization. The node gets stuck at epoch 0 while the snapshot data is from epoch 889+, causing fatal “wrong epoch” errors and peer rejection.

Technical Details
Environment
Sui Version: 1.32.1-homebrew
Server: Hetzner AX162-R (48 cores, 192GB RAM, 2x1.92TB NVMe)
Network: Mainnet
Target Epoch: 889
Downloaded Data
Formal Snapshot (Complete): 43.5GB - Epoch 889
Path: /opt/sui-data/mainnet-db-formal/snapshot/epoch_889/
Contains: 1,228 .ref files (1_1.ref through 1_1228.ref)
Status: Successfully downloaded and verified
Checkpoint Snapshot (Complete): 2.2TB
Path: /opt/sui-data/mainnet-db-checkpoint-snapshot-backup/
Contains: Full transaction/checkpoint history
Status: Complete but unusable due to epoch mismatch
The Fatal Error
When attempting to start the node with restored data:

ERROR: We should never enqueue certificate from wrong epoch.
Expected=0 Certificate=890
Root Cause Analysis
After extensive investigation, I’ve identified the core issue:

Database Initialization Flow:
Snapshot Problem:
Snapshots contain data (SST files) but NOT database metadata
The epoch configuration is stored in RocksDB metadata tables
Without proper initialization, the node defaults to epoch 0
The node receives epoch 889/890 data but expects epoch 0
All peer connections are rejected due to epoch mismatch
What I’ve Tried (All Failed)
sui-tool restore-db
sui-tool restore-db --config-path fullnode.yaml \ --db-checkpoint-path /path/to/snapshot ``` Result: Only copies files, doesn't initialize epoch metadata

  1. Manual Restoration Attempts:
  • Copying formal snapshot to epoch_889 directory
    • Combining formal + checkpoint snapshots
    • Using --run-with-range-epoch flag
      Result: Node still reads epoch 0 from metadata
  1. Direct Database Manipulation:
  • Attempted to manually create proper database structure
    • Tried to inject epoch metadata from .ref files
      Result: Cannot bypass the initialization requirement

Technical Findings

  1. Snapshot Types Mismatch:
  • Formal snapshots: Data export without metadata structure
    • Checkpoint snapshots: Transaction history but wrong epoch in metadata
    • Neither provides complete database initialization
  1. Missing Components:
  • EpochStartConfiguration object
    • Proper typed_store entries
    • Committee information for epoch
    • Database MANIFEST/CURRENT files with correct epoch
  1. Data Verification:
  • Found 154 instances of epoch 889 pattern (0x79 0x03) in .ref files
    • Data is present but cannot be utilized without proper metadata

What I Need from Mysten Labs

Option 1: Proper Database Snapshot

Can you provide or point me to:

  • A complete RocksDB database backup (not just snapshot) from epoch 889
  • Must include all metadata tables and MANIFEST files
  • Should preserve the exact typed_store structure

Option 2: Initialization Tools

Are there undocumented tools or methods to:

  • Initialize a database directly at a specific epoch?
  • Convert snapshots to proper database format?
  • Skip the genesis → epoch 0 initialization requirement?

Option 3: Technical Guidance

Can you provide:

  • The exact RocksDB column family where epoch metadata is stored?
  • The BCS serialization format for EpochStartConfiguration?
  • A method to construct proper epoch metadata from snapshot data?
  • Any debug flags or environment variables that might help?

Option 4: Alternative Bootstrap Method

Is there any way to:

  • Bootstrap from a specific epoch without full sync?
  • Use checkpoint data to initialize at epoch 889?
  • Create a minimal chain history that satisfies initialization?

Impact and Urgency

This issue is blocking production deployment of a Sui full node for:

  • MEV bot operations
  • Network participation
  • Data availability requirements

The alternative (syncing from genesis) would require:

  • 2-4 weeks of sync time
  • 7+ TB of data transfer
  • Significant bandwidth costs
  • Extended downtime

Specific Questions

  1. Is this a known limitation of the snapshot system?
  2. Are there plans to provide complete database backups instead of just snapshots?
  3. Can sui-tool be enhanced to properly initialize epoch metadata?
  4. Is there a recommended way to get a node running at current epoch without full sync?
  5. Would Mysten consider providing a database backup service for node operators?

Proposed Solutions

I believe this could be addressed by:

  1. Providing complete database backups (with metadata) alongside snapshots
  2. Adding a --initialize-from-snapshot flag to sui-node
  3. Documenting the database structure and initialization requirements
  4. Creating tools to convert snapshots to proper databases

Environment Details for Reproduction

Server: Hetzner AX162-R  
OS: Ubuntu 22.04 LTS  
Sui Version: 1.32.1-homebrew  
Snapshot Source: Google Cloud bucket requester pays
Network: Mainnet  
Target Epoch: 889  
Data Downloaded: 2.25TB total  

Thank you for your attention to this critical issue. I believe addressing this limitation would greatly benefit the entire Sui node operator community.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions