Skip to content

Problem: Snapshot process is inefficient #1526

@outerlook

Description

@outerlook

for current data, snapshots is already a process that holds our node for several minutes, and we can improve it a lot with some tweaks. But this should be well tested to prevent any issue with non determinism

Problem Summary

The current snapshot creation process (node/snapshotter/snapshotter.go) has severe performance bottlenecks, primarily caused by inefficient disk I/O patterns. For large databases (e.g., 100GB+), creating a snapshot can take hours. This impacts node operations and recovery times significantly.

Root Cause Analysis

There are three major inefficiencies:

  1. Random Disk Access During Sorting (CRITICAL):

    • In sanitizeDump (STAGE2), the code sorts table data (COPY blocks) by reading the file once, storing file offsets, sorting the offsets, and then using f.Seek() to re-read every single row in the new order.
    • Impact: This performs millions of random disk operations. Random I/O is orders of magnitude slower than sequential I/O and is the primary bottleneck.
  2. Excessive Intermediate Disk Writes:

    • Each stage (dump, sanitize, compress, chunk) writes its full output to disk before the next stage reads it.
    • pg_dump -> stage1.sql -> stage2.sql -> stage3.sql.gz -> chunks.
    • Impact: A 100GB database results in over 400GB of sequential writes and reads, adding significant unnecessary time.
  3. Double I/O for Chunk Hashing:

    • In splitDumpIntoChunks (STAGE4), the code writes a chunk to a file, closes it, and then immediately reads the entire file back just to calculate its hash.
    • Impact: Doubles the amount of I/O needed for the final chunking stage.

Proposed Solution

We need to eliminate random I/O and reduce total sequential I/O.

  1. Implement External Merge Sort for Sanitization:

    • Action: Replace the f.Seek() logic in sanitizeDump with an external merge sort.
    • How: Read table data into a fixed-size memory buffer (e.g., 256MB). If the table exceeds the buffer, write the sorted buffer to a temporary file and repeat. Finally, merge all sorted temporary files sequentially.
    • Benefit: Eliminates all random I/O. Ensures constant memory usage, preventing Out-Of-Memory (OOM) errors even with huge tables.
  2. Pipeline Stages with io.Pipe:

    • Action: Stream the output of one stage directly to the input of the next using Go's io.Pipe and goroutines (Sanitize -> Compress -> Chunk).
    • Benefit: Removes the need for large intermediate files (stage2output.sql, stage3output.sql.gz), cutting total sequential I/O by more than half.
  3. Hash While Writing Chunks:

    • Action: Use io.MultiWriter to write data to both the final chunk file on disk AND the hash function simultaneously.
    • Benefit: Eliminates the need to re-read the chunk file, halving the I/O for the final stage.

Impact & ROI

Implementing these changes will yield massive performance gains:

  • Speed: Estimated 10x-100x faster. A process that takes hours or days could be reduced to minutes.
  • Reduced Disk Load: Total data moved (I/O) will decrease by ~60% (e.g., from ~580GB to ~220GB for a 100GB source).
  • Scalability & Stability: The process will be able to handle arbitrarily large databases using a fixed amount of memory, preventing OOM crashes.

Alternatives Considered

  1. Simple In-Memory Sort:

    • Idea: Load the entire table's data into memory instead of using file offsets.
    • Problem: Unsafe. If a single table is larger than available RAM (e.g., a 50GB table on a 32GB machine), the node will crash (OOM).
  2. Do Nothing:

    • Problem: The current implementation does not scale and is unacceptably slow for large datasets due to the random I/O bottleneck.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions