-
Notifications
You must be signed in to change notification settings - Fork 18
Description
for current data, snapshots is already a process that holds our node for several minutes, and we can improve it a lot with some tweaks. But this should be well tested to prevent any issue with non determinism
Problem Summary
The current snapshot creation process (node/snapshotter/snapshotter.go
) has severe performance bottlenecks, primarily caused by inefficient disk I/O patterns. For large databases (e.g., 100GB+), creating a snapshot can take hours. This impacts node operations and recovery times significantly.
Root Cause Analysis
There are three major inefficiencies:
-
Random Disk Access During Sorting (CRITICAL):
- In
sanitizeDump
(STAGE2), the code sorts table data (COPY
blocks) by reading the file once, storing file offsets, sorting the offsets, and then usingf.Seek()
to re-read every single row in the new order. - Impact: This performs millions of random disk operations. Random I/O is orders of magnitude slower than sequential I/O and is the primary bottleneck.
- In
-
Excessive Intermediate Disk Writes:
- Each stage (dump, sanitize, compress, chunk) writes its full output to disk before the next stage reads it.
pg_dump
->stage1.sql
->stage2.sql
->stage3.sql.gz
->chunks
.- Impact: A 100GB database results in over 400GB of sequential writes and reads, adding significant unnecessary time.
-
Double I/O for Chunk Hashing:
- In
splitDumpIntoChunks
(STAGE4), the code writes a chunk to a file, closes it, and then immediately reads the entire file back just to calculate its hash. - Impact: Doubles the amount of I/O needed for the final chunking stage.
- In
Proposed Solution
We need to eliminate random I/O and reduce total sequential I/O.
-
Implement External Merge Sort for Sanitization:
- Action: Replace the
f.Seek()
logic insanitizeDump
with an external merge sort. - How: Read table data into a fixed-size memory buffer (e.g., 256MB). If the table exceeds the buffer, write the sorted buffer to a temporary file and repeat. Finally, merge all sorted temporary files sequentially.
- Benefit: Eliminates all random I/O. Ensures constant memory usage, preventing Out-Of-Memory (OOM) errors even with huge tables.
- Action: Replace the
-
Pipeline Stages with
io.Pipe
:- Action: Stream the output of one stage directly to the input of the next using Go's
io.Pipe
and goroutines (Sanitize -> Compress -> Chunk). - Benefit: Removes the need for large intermediate files (
stage2output.sql
,stage3output.sql.gz
), cutting total sequential I/O by more than half.
- Action: Stream the output of one stage directly to the input of the next using Go's
-
Hash While Writing Chunks:
- Action: Use
io.MultiWriter
to write data to both the final chunk file on disk AND the hash function simultaneously. - Benefit: Eliminates the need to re-read the chunk file, halving the I/O for the final stage.
- Action: Use
Impact & ROI
Implementing these changes will yield massive performance gains:
- Speed: Estimated 10x-100x faster. A process that takes hours or days could be reduced to minutes.
- Reduced Disk Load: Total data moved (I/O) will decrease by ~60% (e.g., from ~580GB to ~220GB for a 100GB source).
- Scalability & Stability: The process will be able to handle arbitrarily large databases using a fixed amount of memory, preventing OOM crashes.
Alternatives Considered
-
Simple In-Memory Sort:
- Idea: Load the entire table's data into memory instead of using file offsets.
- Problem: Unsafe. If a single table is larger than available RAM (e.g., a 50GB table on a 32GB machine), the node will crash (OOM).
-
Do Nothing:
- Problem: The current implementation does not scale and is unacceptably slow for large datasets due to the random I/O bottleneck.