A high-performance, scalable file backup system designed to handle millions of files with deduplication, compression, and version history tracking.
- Scalable Architecture: Handles unlimited file counts using chunked META directory structure
- Content Deduplication: MD5-based deduplication saves storage space for identical files
- Gzip Compression: Fast compression reduces storage requirements
- Version History: Tracks up to 3 versions of each file with timestamps
- Memory Efficient: Streaming processing prevents out-of-memory errors
- Crash Resistant: Atomic operations and backup systems ensure data integrity
- Fast Incremental Backups: Only processes changed files on subsequent runs
- Configurable: JSON-based configuration system
/backup_location/{DIR}/
├── files/ # Compressed file content (MD5 named)
├── backups/ # Timestamped index backups
├── META/ # Chunked metadata directory
│ ├── master.json # Master index with chunk references
│ ├── index_0000.json.gz # Index chunks (25k files each)
│ ├── index_0001.json.gz
│ └── ...
└── cache/ # Generated file lists
- Files are stored by their MD5 hash in
files/directory - All files are gzip compressed for space efficiency
- Identical files are deduplicated automatically
- Original paths and metadata stored in index
- Master Index: Small JSON file with chunk metadata
- Index Chunks: Compressed chunks containing file metadata
- Automatic Migration: Legacy formats are automatically upgraded
- Clone or download the backer system
- Configure
config.jsonwith your settings - Generate file caches using
files.js - Run backups using
store.js
Important: This system requires root/administrator permissions to access system directories and files. Always run commands with sudo on Linux/macOS or as Administrator on Windows.
Edit config.json to configure the system:
{
"backup_location": "/run/media/exHDD/backups",
"backup_dirs": [
"/var",
"/usr",
"/etc",
"/opt",
"/lib",
"/srv",
"/root",
"/home"
],
"flags": {
"MAX_CONTENT_HISTORY": 3,
"INDEX_SAVE_FREQUENCY": 10000,
"GC_FREQUENCY": 2000,
"MEMORY_CHECK_FREQUENCY": 5000,
"PROGRESS_UPDATE_FREQUENCY": 250
}
}- backup_location: Root directory for all backups
- backup_dirs: Array of directories to backup
- MAX_CONTENT_HISTORY: Number of file versions to keep (default: 3)
- INDEX_SAVE_FREQUENCY: Save index every N files (default: 10000)
- GC_FREQUENCY: Garbage collection frequency (default: 2000)
- MEMORY_CHECK_FREQUENCY: Memory usage check frequency (default: 5000)
- PROGRESS_UPDATE_FREQUENCY: Progress display update frequency (default: 250)
First, generate a cache of files to backup:
sudo node files.jsThis creates cache files in cache/ directory for each configured backup directory.
Backup a specific directory:
sudo node store.js <directory_name>Examples:
sudo node store.js home # Backup /home directory
sudo node store.js var # Backup /var directory
sudo node store.js etc # Backup /etc directoryValidate backup integrity:
sudo node valid.js- Initial Backup: ~8 minutes for 1.2M files
- Incremental Run: ~25 seconds when no files changed
- Memory Usage: <4GB RAM for millions of files
- Storage Efficiency: ~70% compression ratio typical
- Streaming file processing prevents memory overflow
- Chunked index system scales to unlimited file counts
- Automatic garbage collection maintains memory stability
- Progress tracking with ETA calculations
- Background save operations don't block processing
- Load existing index from META directory
- Compare modification timestamps
- Skip unchanged files
- Hash and compress new/changed files
- Update index with new metadata
- Save index in chunks periodically
- Create timestamped backup of index
- Files are hashed using MD5
- Identical content is stored only once
- Multiple file paths can reference same content
- Significant space savings for duplicate files
- Up to 3 versions of each file are tracked
- Each version includes:
- Original file path
- Modification timestamp
- Storage timestamp
- Content hash reference
{
"version": "2.0",
"created": "2025-08-25T...",
"totalFiles": 1234567,
"totalChunks": 50,
"chunkSize": 25000,
"chunks": [
{
"filename": "index_0000.json.gz",
"files": 25000,
"startPath": "/first/file/path",
"endPath": "/last/file/path",
"size": 2458392
}
]
}{
"/path/to/file": {
"latest": 0,
"path": "/path/to/file",
"timeline": [
{
"modified": "2025-07-01T13:05:46.723Z",
"stored": "2025-08-25T08:30:15.123Z",
"path": "a1b2c3d4e5f6..."
}
]
}
}- Invalid string length errors trigger chunked storage
- Memory overflow switches to emergency save mode
- Corrupted chunks are skipped with warnings
- Atomic operations prevent data corruption
- Index is backed up before each run
- Temporary files used for atomic operations
- Legacy format migration preserves all data
- Multiple fallback save mechanisms
- Streaming file operations
- Periodic garbage collection
- Memory usage monitoring
- Chunked index loading
- Automatic cleanup of processed file sets
- Designed to run in 4GB RAM
- Automatic warnings at 1GB usage
- Emergency garbage collection triggers
- Chunk size optimized for memory efficiency
Out of Memory Errors
- Increase Node.js memory:
sudo node --max-old-space-size=4096 store.js <directory> - System automatically switches to chunked mode
- Check available system RAM
Slow Performance
- Verify disk I/O isn't bottleneck
- Check backup_location drive speed
- Consider SSD for index storage
- Adjust INDEX_SAVE_FREQUENCY if needed
Cache Generation Fails
- Check permissions on backup_dirs
- Verify paths exist in config.json
- Look for symbolic link issues
Add debug logging by modifying console.log statements in the code or enable verbose output during operations.
- fs: File system operations
- path: Path manipulation
- crypto: MD5 hashing
- zlib: Gzip compression
- config.json: Configuration management
- MD5 Hashing: Content-based deduplication
- Gzip Compression: Fast compression with level 1
- Streaming I/O: Memory-efficient file processing
- Chunked Storage: Scalable metadata management
- MD5 used for deduplication (not cryptographic security)
- File permissions preserved in metadata
- No encryption - files stored compressed but unencrypted
- Consider filesystem-level encryption for sensitive data
This project is licensed under the Apache License, Version 2.0. This software is provided "as-is". The authors are not responsible for any damages or data loss that may result from its use.
For issues or questions:
- Check this README for troubleshooting
- Verify configuration in config.json
- Check system resources (RAM, disk space)
- Review console output for specific error messages