Skip to content

A high-performance, scalable file backup system with deduplication, compression, and version history tracking.

License

Notifications You must be signed in to change notification settings

CaviraOSS/backer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Backer - File Backup System

A high-performance, scalable file backup system designed to handle millions of files with deduplication, compression, and version history tracking.

Features

  • Scalable Architecture: Handles unlimited file counts using chunked META directory structure
  • Content Deduplication: MD5-based deduplication saves storage space for identical files
  • Gzip Compression: Fast compression reduces storage requirements
  • Version History: Tracks up to 3 versions of each file with timestamps
  • Memory Efficient: Streaming processing prevents out-of-memory errors
  • Crash Resistant: Atomic operations and backup systems ensure data integrity
  • Fast Incremental Backups: Only processes changed files on subsequent runs
  • Configurable: JSON-based configuration system

Architecture

Directory Structure

/backup_location/{DIR}/
├── files/                    # Compressed file content (MD5 named)
├── backups/                  # Timestamped index backups
├── META/                     # Chunked metadata directory
│   ├── master.json          # Master index with chunk references
│   ├── index_0000.json.gz   # Index chunks (25k files each)
│   ├── index_0001.json.gz
│   └── ...
└── cache/                    # Generated file lists

File Storage

  • Files are stored by their MD5 hash in files/ directory
  • All files are gzip compressed for space efficiency
  • Identical files are deduplicated automatically
  • Original paths and metadata stored in index

Index Structure

  • Master Index: Small JSON file with chunk metadata
  • Index Chunks: Compressed chunks containing file metadata
  • Automatic Migration: Legacy formats are automatically upgraded

Installation

  1. Clone or download the backer system
  2. Configure config.json with your settings
  3. Generate file caches using files.js
  4. Run backups using store.js

Important: This system requires root/administrator permissions to access system directories and files. Always run commands with sudo on Linux/macOS or as Administrator on Windows.

Configuration

Edit config.json to configure the system:

{
    "backup_location": "/run/media/exHDD/backups",
    "backup_dirs": [
        "/var",
        "/usr", 
        "/etc",
        "/opt",
        "/lib",
        "/srv",
        "/root",
        "/home"
    ],
    "flags": {
        "MAX_CONTENT_HISTORY": 3,
        "INDEX_SAVE_FREQUENCY": 10000,
        "GC_FREQUENCY": 2000,
        "MEMORY_CHECK_FREQUENCY": 5000,
        "PROGRESS_UPDATE_FREQUENCY": 250
    }
}

Configuration Options

  • backup_location: Root directory for all backups
  • backup_dirs: Array of directories to backup
  • MAX_CONTENT_HISTORY: Number of file versions to keep (default: 3)
  • INDEX_SAVE_FREQUENCY: Save index every N files (default: 10000)
  • GC_FREQUENCY: Garbage collection frequency (default: 2000)
  • MEMORY_CHECK_FREQUENCY: Memory usage check frequency (default: 5000)
  • PROGRESS_UPDATE_FREQUENCY: Progress display update frequency (default: 250)

Usage

1. Generate File Cache

First, generate a cache of files to backup:

sudo node files.js

This creates cache files in cache/ directory for each configured backup directory.

2. Run Backup

Backup a specific directory:

sudo node store.js <directory_name>

Examples:

sudo node store.js home     # Backup /home directory
sudo node store.js var      # Backup /var directory  
sudo node store.js etc      # Backup /etc directory

3. Validate Backup (Optional)

Validate backup integrity:

sudo node valid.js

Performance

Benchmarks

  • Initial Backup: ~8 minutes for 1.2M files
  • Incremental Run: ~25 seconds when no files changed
  • Memory Usage: <4GB RAM for millions of files
  • Storage Efficiency: ~70% compression ratio typical

Optimization Features

  • Streaming file processing prevents memory overflow
  • Chunked index system scales to unlimited file counts
  • Automatic garbage collection maintains memory stability
  • Progress tracking with ETA calculations
  • Background save operations don't block processing

File Operations

Backup Process

  1. Load existing index from META directory
  2. Compare modification timestamps
  3. Skip unchanged files
  4. Hash and compress new/changed files
  5. Update index with new metadata
  6. Save index in chunks periodically
  7. Create timestamped backup of index

Deduplication

  • Files are hashed using MD5
  • Identical content is stored only once
  • Multiple file paths can reference same content
  • Significant space savings for duplicate files

Version History

  • Up to 3 versions of each file are tracked
  • Each version includes:
    • Original file path
    • Modification timestamp
    • Storage timestamp
    • Content hash reference

Index Format

Master Index (master.json)

{
  "version": "2.0",
  "created": "2025-08-25T...",
  "totalFiles": 1234567,
  "totalChunks": 50,
  "chunkSize": 25000,
  "chunks": [
    {
      "filename": "index_0000.json.gz",
      "files": 25000,
      "startPath": "/first/file/path",
      "endPath": "/last/file/path", 
      "size": 2458392
    }
  ]
}

File Record Format

{
  "/path/to/file": {
    "latest": 0,
    "path": "/path/to/file",
    "timeline": [
      {
        "modified": "2025-07-01T13:05:46.723Z",
        "stored": "2025-08-25T08:30:15.123Z",
        "path": "a1b2c3d4e5f6..."
      }
    ]
  }
}

Error Handling

Automatic Recovery

  • Invalid string length errors trigger chunked storage
  • Memory overflow switches to emergency save mode
  • Corrupted chunks are skipped with warnings
  • Atomic operations prevent data corruption

Backup Safety

  • Index is backed up before each run
  • Temporary files used for atomic operations
  • Legacy format migration preserves all data
  • Multiple fallback save mechanisms

Memory Management

Optimization Strategies

  • Streaming file operations
  • Periodic garbage collection
  • Memory usage monitoring
  • Chunked index loading
  • Automatic cleanup of processed file sets

Memory Limits

  • Designed to run in 4GB RAM
  • Automatic warnings at 1GB usage
  • Emergency garbage collection triggers
  • Chunk size optimized for memory efficiency

Troubleshooting

Common Issues

Out of Memory Errors

  • Increase Node.js memory: sudo node --max-old-space-size=4096 store.js <directory>
  • System automatically switches to chunked mode
  • Check available system RAM

Slow Performance

  • Verify disk I/O isn't bottleneck
  • Check backup_location drive speed
  • Consider SSD for index storage
  • Adjust INDEX_SAVE_FREQUENCY if needed

Cache Generation Fails

  • Check permissions on backup_dirs
  • Verify paths exist in config.json
  • Look for symbolic link issues

Debug Mode

Add debug logging by modifying console.log statements in the code or enable verbose output during operations.

Technical Details

Dependencies

  • fs: File system operations
  • path: Path manipulation
  • crypto: MD5 hashing
  • zlib: Gzip compression
  • config.json: Configuration management

Algorithms

  • MD5 Hashing: Content-based deduplication
  • Gzip Compression: Fast compression with level 1
  • Streaming I/O: Memory-efficient file processing
  • Chunked Storage: Scalable metadata management

Security Considerations

  • MD5 used for deduplication (not cryptographic security)
  • File permissions preserved in metadata
  • No encryption - files stored compressed but unencrypted
  • Consider filesystem-level encryption for sensitive data

License

This project is licensed under the Apache License, Version 2.0. This software is provided "as-is". The authors are not responsible for any damages or data loss that may result from its use.

Support

For issues or questions:

  1. Check this README for troubleshooting
  2. Verify configuration in config.json
  3. Check system resources (RAM, disk space)
  4. Review console output for specific error messages

About

A high-performance, scalable file backup system with deduplication, compression, and version history tracking.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published