Skip to content

πŸš€ LowMemoryLLM is A lightweight C-based LLM inference engine optimized for memory-constrained environments. ✨ Features: - πŸ“Š Multiple quantization (INT8/4/2) - πŸ’Ύ Smart memory management - πŸ”„ Efficient KV-cache Built for edge devices and systems with limited RAM. Pure C, fast and lightweight!

License

Notifications You must be signed in to change notification settings

2404589803/LowMemoryLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LowMemoryLLM

δΈ­ζ–‡ζ–‡ζ‘£

Overview

LowMemoryLLM is a lightweight inference and training implementation for Large Language Models (LLMs) designed specifically for memory-constrained environments. It provides efficient model inference and training with minimal memory footprint through various optimization techniques.

Key Features

  • πŸš€ Multiple quantization options (INT8, INT4, INT2)
  • πŸ’Ύ Smart memory management with disk offloading
  • πŸ”„ Efficient attention caching mechanism
  • πŸ“¦ Hugging Face model integration
  • πŸ› οΈ Flexible memory management and optimization
  • 🌐 Built-in download manager with proxy support
  • 🎯 Hardware-agnostic training support
  • πŸ”‹ Assembly-optimized computation kernels

Technical Features

  • Memory optimization through disk offloading and memory mapping
  • Configurable quantization with per-channel support
  • KV-cache for efficient inference
  • Support for various activation functions (ReLU, GELU, SILU, SWISH)
  • Matrix operations optimized for low memory environments
  • Comprehensive tensor operations and management
  • Cross-platform training support with hardware acceleration
  • Multiple optimizer implementations (SGD, Adam, AdamW, RMSprop)
  • Mixed precision training support
  • Gradient clipping and normalization

Requirements

  • C compiler with C11 support
  • CMake for build system
  • Sufficient disk space for model weights and swap files
  • Network connection for model downloads
  • (Optional) AVX2/NEON support for hardware acceleration

Installation

git clone https://github.com/2404589803/LowMemoryLLM.git
cd LowMemoryLLM
mkdir build && cd build
cmake ..
make

Usage

Inference

  1. Configure model settings:
LLMConfig config = {
    .vocab_size = 50257,
    .hidden_size = 768,
    .num_layers = 12,
    .max_seq_length = 1024,
    .batch_size = 1,
    .use_cache = 1
};
  1. Initialize memory manager:
MemoryManager mem_manager = {
    .use_disk_offload = 1,
    .use_memory_map = 1,
    .prefetch_size = 1024 * 1024
};
  1. Initialize and run inference:
llm_init(&config, &mem_manager);
llm_load_weights("path/to/weights");
// Run inference...

Training

  1. Configure training settings:
TrainingConfig train_config = {
    .batch_size = 32,
    .num_epochs = 100,
    .loss_type = LOSS_CROSS_ENTROPY,
    .optimizer = {
        .type = OPTIMIZER_ADAM,
        .learning_rate = 0.001f,
        .beta1 = 0.9f,
        .beta2 = 0.999f,
        .epsilon = 1e-8f
    },
    .gradient_clip_norm = 1.0f,
    .enable_mixed_precision = 1
};
  1. Initialize training system:
TrainingExtension extension = {
    .backward_matrix_multiply = backward_matrix_multiply_asm,
    .backward_vector_add = backward_vector_add_asm,
    // Set other function pointers...
};

training_init(device, &extension);
training_configure(&train_config);
  1. Training loop:
TrainingState state = {0};
TrainingCallbacks callbacks = {
    .on_epoch_begin = my_epoch_begin_callback,
    .on_batch_end = my_batch_end_callback
};

for (size_t epoch = 0; epoch < train_config.num_epochs; epoch++) {
    for (size_t batch = 0; batch < num_batches; batch++) {
        void* inputs = prepare_batch_inputs(batch);
        void* targets = prepare_batch_targets(batch);
        training_step(model, inputs, targets, &state, &callbacks);
    }
    
    float metrics[2];
    training_evaluate(model, val_inputs, val_targets, metrics, 2);
}

Hardware Support

  • x86_64 with AVX/AVX2 optimization
  • ARM64 with NEON optimization
  • Generic CPU fallback implementation
  • Extensible device abstraction layer

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Contact

[Add your contact information here]

About

πŸš€ LowMemoryLLM is A lightweight C-based LLM inference engine optimized for memory-constrained environments. ✨ Features: - πŸ“Š Multiple quantization (INT8/4/2) - πŸ’Ύ Smart memory management - πŸ”„ Efficient KV-cache Built for edge devices and systems with limited RAM. Pure C, fast and lightweight!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published