Skip to content

CrowSoda/crows_qwen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Qwen3-235B Local Performance Testing Environment

Production-grade local testing setup with Prometheus + Grafana monitoring

πŸ“‹ Overview

This is a comprehensive testing environment for Qwen3-235B with all quantization levels (Q2, Q3, Q4), featuring:

  • Real-time monitoring with Prometheus + Grafana integration
  • System optimization for dual A6000 + i9-14900K setup
  • Professional benchmarking with comprehensive metrics
  • vLLM integration for maximum performance
  • Automated deployment and testing workflows

🎯 Key Features

βœ… Completed Components:

  • System Optimization: Hardware-specific tuning for maximum performance
  • Dependencies Management: Clean Python environment with all required packages
  • Model Downloader: Prometheus-monitored downloads with resume capability
  • Metrics Collector: Real-time system monitoring with Grafana integration
  • Professional Logging: Comprehensive logging and reporting

πŸ”„ In Progress:

  • vLLM configuration files
  • Model launcher scripts
  • Benchmark testing suite
  • Grafana dashboard templates

πŸ—οΈ Architecture

/mnt/nvme/qwen3_local/
β”œβ”€β”€ scripts/                    # Setup and utility scripts
β”‚   β”œβ”€β”€ 01_system_optimize.sh   # System-level optimizations
β”‚   β”œβ”€β”€ 02_install_dependencies.sh # Environment setup
β”‚   └── 03_download_models.py   # Model downloader with monitoring
β”œβ”€β”€ monitoring/                 # Prometheus monitoring
β”‚   └── metrics_collector.py    # Real-time system metrics
β”œβ”€β”€ models/                     # Model storage (Q2, Q3, Q4)
β”œβ”€β”€ configs/                    # vLLM and system configurations
β”œβ”€β”€ launchers/                  # Model launch scripts
β”œβ”€β”€ tests/                      # Testing and benchmarking
β”œβ”€β”€ dashboards/                 # Grafana dashboard templates
β”œβ”€β”€ logs/                       # Runtime logs and reports
└── results/                    # Benchmark results and analysis

πŸš€ Quick Start

Step 1: System Optimization

cd /mnt/nvme/qwen3_local
sudo ./scripts/01_system_optimize.sh

Step 2: Install Dependencies

./scripts/02_install_dependencies.sh

Step 3: Activate Environment

source activate.sh

Step 4: Download Models

# Download all models (Q2, Q3, Q4)
./scripts/03_download_models.py download

# Download specific models
./scripts/03_download_models.py download --models q3

# Check download status
./scripts/03_download_models.py status

Step 5: Start Monitoring

./monitoring/metrics_collector.py --interval 5.0

πŸ“Š Prometheus Integration

Metrics Collection

  • System Metrics: CPU, memory, disk, network usage
  • GPU Metrics: Utilization, memory, temperature, power
  • Model Metrics: Download progress, inference performance
  • vLLM Metrics: Service status, response times, tokens/sec

Pushgateway Integration

  • URL: http://192.168.20.13:30091
  • Job: qwen3_performance
  • Instance: local_testing

Available Metrics

# Download metrics
qwen3_download_progress_percent{model="q2|q3|q4", file="filename"}
qwen3_download_speed_mbps{model="q2|q3|q4", file="filename"}
qwen3_download_status{model="q2|q3|q4", file="filename"}

# System metrics
qwen3_cpu_usage_percent{cpu="all|cpu0|cpu1...", type="total|core"}
qwen3_memory_usage_bytes{type="total|used|free|cached|buffers"}
qwen3_gpu_utilization_percent{gpu_id="0|1", name="gpu_name"}
qwen3_gpu_memory_usage_bytes{gpu_id="0|1", name="gpu_name", type="used|total|free"}
qwen3_gpu_temperature_celsius{gpu_id="0|1", name="gpu_name"}
qwen3_gpu_power_watts{gpu_id="0|1", name="gpu_name"}

# vLLM metrics
qwen3_vllm_status{model="q2|q3|q4", port="8001|8002|8003"}
qwen3_vllm_tokens_per_second{model="q2|q3|q4"}
qwen3_vllm_response_time_seconds{model="q2|q3|q4", endpoint="completions"}

πŸ”§ Hardware Configuration

Optimized For:

  • CPU: Intel i9-14900K (24 cores, 32 threads)
  • GPU: Dual NVIDIA RTX A6000 (48GB VRAM each)
  • RAM: 128GB DDR5-6400
  • Storage: NVMe SSD with 300GB+ free space

Performance Targets:

  • Q2_K_M: 25-40 tokens/sec (Speed optimized)
  • Q3_K_L: 15-30 tokens/sec (Balanced performance)
  • Q4_K_M: 10-20 tokens/sec (Quality optimized)

πŸ“ˆ Monitoring Dashboard

Grafana Integration

Monitor your Qwen3 performance in real-time:

  • Prometheus: http://192.168.20.13:30090
  • Pushgateway: http://192.168.20.13:30091
  • Grafana: (Configure with your Grafana instance)

Key Performance Indicators

  • Tokens per second by model
  • GPU utilization and memory usage
  • System resource consumption
  • Model download progress and speed
  • Response times and error rates

πŸ› οΈ Development Workflow

Testing Cycle

  1. System Optimization β†’ Maximum hardware performance
  2. Model Download β†’ Monitored with Prometheus
  3. Performance Testing β†’ Comprehensive benchmarks
  4. Results Analysis β†’ Grafana dashboards
  5. Optimization β†’ Iterative improvements

Logging and Monitoring

  • Logs: /mnt/nvme/qwen3_local/logs/
  • Results: /mnt/nvme/qwen3_local/results/
  • Metrics: Real-time Prometheus integration

🎯 Model Specifications

Q2_K_M (Speed Optimized)

  • Size: ~65GB
  • Performance: 25-40 tokens/sec
  • Context: 32K tokens
  • Use Case: Fast prototyping, simple tasks

Q3_K_L (Balanced)

  • Size: ~97GB
  • Performance: 15-30 tokens/sec
  • Context: 65K tokens (with YaRN)
  • Use Case: Daily development, most coding tasks

Q4_K_M (Quality Optimized)

  • Size: ~117GB
  • Performance: 10-20 tokens/sec
  • Context: 131K tokens (with YaRN)
  • Use Case: Complex algorithms, production code

πŸ“‹ Environment Variables

# GPU Configuration
export CUDA_VISIBLE_DEVICES=0,1
export NVIDIA_VISIBLE_DEVICES=all

# vLLM Optimizations
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export VLLM_FLASH_ATTN_VERSION=2

# Prometheus Integration
export PROMETHEUS_PUSHGATEWAY_URL=http://192.168.20.13:30091
export PROMETHEUS_JOB_NAME=qwen3_performance
export PROMETHEUS_INSTANCE_NAME=local_testing

# Performance Tuning
export OMP_NUM_THREADS=24
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:1024

πŸ› Troubleshooting

Common Issues

Out of Memory Errors:

  • Check GPU memory usage: nvidia-smi
  • Reduce batch size or context length
  • Ensure swap is configured for Q4 model

Slow Performance:

  • Verify system optimization: cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
  • Check GPU clocks: nvidia-smi -q -d CLOCK
  • Monitor CPU frequency: cat /proc/cpuinfo | grep MHz

Monitoring Issues:

  • Verify Pushgateway connectivity: curl http://192.168.20.13:30091/metrics
  • Check log files: tail -f logs/metrics_collector.log
  • Test metric collection: ./monitoring/metrics_collector.py --info

πŸ“Š Next Steps

  1. Complete vLLM Configuration: Model-specific configs for Q2, Q3, Q4
  2. Create Model Launchers: Easy startup scripts for each model
  3. Build Testing Suite: Comprehensive performance benchmarks
  4. Create Grafana Dashboards: Visual monitoring templates
  5. Implement Automated Testing: CI/CD-style testing workflows

πŸ’‘ Tips for Maximum Performance

  • Always run system optimization before testing
  • Monitor resource usage in real-time with Grafana
  • Use appropriate quantization for your use case
  • Test with different context lengths to find optimal settings
  • Keep logs and results for performance analysis

Status: Foundation complete, ready for vLLM integration and testing Next: vLLM configuration files and model launchers

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published