🚀 Qwen3-235B Local Performance Testing Environment

Production-grade local testing setup with Prometheus + Grafana monitoring

📋 Overview

This is a comprehensive testing environment for Qwen3-235B with all quantization levels (Q2, Q3, Q4), featuring:

Real-time monitoring with Prometheus + Grafana integration
System optimization for dual A6000 + i9-14900K setup
Professional benchmarking with comprehensive metrics
vLLM integration for maximum performance
Automated deployment and testing workflows

🎯 Key Features

✅ Completed Components:

System Optimization: Hardware-specific tuning for maximum performance
Dependencies Management: Clean Python environment with all required packages
Model Downloader: Prometheus-monitored downloads with resume capability
Metrics Collector: Real-time system monitoring with Grafana integration
Professional Logging: Comprehensive logging and reporting

🔄 In Progress:

vLLM configuration files
Model launcher scripts
Benchmark testing suite
Grafana dashboard templates

🏗️ Architecture

/mnt/nvme/qwen3_local/
├── scripts/                    # Setup and utility scripts
│   ├── 01_system_optimize.sh   # System-level optimizations
│   ├── 02_install_dependencies.sh # Environment setup
│   └── 03_download_models.py   # Model downloader with monitoring
├── monitoring/                 # Prometheus monitoring
│   └── metrics_collector.py    # Real-time system metrics
├── models/                     # Model storage (Q2, Q3, Q4)
├── configs/                    # vLLM and system configurations
├── launchers/                  # Model launch scripts
├── tests/                      # Testing and benchmarking
├── dashboards/                 # Grafana dashboard templates
├── logs/                       # Runtime logs and reports
└── results/                    # Benchmark results and analysis

🚀 Quick Start

Step 1: System Optimization

cd /mnt/nvme/qwen3_local
sudo ./scripts/01_system_optimize.sh

Step 2: Install Dependencies

./scripts/02_install_dependencies.sh

Step 3: Activate Environment

source activate.sh

Step 4: Download Models

# Download all models (Q2, Q3, Q4)
./scripts/03_download_models.py download

# Download specific models
./scripts/03_download_models.py download --models q3

# Check download status
./scripts/03_download_models.py status

Step 5: Start Monitoring

./monitoring/metrics_collector.py --interval 5.0

📊 Prometheus Integration

Metrics Collection

System Metrics: CPU, memory, disk, network usage
GPU Metrics: Utilization, memory, temperature, power
Model Metrics: Download progress, inference performance
vLLM Metrics: Service status, response times, tokens/sec

Pushgateway Integration

URL: http://192.168.20.13:30091
Job: qwen3_performance
Instance: local_testing

Available Metrics

# Download metrics
qwen3_download_progress_percent{model="q2|q3|q4", file="filename"}
qwen3_download_speed_mbps{model="q2|q3|q4", file="filename"}
qwen3_download_status{model="q2|q3|q4", file="filename"}

# System metrics
qwen3_cpu_usage_percent{cpu="all|cpu0|cpu1...", type="total|core"}
qwen3_memory_usage_bytes{type="total|used|free|cached|buffers"}
qwen3_gpu_utilization_percent{gpu_id="0|1", name="gpu_name"}
qwen3_gpu_memory_usage_bytes{gpu_id="0|1", name="gpu_name", type="used|total|free"}
qwen3_gpu_temperature_celsius{gpu_id="0|1", name="gpu_name"}
qwen3_gpu_power_watts{gpu_id="0|1", name="gpu_name"}

# vLLM metrics
qwen3_vllm_status{model="q2|q3|q4", port="8001|8002|8003"}
qwen3_vllm_tokens_per_second{model="q2|q3|q4"}
qwen3_vllm_response_time_seconds{model="q2|q3|q4", endpoint="completions"}

🔧 Hardware Configuration

Optimized For:

CPU: Intel i9-14900K (24 cores, 32 threads)
GPU: Dual NVIDIA RTX A6000 (48GB VRAM each)
RAM: 128GB DDR5-6400
Storage: NVMe SSD with 300GB+ free space

Performance Targets:

Q2_K_M: 25-40 tokens/sec (Speed optimized)
Q3_K_L: 15-30 tokens/sec (Balanced performance)
Q4_K_M: 10-20 tokens/sec (Quality optimized)

📈 Monitoring Dashboard

Grafana Integration

Monitor your Qwen3 performance in real-time:

Prometheus: http://192.168.20.13:30090
Pushgateway: http://192.168.20.13:30091
Grafana: (Configure with your Grafana instance)

Key Performance Indicators

Tokens per second by model
GPU utilization and memory usage
System resource consumption
Model download progress and speed
Response times and error rates

🛠️ Development Workflow

Testing Cycle

System Optimization → Maximum hardware performance
Model Download → Monitored with Prometheus
Performance Testing → Comprehensive benchmarks
Results Analysis → Grafana dashboards
Optimization → Iterative improvements

Logging and Monitoring

Logs: /mnt/nvme/qwen3_local/logs/
Results: /mnt/nvme/qwen3_local/results/
Metrics: Real-time Prometheus integration

🎯 Model Specifications

Q2_K_M (Speed Optimized)

Size: ~65GB
Performance: 25-40 tokens/sec
Context: 32K tokens
Use Case: Fast prototyping, simple tasks

Q3_K_L (Balanced)

Size: ~97GB
Performance: 15-30 tokens/sec
Context: 65K tokens (with YaRN)
Use Case: Daily development, most coding tasks

Q4_K_M (Quality Optimized)

Size: ~117GB
Performance: 10-20 tokens/sec
Context: 131K tokens (with YaRN)
Use Case: Complex algorithms, production code

📋 Environment Variables

# GPU Configuration
export CUDA_VISIBLE_DEVICES=0,1
export NVIDIA_VISIBLE_DEVICES=all

# vLLM Optimizations
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export VLLM_FLASH_ATTN_VERSION=2

# Prometheus Integration
export PROMETHEUS_PUSHGATEWAY_URL=http://192.168.20.13:30091
export PROMETHEUS_JOB_NAME=qwen3_performance
export PROMETHEUS_INSTANCE_NAME=local_testing

# Performance Tuning
export OMP_NUM_THREADS=24
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:1024

🐛 Troubleshooting

Common Issues

Out of Memory Errors:

Check GPU memory usage: nvidia-smi
Reduce batch size or context length
Ensure swap is configured for Q4 model

Slow Performance:

Verify system optimization: cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
Check GPU clocks: nvidia-smi -q -d CLOCK
Monitor CPU frequency: cat /proc/cpuinfo | grep MHz

Monitoring Issues:

Verify Pushgateway connectivity: curl http://192.168.20.13:30091/metrics
Check log files: tail -f logs/metrics_collector.log
Test metric collection: ./monitoring/metrics_collector.py --info

📊 Next Steps

Complete vLLM Configuration: Model-specific configs for Q2, Q3, Q4
Create Model Launchers: Easy startup scripts for each model
Build Testing Suite: Comprehensive performance benchmarks
Create Grafana Dashboards: Visual monitoring templates
Implement Automated Testing: CI/CD-style testing workflows

💡 Tips for Maximum Performance

Always run system optimization before testing
Monitor resource usage in real-time with Grafana
Use appropriate quantization for your use case
Test with different context lengths to find optimal settings
Keep logs and results for performance analysis

Status: Foundation complete, ready for vLLM integration and testing Next: vLLM configuration files and model launchers

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs/vllm		configs/vllm
dashboards		dashboards
launchers		launchers
monitoring		monitoring
scripts		scripts
tests/prompts		tests/prompts
.gitignore		.gitignore
CURRENT_STATUS.md		CURRENT_STATUS.md
README.md		README.md
USAGE_GUIDE.md		USAGE_GUIDE.md

CrowSoda/crows_qwen

Folders and files

Latest commit

History

Repository files navigation