Production-grade local testing setup with Prometheus + Grafana monitoring
This is a comprehensive testing environment for Qwen3-235B with all quantization levels (Q2, Q3, Q4), featuring:
- Real-time monitoring with Prometheus + Grafana integration
- System optimization for dual A6000 + i9-14900K setup
- Professional benchmarking with comprehensive metrics
- vLLM integration for maximum performance
- Automated deployment and testing workflows
- System Optimization: Hardware-specific tuning for maximum performance
- Dependencies Management: Clean Python environment with all required packages
- Model Downloader: Prometheus-monitored downloads with resume capability
- Metrics Collector: Real-time system monitoring with Grafana integration
- Professional Logging: Comprehensive logging and reporting
- vLLM configuration files
- Model launcher scripts
- Benchmark testing suite
- Grafana dashboard templates
/mnt/nvme/qwen3_local/
βββ scripts/ # Setup and utility scripts
β βββ 01_system_optimize.sh # System-level optimizations
β βββ 02_install_dependencies.sh # Environment setup
β βββ 03_download_models.py # Model downloader with monitoring
βββ monitoring/ # Prometheus monitoring
β βββ metrics_collector.py # Real-time system metrics
βββ models/ # Model storage (Q2, Q3, Q4)
βββ configs/ # vLLM and system configurations
βββ launchers/ # Model launch scripts
βββ tests/ # Testing and benchmarking
βββ dashboards/ # Grafana dashboard templates
βββ logs/ # Runtime logs and reports
βββ results/ # Benchmark results and analysis
cd /mnt/nvme/qwen3_local
sudo ./scripts/01_system_optimize.sh
./scripts/02_install_dependencies.sh
source activate.sh
# Download all models (Q2, Q3, Q4)
./scripts/03_download_models.py download
# Download specific models
./scripts/03_download_models.py download --models q3
# Check download status
./scripts/03_download_models.py status
./monitoring/metrics_collector.py --interval 5.0
- System Metrics: CPU, memory, disk, network usage
- GPU Metrics: Utilization, memory, temperature, power
- Model Metrics: Download progress, inference performance
- vLLM Metrics: Service status, response times, tokens/sec
- URL:
http://192.168.20.13:30091
- Job:
qwen3_performance
- Instance:
local_testing
# Download metrics
qwen3_download_progress_percent{model="q2|q3|q4", file="filename"}
qwen3_download_speed_mbps{model="q2|q3|q4", file="filename"}
qwen3_download_status{model="q2|q3|q4", file="filename"}
# System metrics
qwen3_cpu_usage_percent{cpu="all|cpu0|cpu1...", type="total|core"}
qwen3_memory_usage_bytes{type="total|used|free|cached|buffers"}
qwen3_gpu_utilization_percent{gpu_id="0|1", name="gpu_name"}
qwen3_gpu_memory_usage_bytes{gpu_id="0|1", name="gpu_name", type="used|total|free"}
qwen3_gpu_temperature_celsius{gpu_id="0|1", name="gpu_name"}
qwen3_gpu_power_watts{gpu_id="0|1", name="gpu_name"}
# vLLM metrics
qwen3_vllm_status{model="q2|q3|q4", port="8001|8002|8003"}
qwen3_vllm_tokens_per_second{model="q2|q3|q4"}
qwen3_vllm_response_time_seconds{model="q2|q3|q4", endpoint="completions"}
- CPU: Intel i9-14900K (24 cores, 32 threads)
- GPU: Dual NVIDIA RTX A6000 (48GB VRAM each)
- RAM: 128GB DDR5-6400
- Storage: NVMe SSD with 300GB+ free space
- Q2_K_M: 25-40 tokens/sec (Speed optimized)
- Q3_K_L: 15-30 tokens/sec (Balanced performance)
- Q4_K_M: 10-20 tokens/sec (Quality optimized)
Monitor your Qwen3 performance in real-time:
- Prometheus:
http://192.168.20.13:30090
- Pushgateway:
http://192.168.20.13:30091
- Grafana: (Configure with your Grafana instance)
- Tokens per second by model
- GPU utilization and memory usage
- System resource consumption
- Model download progress and speed
- Response times and error rates
- System Optimization β Maximum hardware performance
- Model Download β Monitored with Prometheus
- Performance Testing β Comprehensive benchmarks
- Results Analysis β Grafana dashboards
- Optimization β Iterative improvements
- Logs:
/mnt/nvme/qwen3_local/logs/
- Results:
/mnt/nvme/qwen3_local/results/
- Metrics: Real-time Prometheus integration
- Size: ~65GB
- Performance: 25-40 tokens/sec
- Context: 32K tokens
- Use Case: Fast prototyping, simple tasks
- Size: ~97GB
- Performance: 15-30 tokens/sec
- Context: 65K tokens (with YaRN)
- Use Case: Daily development, most coding tasks
- Size: ~117GB
- Performance: 10-20 tokens/sec
- Context: 131K tokens (with YaRN)
- Use Case: Complex algorithms, production code
# GPU Configuration
export CUDA_VISIBLE_DEVICES=0,1
export NVIDIA_VISIBLE_DEVICES=all
# vLLM Optimizations
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export VLLM_FLASH_ATTN_VERSION=2
# Prometheus Integration
export PROMETHEUS_PUSHGATEWAY_URL=http://192.168.20.13:30091
export PROMETHEUS_JOB_NAME=qwen3_performance
export PROMETHEUS_INSTANCE_NAME=local_testing
# Performance Tuning
export OMP_NUM_THREADS=24
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:1024
Out of Memory Errors:
- Check GPU memory usage:
nvidia-smi
- Reduce batch size or context length
- Ensure swap is configured for Q4 model
Slow Performance:
- Verify system optimization:
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
- Check GPU clocks:
nvidia-smi -q -d CLOCK
- Monitor CPU frequency:
cat /proc/cpuinfo | grep MHz
Monitoring Issues:
- Verify Pushgateway connectivity:
curl http://192.168.20.13:30091/metrics
- Check log files:
tail -f logs/metrics_collector.log
- Test metric collection:
./monitoring/metrics_collector.py --info
- Complete vLLM Configuration: Model-specific configs for Q2, Q3, Q4
- Create Model Launchers: Easy startup scripts for each model
- Build Testing Suite: Comprehensive performance benchmarks
- Create Grafana Dashboards: Visual monitoring templates
- Implement Automated Testing: CI/CD-style testing workflows
- Always run system optimization before testing
- Monitor resource usage in real-time with Grafana
- Use appropriate quantization for your use case
- Test with different context lengths to find optimal settings
- Keep logs and results for performance analysis
Status: Foundation complete, ready for vLLM integration and testing Next: vLLM configuration files and model launchers