Merge pull request #86 from LogicNet-Subnet/mainnet

LVH-Tony · web-flow · commit 724ae28a5073 · 2025-02-13T13:26:53.000-08:00
Mainnet Integration
diff --git a/docs/VALIDATOR.md b/docs/VALIDATOR.md
@@ -63,8 +63,20 @@ This setup allows you to run the Validator locally by hosting a vLLM server. Whi
    . vllm/bin/activate
    pm2 start "vllm serve Qwen/Qwen2.5-7B-Instruct --port 8000 --host 0.0.0.0" --name "sn35-vllm"
    ```
-   *Adjust the model, port, and host as needed.*
-   eg. include this if the model fail to start `--max-model-len 16384 --gpu-memory-utilization 0.95` 
+   - If you want to run larger models on GPUs with less VRAM, there are several techniques you can use to optimize GPU memory utilization:
+      - You can adjust the GPU memory utilization to maximize the available memory by using a flag like `--gpu_memory_utilization`. This allows the model to use a specified percentage of GPU memory.
+      ```bash
+      pm2 start "vllm serve Qwen/Qwen2.5-7B-Instruct --gpu_memory_utilization 0.95 --port 8000 --host 0.0.0.0" --name "sn35-vllm" 
+      # This command sets the model to use 95% of the available GPU memory.
+      ```
+      - Using mixed precision (FP16) instead of full precision (FP32) reduces the amount of memory required to store model weights, which can significantly lower VRAM usage.
+      ```bash
+      pm2 start "vllm serve Qwen/Qwen2.5-7B-Instruct --precision fp16 --gpu_memory_utilization 0.95 --port 8000 --host 0.0.0.0" --name "sn35-vllm"
+      ```
+      - If you have multiple GPUs, you can shard the model across them to distribute the memory load.
+      ```bash
+      pm2 start "vllm serve Qwen/Qwen2.5-7B-Instruct --shard --port 8000 --host 0.0.0.0" --name "sn35-vllm"
+      ```
 
 ---
 
diff --git a/requirements.txt b/requirements.txt
@@ -1,4 +1,5 @@
-bittensor==8.5.1
+bittensor==9.0.0
+bittensor-cli==9.0.0
 Pillow==10.2.0
 PyYAML==6.0.1
 setuptools==70.0.0