0.13.6
This version adds two new arguments:
--net-turbo 0
- allows to disable non blocking sockets,--gpu-segments <from>:<to>
- allows specifying which segments of the neural network will be loaded onto the GPU. Currently, this option is dedicated only to skipping the first layer (embedding). Other settings may not work.
These options allowed to run Llama 3.3 70B Q40 on 4 x NVIDIA RTX 3060 12 GB. Check this test.