
Run large language models in a heterogeneous decentralized environment with offloading.
The rapid rise of generative AI has boosted demand for large language model (LLM) inference and fine-tuining services. While proprietary models are still favored, advancements in open-source LLMs have made them competitive. However, high costs and limited GPU resources hinder deployment. This work introduces BloomBee, a decentralized offline serving system that leverages idle GPU resources to provide cost-effective access to LLMs.
We rely on global GPU sharing, which includes more consumer-grade GPUs. If your GPU can only manage a small portion of a large language model, like the Llama3.1 (405B) model, you can connect to a network of servers that load different parts of the model. In this network, you can request inference or fine-tuning services.
pip install bloombee
git clone https://github.com/ai-decentralized/BloomBee.git
cd BloomBee
python3 -m venv bloombee-venv
source bloombee-venv/bin/activate
pip install -e .
pip install pynvml
pip install attrsIf you are using Hivemind (required for BloomBee setup), please install this as well:
git clone https://github.com/learning-at-home/hivemind
cd hivemind
pip install -e .
How to use BloomBee(Try now in Colab)
python -m bloombee.cli.run_dht --host_maddrs /ip4/0.0.0.0/tcp/31340 --identity_path bootstrapp1.id
Now you will get the BloomBee's main server location:
Mon 00 01:23:45.678 [INFO] Running a DHT instance. To connect other peers to this one, use --initial_peers /ip4/YOUR_IP_ADDRESS/tcp/31340/p2p/QmefxzDL1DaJ7TcrZjLuz7Xs9sUVKpufyg7f5276ZHFjbQ
You can provide this address as --initial_peers to workers or other backbone servers.
If you want your swarm to be accessible outside of your local network, ensure that you have a public IP address or set up port forwarding correctly, so that your peer is reachable from the outside.
Here is the BloomBee Server location:
export BBSERVER=/ip4/10.52.2.249/tcp/31340/p2p/QmefxzDL1DaJ7TcrZjLuz7Xs9sUVKpufyg7f5276ZHFjbQ
To setup the workers, connect to the GPUs being used (If using remote SSH to instance):
chmod 400 ~/.ssh/<YOURKEYPAIR>.pem
ssh -i ~/.ssh/<YOURKEYPAIR.pem cc@<FLOATING IP>
Next, make sure that the workers are fully set up in the BloomBee environment.
git clone https://github.com/ai-decentralized/BloomBee.git
cd BloomBee
python3 -m venv bloombee-venv
source bloombee-venv/bin/activate
pip install -e .
pip install pynvml
pip install attrs
git clone https://github.com/learning-at-home/hivemind
cd hivemind
pip install -e .
Start one worker to hold 16 blocks (16 tranformer layers)
python -m bloombee.cli.run_server huggyllama/llama-7b --initial_peers $BBSERVER --num_blocks 16 --identity_path bootstrap_1.id
Start second worker to hold another 16 blocks (16 tranformer layers)
python -m bloombee.cli.run_server huggyllama/llama-7b --initial_peers $BBSERVER --num_blocks 16 --identity_path bootstrap_1.id
In case your workers do not run do to IP connection resets, please configure the config files containing the workers' IPs.
If a bitsandbytes error comes up, please use this fix:
cd ~/BloomBee
rm -rf bitsandbytes
git clone https://github.com/TimDettmers/bitsandbytes.git
cd bitsandbytes
Make sure to set CUDA versions to the correct library paths if necessary.
cd BloombBee/
python benchmarks/benchmark_inference.py --model huggyllama/llama-7b --initial_peers $BBSERVER --torch_dtype float32 --seq_len 128
cd BloomBee/
python benchmarks/benchmark_training.py --model huggyllama/llama-7b --initial_peers $BBSERVER --torch_dtype float32 --n_steps 20 --batch_size 32 --seq_len 128
BloomBee is built upon a few popular libraries:
- Hivemind - A PyTorch library for decentralized deep learning across the Internet.
- FlexLLMGen - An offloading-based system running on weak GPUs.
- Petals - A library for decentralized LLMs fine-tuning and inference without offloading.