Skip to content

Run CI on Modal, upgrade Bitsandbytes #641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 58 commits into from
Mar 15, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
fc52696
Run CI on Modal, upgrade Bitsandbytes
mryab Feb 10, 2025
58f3d44
Add docs configuration
mryab Feb 10, 2025
6d36cd1
Fix formatting
mryab Feb 10, 2025
ab714bd
Configure concurrency for Modal tests
mryab Feb 10, 2025
c840ab9
Sort imports
mryab Feb 10, 2025
f717bf6
Set up the timeout
mryab Feb 10, 2025
0dca5a2
Set up concurrency for other actions as well
mryab Feb 10, 2025
11feccf
Remove concurrency limits
mryab Feb 10, 2025
cbf4450
Add concurrency, update bitsandbytes in dependencies
mryab Feb 10, 2025
4f303bd
Add cache, bump CI versions
mryab Feb 10, 2025
6a5ec5e
Skip test_allreduce_protocol for the time being
mryab Feb 10, 2025
ba3e386
Reduce the number of CPUs
mryab Feb 10, 2025
1fb8dec
Decrease the limits in test_dht_connection_successful
mryab Feb 10, 2025
67e040f
Restore the limits in test_dht_connection_successful
mryab Feb 10, 2025
c0af379
Clear the blacklist before attempting store
mryab Feb 10, 2025
6116570
Increase the wait in test_load_state_from_peers
mryab Feb 10, 2025
801bb4f
Parametrize tests by Python version, upload Codecov coverage
mryab Feb 11, 2025
fd69b64
Check out and build a specific version of bitsandbytes
mryab Feb 11, 2025
22739f5
Increase the timeouts to account for image builds
mryab Feb 11, 2025
635879f
Introduce timeouts
mryab Feb 22, 2025
8fbd9dd
Increase the number of CPUs for tests
mryab Feb 22, 2025
d70b4b9
Make tests more robust
mryab Feb 23, 2025
4254468
Make tests more robust
mryab Feb 23, 2025
1753bae
Reformat the code
mryab Feb 23, 2025
4753fef
Mark test_client_disconnect as flaky
mryab Feb 23, 2025
9705318
Build and test p2pd separately
mryab Feb 23, 2025
ae5ed98
Install Go only for a specific image
mryab Feb 23, 2025
11eb277
Don't use uv when building p2pd
mryab Feb 23, 2025
9d37fe9
Mark test_dhtnode_blacklist as flaky
mryab Feb 23, 2025
7abc9f0
Increase timeouts
mryab Feb 23, 2025
5b69835
Make test_averaging_trigger more robust
mryab Feb 23, 2025
9e37679
Download codecov with wget
mryab Feb 23, 2025
aa20215
Skip all training tests for the time being
mryab Feb 23, 2025
a03288e
Skip test_allgather for the time being
mryab Feb 23, 2025
a614a02
Mark test_performance_ema_threadsafe and test_remote_expert_worker_ru…
mryab Feb 23, 2025
2cfc94a
Reduce timeouts, mark test_background_server_identity_path as flaky
mryab Feb 23, 2025
df048db
Mention sponsorship by Prime Intellect
mryab Feb 23, 2025
e388e07
Fix missing import
mryab Feb 23, 2025
98e6a38
Mark flaky tests
mryab Feb 23, 2025
b317b29
Modify the codecov command
mryab Feb 23, 2025
66c9187
Pass extra environment variables to codecov
mryab Feb 23, 2025
93460aa
Remove --dist from codecov run
mryab Feb 23, 2025
75529a1
Pass GITHUB_EVENT_PULL_REQUEST_HEAD_SHA when running the test
mryab Feb 23, 2025
83b53bb
Mark test_fault_tolerance as flaky
mryab Feb 23, 2025
5984bad
Mark test_cli_run_server_identity_path as flaky
mryab Feb 23, 2025
2f67c52
Disable parallel execution for codecov management
mryab Feb 23, 2025
e8efb66
Increase codecov run timeout to 15 minutes
mryab Feb 23, 2025
f8ad2a8
Pass GITHUB_EVENT_PULL_REQUEST_HEAD_SHA to the workflow
mryab Feb 23, 2025
225439e
Pass additional secrets
mryab Feb 23, 2025
3695813
Mark one more test as flaky
mryab Feb 23, 2025
6bac780
Mark another test as flaky
mryab Feb 23, 2025
3228dfd
Pass codecov values explicitly
mryab Feb 23, 2025
0a9347d
Pass --no-use-pep517 to uv pip install
mryab Feb 23, 2025
87f0ece
Change uv pip to pip
mryab Feb 23, 2025
46fa9f5
Extract the blocksize for quantization into a constant
mryab Mar 15, 2025
717dd34
Fix missing newline
mryab Mar 15, 2025
0fcd2ba
Rewrite test_averaging_trigger with time.monotonic
mryab Mar 15, 2025
cfa51d2
Replace os.unlink with os.remove
mryab Mar 15, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions .github/workflows/check-style.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,24 @@ on:
branches: [ master ]
pull_request:

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
black:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- uses: psf/black@stable
with:
options: "--check --diff"
version: "22.3.0"
isort:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v3
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: 3.11
- uses: isort/isort-action@master
Expand All @@ -28,7 +32,7 @@ jobs:
codespell:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- uses: codespell-project/actions-codespell@v1
with:
only_warn: 1
Expand Down
6 changes: 5 additions & 1 deletion .github/workflows/push-docker-image.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,17 @@ on:
pull_request:
branches: [ master ]

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
build:
runs-on: ubuntu-latest

steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4

- name: Docker meta
id: meta
Expand Down
12 changes: 8 additions & 4 deletions .github/workflows/run-benchmarks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,23 @@ on:
branches: [ master ]
pull_request:

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
run_benchmarks:

runs-on: ubuntu-latest
timeout-minutes: 10
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v3
uses: actions/setup-python@v5
with:
python-version: 3.11
- name: Cache dependencies
uses: actions/cache@v3
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: Key-v1-3.11-${{ hashFiles('requirements.txt') }}-${{ hashFiles('requirements-dev.txt') }}
Expand All @@ -28,7 +32,7 @@ jobs:
pip install -r requirements-dev.txt
- name: Build bitsandbytes
run: |
pip install bitsandbytes==0.41.1
pip install bitsandbytes==0.45.2
- name: Build hivemind
run: |
pip install .
Expand Down
112 changes: 112 additions & 0 deletions .github/workflows/run-tests-on-modal.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
name: Modal tests

on:
push:
branches: [master]
pull_request:

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
run_tests:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]
fail-fast: false
env:
MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
PYTHON_VERSION: ${{ matrix.python-version }}
timeout-minutes: 15
steps:
- name: Checkout Repository
uses: actions/checkout@v4

- name: Install Python
uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Cache dependencies
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: Key-v1-3.12-modal

- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install modal==0.73.32

- name: Run tests
run: |
modal run modal_ci.py::run_tests

measure_coverage:
runs-on: ubuntu-latest
env:
MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
GITHUB_EVENT_NAME: ${{ github.event_name }}
GITHUB_EVENT_NUMBER: ${{ github.event.number }}
GITHUB_EVENT_PULL_REQUEST_HEAD_SHA: ${{ github.event.pull_request.head.sha }}
PYTHON_VERSION: "3.11"
timeout-minutes: 15
steps:
- name: Checkout Repository
uses: actions/checkout@v4

- name: Install Python
uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Cache dependencies
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: Key-v1-3.12-modal

- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install modal==0.73.32

- name: Measure and upload coverage
run: |
modal run modal_ci.py::run_codecov

build_and_test_p2pd:
runs-on: ubuntu-latest
env:
MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
PYTHON_VERSION: "3.11"
timeout-minutes: 10
steps:
- name: Checkout Repository
uses: actions/checkout@v4

- name: Install Python
uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Cache dependencies
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: Key-v1-3.12-modal

- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install modal==0.73.32

- name: Run p2pd tests
run: |
modal run modal_ci.py::build_and_test_p2pd
20 changes: 11 additions & 9 deletions .github/workflows/run-tests.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
name: Tests

on:
push:
branches: [ master ]
pull_request:
# Tests in GHA only run manually, see run-tests-on-modal.yml for the same tests in CI
on: workflow_dispatch

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
run_tests:
Expand All @@ -15,13 +17,13 @@ jobs:
fail-fast: false
timeout-minutes: 15
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v3
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Cache dependencies
uses: actions/cache@v3
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: Key-v1-${{ matrix.python-version }}-${{ hashFiles('requirements.txt') }}-${{ hashFiles('requirements-dev.txt') }}
Expand All @@ -32,7 +34,7 @@ jobs:
pip install -r requirements-dev.txt
- name: Build bitsandbytes
run: |
pip install bitsandbytes==0.41.1
pip install bitsandbytes==0.45.2
- name: Build hivemind
run: |
pip install .
Expand Down Expand Up @@ -94,7 +96,7 @@ jobs:
pip install -r requirements-dev.txt
- name: Build bitsandbytes
run: |
pip install bitsandbytes==0.41.1
pip install bitsandbytes==0.45.2
- name: Build hivemind
run: |
pip install -e . --no-use-pep517
Expand Down
1 change: 1 addition & 0 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ version: 2

sphinx:
fail_on_warning: true
configuration: docs/conf.py

python:
install:
Expand Down
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,10 @@ the [contributing guidelines](https://github.com/learning-at-home/hivemind/blob/
more about other ways to contribute, read
our [guide](https://learning-at-home.readthedocs.io/en/latest/user/contributing.html).

## Collaborators and Sponsorship

* [Prime Intellect](https://www.primeintellect.ai/) sponsoring compute resources over [Modal](https://modal.com/) for CI

## Citation

If you found hivemind or its underlying algorithms useful for your research, please cite the following source:
Expand Down
6 changes: 3 additions & 3 deletions hivemind/compression/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,14 +107,14 @@ def extract(self, serialized_tensor: runtime_pb2.Tensor) -> torch.Tensor:
if serialized_tensor.dtype == "bfloat16":
numel = shape.numel()
if numel > 0 and len(serialized_tensor.buffer) // numel == 4:
array = np.frombuffer(serialized_tensor.buffer, dtype=np.float32)
array = np.frombuffer(bytearray(serialized_tensor.buffer), dtype=np.float32)
tensor = torch.as_tensor(array, dtype=torch.bfloat16)
else:
array = np.frombuffer(serialized_tensor.buffer, dtype=np.int16)
array = np.frombuffer(bytearray(serialized_tensor.buffer), dtype=np.int16)
# reinterpret_cast from an arbitrary 2-byte type supported by numpy
tensor = torch.as_tensor(array).view(torch.bfloat16)
else:
array = np.frombuffer(serialized_tensor.buffer, dtype=np.dtype(serialized_tensor.dtype))
array = np.frombuffer(bytearray(serialized_tensor.buffer), dtype=np.dtype(serialized_tensor.dtype))
tensor = torch.as_tensor(array)
return tensor.reshape(shape)

Expand Down
16 changes: 13 additions & 3 deletions hivemind/compression/quantization.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
warnings.filterwarnings("ignore", module="bitsandbytes", category=UserWarning)

EXECUTOR = ThreadPoolExecutor(max_workers=int(os.environ.get("QUANTIZATION_THREADS", 128)))
_BLOCKWISE_QUANTIZATION_BLOCKSIZE = 4096


class Quantization(CompressionBase, ABC):
Expand Down Expand Up @@ -140,8 +141,15 @@ def quantize(
except ImportError:
raise ImportError(BNB_MISSING_MESSAGE)

quantized, (absmax, codebook, *extra_params) = quantize_blockwise(tensor, blocksize=4096, nested=False)
assert tuple(extra_params) == self.EXTRA_PARAMS # blocksize, nested, dtype, offset, state2
assert tensor.dtype == torch.float32

quantized, quant_state = quantize_blockwise(tensor, blocksize=_BLOCKWISE_QUANTIZATION_BLOCKSIZE, nested=False)
absmax, codebook = quant_state.absmax, quant_state.code
assert quant_state.blocksize == _BLOCKWISE_QUANTIZATION_BLOCKSIZE
assert quant_state.nested is False
assert quant_state.dtype == self.EXTRA_PARAMS[2]
assert quant_state.offset == self.EXTRA_PARAMS[3]
assert quant_state.state2 == self.EXTRA_PARAMS[4]
return quantized.numpy(), (absmax.numpy(), codebook.numpy())

def compress(self, tensor: torch.Tensor, info: CompressionInfo, allow_inplace: bool = False) -> runtime_pb2.Tensor:
Expand Down Expand Up @@ -187,5 +195,7 @@ def extract(self, serialized_tensor: runtime_pb2.Tensor) -> torch.Tensor:
absmax = torch.as_tensor(absmax)
codebook = torch.as_tensor(codebook)
quantized = torch.as_tensor(quantized).reshape(tuple(serialized_tensor.size))
result = dequantize_blockwise(quantized, (absmax, codebook, *self.EXTRA_PARAMS))
result = dequantize_blockwise(
quantized, absmax=absmax, code=codebook, blocksize=_BLOCKWISE_QUANTIZATION_BLOCKSIZE, nested=False
)
return result.to(getattr(torch, serialized_tensor.dtype)).requires_grad_(serialized_tensor.requires_grad)
Loading
Loading