-
Notifications
You must be signed in to change notification settings - Fork 0
Description
The goal of this feature is to develop a highly efficient, pure-Go implementation for inference using Microsoft's BitNet b1.58‑2B 4T language model, optimized specifically for CPU environments, with potential future support for GPU acceleration. This implementation will handle language model inference with a context length of up to 4096 tokens, enabling practical text-generation and completion tasks. Leveraging BitNet's 2-bit ternary quantization, it aims to achieve exceptionally low memory usage and high throughput by extensively using Go's native bitwise operations and scalable goroutine-based concurrency across multiple CPU cores. The resulting inference engine will be lightweight, scalable, and suitable for both edge and cloud environments.
This roadmap outlines a sequence of small, sequential tasks to implement Microsoft’s BitNet b1.58‑2B 4T model in pure Go (inference-only). The implementation aims to support a 4096-token context and leverage goroutine-based concurrency to utilize multiple CPU cores.
- Model: BitNet-b1.58-2B-4T
- Research Paper: https://arxiv.org/abs/2310.11453
- Our development branch:
bitnet
- Addition to the sub-task list below, you can find all work issues related to this feature using labels
bitnet
andtask
- Feedback welcome! See draft PR to main
- Discord: HyperifyIO