A high-performance Go library implementing IEEE 754 FP8 E4M3FN format for 8-bit floating-point arithmetic, commonly used in machine learning applications for reduced-precision computations.
- IEEE 754 FP8 E4M3FN Format: Complete implementation of the 8-bit floating-point format
- High Performance: Optimized arithmetic operations with optional fast lookup tables
- Comprehensive API: Full support for conversion, arithmetic, and mathematical operations
- Machine Learning Ready: Designed for ML workloads requiring reduced precision
- Zero Dependencies: Pure Go implementation with no external dependencies
The Float8 type uses the E4M3FN variant of IEEE 754 FP8:
- 1 bit: Sign (0 = positive, 1 = negative)
- 4 bits: Exponent (biased by 7, range [-6, 7])
- 3 bits: Mantissa (3 explicit bits, 1 implicit leading bit for normal numbers)
- Zero: Exponent=0000, Mantissa=000 (both positive and negative)
- NaN: Exponent=1111, Mantissa=111
- No Infinities: The E4M3FN variant does not support infinity values
go get github.com/zerfoo/float8
package main
import (
"fmt"
"github.com/zerfoo/float8"
)
func main() {
// Initialize the package (optional, done automatically)
float8.Initialize()
// Create Float8 values from float32
a := float8.FromFloat32(3.14)
b := float8.FromFloat32(2.71)
// Perform arithmetic operations
sum := a.Add(b)
product := a.Mul(b)
// Convert back to float32
fmt.Printf("a = %f\n", a.ToFloat32())
fmt.Printf("b = %f\n", b.ToFloat32())
fmt.Printf("a + b = %f\n", sum.ToFloat32())
fmt.Printf("a * b = %f\n", product.ToFloat32())
}
The library supports various configuration options for performance optimization:
// Configure with custom settings
config := &float8.Config{
EnableFastArithmetic: true, // Enable lookup tables for faster arithmetic
EnableFastConversion: true, // Enable lookup tables for faster conversion
DefaultMode: float8.ModeDefault,
ArithmeticMode: float8.ArithmeticAuto,
}
float8.Configure(config)
Float8
: The main 8-bit floating-point typeConfig
: Configuration options for the package
// From other numeric types
func FromFloat32(f float32) Float8
func FromFloat64(f float64) Float8
func FromInt(i int) Float8
// To other numeric types
func (f Float8) ToFloat32() float32
func (f Float8) ToFloat64() float64
func (f Float8) ToInt() int
func (f Float8) Add(other Float8) Float8
func (f Float8) Sub(other Float8) Float8
func (f Float8) Mul(other Float8) Float8
func (f Float8) Div(other Float8) Float8
func (f Float8) Abs() Float8
func (f Float8) Neg() Float8
func (f Float8) Sqrt() Float8
// ... and more
func (f Float8) IsZero() bool
func (f Float8) IsNaN() bool
func (f Float8) IsInf() bool
func (f Float8) String() string
The library offers two performance modes:
- Standard Mode: Compact implementation with minimal memory usage
- Fast Mode: Uses pre-computed lookup tables for faster operations at the cost of memory
Enable fast mode for performance-critical applications:
float8.EnableFastArithmetic()
float8.EnableFastConversion()
Run the comprehensive test suite:
# Run all tests
go test ./...
# Run tests with coverage
go test -cover ./...
# Generate coverage report
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out
Run performance benchmarks:
go test -bench=. -benchmem ./...
- Machine Learning: Reduced precision training and inference
- Neural Networks: Memory-efficient model parameters
- Scientific Computing: Applications requiring controlled precision
- Embedded Systems: Resource-constrained environments
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- IEEE 754 standard for floating-point arithmetic
- The machine learning community for driving FP8 adoption
- Contributors and maintainers of this project