Replies: 2 comments 19 replies
-
@jim-plus is there any progress on the FP8 support? I am working on AMD GPU and also need this FP8 feature. |
Beta Was this translation helpful? Give feedback.
0 replies
-
It's straightforward enough, but the accuracy is worse than int8. typedef union {
float value; /**< Floating-point value */
uint32_t bits; /**< Raw bit representation */
} Float32;
// Encode a float into an 8-bit floating-point representation
uint8_t encode_float8(float value) {
if (value == 0.0f) {
return 0; // Encoded as all zeros
}
Float32 encoder = {.value = value};
// Extract IEEE-754 components
uint32_t sign = (encoder.bits >> 31) & 0x1;
uint32_t exponent = (encoder.bits >> 23) & 0xff;
uint32_t mantissa = encoder.bits & 0x7fffff;
// Define bias parameters
uint32_t e_bias_32 = 127;
uint32_t e_bias_8 = 3;
// Define exponent limits
uint32_t e_max = 7;
uint32_t e_min = 0;
// Calculate compressed exponent
int8_t e_compressed = fmaxf(fminf(exponent - e_bias_32 + e_bias_8, e_max), e_min);
// Calculate compressed mantissa (top 4 bits of the 23-bit mantissa)
uint8_t m_compressed = (mantissa >> 19) & 0xf;
// Pack into an 8-bit integer
return (uint8_t) ((sign << 7) | (e_compressed << 4) | m_compressed);
}
// Decode an 8-bit floating-point representation back to a float
float decode_float8(uint8_t bits) {
// Extract fields
uint8_t sign = (bits >> 7) & 0x01;
uint8_t exponent = (bits >> 4) & 0x07;
uint8_t mantissa = bits & 0x0F;
// Define parameters
uint32_t e_bias_32 = 127;
uint32_t e_bias_8 = 3;
// Expand exponent
int32_t e_expanded = exponent - e_bias_8 + e_bias_32;
// Expand mantissa with implicit leading 1
float m_expanded = 1.0f + (mantissa / 16.0f);
// Reconstruct float
float result = ldexpf(m_expanded, e_expanded - e_bias_32);
return sign ? -result : result;
} I've run multiple experiments between int8 and fp8 and int8 always reduces the margin of error. int8 is also much faster. |
Beta Was this translation helpful? Give feedback.
19 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
How difficult would it be to support conversion to fp8 for GGUF, and to add accelerated GPU support? I have a 4060ti 16gb Lovelace GPU and am interested in leveraging its fp8 support.
Beta Was this translation helpful? Give feedback.
All reactions