Skip to content

Add bf16, f64f64 and f80 types #3456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions text/add-bf16-f64f64-and-f80-type.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
- Feature Name: `add-bf16-f64f64-and-f80-type`
- Start Date: 2023-7-10
- RFC PR: [rust-lang/rfcs#3456](https://github.com/rust-lang/rfcs/pull/3456)
- Rust Issue: [rust-lang/rust#2629](https://github.com/rust-lang/rfcs/issues/2629)

# Summary
[summary]: #summary

This RFC proposes new floating point types to enhance FFI with specific targets:

- `bf16` as builtin type for 'Brain floating format', widely used in machine learning, different from IEEE-754 standard `binary16` representation
- `f64f64` into `core::arch` for the legacy extended float format used in PowerPC architecture
- `f80` into `core::arch` for the extended float format used in x86 and x86_64 architecture

Also, this proposal introduces `c_longdouble` in `core::ffi` to represent correct format for 'long double' in C.

# Motivation
[motivation]: #motivation

The types listed above may be widely used in existing native code, but not available on all targets. Their underlying representations are quite different from 16-bit and 128-bit binary floating format defined in IEEE-754.

In respective targets (namely PowerPC and x86), the target-specific extended types are referenced by `long double`, which makes `long double` ambiguous in context of FFI. Thus defining `c_longdouble` should help interoperating with C code with `long double` type.

# Guide-level explanation
[guide-level-explanation]: #guide-level-explanation

`bf16` is available on all targets. The operators and constants defined for `f32` are also available for `bf16`.

For `f64f64` and `f80`, their availability is limited into following targets, but may change over time:

- `f64f64` is supported on `powerpc-*` and `powerpc64(le)-*`, available in `core::arch::{powerpc, powerpc64}`
- `f80` is supported on `i[356]86-*` and `x86_64-*`, available in `core::arch::{x86, x86_64}`

The operators and constants defined for `f32` or `f64` are available for `f64f64` and `f80` in their respective arch-specific modules.

All the proposed types, including `bf16`, `f64f64` and `f80`, do not have literal representation. Instead, they can be converted to or from IEEE-754 compliant types.

# Reference-level explanation
[reference-level-explanation]: #reference-level-explanation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all of these types, what is their interaction with #3514 -- i.e., what exactly is guaranteed (or not) about their NaN values? Is there any other kind of non-determinism for any of them?


## `bf16` type

`bf16` consists of 1 sign bit, 8 bits of exponent, 7 bits of mantissa. Some ARM, AArch64, x86 and x86_64 targets support `bf16` operations natively. For other targets, they will be promoted into `f32` before computation and truncated back into `bf16`.
Copy link
Member

@RalfJung RalfJung Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that equivalent to whatever the IEEE semantics of bf16 are, if such a thing exists (i.e., a hypothetical IEEE type with 8 bits of exponent, 7 bits of mantissa)?

Copy link
Member

@RalfJung RalfJung Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading other comments below it seems like this is actually an incorrect emulation. So it will be the case, for the first time in Rust history, that primitive operations such as multiplying two elements of bf16 have target-dependent behavior even when no NaNs are involved. That's a major downside of the RFC and needs to be discussed and justified more explicitly. The RFC should also state explicitly what is guaranteed to be true about bf16 arithmetic on all targets -- that's needed e.g. for unsafe code authors to know what they can rely on in terms of soundness. Furthermore, the RFC needs to specify whether on targets that have native bf16 support, it is correct for the compiler to do compile-time optimizations using emulated f32 semantics (IOW, the RFC needs to say whether there are any guarantees that bf16 on such a target will actually behave like the native bf16 of the hardware.)

Copy link

@chorman0773 chorman0773 Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to emulate correct rounding semantics, so this frankly seems like a bug in those softfp impls.

It does require temporarily switching to RTZ mode, and then you can truncate the result with RTN-ties-even.

Edit: Actually NVM, the above procedure still has an error from the correctly rounded result, of at most -2^17*ULP. You'd have to first check FE_INEXACT and just how you round accordingly.

Copy link
Member

@RalfJung RalfJung Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe alternatively the answer for all these float types is -- the resulting bit patterns are entirely unspecified and not guaranteed to be portable in any way.

But even then we need to document how deterministic they are. Does passing the exact same inputs to an operation multiple times during a program execution always definitely produce the exact same outputs, on all targets and optimization levels? For regular floats, the answer turns out to be "only when there are no NaNs" -- that's what #3514 is all about. Sometimes, the same operation with the same inputs on the same target can produce different results depending on optimization levels and how obfuscated the surrounding code is. Even if we don't want to specify the bits that are produced by these operations, we need to specify whether results are consistent across all programs on a given target (define "target" -- is it per-triple or per-architecture), or only consistent across all operations in a single execution, or arbitrarily inconsistent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc, you can do a single add/sub/mul/div/sqrt bf16 operation by promoting to f32, doing the operation in f32, and then rounding back to bf16 using the same rounding mode, not truncating. That is assuming, of course, that bf16 actually meets the conditions which are iirc something like having bf16's mantissa bit count be less than half of f32's mantissa bit count minus 1 or 2.

this is like how you can do that with f32 and f64, which is how you can do f32 operations in JavaScript by using Math.fround.

Copy link
Member

@RalfJung RalfJung Apr 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true for the regular IEEE formats, yes -- it was proven here.

But I don't know if bf16 is enough like an IEEE format to make that theorem apply.

Also, does LLVM when it compiles bf16 to f32 guarantee to do the rounding back to bf16 after each and every operation, never doing more than one operation "at once" in f32 mode?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true for the regular IEEE formats, yes -- it was proven here.

But I don't know if bf16 is enough like an IEEE format to make that theorem apply.

it is, bf16 is just f16 with a few more exponent bits and a few less mantissa bits, everything else is the same.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bf16 is just f32 with the lower 16 mantissa bits dropped. As f64 values that can be rounded to f32 are effectively f32 values with 29 extra mantissa bits, there would be no difference here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does LLVM when it compiles bf16 to f32 guarantee to do the rounding back to bf16 after each and every operation, never doing more than one operation "at once" in f32 mode?

that I don't know, but I hope LLVM at least tries to be correct in non-fast-math mode

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that I don't know, but I hope LLVM at least tries to be correct in non-fast-math mode

That should definitely be noted as something to figure out before stabilization.


`bf16` will generate `bfloat` type in LLVM IR.

## `f64f64` type

`f64f64` is the legacy format of extended floating format used on PowerPC target. It consists of two `f64`s, with the former as normal `f64` and the latter for extended mantissa.

The following `From` traits are implemented in `core::arch::{powerpc, powerpc64}` for conversion between `f64f64` and other floating types:

```rust
impl From<bf16> for f64f64 { /* ... */ }
impl From<f32> for f64f64 { /* ... */ }
impl From<f64> for f64f64 { /* ... */ }
```

`f64f64` will generate `ppc_fp128` type in LLVM IR.

## `f80` type

`f80` represents the extended precision floating type on x86 targets, with 1 sign bit, 15 bits of exponent and 63 bits of mantissa.

The following `From` traits are implemented in `core::arch::{x86, x86_64}` for conversion between `f64f64` and other floating types:

```rust
impl From<bf16> for f80 { /* ... */ }
impl From<f32> for f80 { /* ... */ }
impl From<f64> for f80 { /* ... */ }
```

`f80` will generate `x86_fp80` type in LLVM IR.

## `c_longdouble` type in FFI

`core::ffi::c_longdouble` will always represent whatever `long double` does in C. Rust will defer to the compiler backend (LLVM) for what exactly this represents, but it will approximately be:

- 80-bit extended precision (f80) on `x86` and `x86_64`:
- `f64` double precision with MSVC
- `f128` quadruple precision on AArch64
- `f64f64` on PowerPC

# Drawbacks
[drawbacks]: #drawbacks

`bf16` is not a IEEE-754 standard type, so adding it as primitive type may break existing consistency for builtin float types. The truncation after calculation on targets not supporting `bf16` natively also breaks how Rust treats precision loss in other cases.
Copy link
Member

@programmerjake programmerjake Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correctly rounding bf16 can be relatively easily implemented. bf16 add/sub/mul/div/sqrt can then just convert to f32, do a single operation, and round to bf16, which will always give the correct bf16 result.
round to bf16 code (not tested):

fn f32_to_bf16(v: f32) -> bf16 {
    let b32 = v.to_bits();
    bf16::from_bits(if v.is_nan() {
        (b32 >> 16) as u16
    } else if b32 & 0xFFFF == 0x8000 {
        let b16 = (b32 >> 16) as u16;
        b16.wrapping_add(b16 & 1)
    } else {
        (b32.wrapping_add(0x8000) >> 16) as u16
    })
}

Copy link
Member

@programmerjake programmerjake Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

round to bf16 code staying entirely in SSE registers on x86_64 (also untested):
https://rust.godbolt.org/z/85Ks9sPP6

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code you provided looks like direct truncation. I need to confirm if the rounding behavior is fixed (tozero? tonearest? toinfinity?) or depending on system rounding mode.

Also, clang provides an option -fbfloat16-excess-precision to specify the 'merging' behavior of bfloat operations. For example, will the intermediate result of a-b+c be rounded? But I think that's not an issue for Rust, the value should be none (no merging will be performed).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code you provided looks like direct truncation.

it is round to nearest, ties to even. truncation (round towards zero) would be only:
bf16::from_bits((f32::to_bits(v) >> 16) as u16)

I need to confirm if the rounding behavior is fixed (tozero? tonearest? toinfinity?) or depending on system rounding mode.

LLVM assumes the rounding mode is round to nearest, ties to even, unless you use the constrained fp intrinsics that rustc doesn't support (yet?).


`c_longdouble` are not uniquely determined by architecture, OS and ABI. On the same target, C compiler options may change what representation `long double` uses.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are referring to options like -mlong-double-128 here. This doesn't strike me as a drawback. Instead, I would mention in the c_longdouble section that what exactly long double represents can be changed at compile time in C but Rust won't have this option.


# Rationale and alternatives
[rationale-and-alternatives]: #rationale-and-alternatives

[half](https://github.com/starkat99/half-rs) crate provides implementation of binary16 and bfloat16 types.

However, besides the disadvantage of usage inconsistency between primitive type and type from crate, there are still issues around those bindings.

The availablity of additional float types depends on CPU/OS/ABI/features of different targets heavily. Evolution of LLVM may also unlock possibility of the types on new targets. Implementing them in compiler handles the stuff at the best location.

Most of such crates defines their type on top of C binding. But extended float type definition in C is complex and confusing. The meaning of `long double`, `_Float128` varies by targets or compiler options. Implementing in Rust compiler helps to maintain a stable codegen interface.

And since third party tools also relies on Rust internal code, implementing additional float types in compiler also help the tools to recognize them.

# Prior art
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[prior-art]: #prior-art

There is a previous proposal on `f16b` type to represent `bfloat16`: https://github.com/rust-lang/rfcs/pull/2690.

# Unresolved questions
[unresolved-questions]: #unresolved-questions

This proposal does not contain information for FFI with C's `_Float128` and `__float128` type. Because they are not so commonly used compared to `long double`, and they are even more complex than the situation of `c_longdouble` (for example, their semantics are different under C or C++ mode).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This proposal does not contain information for FFI with C's `_Float128` and `__float128` type. Because they are not so commonly used compared to `long double`, and they are even more complex than the situation of `c_longdouble` (for example, their semantics are different under C or C++ mode).
This proposal does not contain information for FFI with C's `_Float128` and `__float128` type, because they are not so commonly used compared to `long double`, and they are even more complex than the situation of `c_longdouble` (for example, their semantics are different under C and C++).

Copy link

@lygstate lygstate Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not the reason, indeed, on some target c_longdouble needs _Float128, The real reason is because we have a different RFC3453 for it. Do not said this as it's misleading.
I think we needs say in conjunction with RFC3453, we can define c_longdouble properly on all target

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PowerPC has option to control what long double means: f64, f128 or f64f64. But since it is in transition to f128 as of the time of writing, we can drop a little history burden and set f128 on little-endian 64-bit targets.

I don't make sure if x86 has similar option. If not, we can confidently introduce c_longdouble.


Although statements like `X target supports A type` is used in above text, some target may only support some type when some target features are enabled. Such features are assumed to be enabled, with precedents like `core::arch::x86_64::__m256d` (which is part of SSE).

Representation of `long double` in C may depend on some compiler options. For example, in Clang on `powerpc64le-*`, `-mabi=ieeelongdouble`/`-mabi=ibmlongdouble`/`-mlong-double-64` will set `long double` as `fp128`/`ppc_fp128`/`double` in LLVM. Currently, the default option is assumed.

# Future possibilities
[future-possibilities]: #future-possibilities

[LLVM reference for floating types]: https://llvm.org/docs/LangRef.html#floating-point-types