Skip to content

Relative VTables for Rust #903

@PiJoules

Description

@PiJoules

Proposal

We’re experimenting with more ways of saving memory on Fuchsia and one of the places I thought could be a win is rust vtables. We saw a significant amount of memory savings when using a PIC-friendly vtable layout for C++, particularly in large binaries like chromium which heavily rely on vtables. While our rust binaries aren't very big, we do have many of them and together they add up to about 3.77 MB worth of relro data.

Relro data is essentially writeable data that is made readonly once relocations in that data are patched by the dynamic linker. This data must be writable at first because the dynamic linker needs to resolve these relocations to other symbols, which could be anywhere in memory. Once the relocations are patched, the page containing relro data is then made readonly. This is contrary to rodata which is purely readonly and never writable. The significance between the two is that relro data is stored in copy-on-write (COW) pages. These are pages that can be cloned once a new process starts writing to it. In the context of vtables, this means that whenever the dynamic linker in a new process patches relocations for functions inside a vtable, the COW page holding that vtable is cloned and then eventually made readonly. The number of COW pages that are cloned can scale to the number of processes, leading to significant memory usage for data that is effectively readonly for the remainder of the program’s lifetime. rodata doesn’t have this issue because it is pure readonly and can be shared between processes.

Relative vtables aims to essentially move vtables into rodata by making them PIC-friendly. Instead of storing a pointer to a trait function in the vtable, we would store the offset between that trait function and the vtable. This results in a static relocation that is only resolved at build time. This makes the vtable purely readonly since it would contain no dynamic relocations. With this layout, a call to a trait function involves loading this offset from the vtable then adding it back to the vtable’s address to get the corresponding function address.

Benefits

  • Reduced Memory Footprint: VTables can be pure readonly and shared between processes because they will contain no dynamic relocations which leads to fewer COW pages
  • Faster Startup Time: Having fewer dynamic relocations also means the dynamic linker does less work patching relocations on program startup.
  • Halved VTable Sizes: Under the small memory model, PC-relative distances within the binary are assumed to fit in signed 32-bit values (+/-2G distance), meaning we can use 4-byte offsets in the vtable.
  • Improved Cache Locality: Smaller vtables mean more vtable entries can fit into a single cache line, leading to fewer cache misses during function dispatches. This can result in faster program execution.

Design

Below is a brief overview of the current vtable layout and changes for the relative vtable layout. Similar to the implementation in Clang, I think a good rollout process would be hiding this behind an experimental flag initially for testing, then make the ABI dependent on the target so targets like Fuchsia could automatically get relative vtables without needing an explicit flag.

The Current Rust VTable ABI

Rust implements dynamic dispatch through trait objects. A trait is essentially a fat pointer containing two pointers: one to the actual data held by the object, and one to the actual vtable. This type of fat pointer is separate from the normal non-raw rust pointers which contain a base pointer and size. When calling a function that implements a trait, the vtable pointer is unpacked from the fat pointer and we load the appropriate function from this vtable. Similar to Itanium C++ ABI vtables, a Rust vtable is mostly just a struct of function pointers. The main difference is the components at the start of the vtable. Each Rust vtable starts with these 3 components:

  • The “drop in place” function pointer. This is a pointer to the core::ptr::drop_in_place implementation for this type. This function is essentially a destructor that recursively calls the destructors for each of the type’s members. This component is needed for Box to correctly destroy the object when a wrapping Box goes out of scope.
  • The size of the concrete type. This is also needed by Box to deallocate the correct number of bytes after destroying its object. I believe the size component should only be legally accessible via the size_of_val intrinsic.
  • The alignment of the concrete type. This is also needed by Box but mainly to ensure that the allocated object is correctly aligned. I believe the align component should only be legally accessible via the align_of_val intrinsic.

Each of these components is the size of one word. Given the following struct and trait implementation:

trait MyTrait {
    fn foo(&self);
    fn bar(&self) -> u32;
}

struct Struct {
    s: String,
}

impl MyTrait for Struct {
    fn foo(&self) {
        println!("Struct foo");
    }
    fn bar(&self) -> u32 {
        42
    }
}

We would see a vtable (in LLVM IR) that looks like:

@vtable = private unnamed_addr constant <{ ptr, [16 x i8], ptr, ptr }> <{
	;; This is the drop-in-place function
        ptr @"_ZN4core3ptr33drop_in_place$LT$test..Struct$GT$17haf3a6bdbafdd9b84E",

        ;; The first 8 bytes are the size of `Struct`, which is only composed of a
        ;; single `String` that's 24 bytes large.
        ;;
        ;; The second 8 bytes are the alignment of `Struct`.
        [16 x i8] c"\18\00\00\00\00\00\00\00\08\00\00\00\00\00\00\00",

        ;; These are the function pointers to `foo` and `bar`
        ptr @"_ZN46_$LT$test..Struct$u20$as$u20$test..MyTrait$GT$3foo17h4aa480e1d9c41f39E",
        ptr @"_ZN46_$LT$test..Struct$u20$as$u20$test..MyTrait$GT$3bar17hf944da1b1fe1b667E"
}>, align 8

One thing that’s kind of odd is the size and alignment fields are provided as i8 arrays rather than just i64s. Loads to these parts of the vtables always load them as i64, so it’s unclear to me why they can’t just be i64 in the vtable. This may or may not be just some suboptimal lowering in rust which could be changed independently of this MCP. If the object were effectively a POD not needing an actual destruction method, then this component is effectively a nullptr also represented as an array of i8s.

@vtable = private unnamed_addr constant <{ [24 x i8], ptr, ptr }> <{
        [24 x i8] c"\00\00\00\00\00\00\00\00\18\00\00\00\00\00\00\00\08\00\00\00\00\00\00\00",
...

The IR for calling a vtable function is just a normal dynamic dispatch. So given,

fn process_trait_object(obj: &dyn MyTrait) {
    obj.foo();
}

the IR would look like this:

define internal void @_ZN4test20process_trait_object17hdb64382979262772E(ptr align 1 %obj, ptr align 8 %vtable) unnamed_addr #2 {
start:
  ;; Get the function address from the vtable
  %0 = getelementptr inbounds i8, ptr %vtable, i64 24

  ;; Load the function pointer
  %1 = load ptr, ptr %0, align 8, !invariant.load !5, !nonnull !5

  ;; Call `foo`
  call void %1(ptr align 1 %obj)
  ret void
}

Relative VTable ABI

Changing the vtables to be PIC-friendly akin to relative vtables in C++ is pretty straightforward and can be done almost the exact same way. During vtable emission, for each function, we take the offset between that function and the start of the vtable. We can use the exact same IR sequence we use for C++ relative vtables. So the vtable would look like this:

@vtable = ...
        ;; `foo`
        i32 trunc (
                i64 sub (
                        i64 ptrtoint (ptr @"_ZN46_$LT$test..Struct$u20$as$u20$test..MyTrait$GT$3foo17h4aa480e1d9c41f39E" to i64),
                        i64 ptrtoint (ptr @vtable to i64)
                ) to i32
        ),


        ;; `bar`
        i32 trunc (
                i64 sub (
                        i64 ptrtoint (ptr @"_ZN46_$LT$test..Struct$u20$as$u20$test..MyTrait$GT$3bar17hf944da1b1fe1b667E" to i64),
                        i64 ptrtoint (ptr @vtable to i64)
                ) to i32
        )

Unlike relative vtables in C++, dso_local_equivalent isn’t needed since all rust trait functions have internal linkage and are guaranteed to be dso-local. Rust vtables are also always emitted with private linkage, so they are also always dso-local. This subtraction should always result in a PC32-style static relocation. If the object doesn't require a "destructor" (that is, the original drop-in-place component would be zero), then this component under a relative vtable abi would also be zero, but just as an i32. Otherwise, the component would be made relative just like other trait functions in the vtable.

The alignment component can just be an i32 since we probably won’t be having objects aligned up to 2^(2^32) bytes.

The tricky part is the size component. Ideally, it would just be i32 because we could make the entire vtable an array type with 4-byte alignment. But limiting the size component means the maximum size of the actual object would need to be 4GB. I’m not sure how often creating such large objects are in practice, but I imagine it should be entirely possible from the language level to make an array of isize::MAX bytes. If such large objects are needed, then we’d need an i64 to cover this object. This would unfortunately mean the vtable would be a struct of mostly i32s with an i64 component in the middle, forcing padding and making the vtable 8-byte aligned still. I don’t know if this is desirable, so I have a number of possible layout suggestions:

  1. Hide the feature behind a flag and just unconditionally force the size field to be i32. We say this feature is only usable when all objects are less than 4GB in size. This could be enforced with a compiler error.
  2. We make the size field two adjacent i32s. When reading the size from the vtable via the size_of_val intrinsic, we update it to do two 4-byte loads that get merged into an i64 rather than loading a single 8-byte value. This allows the vtable to keep the 4-byte alignment at the cost of an extra (potentially mostly unused) field and some extra instructions when calling this intrinsic. It’s possible the compiler will just fuse these into a single i64 load when the target doesn't mind unaligned loads, so this may be essentially free to do.
  3. We have the alignment field be an i32 that’s split into two parts: the bottom 26 bits which will act as an extension of the size field, and the top 6-bits which will be used as a power-2 representation of alignment. The max alignment we’ll probably want to support is 2^64, so the largest value we’ll need to represent is 64 which only requires 6 bits to represent. It's more likely the OS integration already limits the max alignment to page size effectively. That's unlikely ever to exceed 2^20 (and 2^12 or 2^16 are more common), so having even 4 or 5 bits is probably plenty here. Loading the size from the vtable via the size_of_val intrinsic involves the same logic from (2) but with an instruction to mask out the power-2 alignment bits. Loading the alignment from the vtable via the align_of_val intrinsic involves doing a load, shifting to get the alignment bits, and another shift by this value to get the real alignment. This allows for the size and alignment components to still be 4-bytes each and gives an extra buffer for the original size component.
  4. Store the actual size in a separate dso-local global elsewhere. We store the offset between the vtable and this global as the size component of the vtable. Loading the size via the size_of_val intrinsic would mean doing another PC-relative load for the size which can remain 64 bits. This would expand rodata by more 64-bit ints, but I think GC should only allow this to scale by the number of unique sizes if emitted into the .rodata.cst8 section, which is SHF_MERGE at size 8.. This would also allow the vtable to be 4-byte aligned. An extension of this could be moving the alignment field into this extra global as well, akin to the typeinfo struct in C++. The net rodata size chance from doing this is zero since we’re just moving an int, but GC might still be able to merge some of the extracted size/alignment struct.

The IR for invoking foo would look like this:

define internal void @_ZN4test20process_trait_object17hdb64382979262772E(ptr align 1 %obj, ptr align 4 %vtable) unnamed_addr #2 {
start:
  ;; This intrinsic does 2 things. First (assuming the first 12 bytes consist of the drop-in-place,
  ;; size, and alignment components), it loads the i32 offset 12 bytes after `vtable`.
  ;; Next, it adds this offset back to `vtable` which will be the trait function.
  %0 = call ptr @llvm.load.relative.i32(ptr %vtable, i32 12)

  ;; Call normally
  call void %0(ptr align 1 %obj)
  ret void
}

Compiler Changes

These are the changes I know I’ll need to make, but it’s gonna take some time for me to get familiar with rust to know where to find them.

  • VTable layout emission
  • Trait calling emission
  • Lowering of the align_of_val and size_of_val compiler intrinsics

Runtime Changes

I think this shouldn’t require any runtime changes because extracting information from the vtable is only legally done via the compiler intrinsics mentioned above. An actual rust expert can correct me on this. std::boxed::Box and std::layout::Layout which do care about the size and alignment fields just make calls to std::mem::size_of_val_raw and std::mem::align_of_val_raw which end up just calling their respective intrinsics.

Mentors or Reviewers

If you have a reviewer or mentor in mind for this work, mention them here. You can put your own name here if you are planning to mentor the work.

Process

The main points of the Major Change Process are as follows:

  • File an issue describing the proposal.
  • A compiler team member who is knowledgeable in the area can second by writing @rustbot second or kickoff a team FCP with @rfcbot fcp $RESOLUTION.
  • Once an MCP is seconded, the Final Comment Period begins.
    • Final Comment Period lasts for 10 days after all outstanding concerns are solved.
    • Outstanding concerns will block the Final Comment Period from finishing. Once all concerns are resolved, the 10 day countdown is restarted.
    • If no concerns are raised after 10 days since the resolution of the last outstanding concern, the MCP is considered approved.

You can read more about Major Change Proposals on forge.

Metadata

Metadata

Assignees

No one assigned

    Labels

    T-compilerAdd this label so rfcbot knows to poll the compiler teammajor-changeA proposal to make a major change to rustc

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions