RedLeaf: Isolation and Communication in a Safe Operating System
Abstract
红叶是一种用 Rust 从头开始开发的新操作系统,以探索语言安全对操作系统组织的影响。与商用系统相反,红叶不依靠硬件地址空间进行隔离,而仅使用 Rust 语言的类型和存储安全。摆脱昂贵的硬件隔离机制,我们能够探索采用轻量细粒隔离的系统的设计空间。我们开发了一个基于语言的轻巧隔离域的新抽象,该域提供了一个信息隐藏和故障隔离的单元。域可以动态地加载并干净地终结,即一个域中的错误不会影响其他域的执行。在红叶隔离机制的基础上,我们证明了实现端到端零拷贝,故障隔离和设备驱动程序透明恢复的可能性。为了评估红叶抽象的实用性,我们实现了 RV6,这是一种 POSIX 子集操作系统,作为红叶域的集合。最后,为了证明 Rust 和细粒度的隔离是实用的,我们开发了10Gbps Intel ixgbe 网络和 NVMe 固态硬盘驱动程序,它们达到了与最快的 DPDK 和 SPDK 相当的性能。
RedLeaf is a new operating system developed from scratch in Rust to explore the impact of language safety on operating system organization. In contrast to commodity systems, RedLeaf does not rely on hardware address spaces for isolation and instead uses only type and memory safety of the Rust language. Departure from costly hardware isolation mechanisms allows us to explore the design space of systems that embrace lightweight fine-grained isolation. We develop a new abstraction of a lightweight language-based isolation domain that provides a unit of information hiding and fault isolation. Domains can be dynamically loaded and cleanly terminated, i.e., errors in one domain do not affect the execution of other domains. Building on RedLeaf isolation mechanisms, we demonstrate the possibility to implement end-to-end zero-copy, fault isolation, and transparent recovery of device drivers. To evaluate the practicality of RedLeaf abstractions, we implement Rv6, a POSIX-subset operating system as a collection of RedLeaf domains. Finally, to demonstrate that Rust and fine-grained isolation are practical—we develop efficient versions of a 10Gbps Intel ixgbe network and NVMe solid-state disk device drivers that match the performance of the fastest DPDK and SPDK equivalents.
1 Introduction
四十年前,早期操作系统设计明确了隔离内核子系统是提高整个系统可靠性和安全性的关键机制[12,32]。不幸的是,尽管有许多将细粒度隔离引入内核的尝试,现代系统仍然是一体的。从历史上看,对于性能要求最高的子系统来说,软件和硬件机制仍然难以接受的昂贵。多个硬件项目探索了在硬件中实施细粒度、低开销的隔离机制的能力[84,89,90]。但是,专注于性能的现代商用 CPU 仅为用户应用程序的粗粒隔离提供了基本支持。同样,几十年来,可以在软件中提供细粒度隔离的安全语言的开销对底层操作系统代码来说仍然是难以承受的。传统上,安全的语言需要托管运行时,特别是垃圾收集,来实现安全。尽管垃圾收集方面已经取得了很多进展,但对于旨在处理每秒每核数百万个请求的系统,其开销仍然很高(在典型的设备驱动工作负载上,最快的垃圾收集语言比 C 语言慢20-50% /[28/])。
Four decades ago, early operating system designs identified the ability to isolate kernel subsystems as a critical mechanism for increasing the reliability and security of the entire system [12, 32]. Unfortunately, despite many attempts to introduce fine-grained isolation to the kernel, modern systems remain monolithic. Historically, software and hardware mechanisms remain prohibitively expensive for isolation of subsystems with tightest performance budgets. Multiple hardware projects explored the ability to implement fine-grained, lowoverhead isolation mechanisms in hardware [84,89,90]. However, focusing on performance, modern commodity CPUs provide only basic support for coarse-grained isolation of user applications. Similarly, for decades, overheads of safe languages that can provide fine-grained isolation in software remained prohibitive for low-level operating system code. Traditionally, safe languages require a managed runtime, and specifically, garbage collection, to implement safety. Despite many advances in garbage collection, its overhead is high for systems designed to process millions of requests per second per core (the fastest garbage collected languages experience 20-50% slowdown compared to C on a typical device driver workload [28]).
几十年来,打破宏内核的设计选择仍然不切实际。因此,现代内核缺乏隔离及其好处:干净的模块化、信息隐藏、故障隔离、透明的子系统恢复和细粒度的访问控制。
For decades, breaking the design choice of a monolithic kernel remained impractical. As a result, modern kernels suffer from lack of isolation and its benefits: clean modularity, information hiding, fault isolation, transparent subsystem recovery, and fine-grained access control.
随着 Rust 的发展,隔离和性能的历史平衡正在发生变化,Rust 可以说是第一种没有垃圾收集就能实现安全的实用语言[45]。Rust 结合了线性类型的古老理念和实用的语言设计。Rust 通过限制性的所有权模型来实现类型和内存安全,该模型只允许对内存中的每个活对象有一个唯一的引用。这允许静态地跟踪对象的生命周期,并在没有垃圾收集器的情况下对其进行回收。该语言的运行时开销仅限于边界检查,在许多情况下,它可以被现代超标量乱序 CPU 所掩盖,这些 CPU 可以预测并绕过检查执行正确的路径[28]。为了实现实用的非线性数据结构,Rust 提供了一小套精心挑选的原语,允许摆脱线性类型系统的严格限制。
The historical balance of isolation and performance is changing with the development of Rust, arguably, the first practical language that achieves safety without garbage collection [45]. Rust combines an old idea of linear types [86] with pragmatic language design. Rust enforces type and memory safety through a restricted ownership model allowing only one unique reference to each live object in memory. This allows statically tracking the lifetime of the object and deallocating it without a garbage collector. The runtime overhead of the language is limited to bounds checking, which in many cases can be concealed by modern superscalar out-of-order CPUs that can predict and execute the correct path around the check [28]. To enable practical non-linear data structures, Rust provides a small set of carefully chosen primitives that allow escaping strict limitations of the linear type system.
Rust 作为开发底层系统的工具正在迅速得到普及,传统上这是 C 语言完成的[4, 24, 40, 47, 50, 65]。低开销的安全性带来了一系列直接的安全好处——预计三分之二的由不安全语言的底层编程习惯引起的漏洞可以通过仅使用安全语言来消除[20, 22, 67, 69, 77]。
Rust is quickly gaining popularity as a tool for development of low-level systems that traditionally were done in C [4, 24, 40, 47, 50, 65]. Low-overhead safety brings a range of immediate security benefits—it is expected, that two-thirds of vulnerabilities caused by low-level programming idioms typical for unsafe languages can be eliminated through the use of a safe language alone [20, 22, 67, 69, 77].
不幸的是,最近的项目大多只是把 Rust 当作 C 语言的平替。然而,我们认为,语言安全的真正好处在于可以实现实用的、轻量级的、细粒度的隔离和一系列机制,这些机制仍然是系统研究的重点,但几十年来仍然不实用:故障隔离[79]、透明设备驱动恢复[78]、安全内核扩展[13, 75]、基于权限的细粒度访问控制[76]等等。
Unfortunately, recent projects mostly use Rust as a drop-in replacement for C. We, however, argue that true benefits of language safety lie in the possibility to enable practical, lightweight, fine-grained isolation and a range of mechanisms that remained in the focus of systems research but remained impractical for decades: fault isolation [79], transparent device driver recovery [78], safe kernel extensions [13, 75], fine-grained capability-based access control [76], and more.
红叶^1^是一个新的操作系统,旨在探索语言安全对操作系统组织的影响,特别是在内核中利用细粒度隔离的能力及其好处。红叶是用 Rust 从头开始实现的。它不依赖硬件机制进行隔离,而是只使用 Rust 语言的类型和内存安全。
RedLeaf^1^ is a new operating system aimed at exploring the impact of language safety on operating system organization, and specifically the ability to utilize fine-grained isolation and its benefits in the kernel. RedLeaf is implemented from scratch in Rust. It does not rely on hardware mechanisms for isolation and instead uses only type and memory safety of the Rust language.
尽管有许多项目在探索基于语言的系统中的隔离问题[6, 35, 39, 85],阐明隔离的原则并在 Rust 中提供实际的实现仍然具有挑战性。一般来说,安全语言提供机制来控制对单个对象字段的访问(例如,用 Rust 的 pub 修饰访问),并保护指针,即通过可见全局变量和明确传递的参数限制对可达的程序状态的访问。对引用和通信管道的控制允许在函数和模块的边界上隔离程序的状态,保证机密性和完整性,更普遍的是,通过对象权限语言所探索的一系列技术,构建广泛的最小权限系统[59]。
Despite multiple projects exploring isolation in language-based systems [6, 35, 39, 85] articulating principles of isolation and providing a practical implementation in Rust remains challenging. In general, safe languages provide mechanisms to control access to the fields of individual objects (e.g., through pub access modifier in Rust) and protect pointers, i.e., restrict access to the state of the program transitively reachable through visible global variables and explicitly passed arguments. Control over references and communication channels allows isolating the state of the program on function and module boundaries enforcing confidentiality and integrity, and, more generally, constructing a broad range of least-privilege systems through a collection of techniques explored by object-capability languages [59].
不幸的是,仅靠内置的语言机制并不足以实现隔离相互不信任的计算的系统,例如,依靠语言安全来隔离应用程序和内核子系统的操作系统内核。为了保护整个系统的执行,内核需要一种隔离故障的机制,即提供一种终止故障或行为不当的计算的方式,使系统处于干净的状态。具体来说,在子系统终止后,隔离机制应该提供一种方法:1)取消分配子系统使用的所有资源;2)保留子系统分配的对象,但随后通过通信渠道传递给其他子系统;3)确保所有未来对被终止的子系统所暴露的接口的调用不会违反安全性,或阻止调用者并返回一个错误。面对基于语言的系统所鼓励的语义丰富的接口,故障隔离是具有挑战性的--频繁的引用交换往往意味着单个组件的崩溃会使整个系统处于损坏的状态[85]。
Unfortunately, built-in language mechanisms alone are not sufficient for implementing a system that isolates mutually distrusting computations, e.g., an operating system kernel that relies on language safety for isolating applications and kernel subsystems. To protect the execution of the entire system, the kernel needs a mechanism that isolates faults, i.e., provides a way to terminate a faulting or misbehaving computation in such a way that it leaves the system in a clean state. Specifically, after the subsystem is terminated the isolation mechanisms should provide a way to 1) deallocate all resources that were in use by the subsystem, 2) preserve the objects that were allocated by the subsystem but then were passed to other subsystems through communication channels, and 3) ensure that all future invocations of the interfaces exposed by the terminated subsystem do not violate safety or block the caller, but instead return an error. Fault isolation is challenging in the face of semantically-rich interfaces encouraged by language-based systems—frequent exchange of references all too often implies that a crash of a single component leaves the entire system in a corrupted state [85].
多年来,在基于语言的系统中隔离计算的目标走过了漫长的道路:从早期的单用户、单语言、单地址空间设计[9, 14, 19, 25, 34, 55, 71, 80]到堆隔离的想法[6, 35]和使用线性类型[39]。尽管如此,如今基于语言隔离的原则还没有被很好地理解。Singularity
[39]在 Sing#
中实现了故障隔离,它依赖于语言和操作系统的紧密联合设计来实现其隔离机制。然而,最近的几个系统提出了使用 Rust 进行轻量级隔离的想法,例如 Netbricks
[68] 和 Splinter
[47],但却难以阐明实现隔离的原则,而是回退到用 Rust 已经提供的信息隐藏来替代故障隔离。类似地,最近的 Rust 操作系统 Tock 通过传统的硬件机制和受限的系统调用接口支持用户进程的故障隔离,但未能为其用安全 Rust 实现的设备驱动程序(胶囊)提供故障隔离[50]。我们的工作开发了安全语言中的故障隔离原则和机制。我们引入了一个基于语言的隔离域的抽象,作为信息隐藏、加载和故障隔离的单元。为了封装域的状态并在域的边界实现故障隔离,我们制定了以下原则:
Over the years the goal to isolate computations in language-based systems came a long way from early single-user, single-language, single-address space designs [9, 14, 19, 25, 34, 55, 71, 80] to ideas of heap isolation [6, 35] and use of linear types to enforce it [39]. Nevertheless, today the principles of language-based isolation are not well understood. Singularity [39], which implemented fault isolation in Sing#, relie on a tight co-design of the language and operating system to implement its isolation mechanisms. Nevertheless, several recent systems suggesting the idea of using Rust for lightweight isolation, e.g., Netbricks [68] and Splinter [47], struggled to articulate the principles of implementing isolation, instead falling back to substituting fault isolation for information hiding already provided by Rust. Similar, Tock, a recent operating system in Rust, supports fault isolation of user processes through traditional hardware mechanisms and a restricted system call interface, but fails to provide fault isolation of its device drivers (capsules) implemented in safe Rust [50]. Our work develops principles and mechanisms of fault isolation in a safe language. We introduce an abstraction of a language-based isolation domain that serves as a unit of information hiding, loading, and fault isolation. To encapsulate domain’s state and implement fault isolation at domain boundary, we develop the following principles:
- 堆隔离 我们将堆隔离作为一个跨域的不变因素来执行,也就是说,域不会持有指向其他域的私有堆的指针。堆隔离是终止和卸载崩溃域的关键,因为没有其他域持有指向崩溃域的私有堆的指针,所以取消分配整个堆是安全的。为了实现跨域通信,我们引入了一个特殊的共享堆,允许分配可以在域之间交换的对象。
- 可交换类型 为了强制执行堆隔离,我们引入了可交换类型的概念,即可以安全地跨域交换的类型,而不会泄露指向私有堆的指针。可交换类型允许我们静态地强制执行这样的不变性:在共享堆上分配的对象不能有指向私有域堆的指针,但可以有对共享堆上其他对象的引用。
- 所有权跟踪为了取消共享堆上崩溃的域所拥有的资源,我们跟踪共享堆上所有对象的所有权。当一个对象在域之间传递时,我们会根据它是在域之间移动还是在只读访问中被借用来更新其所有权。我们依靠 Rust 的所有权机制保证当域在跨域函数调用中传递对共享对象的引用时,域就失去了(对象的)所有权,也就是说,Rust 保证在调用者域中不存在传递对象的别名。
- 接口验证为了提供系统的可扩展性,并允许域的作者为他们实现的子系统定义自定义接口,同时保留隔离性,我们验证了所有跨域的接口,保证接口仅限于可交换类型,从而防止它们破坏堆隔离。我们开发了一种接口定义语言(IDL),可以静态地验证跨域接口的定义,并为它们生成实现。
- 跨域调用代理 我们用调用代理来调解所有的跨域调用--这是一层受信任的代码,插在所有域的接口上。代理更新跨域传递的对象的所有权,提供对崩溃的域的线程执行的支持,并在域被终止后保护其未来的调用。我们的 IDL 从接口定义中生成了代理对象的实现。
- Heap isolation We enforce heap isolation as an invariant across domains, i.e., domains never hold pointers into private heaps of other domains. Heap isolation is key for termination and unloading of crashing domains, since no other domains hold pointers into the private heap of a crashing domain, it’s safe to deallocate the entire heap. To enable cross-domain communication, we introduce a special shared heap that allows allocation of objects that can be exchanged between domains.
- Exchangeable types To enforce heap isolation, we introduce the idea of exchangeable types, i.e., types that can be safely exchanged across domains without leaking pointers to private heaps. Exchangeable types allow us to statically enforce the invariant that objects allocated on the shared heap cannot have pointers into private domain heaps, but can have references to other objects on the shared heap.
- Ownership tracking To deallocate resources owned by a crashing domain on the shared heap, we track ownership of all objects on the shared heap. When an object is passed between domains we update its ownership depending on whether it’s moved between domains or borrowed in a read-only access. We rely on Rust’s ownership discipline to enforce that domains lose ownership when they pass a reference to a shared object in a crossdomain function call, i.e., Rust enforces that there are no aliases into the passed object left in the caller domain.
- Interface validation To provide extensibility of the system and allow domain authors to define custom interfaces for subsystems they implement while retaining isolation, we validate all cross-domain interfaces enforcing the invariant that interfaces are restricted to exchangeable types and hence preventing them from breaking the heap isolation invariants. We develop an interface definition language (IDL) that statically validates definitions of cross-domain interfaces and generates implementations for them.
- Cross-domain call proxying We mediate all cross domain invocations with invocation proxies—a layer of trusted code that interposes on all domain’s interfaces. Proxies update ownership of objects passed across domains, provide support for unwinding execution of threads from a crashed domain, and protect future invocations of the domain after it is terminated. Our IDL generates implementations of the proxy objects from interface definitions.
上述原则使我们能够以实用的方式实现故障隔离:即使面对语义丰富的接口,隔离边界也会引入最小的开销。当一个域崩溃时,我们通过解除当前在该域内执行的所有线程的执行来隔离故障,并在不影响系统其他部分的情况下删除域的资源。随后对域的接口的调用会返回错误,但仍然是安全的,不会触发恐慌。所有由域分配的、但在崩溃前返回的对象都保持活力。
The above principles allow us to enable fault-isolation in a practical manner: isolation boundaries introduce minimal overhead even in the face of semantically-rich interfaces. When a domain crashes, we isolate the fault by unwinding execution of all threads that currently execute inside the domain, and deallocate domain’s resources without affecting the rest of the system. Subsequent invocations of domain’s interfaces return errors, but remain safe and do not trigger panics. All objects allocated by the domain, but returned before the crash, remain alive.
为了测试这些原则,我们将红叶作为一个微内核系统来实现,在这个系统中,一组孤立的域实现了内核的功能:典型的内核子系统、类似 POSIX 的接口、设备驱动程序和用户应用程序。红叶提供了现代内核的典型功能:多核支持、内存管理、内核扩展的动态加载、类似 POSIX 的用户进程和快速设备驱动。在 RedLeaf 隔离机制的基础上,我们展示了透明地恢复崩溃设备驱动的可能性。我们实现了一个类似于影子驱动的想法[78],即轻量级的影子域,它介于对设备驱动的访问,并在崩溃后重新启动它重放其初始化协议。
To test these principles we implement RedLeaf as a microkernel system in which a collection of isolated domains implement functionality of the kernel: typical kernel subsystems, POSIX-like interface, device drivers, and user applications. RedLeaf provides typical features of a modern kernel: multi-core support, memory management, dynamic loading of kernel extensions, POSIX-like user processes, and fast device drivers.Building on RedLeaf isolation mechanisms, we demonstrate the possibility to transparently recover crashing device drivers. We implement an idea similar to shadow drivers [78], i.e., lightweight shadow domains that mediate access to the device driver and restart it replaying its initialization protocol after the crash.
为了评估红叶抽象的通用性,我们在红叶之上实现了 Rv6,一个 POSIX 子集的操作系统。Rv6 遵循 UNIX V6 规范[53]。尽管是一个相对简单的内核,Rv6 是一个很好的平台,它说明了细粒度的、基于语言的隔离思想如何能够应用于以 POSIX 接口为中心的现代内核。最后,为了证明 Rust 和细粒度隔离引入了非抑制性的开销,我们开发了 10Gbps 英特尔 Ixgbe 网络和 PCI 连接的固态硬盘 NVMe 驱动的高效版本。
To evaluate the generality of RedLeaf abstractions, we implement Rv6, a POSIX-subset operating system on top of RedLeaf. Rv6 follows the UNIX V6 specification [53]. Despite being a relatively simple kernel, Rv6 is a good platform that illustrates how ideas of fine-grained, language-based isolation can be applied to modern kernels centered around the POSIX interface. Finally, to demonstrate that Rust and fine-grained isolation introduces a non-prohibitive overhead, we develop efficient versions of 10Gbps Intel Ixgbe network and PCIe-attached solid state-disk NVMe drivers.
我们认为,实用的语言安全和所有权纪律的结合使我们能够首次以高效的方式实现操作系统研究的许多经典想法。红叶速度快,支持内核子系统的细粒度隔离[57, 61, 62, 79],故障隔离[78, 79],实现端到端的零拷贝通信[39],实现用户级设备驱动和内核旁路[11, 21, 42, 70],等等。
We argue that a combination of practical language safety and ownership discipline allows us to enable many classical ideas of operating system research for the first time in an efficient way. RedLeaf is fast, supports fine-grained isolation of kernel subsystems [57, 61, 62, 79], fault isolation [78, 79], implements end-to-end zero-copy communication [39], enables user-level device drivers and kernel bypass [11, 21, 42, 70], and more.
2 Isolation in Language-Based Systems
隔离在基于语言的系统中有着悠久的研究历史,这些研究正在探索通过语言安全、指针的细粒度控制和类型系统来执行轻量级隔离边界的权衡。早期的操作系统将安全语言用于操作系统的开发[9, 14, 19, 25, 34, 55, 71, 80]。这些系统实现了一个“开放”的架构,即单用户、单语言、单地址空间的操作系统,模糊了操作系统和应用程序本身之间的界限。这些系统依靠语言安全来防止意外的错误,但没有提供子系统或用户应用的隔离(现代单内核采取了类似的方法[2, 37, 56])。
Isolation has a long history of research in language-based systems that were exploring tradeoffs of enforcing lightweight isolation boundaries through language safety, fine-grained control of pointers, and type systems. Early operating systems applied safe languages for operating system development [9, 14, 19, 25, 34, 55, 71, 80]. These systems implemented an “open” architecture, i.e., a single-user, single-language, single-address space operating system that blurred the boundary between the operating system and the application itself [48]. These systems relied on language safety to protect against accidental errors but did not provide isolation of subsystems or user-applications (modern unikernels take a similar approach [2, 37, 56]).
SPIN 首次建议将语言安全作为实现动态内核扩展隔离的机制([13])。SPIN 利用 Modula-3 指针保护机密性和完整性,但由于指针是跨隔离边界交换的,因此未能提供故障隔离--崩溃的蔓延使系统处于不一致的状态。
SPIN was the first to suggest language safety as a mechanism to implement isolation of dynamic kernel extensions [13]. SPIN utilized Modula-3 pointers as capabilities to enforce confidentiality and integrity, but since pointers were exchanged across isolation boundaries it failed to provide fault isolation—a crashing extension left the system in an inconsistent state.
J-Kernel [85] 和 KaffeOS [6] 是第一个指出问题的内核,即单靠语言安全不足以执行故障隔离和终止不受信任的子系统。为了支持 Java 中隔离域的终止,J-Kernel 提出了间接访问所有跨域共享的对象的想法 [85]。J-Kernel 引入了一个特殊的能力对象以包装跨隔离的子系统共享的原始对象的接口。为了支持域终止,所有由崩溃的域创建的能力都被撤销,因此放弃了对被垃圾回收的原始对象的引用,并通过返回异常来阻止未来的访问。J-Kernel 依靠一个自定义的类加载器来验证跨域接口(即在运行时生成远程调用代理,而不是使用静态 IDL 编译器)。为了强制隔离,J-Kernel 利用了一个特殊的调用约定,允许通过引用来传递能力对象,但对于普通的非包装对象需要进行深拷贝。由于没有对共享对象的所有权约束,J-Kernel 提供了一个有限的故障隔离模型:当创建对象的域崩溃时,所有对共享对象的引用都被撤销,从而将故障传播到通过跨域调用获得这些对象的域中。此外,缺乏“移动”语义,即当对象被传递给被调用者时,强制调用者失去对对象的访问的能力,这意味着隔离需要对对象进行深拷贝,这对现代高吞吐量设备驱动程序的隔离来说是难以实现的。
J-Kernel [85] and KaffeOS [6] were the first kernels to point out the problem that language safety alone is not sufficient for enforcing fault isolation and termination of untrusted subsystems. To support termination of isolated domains in Java, J-Kernel developed the idea of mediating accesses to all objects that are shared across domains [85]. J-Kernel introduces a special capability object that wraps the interface of the original object shared across isolated subsystems. To support domain termination, all capabilities created by a crashing domain were revoked hence dropping the reference to the original object that was garbage collected and preventing the future accesses by returning an exception. J-Kernel relied on a custom class loader to validate cross-domain interfaces (i.e., generate remote-invocation proxies at run-time instead of using a static IDL compiler). To enforce isolation, J-Kernel utilized a special calling convention that allowed passing capability references by reference, but required a deep copy for regular unwrapped objects. Without ownership discipline for shared objects, J-Kernel provided a somewhat limited fault isolation model: the moment the domain that created the object crashed all references to the shared objects were revoked, propagating faults into domains that acquired these objects through cross-domain invocations. Moreover, lack of “move” semantics, i.e., the ability to enforce that the caller lost access to the object when it was passed to the callee, implied that isolation required a deep copy of objects which is prohibitive for isolation of modern, high-throughput device drivers.
KaffeOS 采用了“写屏障”技术[88],而不是通过能力引用来调解对共享对象的访问,该技术验证了整个系统的所有指针分配,因此可以强制执行特定的指针纪律[6]。KaffeOS 引入了私有域和特殊共享堆的分离,指定用于跨域共享对象--明确的分离对于执行写屏障检查至关重要,也就是说,如果分配的指针属于特定的堆。写入障碍被用来强制执行以下不变量:1)允许私有堆上的对象有指向共享堆上的对象的指针,但 2)共享堆上的对象被限制在同一个共享堆上。
Instead of mediating accesses to shared objects through capability references, KaffeOS adopts the technique of “write barriers” [88] that validate all pointer assignments throughout the system and hence can enforce a specific pointer discipline [6]. KaffeOS introduced separation of private domain and special shared heaps designated for sharing of objects across domains—explicit separation was critical to perform the write barrier check, i.e., if assigned pointer belonged to a specific heap. Write barriers were used to enforce the following invariants: 1) objects on the private heap were allowed to have pointers into objects on the shared heap, but 2) objects on the shared heap were constrained to the same shared heap.
在跨域调用中,当一个共享对象的引用被传递到另一个域时,写屏障被用来验证不变性,同时也用来创建一对特殊的对象,负责共享对象的引用计数和垃圾收集。KaffeOS 有以下故障隔离模型:当对象的创建者终止时,其他域保留对该对象的访问权(引用计数确保了最终对象在所有共享者终止时被取消分配)。不幸的是,虽然其他域能够在创建者崩溃后访问对象,但这并不足以实现干净的隔离--共享对象有可能被留在不一致的状态中(例如,如果崩溃发生在对象更新的中途),从而有可能使其他域停止或崩溃。与 J-Kernel 类似,对象的隔离需要在跨域调用中进行深拷贝。最后,调解所有指针更新的性能开销也很高。
On cross-domain invocations, when a reference to a shared object was passed to another domain, the write barrier was used to validate the invariants, and also to create a special pair of objects responsible for reference counting and garbage collecting shared objects. KaffeOS had the following fault isolation model: when the creator of the object terminated, other domains retained access to the object (reference counting ensured that eventually objects were deallocated when all sharers terminated). Unfortunately, while other domains were able to access the objects after their creator crashed, it was not sufficient for clean isolation—shared objects were potentially left in an inconsistent state (e.g., if the crash happened halfway through an object update), thus potentially halting or crashing other domains. Similar to J-Kernel, isolation of objects required a deep copy on a cross-domain invocation. Finally, performance overhead of mediating all pointer updates was high.
Singularity OS 引入了一个新的故障隔离模型,该模型围绕着静态强制的所有权纪律而建立[39]。与 KaffeOS 类似,在 Singularity 操作系统中,应用程序使用隔离的私有堆和一个特殊的“交换堆”来共享对象。一个开创性的设计决定是对分配在交换堆上的对象实行单一所有权,也就是说,在同一时间内,只有一个域可以对共享堆上的对象有一个引用。当一个对象的引用被跨域传递时,该对象的所有权在域之间被“移动”(在将该对象传递到另一个域之后,试图访问该对象会被编译器拒绝)。Singularity 开发了一系列新颖的静态分析和验证技术,在垃圾收集的 Sing#
语言中静态地执行这一属性。单一所有权是一个干净实用的故障隔离模型的关键--崩溃的域不能影响系统的其他部分--不仅它们的私有堆被隔离,而且新颖的所有权规则允许共享堆的隔离,也就是说,崩溃的域没有办法触发其他域的共享引用的撤销,或者让共享对象处于不一致的状态。此外,单一所有权允许以零拷贝的方式进行安全隔离,也就是说,移动语义保证了对象的发送者失去对它的访问,因此允许接收者更新对象的状态,知道发送者不能访问新的状态或改变旧的状态。
Singularity OS introduced a new fault isolation model built around a statically enforced ownership discipline [39]. Similar to KaffeOS, in Singularity applications used isolated private heaps and a special “exchange heap” for shared objects. A pioneering design decision was to enforce single ownership of objects allocated on the exchange heap, i.e., only one domain could have a reference to an object on the shared heap at a time. When a reference to an object was passed across domains the ownership of the object was “moved” between domains (an attempt to access the object after passing it to another domain was rejected by the compiler). Singularity developed a collection of novel static analysis and verification techniques enforcing this property statically in a garbage collected Sing# language. Single ownership was key for a clean and practical fault isolation model—crashing domains were not able to affect the rest of the system—not only their private heaps were isolated, but a novel ownership discipline allowed for isolation of the shared heap, i.e., there was no way for a crashing domain to trigger revocation of shared references in other domains, or leave shared objects in an inconsistent state. Moreover, single ownership allowed secure isolation in a zero-copy manner, i.e., the move semantics guaranteed that the sender of an object was losing access to it and hence allowed the receiver to update the object’s state knowing that the sender was not able to access new state or alter the old state underneath.
在 J-Kernel、KaffeOS 和 Singularity 的见解基础上,我们的工作开发了在安全语言中强制执行故障隔离的原则,并强制执行所有权。与 J-Kernel 类似,我们采用了用代理来包装接口。然而,我们静态地生成代理,以避免运行时的开销。我们依靠与 KaffeOS 和 Singularity 类似的堆隔离。我们采用堆隔离的主要原因是为了能够在不了解里面的对象的语义的情况下取消域的私有堆。我们借用共享堆上的对象的移动语义来提供干净的故障隔离,同时支持 Singularity 的零拷贝通信。然而,我们用只读的借用语义来扩展它,我们需要在不放弃零拷贝的情况下支持透明的领域恢复。由于我们用 Rust 实现红叶,我们受益于它的所有权纪律,它允许我们对共享堆上的对象执行移动语义。在线性类型[86]、仿生类型、别名类型[18, 87]和基于区域的内存管理[81]的研究基础上,并受 Sing# [29]、Vault [30] 和 Cyclone [43] 等语言的影响,Rust 静态地执行了所有权,并且不影响语言的可用性。与 Singularity 严重依赖 Sing# [29]及其通信机制的共同设计相比,我们在 Rust 语言之外开发了红叶的隔离抽象--可交换类型、接口验证和跨域调用代理。这使我们能够清楚地阐明提供故障隔离所需的最小原则集,并开发一套独立于语言实现这些原则的机制,可以说,这使我们能够根据具体的设计权衡进行调整。最后,我们做了几个设计选择,旨在实现我们系统的实用性。我们为最常见的“迁移线程”模型[31]而不是消息[39]设计并实现了我们的隔离机制,以避免关键的跨域调用路径上的线程上下文切换,并允许更自然的编程习惯,例如,在红叶域的接口只是 Rust 特征。
Building on the insights from J-Kernel, KaffeOS, and Singularity, our work develops principles for enforcing fault isolation in a safe language that enforces ownership. Similar to J-Kernel, we adopt wrapping of interfaces with proxies. We, however, generate proxies statically to avoid the run-time overhead. We rely on heap isolation similar to KaffeOS and Singularity. Our main reason for heap isolation is to be able to deallocate the domain’s private heap without any semantic knowledge of objects inside. We borrow move semantics for the objects on the shared heap to provide clean fault isolation and at the same time support zero-copy communication from Singularity. We, however, extend it with the read-only borrow semantics which we need to support transparent domain recovery without giving up zero-copy. Since we implement RedLeaf in Rust, we benefit from its ownership discipline that allows us to enforce the move semantics for objects on the shared heap. Building on a body of research on linear types [86], affine types, alias types [18, 87], and regionbased memory management [81], and being influenced by languages like Sing# [29], Vault [30], and Cyclone [43], Rust enforces ownership statically and without compromising usability of the language. In contrast to Singularity that heavily relies on the co-design of Sing# [29] and its communication mechanisms, we develop RedLeaf’s isolation abstractions— exchangeable types, interface validation, and cross-domain call proxying—outside of the Rust language. This allows us to clearly articulate the minimal set of principles required to provide fault isolation, and develop a set of mechanisms implementing them independently from the language, that, arguably, allows adapting them to specific design tradeoffs. Finally, we make several design choices aimed at practicality of our system. We design and implement our isolation mechanisms for the most common, “migrating threads” model [31] instead of messages [39] to avoid a thread context switch on the critical cross-domain call path and allow a more natural programming idiom, e.g., in RedLeaf domain interfaces are just Rust traits.
3 RedLeaf Architecture
红叶的结构是一个微内核系统,它依靠基于语言的轻量级域进行隔离(图 1)。微内核实现了启动执行线程、内存管理、域加载、调度和中断转发所需的功能。隔离域的集合实现了设备驱动程序、操作系统的个性,即 POSIX 接口,以及用户应用程序(第 4.5 节)。由于红叶不依赖于硬件隔离原语,所有的域和微内核都在 ring0 中运行。然而,域被限制为安全的 Rust(即微内核和可信库是红叶中唯一允许使用不安全的 Rust 扩展的部分)。我们在域之间强制执行堆隔离不变性。为了通信,域从全局共享堆中分配可共享的对象,并交换特殊指针,即远程引用(RRef<T>
),指向共享堆上分配的对象(第 3.1 节)。所有权纪律允许我们在隔离的域之间实现轻量级的零拷贝通信(第 3.1 节)。
RedLeaf is structured as a microkernel system that relies on lightweight language-based domains for isolation (Figure 1). The microkernel implements functionality required to start threads of execution, memory management, domain loading, scheduling, and interrupt forwarding. A collection of isolated domains implement device drivers, personality of an operating system, i.e., the POSIX interface, and user applications (Section 4.5). As RedLeaf does not rely on hardware isolation primitives, all domains and the microkernel run in ring 0. Domains, however, are restricted to safe Rust (i.e., microkernel and trusted libraries are the only parts of RedLeaf that are allowed to use unsafe Rust extensions). We enforce the heap isolation invariant between domains. To communicate, domains allocate shareable objects from a global shared heap and exchange special pointers, remote references (
RRef<T>
), to objects allocated on the shared heap (Section 3.1). The ownership discipline allows us to implement lightweight zero-copy communication across isolated domains (Section 3.1).
域通过正常的、类型化的 Rust 函数调用进行通信。在跨域调用时,线程在域之间移动,但在同一堆栈上继续执行。域的开发者为域的入口点和它的接口提供一个接口定义。红叶 IDL 编译器自动生成创建和初始化域的代码,并检查跨域边界传递的所有类型的有效性(3.1.5 节)。
Domains communicate via normal, typed Rust function invocations. Upon cross-domain invocation, the thread moves between domains but continues execution on the same stack. Domain developers provide an interface definition for the domain’s entry point and its interfaces. The RedLeaf IDL compiler automatically generates code for creating and initializing domains and checks the validity of all types passed across domain boundaries (Section 3.1.5).
红叶通过可信的代理对象来调解所有跨域通信。代理对象是由 IDL 编译器从 IDL 定义中自动生成的(第 3.1.5 节)。在每一个域的条目中,代理都会检查一个域是否是活的,如果是,它就会创建一个轻量级的续体,允许我们在域崩溃时解除线程的执行。
RedLeaf mediates all cross-domain communication with trusted proxy objects. Proxies are automatically generated from the IDL definitions by the IDL compiler (Section 3.1.5). On every domain entry, the proxy checks if a domain is alive and if so, it creates a lightweight continuation that allows us to unwind execution of the thread if the domain crashes.
在红叶中,对对象和特质的引用是能力。在 Rust 中,trait 声明了一个类型必须实现的一组方法,因此提供了一个接口的抽象。为了暴露他们的功能,域通过跨域调用交换对特征的引用。我们依靠基于能力的访问控制[76]来执行最小特权原则,并实现灵活的操作系统组织:例如,我们实现了几个场景,其中应用程序直接绕过内核与设备驱动程序对话,甚至可以利用 DPDK 式的用户级设备驱动程序访问,与设备驱动程序库链接。
In RedLeaf references to objects and traits are capabilities. In Rust, a trait declares a set of methods that a type must implement hence providing an abstraction of an interface. To expose their functionality, domains exchange references to traits via cross-domain calls. We rely on capability-based access control [76] to enforce the principle of least privilege and enable flexible operating system organizations: e.g., we implement several scenarios in which applications talk to the device driver directly bypassing the kernel, and even can link against device driver libraries leveraging DPDK-style user-level device driver access.
保护模型 红叶背后的核心假设是,我们信任(1)Rust编译器能正确实现语言安全,以及(2)使用不安全代码的 Rust 核心库,例如实现内部可变性的类型等。红叶的 TCB 包括微内核、实现硬件接口和低级抽象所需的一小部分可信的红叶板块、提供硬件资源安全接口的设备板块,例如访问 DMA 缓冲区等、红叶 IDL 编译器和 红叶 可信编译环境。目前,我们不解决不安全的 Rust 扩展中的漏洞,但再次推测,最终所有不安全的代码都将被验证功能正确性[5, 8, 82]。具体来说,RustBelt 项目提供了一个指南,以确保不安全的代码被封装在一个安全的接口中[44]。
Protection model The core assumptions behind RedLeaf are that we trust (1) the Rust compiler to implement language safety correctly, and (2) Rust core libraries that use unsafe code, e.g., types that implement interior mutability, etc. RedLeaf’s TCB includes the microkernel, a small set of trusted RedLeaf crates required to implement hardware interfaces and low-level abstractions, device crates that provide a safe interface to hardware resources, e.g., access to DMA buffers, etc., the RedLeaf IDL compiler, and the RedLeaf trusted compilation environment. At the moment, we do not address vulnerabilities in unsafe Rust extensions, but again speculate that eventually all unsafe code will be verified for functional correctness [5, 8, 82]. Specifically, the RustBelt project provides a guide for ensuring that unsafe code is encapsulated within a safe interface [44].
我们相信设备是没有恶意的。这个要求在将来可以通过使用 IOMMU 来保护物理内存而放松。最后,我们不保护侧信道攻击;虽然这些攻击很重要,但解决它们根本超出了目前的工作范围。我们推测,缓解信息泄露的硬件对抗措施将在未来的 CPU 中找到它们的方式[41]。
We trust devices to be non-malicious. This requirement can be relaxed in the future by using IOMMUs to protect physical memory. Finally, we do not protect against sidechannel attacks; while these are important, addressing them is simply beyond the scope of the current work. We speculate that hardware counter-measures to alleviate the information leakage will find their way in the future CPUs [41].
3.1 Domains and Fault Isolation
在红叶中,域是信息隐藏、故障隔离和组合的单位。设备驱动、内核子系统(如文件系统、网络堆栈等)和用户程序被加载为域。每个域都以对微内核系统调用接口的引用作为其参数之一开始。这个接口允许每个域创建执行线程,分配内存,创建同步对象,等等。默认情况下,微内核系统调用接口是域的唯一权限,也就是说,域可以通过这个接口影响系统的其他部分。然而,域可以为一个入口函数定义一个自定义的类型,要求在创建时传递对对象和接口的额外引用。默认情况下,我们不会为域创建一个新的执行线程。
In RedLeaf domains are units of information hiding, fault isolation, and composition. Device drivers, kernel subsystems, e.g., file system, network stack, etc., and user programs are loaded as domains. Each domain starts with a reference to a microkernel system-call interface as one of its arguments. This interface allows every domain to create threads of execution, allocate memory, create synchronization objects, etc. By default, the microkernel system call interface is the only authority of the domain, i.e., the only interface through which the domain can affect the rest of the system. Domains however can define a custom type for an entry function requesting additional references to objects and interfaces to be passed when it is created. By default, we do not create a new thread of execution for the domain.
然而,每个域都可以从微内核在域被加载时调用的 init 函数中创建线程。在内部,微内核跟踪所有代表每个域创建的资源:分配的内存、注册的中断线程,等等。线程可以超越创建它们的域,因为它们进入其他域,在那里它们可以无限期地运行。这些线程继续运行,直到它们返回到崩溃的域,并且它是它们延续链中的最后一个域。
Every domain, however, can create threads from the init function called by the microkernel when the domain is loaded. Internally, the microkernel keeps track of all resources created on behalf of each domain: allocated memory, registered interrupt threads, etc. Threads can outlive the domain creating them as they enter other domains where they can run indefinitely. Those threads continue running until they return to the crashed domain and it is the last domain in their continuation chain.
故障隔离 红叶域提供对故障隔离的支持。我们以如下方式定义故障隔离。我们说,当进入域的一个线程恐慌时,域就会崩溃并需要终止。恐慌有可能使域内可触及的对象处于不一致的状态,从而使域内任何线程的进一步进展变得不切实际(也就是说,即使线程没有死锁或恐慌,计算的结果也是未定义的)。那么,如果以下条件成立,我们就说故障是隔离的。首先,我们可以将运行在崩溃域内的所有线程解开到域的入口点,并向调用者返回一个错误。第二,调用域的后续尝试会返回错误,但不会违反安全保证或导致恐慌。第三,崩溃域的所有资源都可以被安全地去分配,也就是说,其他域不持有对崩溃域的堆的引用(堆隔离不变性),我们可以回收该域所拥有的所有资源而不发生泄漏。第四,其他域中的线程继续执行,并且可以继续访问由崩溃的域分配的对象,但在崩溃前被转移到其他域中。
Fault isolation RedLeaf domains provide support for faultisolation. We define fault isolation in the following manner. We say that a domain crashes and needs to be terminated when one of the threads that enters the domain panics. Panic potentially leaves objects reachable from inside the domain in an inconsistent state, making further progress of any of the threads inside the domain impractical (i.e., even if threads do not deadlock or panic, the results of the computation are undefined). Then, we say that the fault is isolated if the following conditions hold. First, we can unwind all threads running inside the crashing domain to the domain entry point and return an error to the caller. Second, subsequent attempts to invoke the domain return errors but do not violate safety guarantees or result in panics. Third, all resources of the crashed domain can be safely deallocated, i.e., other domains do not hold references into the heap of the crashed domain (heap isolation invariant), and we can reclaim all resources owned by the domain without leaks. Fourth, threads in other domains continue execution, and can continue accessing objects that were allocated by the crashed domain, but were moved to other domains before the crash.
执行故障隔离是一项挑战。在红叶中,隔离的子系统输出复杂的、语义丰富的接口,也就是说,域可以自由地交换对接口和对象层次的引用。我们做了几个设计选择,使我们能够干净地封装域的状态,同时又支持语义丰富的接口和零拷贝通信。
Enforcing fault isolation is challenging. In RedLeaf isolated subsystems export complex, semantically rich interfaces, i.e., domains are free to exchange references to interfaces and hierarchies of objects. We make several design choices that allow us to cleanly encapsulate domain’s state and yet support semantically rich interfaces and zero-copy communication.
3.1.1 Heap Isolation and Sharing
私有堆和共享堆 为了提供跨域的故障隔离并确保域的安全终止,我们强制执行跨域的堆隔离,即在域的私有堆、堆栈或全局数据段上分配的对象不能从域的外部到达。这一不变性使我们能够在执行的任何时刻安全地终止任何域。由于没有其他域持有进入被终止的域的私有堆的指针,所以安全地取消分配整个堆。
Private and shared heaps To provide fault isolation across domains and ensure safe termination of domains, we enforce heap isolation across domains, i.e., objects allocated on the private heap, stack, or global data section of the domain can not be reached from outside of the domain. This invariant allows us to safely terminate any domain at any moment of execution. Since no other domain holds pointers into the private heap of a terminated domain, it is safe to deallocate the entire heap.
为了支持高效的跨域通信,我们为可以跨域发送的对象提供了一个特殊的、全局的共享堆。域在共享堆上分配对象的方式类似于正常的堆分配,Rust 的 Box<T>
类型在堆上分配了一个 T 类型的值。我们构造一个特殊的类型,远程引用或 RRef<T>
,在共享堆上分配一个 T 类型的值(图2)。RRef<T>
由两部分组成:一个小的元数据和值本身。
To support efficient cross-domain communication, we provide a special, global shared heap for objects that can be sent across domains. Domains allocate objects on the shared heap in a way similar to the normal heap allocation with the Rust
Box<T>
type that allocates a value of type T on the heap. We construct a special type, remote reference orRRef<T>
, that allocates a value of type T on the shared heap (Figure 2).RRef<T>
consists of two parts: a small metadata and the value itself.
RRef<T>
元数据包含当前拥有引用的域的标识符、借用计数器和值的类型信息。RRef<T>
元数据和值一起被分配在共享堆上,允许 RRef<T>
在最初分配它的域之外。
The
RRef<T>
metadata contains an identifier of the domain currently owning the reference, borrow counter, and type information for the value. TheRRef<T>
metadata along with the value are allocated on the shared heap that allowsRRef<T>
to outlive the domain that originally allocates it.
域堆上的内存分配 为了对域的私有堆进行封装,我们实现了一个两级的内存分配方案。在底部,微内核为域提供了一个接口,用于分配无类型的粗粒度的内存区域(大于一页)。每个粗粒度的分配都被记录在堆注册表中。为了在域的私有堆上提供细粒度的类型化分配,每个域都与提供 Rust 内存分配接口 Box<T>
的受信任的 crate 链接。域的堆分配遵循 Rust 的所有权规则,也就是说,当对象超出范围时就会被回收。两级方案有以下好处:只分配大的内存区域,微内核记录域分配的所有内存,没有明显的性能开销。如果域发生恐慌,微内核就会浏览由分配给域的分配器分配的所有无类型的内存区域的注册表,并在不调用任何析构器的情况下对其进行回收。这种无类型的、粗粒度的回收是安全的,因为我们确保了堆隔离的不变性:其他域没有对已回收的堆的引用。
Memory allocation on the domain heap To provide encapsulation of domain’s private heap, we implement a two-level memory allocation scheme. At the bottom, the microkernel provides domains with an interface for allocating untyped coarse-grained memory regions (larger than one page). Each coarse-grained allocation is recorded in the heap registry. To serve fine-grained typed allocations on the domain’s private heap, each domain links against a trusted crate that provides the Rust memory allocation interface,
Box<T>
. Domain heap allocations follow the rules of the Rust’s ownership discipline, i.e., objects are deallocated when they go out of scope. The two-level scheme has the following benefit: allocating only large memory regions, the microkernel records all memory allocated by the domain without significant performance overheads. If the domain panics, the microkernel walks the registry of all untyped memory regions allocated by the allocator assigned to the domain and deallocates them without calling any destructors. Such untyped, coarse-grained deallocation is safe as we ensure the heap isolation invariant: other domains have no references into the deallocated heap.
3.1.2 Exchangeable Types
在共享堆上分配的对象要遵守以下规则:它们只能由可交换类型组成。可交换类型执行的不变性是:共享堆上的对象不能有指向私有堆或共享堆的指针,但可以有指向共享堆上分配的其他对象的 RRef<T>s
。红叶的 IDL 编译器在生成域的接口时验证了这个不变性(3.1.5 节)。我们将可交换类型定义为以下集合。1)RRef<T>
本身;2)Rust 原始拷贝类型的一个子集,例如 u32、u64,但不是一般情况下的引用,也不是指针;3)由可交换类型构建的匿名(元组、数组)和命名(枚举、结构)复合类型,4)对具有接收可交换类型的方法的 trait 的引用。另外,所有的 trait 方法都需要遵循以下的调用约定,要求它们返回RpcResult<T>
类型,以支持从崩溃域返回的线程的干净中止语义(3.1节)。IDL 检查接口定义,并验证所有类型都是格式良好的(第 3.1.5 节)。
Objects allocated on the shared heap are subject to the following rule: they can be composed of only of exchangeable types. Exchangeable types enforce the invariant that objects on the shared heap cannot have pointers into private or shared heaps, but can have
RRef<T>s
to other objects allocated on the shared heap. RedLeaf’s IDL compiler validates this invariant when generating interfaces of the domain (Section 3.1.5). We define exchangeable types as the following set: 1)RRef<T>
itself, 2) a subset of Rust primitive Copy types, e.g., u32, u64, but not references in the general case, nor pointers, 3) anonymous (tuples, arrays) and named (enums, structs) composite types constructed out of exchangeable types, 4) references to traits with methods that receive exchangeable types. Also, all trait methods are required to follow the following calling convention that requires them to return theRpcResult<T>
type to support clean abort semantics for threads returning from crashing domains (Section 3.1). The IDL checks interface definition and validates that all types are well-formed (Section 3.1.5).
3.1.3 Ownership Tracking
在RedLeaf中,RRef<T>
可以在域之间自由传递。
In RedLeaf
RRef<T>s
can be freely passed between domains.
我们允许 RRef<T>
被移动或不可变地借用。然而,我们为 RRef<T>
实现了一个所有权纪律,在跨域调用中强制执行。所有权跟踪使我们能够安全地去分配崩溃域所拥有的共享堆上的对象。RRef<T>
的元数据部分记录了所有者域以及在跨域调用中被不可变借用的次数。
We allow
RRef<T>s
to be moved or borrowed immutably. However, we implement an ownership discipline forRRef<T>s
that is enforced on cross-domain invocations. Ownership tracking allows us to safely deallocate objects on the shared heap owned by a crashing domain. The metadata section of theRRef<T>
keeps track of the owner domain and the number of times it was borrowed immutably on cross-domain invocations.
最初,RRef<T>
是由分配引用的域所拥有。如果引用在跨域调用中被转移到另一个域,我们会改变RRef<T>
中的所有者标识,将所有权从一个域转移到另一个域。所有的跨域通信都是由受信任的代理机构代理的,所以我们可以从代理机构安全地更新所有者标识符。Rust 的所有权纪律确保在域内始终只有一个对象的远程引用,因此当引用在跨域调用时在域之间移动时,调用者会失去对传递给被调用者的对象的访问。如果引用在跨域调用中被永久借用,我们不会改变 RRef<T>
中的所有者标识,而是增加跟踪 RRef<T>
被借用次数的计数器。
Initially,
RRef<T>
is owned by the domain that allocates the reference. If the reference is moved to another domain in a cross-domain call, we change the owner identifier insideRRef<T>
moving ownership from one domain to another. All cross-domain communication is mediated by trusted proxies, so we can securely update the owner identifier from the proxy. Rust’s ownership discipline ensures that there is always only one remote reference to the object inside the domain, hence when the reference is moved between domains on a cross-domain call, the caller loses access to the object passing it to the callee. If the reference is borrowed immutably in a cross-domain call, we do not change the owner identifier insideRRef<T>
, but instead increment the counter that tracks the number of timesRRef<T>
was borrowed.
递归引用 RRef<T>
可以形成对象的层次结构。为了避免在跨域调用时递归移动层次结构中的所有 RRef<T>
,只有对象层次结构的根有一个有效的所有者标识符(在图 2 中只有对象 X 有一个有效的域标识符 A,对象 Y 没有)。在跨域调用时,根 RRef<T>
被代理更新,它改变了域标识符,以便在域之间移动 RRef<T>
的所有权。这需要一个特殊的方案来在崩溃的情况下取消分配 RRef<T>
:我们扫描整个 RRef<T>
注册表来清理崩溃的域所拥有的资源。为了防止层次结构中的子对象被取消分配,我们依靠它们没有有效的 RRef<T>
标识(我们在扫描中跳过它们)。根 RRef<T>
对象的 drop 方法会遍历整个层次结构,并删除所有子对象(RRef<T>
不能形成循环)。注意,我们应该仔细处理当一个 RRef<T>
被带出层次结构的情况。为了正确地取消这个 RRef<T>
,我们需要给它分配一个有效的域标识符,也就是说,当 Y 从 X 中移出时,它将得到一个合适的域标识符。我们用可信的访问器方法来调解 RRef<T>
的域分配。我们生成的访问器方法提供了从一个对象字段中取出 RRef<T>
的唯一方法。这使我们能够调解移动操作,并为被移动的 RRef<T>
更新域标识符。注意,对于未命名的复合类型,例如数组和元组,访问器不能被强制执行。对于这些类型,我们在跨越域边界时更新所有复合元素的所有权。
Recursive references
RRef<T>s
can form hierarchies of objects. To avoid moving allRRef<T>s
in the hierarchy recursively on a cross-domain invocation, only the root of the object hierarchy has a valid owner identifier (in Figure 2 only object X has a valid domain identifier A, object Y does not). Upon a cross-domain call, the rootRRef<T>
is updated by the proxy which changes the domain identifier to move ownership of theRRef<T>
between domains. This requires a special scheme for deallocatingRRef<T>
s in case of a crash: we scan the entireRRef<T>
registry to clean up resources owned by a crashing domain. To prevent deallocation of children objects of the hierarchy, we rely on the fact that they do not have a validRRef<T>
identifier (we skip them during the scan). The drop method of the rootRRef<T>
object walks the entire hierarchy and deallocates all children objects (RRef<T>s
cannot form cycles). Note, we should carefully handle the case when anRRef<T>
is taken out of the hierarchy. To deallocate thisRRef<T>
correctly we need to assign it a valid domain identifier, i.e., Y gets a proper domain identifier when it is moved out from X. We mediateRRef<T>
field assignments with trusted accessor methods. We generate accessor methods that provide the only way to take out anRRef<T>
from an object field. This allows us to mediate the move operation and update the domain identifier for the movedRRef<T>
. Note that accessors cannot be enforced for the unnamed composite types, e.g., arrays and tuples. For these types we update ownership of all composite elements upon crossing the domain boundary.
回收共享堆 所有权跟踪使我们能够删除当前由崩溃域拥有的对象。我们维护一个所有分配的 RRef<T>s
的全局注册表(图2)。当一个域发生恐慌时,我们会浏览注册表,并删除所有被崩溃的域所拥有的引用。如果 RRef<T>
被借用,我们会推迟去分配,直到借用次数下降到零。每个 RRef<T>
的去分配需要我们为每个 RRef<T>
类型准备一个 drop 方法,并且能够动态地识别引用的类型。每个 RRef<T>
都有一个由 IDL 编译器生成的唯一类型标识符(IDL 知道系统中所有的 RRef<T>
类型,因为它生成了所有的跨域接口)。我们将类型标识符与 RRef<T>
一起存储,并调用适当的 drop 方法来正确地取消共享堆上任何可能的分层数据结构。
Reclaiming shared heap Ownership tracking allows us to deallocate objects that are currently owned by the crashing domain. We maintain a global registry of all allocated RRef<T>s
(Figure 2). When a domain panics, we walk through the registry and deallocate all references that are owned by the crashing domain. We defer deallocation if RRef<T>
was borrowed until the borrow count drops to zero. Deallocation of each RRef<T>
requires that we have a drop method for each RRef<T>
type and can identify the type of the reference dynamically. Each RRef<T>
has a unique type identifier generated by the IDL compiler (the IDL knows all RRef<T>
types in the system as it generates all cross-domain interfaces). We store the type identifier along with the RRef<T>
and invoke the appropriate drop method to correctly deallocate any, possibly, hierarchical data structure on the shared heap.
3.1.4 Cross-Domain Call Proxying
To enforce fault isolation, RedLeaf relies on invocation proxies to interpose on all cross-domain invocations (Figure 2). A proxy object exposes an interface identical to the interface it mediates. Hence the proxy interposition is transparent to the user of the interface. To ensure isolation and safety, the proxy implements the following inside each wrapped function:
- The proxy checks if the domain is alive before performing the invocation. If the domain is alive, the proxy records the fact that the thread moves between domains by updating its state in the microkernel. We use this information to unwind all threads that happen to execute inside the domain when it crashes. 2) For each invocation, the proxy creates a lightweight continuation that captures the state of the thread right before the cross-domain invocation. The continuation allows us to unwind execution of the thread, and return an error to the caller. 3) The proxy moves ownership of all
RRef<T>s
passed as arguments between domains, or updates the borrow count for all references borrowed immutably. 4) Finally, the proxy wraps all trait references passed as arguments: the proxy creates a new proxy for each trait and passes the reference to the trait implemented by that proxy.
Thread unwinding To unwind execution of a thread from a crashing domain, we capture the state of the thread right before it enters the callee domain. For each function of the trait mediated by the proxy, we utilize an assembly trampoline that saves all general registers into a continuation. The microkernel maintains a stack of continuations for each thread. Each continuation contains the state of all general registers and a pointer to an error handling function that has the signature identical to the function exported by the domain’s interface. If we have to unwind the thread, we restore the stack to the state captured by the continuation, and invoke the error handling function on the same stack and with the same values of general registers. The error handling function returns an error to the caller.
To cleanly return an error in case of a crash, we enforce the following calling convention for all cross-domain invocations:
every cross-domain function must return RpcResult<T>
, an enumerated type that either holds the returned value or an error (Figure 3). This allows us to implement the following invariant: functions unwound from the crashed domain never return corrupted data, but instead return an RpcResult<T>
error.