-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
I'm currently investigating the potential of Rust for MCU. The features provided by Embassy and Rust are really a good step forward better codebases, but there is a huge drawback: the binary sizes.
A simple blink program for a stm32f4 target in C is close to 1k bytes, the half being the vector table (512 bytes). Adding async behavior, for example using ProtoThread, will slightly increase this size.
The current default release binary (text size returned by arm-none-eabi-size) for stm32f4 blinky with embassy is 21260 bytes which is huge.
File .text Size Crate Name
0.1% 18.5% 3.4KiB embassy_stm32 embassy_stm32::_generated::<impl core::ops::arith::Mul<stm32_metapac::rcc::vals::Plln> for embassy_stm32::time::Hertz>::mul
0.0% 8.0% 1.5KiB embassy_stm32 embassy_stm32::rcc::_version::init
0.0% 6.5% 1.2KiB std core::str::count::do_count_chars
0.0% 6.0% 1.1KiB embassy_stm32 embassy_stm32::rcc::_version::init_pll
0.0% 5.6% 1.0KiB embassy_stm32 embassy_stm32::_generated::<impl core::ops::arith::Div<stm32_metapac::rcc::vals::Pllm> for embassy_stm32::time::Hertz>::div
0.0% 3.9% 744B std compiler_builtins::mem::memcpy
0.0% 3.2% 602B std core::fmt::Formatter::pad_integral
0.0% 2.5% 476B embassy_stm32 embassy_stm32::rcc::bd::LsConfig::init
0.0% 2.5% 472B std core::fmt::Formatter::pad
0.0% 2.3% 438B embassy_stm32 embassy_stm32::init
0.0% 2.2% 408B defmt_rtt <defmt_rtt::Logger as defmt::traits::Logger>::write
0.0% 2.0% 384B embassy_stm32 TIM4
0.0% 1.9% 366B std core::fmt::write
0.0% 1.9% 364B embassy_stm32 embassy_stm32::dma::dma_bdma::<impl embassy_stm32::dma::AnyChannel>::on_irq
0.0% 1.9% 356B embassy_stm32 <embassy_stm32::_generated::Clocks as defmt::traits::Format>::_format_data
0.0% 1.6% 310B embassy_executor embassy_executor::raw::TaskStorage<F>::poll
0.0% 1.6% 300B embassy_executor embassy_executor::raw::wake_task
0.0% 1.6% 300B embassy_executor embassy_executor::raw::waker::wake
0.0% 1.4% 274B defmt_rtt <defmt_rtt::Logger as defmt::traits::Logger>::release
0.0% 1.4% 266B embassy_executor embassy_executor::raw::Executor::spawn
0.1% 23.3% 4.3KiB And 87 smaller methods. Use -n N to show more.
Which may not be an issue for big MCU (>= 256k) but it makes embassy not usable for some low flash MCU (64k).
First step we can remove all defmt logging and panic handler associated in release binary (diorcety@dcb1590). The binary size is divided by 2: 9112 bytes
File .text Size Crate Name
1.0% 50.3% 4.2KiB embassy_stm32 embassy_stm32::rcc::_version::init_pll
0.4% 22.5% 1.9KiB embassy_executor embassy_executor::raw::TaskStorage<F>::poll
0.1% 3.8% 328B embassy_stm32 embassy_stm32::dma::dma_bdma::<impl embassy_stm32::dma::AnyChannel>::on_irq
0.1% 3.0% 258B embassy_time_queue_utils embassy_time_queue_utils::queue_integrated::Queue::next_expiration
0.0% 2.5% 212B embassy_time <embassy_time::timer::Timer as core::future::future::Future>::poll
0.0% 2.4% 202B embassy_executor embassy_executor::arch::thread::Executor::run
0.0% 1.9% 160B embassy_stm32 embassy_stm32::time_driver::RtcDriver::set_alarm
0.0% 1.6% 138B embassy_stm32 TIM4
0.0% 1.5% 128B embassy_executor embassy_executor::raw::waker::wake
0.0% 1.4% 116B embassy_stm32 embassy_stm32::rcc::RccInfo::enable_and_reset_with_cs
0.0% 1.2% 106B embassy_stm32 embassy_stm32::exti::on_irq
0.0% 1.0% 84B embassy_stm32 embassy_stm32::time_driver::RtcDriver::next_period
0.0% 0.7% 64B embassy_stm32 embassy_stm32::gpio::Flex::set_as_output
0.0% 0.7% 60B embassy_sync embassy_sync::waitqueue::atomic_waker::AtomicWaker::wake
0.0% 0.5% 44B embassy_stm32 embassy_stm32::_generated::<impl core::ops::arith::Div<stm32_metapac::rcc::vals::Ppre> for embassy_stm32::time::Hertz>::div
0.0% 0.5% 40B cortex_m_rt Reset
0.0% 0.3% 22B blinky blinky::__cortex_m_rt_main
0.0% 0.2% 16B embassy_executor embassy_executor::raw::waker::clone
0.0% 0.2% 14B embassy_stm32 DMA1_STREAM0
0.0% 0.2% 14B embassy_stm32 DMA1_STREAM1
0.1% 3.5% 300B And 27 smaller methods. Use -n N to show more.
An huge part is this code is the RCC initialization. Compare to C blink for example (https://github.com/platformio/platform-ststm32/tree/develop/examples/stm32cube-ll-blink), which takes less that 100 bytes,
in embassy the RCC takes 4K ! Doing a dummy RCC initialization (diorcety@a2bc3c7) reduce the binary to 2728 bytes
File .text Size Crate Name
0.1% 16.4% 358B embassy_executor embassy_executor::raw::TaskStorage<F>::poll
0.1% 15.0% 328B embassy_stm32 embassy_stm32::dma::dma_bdma::<impl embassy_stm32::dma::AnyChannel>::on_irq
0.1% 11.8% 258B embassy_time_queue_utils embassy_time_queue_utils::queue_integrated::Queue::next_expiration
0.1% 9.3% 202B embassy_executor embassy_executor::arch::thread::Executor::run
0.1% 7.1% 154B embassy_time <embassy_time::timer::Timer as core::future::future::Future>::poll
0.0% 5.9% 128B embassy_executor embassy_executor::raw::waker::wake
0.0% 4.9% 106B embassy_stm32 embassy_stm32::exti::on_irq
0.0% 4.4% 96B [Unknown] SysTick
0.0% 2.9% 64B embassy_stm32 embassy_stm32::gpio::Flex::set_as_output
0.0% 2.8% 60B embassy_sync embassy_sync::waitqueue::atomic_waker::AtomicWaker::wake
0.0% 1.8% 40B cortex_m_rt Reset
0.0% 1.0% 22B blinky blinky::__cortex_m_rt_main
0.0% 0.7% 16B embassy_executor embassy_executor::raw::waker::clone
0.0% 0.6% 14B embassy_stm32 DMA1_STREAM0
0.0% 0.6% 14B embassy_stm32 DMA1_STREAM1
0.0% 0.6% 14B embassy_stm32 DMA1_STREAM2
0.0% 0.6% 14B embassy_stm32 DMA1_STREAM3
0.0% 0.6% 14B embassy_stm32 DMA1_STREAM4
0.0% 0.6% 14B embassy_stm32 DMA1_STREAM5
0.0% 0.6% 14B embassy_stm32 DMA1_STREAM6
0.1% 10.6% 230B And 22 smaller methods. Use -n N to show more.
Duration, Instant and other associated functions use u64 type. The usage of u64, in some cases, may be overkill and not very efficient on 32bit targets, we can replace the usage of u64 by u32(diorcety@002fefd): 2644 bytes
File .text Size Crate Name
0.1% 16.6% 348B embassy_executor embassy_executor::raw::TaskStorage<F>::poll
0.1% 15.6% 328B embassy_stm32 embassy_stm32::dma::dma_bdma::<impl embassy_stm32::dma::AnyChannel>::on_irq
0.1% 9.6% 202B embassy_executor embassy_executor::arch::thread::Executor::run
0.1% 9.5% 200B embassy_time_queue_utils embassy_time_queue_utils::queue_integrated::Queue::next_expiration
0.1% 6.6% 138B embassy_time <embassy_time::timer::Timer as core::future::future::Future>::poll
0.0% 6.1% 128B embassy_executor embassy_executor::raw::waker::wake
0.0% 5.1% 106B embassy_stm32 embassy_stm32::exti::on_irq
0.0% 4.5% 94B [Unknown] SysTick
0.0% 3.1% 64B embassy_stm32 embassy_stm32::gpio::Flex::set_as_output
0.0% 2.9% 60B embassy_sync embassy_sync::waitqueue::atomic_waker::AtomicWaker::wake
0.0% 1.9% 40B cortex_m_rt Reset
0.0% 1.0% 22B blinky blinky::__cortex_m_rt_main
0.0% 0.8% 16B embassy_executor embassy_executor::raw::waker::clone
0.0% 0.7% 14B embassy_stm32 DMA1_STREAM0
0.0% 0.7% 14B embassy_stm32 DMA1_STREAM1
0.0% 0.7% 14B embassy_stm32 DMA1_STREAM2
0.0% 0.7% 14B embassy_stm32 DMA1_STREAM3
0.0% 0.7% 14B embassy_stm32 DMA1_STREAM4
0.0% 0.7% 14B embassy_stm32 DMA1_STREAM5
0.0% 0.7% 14B embassy_stm32 DMA1_STREAM6
0.1% 11.0% 230B And 22 smaller methods. Use -n N to show more.
0.8% 100.0% 2.0KiB .text section size, the file size is 258.6KiB
EXTI and DMA IRQ are implemented even if they are not used at all(diorcety@52d0353): 1768 bytes
File .text Size Crate Name
0.1% 25.8% 348B embassy_executor embassy_executor::raw::TaskStorage<F>::poll
0.1% 15.0% 202B embassy_executor embassy_executor::arch::thread::Executor::run
0.1% 14.8% 200B embassy_time_queue_utils embassy_time_queue_utils::queue_integrated::Queue::next_expiration
0.1% 10.2% 138B embassy_time <embassy_time::timer::Timer as core::future::future::Future>::poll
0.1% 9.5% 128B embassy_executor embassy_executor::raw::waker::wake
0.0% 7.0% 94B [Unknown] SysTick
0.0% 4.7% 64B embassy_stm32 embassy_stm32::gpio::Flex::set_as_output
0.0% 3.0% 40B cortex_m_rt Reset
0.0% 1.6% 22B blinky blinky::__cortex_m_rt_main
0.0% 1.2% 16B embassy_executor embassy_executor::raw::waker::clone
0.0% 0.9% 12B embassy_executor embassy_executor::raw::util::UninitCell<T>::write_in_place
0.0% 0.6% 8B std core::cell::panic_already_borrowed
0.0% 0.6% 8B std core::option::unwrap_failed
0.0% 0.6% 8B std core::panicking::panic_fmt
0.0% 0.6% 8B [Unknown] main
0.0% 0.4% 6B cortex_m_rt HardFault_
0.0% 0.4% 6B panic_halt __rustc::rust_begin_unwind
0.0% 0.4% 6B embassy_executor embassy_executor::raw::waker::drop
0.0% 0.4% 6B cortex_m_rt DefaultPreInit
0.0% 0.4% 6B cortex_m_rt DefaultHandler_
0.6% 100.0% 1.3KiB .text section size, the file size is 227.2KiB
A big remain part is the fmt functions called in panic! or assert! macro. Most of the MCU program in release mode don't have debug interface, all the fmt done in these macros are made for nothing.
Forcing some parameters in config.toml can be used to remove these formattings:
[unstable]
build-std = ["core", "panic_abort"]
build-std-features = ["panic_immediate_abort"]
Final size(diorcety@971b8a8): 1720 bytes
File .text Size Crate Name
0.1% 26.2% 340B embassy_executor embassy_executor::raw::TaskStorage<F>::poll
0.1% 15.4% 200B embassy_time_queue_utils embassy_time_queue_utils::queue_integrated::Queue::next_expiration
0.1% 15.4% 200B embassy_executor embassy_executor::arch::thread::Executor::run
0.1% 10.2% 132B embassy_time <embassy_time::timer::Timer as core::future::future::Future>::poll
0.1% 9.8% 128B embassy_executor embassy_executor::raw::waker::wake
0.0% 7.1% 92B [Unknown] SysTick
0.0% 4.9% 64B embassy_stm32 embassy_stm32::gpio::Flex::set_as_output
0.0% 3.1% 40B cortex_m_rt Reset
0.0% 1.7% 22B blinky blinky::__cortex_m_rt_main
0.0% 1.2% 16B embassy_executor embassy_executor::raw::waker::clone
0.0% 0.9% 12B embassy_executor embassy_executor::raw::util::UninitCell<T>::write_in_place
0.0% 0.6% 8B [Unknown] main
0.0% 0.5% 6B cortex_m_rt HardFault_
0.0% 0.5% 6B embassy_executor embassy_executor::raw::waker::drop
0.0% 0.5% 6B cortex_m_rt DefaultPreInit
0.0% 0.5% 6B cortex_m_rt DefaultHandler_
0.5% 100.0% 1.3KiB .text section size, the file size is 237.3KiB
The remaining overhead compare to a C blinky (about 700 bytes) is async/await mechanisms, which seem fair.
I'm not saying that my modifications here are corrects (notably the u32 for Durations/Ticks part or the dummy RCC, which are done in brainless mode), just to pinpoint that some design choices make almost impossible to use embassy on low end devices.
In order to resume, here the modifications that could be done in order to permit the usage of embassy on such devices:
- Provide a way to reduce the size of init/rcc::init. Maybe using a compile time RCC initialization code generation? With the support of multiple profiles allowing to switch between them at runtime (as available with some MCU toolchain)?
- Create another time_driver, only using native unsigned integer, if possible using systick available in almost (all?) cortex, reducing in the same time the resource usage.
- Removed unused IRQs: Some "real world" application doesn't even require DMA or EXTI.
- More aggressive optimization in release: maybe using a feature or by default