Skip to content

[RFC] DMA and Data cache coherency on arm M7 devices #36471

@FRASTM

Description

@FRASTM

Introduction

On the ARM-M7 core-based stm32 MCU, when the L1-Cache memory is enabled, the system needs to "ensure data coherency between the core and the main memory when dealing with shared data", especially with DMA transfers.

This is detailed in the [AN4839: Level 1 cache on STM32F7 Series and STM32H7 Series] (https://www.st.com/content/ccc/resource/technical/document/application_note/group0/08/dd/25/9c/4d/83/43/12/DM00272913/files/DM00272913.pdf/jcr:content/translations/en.DM00272913.pdf)
or other documents from various manufacturer

Problem description

When using the DMA for peripheral transfers like spi or uart, or other clients on M7 devices, typically stm32f7xx or stm32h7xx, it is recommended:

  • If the software is using cacheable memory regions for the DMA source/or destination buffers. The software
    must trigger a cache clean before starting a DMA operation to ensure that all the data are committed to the
    subsystem memory. After the DMA transfer complete, when reading the data from the peripheral, the
    software must perform a cache invalidate before reading the DMA updated memory region.

  • Always better to use non-cacheable regions for DMA buffers. The software can use the MPU to set up a
    non-cacheable memory block to use as a shared memory between the CPU and DMA.

  • Do not enable cache for the memory that is being used extensively for a DMA operation.

Proposed change

  • item 1: control the cache coherency on dma buffers

DMA buffers are aligned on the cache line-size
User-application buffers are memcopy to/from those DMA buffers with padding and alignement.

On each dma transfer Rx or Tx the flush and invalidate operations are required

1.1 each client of the dma has to control the cache coherency on their DMA operations Tx, Rx
--> this has to be controlled by each driver which uses the dma transfers (typically spi, i2s, uart, etc)
with similar code to be adapted.
Example for spi client in the PR "[RFC] Initial support for cache handling when doing SPI/DMA on STM32F7" #27911

1.2 the DMA driver controls the cache coherency on its own DMA operations Tx, Rx
--> not validated, yet but PR above could help

  • item 2: Always better to use non-cacheable regions for DMA buffers

These buffers must have RW-access
These buffers are allocated in the NoCache ram area.
These buffers might be for DMA use only or for the dma-client as-well

2.1 We map a special NONCACHE memory Area for the dma buffers (Tx and Rx)
Some memcopy operations are required to exchange data between the initial client allocated buffers
and that special Non-chached DMA buffer.
--> We could see a significant overhead due to memcopy, especially when the buffers are small and frequently transferred

2.2 We map a special NONCACHE memory Area for the dma-client buffers (Tx and Rx)
These buffers are allocated in the User area, and used for the client and dma exchange
The user applications must allocate its buffer in the user Nocache Ram area
PR "run test on stm32F7 with CONFIG_UART_ASYNC_API and DMA" #32833
--> this is a significant constraint to map user buffers in this NoCache area, else memcopy are also needed.

  • item 3: "Do not enable cache"

We simply disable the Data cache when DMA is enabled on stm32 M7-based MCUs
3.1 statically, based on the DT dma
Example in the PR "soc: stm32f7 remove cache memory with dma transfer" #35165

This is most most simple solution to avoid data cache coherency problems. And this is funtionnal. In the soc/arm/st_stm32/stm32f7/soc.c or soc/arm/st_stm32/stm32h7/soc.c, do not enable D-cache in case of DMA or explicitely defined by the CONFIG_NOCACHE_MEMORY

+#ifndef CONFIG_NOCACHE_MEMORY
 	if (!(SCB->CCR & SCB_CCR_DC_Msk)) {
 		SCB_EnableDCache();
 	}
+#endif

3.1 dynamically: disable the cache when starting/initialising the dma device

Concerns and Unresolved Questions

  • impact of the memcopy operations in terms of performance
  • impact on the system of disabling the cache of stm32f7 or h7 MCUs as soon as the DMA is enabled
  • constraint on the user application to allocate their buffers in noncached ram area

Metadata

Metadata

Assignees

Labels

RFCRequest For Comments: want input from the communityarea: APIChanges to public APIsarea: DMADirect Memory Accessarea: UserspaceUserspace

Type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions