DMA — Direct Memory Access

Imagine you are a chef in a busy kitchen. You need ingredients from the pantry. You could walk to the pantry yourself, grab each item one at a time, and walk back — but then you are not cooking. Now imagine you have a sous-chef whose only job is to fetch ingredients and put them on your counter. You just say "get me flour, eggs, and butter" and keep cooking. That sous-chef is DMA.

Why DMA Matters

Without DMA, the CPU handles every single byte of every transfer. When you receive 100 bytes over UART at 115200 baud, the CPU must:

Get interrupted for each byte
Read the byte from the peripheral register
Store it in a buffer in memory
Go back to whatever it was doing
Repeat 99 more times

That is 100 interrupts in about 9 milliseconds. The CPU can do it, but it is wasting time on grunt work when it could be computing sensor fusion or running a PID loop.

With DMA, the conversation goes like this:

CPU tells the DMA controller: "Copy bytes from UART data register to this buffer. Let me know when you have 100 bytes."
CPU goes back to computing.
DMA silently shuttles bytes from the peripheral to memory, one at a time, without involving the CPU at all.
When the transfer is complete, DMA fires a single interrupt: "All done."

The CPU did zero work during the transfer. It was free to run your control loop, process sensor data, or even sleep.

💡 Fun Fact: DMA is not unique to microcontrollers. Your PC's GPU uses DMA to read textures from RAM. Your SSD controller uses DMA to transfer data. Network cards use DMA. Any time a peripheral needs to move large amounts of data, DMA is the answer. The concept dates back to the UNIVAC I computer in 1951.

How DMA Works (Conceptually)

A DMA controller is a simple hardware unit that can copy data between two locations — typically between a peripheral's data register and a memory buffer. You configure it with:

Parameter	Description
Source	Where to read from (e.g., UART data register)
Destination	Where to write to (e.g., a buffer in RAM)
Transfer count	How many bytes (or half-words, or words) to move
Direction	Peripheral-to-memory, memory-to-peripheral, or memory-to-memory
Increment	Whether to increment the source/destination address after each transfer

For a UART receive, you would configure: source = UART data register (fixed address, no increment), destination = your buffer (incrementing address), count = buffer size.

The DMA controller then watches the peripheral. Every time the UART receives a byte, DMA grabs it and puts it in the next slot in your buffer. No CPU involvement at all.

Embassy Makes DMA Easy

Here is the beautiful thing about Embassy: you have already been using DMA. When you passed DMA channels to Uart::new or Spi::new in the previous chapters, you were enabling DMA transfers. The .await on a read or write operation sets up a DMA transfer, suspends your task, and wakes it up when the transfer completes.

#![allow(unused)]
fn main() {
// This UART transfer uses DMA — the CPU is free while data arrives
let mut buf = [0u8; 256];
let n = uart.read_until_idle(&mut buf).await.unwrap();
// ^ CPU was doing other things during this entire transfer
}

Compare with a blocking (non-DMA) approach:

#![allow(unused)]
fn main() {
// This would block the CPU for the entire transfer
let mut buf = [0u8; 256];
for i in 0..256 {
    buf[i] = uart.blocking_read(); // CPU waits for EACH byte
}
}

The same applies to SPI:

#![allow(unused)]
fn main() {
// DMA transfer — CPU is free
let mut rx = [0u8; 1024];
spi.transfer(&mut rx, &tx).await.unwrap();

// Without DMA, the CPU would bit-bang 1024 bytes
// At 10 MHz SPI, that is only ~100 microseconds,
// but those are microseconds your control loop might need.
}

Passing DMA Channels

In Embassy, you assign DMA channels when creating a peripheral. Each peripheral transfer (TX and RX) needs its own DMA channel:

#![allow(unused)]
fn main() {
// UART with DMA
let uart = Uart::new(
    p.USART2,
    p.PA3, p.PA2,
    Irqs,
    p.DMA1_CH6,   // TX DMA channel
    p.DMA1_CH5,   // RX DMA channel
    config,
).unwrap();

// SPI with DMA
let spi = Spi::new(
    p.SPI1,
    p.PA5, p.PA7, p.PA6,
    p.DMA2_CH3,   // TX DMA channel
    p.DMA2_CH2,   // RX DMA channel
    spi_config,
);

// I2C with DMA
let i2c = I2c::new(
    p.I2C1,
    p.PB6, p.PB7,
    Irqs,
    p.DMA1_CH6,   // TX DMA channel
    p.DMA1_CH0,   // RX DMA channel
    Hertz(400_000),
    Default::default(),
);
}

🧠 Think About It: If DMA is strictly better, why does Embassy even offer blocking_read and blocking_write? Think about single-byte transfers. Setting up a DMA transfer has overhead (configuring registers, handling the interrupt). For a 1-byte transfer, the DMA setup time can exceed the time it takes the CPU to just do the transfer itself.

The H743 Memory Trap

This section is critical if you are using an STM32H7 or STM32F7. If you are on an F1, F4, or G0, feel free to skip it — but read it anyway, because you will encounter H7 boards eventually.

The STM32H743 has a complicated memory map. The Cortex-M7 core has tightly-coupled memory (TCM) for maximum performance:

Memory Region	Address Range	Size (H743)	CPU Access	DMA Access
DTCM	0x2000_0000	128 KB	Fast (0 wait)	NO
ITCM	0x0000_0000	64 KB	Fast (0 wait)	NO
AXI SRAM	0x2400_0000	512 KB	Normal	Yes
SRAM1	0x3000_0000	128 KB	Normal	Yes
SRAM2	0x3002_0000	128 KB	Normal	Yes
SRAM3	0x3004_0000	32 KB	Normal	Yes
SRAM4	0x3800_0000	64 KB	Normal	Yes

Here is the trap: by default, the stack lives in DTCM. And DMA cannot access DTCM. So if you declare a buffer on the stack and pass it to a DMA transfer, the DMA controller literally cannot read or write that memory. The transfer silently fails or produces corrupted data. No error. No panic. Just wrong data.

#![allow(unused)]
fn main() {
// THIS WILL SILENTLY FAIL ON H743
async fn broken_read(uart: &mut Uart<'_, Async>) {
    let mut buf = [0u8; 64]; // <-- This is on the stack, which is in DTCM
    uart.read_until_idle(&mut buf).await.unwrap(); // DMA cannot write here!
}
}

The Fix: Place DMA Buffers in Accessible SRAM

Use the #[link_section] attribute to place your buffer in a SRAM region that DMA can access:

#![allow(unused)]
fn main() {
#[link_section = ".sram1"]
static mut DMA_BUF: [u8; 256] = [0u8; 256];

// Or for a more Rust-idiomatic approach with Embassy:
#[link_section = ".axisram"]
static DMA_BUF: embassy_sync::mutex::Mutex<
    embassy_sync::blocking_mutex::raw::CriticalSectionRawMutex,
    [u8; 256],
> = embassy_sync::mutex::Mutex::new([0u8; 256]);
}

You also need to define these memory sections in your linker script (memory.x):

MEMORY
{
    FLASH  : ORIGIN = 0x08000000, LENGTH = 2M
    DTCM   : ORIGIN = 0x20000000, LENGTH = 128K
    RAM    : ORIGIN = 0x24000000, LENGTH = 512K
    SRAM1  : ORIGIN = 0x30000000, LENGTH = 128K
    SRAM2  : ORIGIN = 0x30020000, LENGTH = 128K
}

Warning: This is the single most common source of mysterious bugs on the STM32H7. If your DMA transfers return all zeros, random data, or seem to "work sometimes," check your buffer placement first.

Cache Coherency (H7/F7 Only)

The Cortex-M7 core in the H7 and F7 has a data cache. This means the CPU does not always read directly from RAM — it reads from a fast local copy. This creates a problem with DMA:

CPU writes data to a buffer (goes into the cache, not necessarily to RAM)
DMA reads that buffer from RAM (sees stale data — the CPU's writes are still in the cache)
DMA sends garbage to the peripheral

Or the reverse:

DMA writes received data to a buffer in RAM
CPU reads the buffer (gets stale data from the cache, not the fresh DMA data)

The simplest fix is to place DMA buffers in a non-cacheable memory region. SRAM1 through SRAM4 on the H743 can be configured as non-cacheable using the MPU (Memory Protection Unit).

#![allow(unused)]
fn main() {
// In your Embassy H7 configuration, you typically configure the MPU
// to mark SRAM regions as non-cacheable for DMA use.
// Embassy's H7 examples include this setup — follow them closely.
}

Alternatively, you can manually invalidate/clean the cache before and after DMA transfers, but this is error-prone. Using non-cacheable regions is the safer approach.

💡 Fun Fact: Cache coherency is the same problem that plagues multi-core CPUs in desktop computers. The H7's M7 core + DMA controller is essentially a tiny multi-core system. Enterprise CPUs solve this with hardware cache coherency protocols (like MESI). The Cortex-M7 does not have this, so you have to manage it yourself.

When DMA Matters Most

Not every transfer needs DMA. Here is a rough guide:

Scenario	DMA Worth It?	Why
UART debug prints	Maybe	Frees CPU, but small transfers
UART GPS stream (continuous)	Yes	Constant data flow while CPU computes
SPI sensor read (10 bytes)	No	Setup overhead exceeds transfer time
SPI display update (64 KB)	Yes	Huge transfer, CPU is blocked for ages
ADC continuous sampling	Yes	Hundreds of samples per second
I2C sensor read (6 bytes)	No	Tiny transfer, barely worth the setup

The general rule: use DMA when the transfer is large or continuous, and the CPU has better things to do.

Summary

DMA is your CPU's best friend. It offloads the boring work of moving bytes from point A to point B, leaving the CPU free to do actual computation. In Embassy, you get DMA almost for free — just pass DMA channels to your peripheral constructors, and every .await on a transfer uses DMA automatically.

The one gotcha to remember: on H7 and F7, DMA cannot access DTCM memory (where the stack lives by default), and the data cache can cause coherency issues. Place DMA buffers in SRAM1 or another DMA-accessible, non-cacheable region.

Next up: the watchdog timer — your firmware's safety net for when things go wrong.

Embedded Systems Programming with STM32 and Rust