STM32 bare-metal start-up and real bit banging speed

Update 2015.05.06 I did the same tests again, and got better results. The reset vector missed this attribute so as to enable compiler opimizations

__attribute__ ((used))

I also added more information about the compiler version [1], the compiler flags [2] used for the tests, and the result according to optimization level -O.

My development board features an ARM Cortex-M3 from ST (STM32F103 VET6) microcontroller running at 72MHz maximum clock.

There are at least 2 different ways for bit banging the microcontroller GPIO pins: using the CPU core directly, or using a SRAM-to-GPIO DMA transfer.

STM32F1 bare metal start-up (no compiler libraries added)

The ARM Cortex-M3 processor architecture allows the start-up code to be all in C code. This start-up code is meant to run on bare-metal, in other words, without any support from compiler or additional code. When compiled with gcc, the flag -nostartfiles can be given, hence disabling the default gcc start-up library [3].

Note about the setting of the system clock: if one flash wait state is configured (e.g., default configuration), the system clock cannot exceed 48MHz (in the example above, the PLL multiplier cannot be more than 6 = RCC_CFGR_PLLMULL6). For the CPU to run at 72MHz, the flash latency should increase. See [RM0008] (at p. 54, rev. 14):

These bits represent the ratio of the SYSCLK (system clock) period to the Flash access time.
        000 Zero wait state, if 0 < SYSCLK ≤ 24 MHz
        001 One wait state, if 24 MHz < SYSCLK ≤ 48 MHz
        010 Two wait states, if 48 MHz < SYSCLK ≤ 72 MHz

STM32F1 bit banging speeds

The simplest way to copy a memory buffer over GPIO is to run a for loop.

        u8 buffer[8] = {0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00};

        while(1) {
                for(u32 i=0; i<8; i++)
                        GPIOD->ODR = buffer[i];

At 48MHz and no settings for flash wait states, the CPU can achieve slightly more than 1MHz at -O0 (no optimization). There are gaps when the while loop restarts:

At 48MHz and no settings for flash wait states, the CPU achieves 1.8MHz at -O2. The compiler optimized the while loop, and the gaps disappeared:

If the flash latency is raised to 2 wait states, the performances decreases as the CPU wait for the flash (here -O0):

If the flash latency is raised to 2 wait states, the performances decreases as the CPU wait for the flash (here -O2):

At 72MHz, the performances are slightly better than at 48MHz CPU clock (here -O0):

At 72MHz, the performances are slightly better than at 48MHz CPU clock (here -O2):

Unrolling the loop avoids the costly branch instruction even with no optimization activated.

        u8 buffer[8] = {0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00};

        while(1) {
                GPIOD->ODR = buffer[0];
                GPIOD->ODR = buffer[1];
                GPIOD->ODR = buffer[2];
                GPIOD->ODR = buffer[3];
                GPIOD->ODR = buffer[4];
                GPIOD->ODR = buffer[5];
                GPIOD->ODR = buffer[6];
                GPIOD->ODR = buffer[7];

At 72MHz, with unrolled loop, the CPU can now achieve 3.6MHz. Notice again the time lost caused by branch at the end of the unrolled loop:

At 72MHz, the same with -O2 optimization shows an impressive 18MHz. Notice that the first cycle is longer than the following ones:

Note: 18MHz is the max GPIO speed according to the datasheet [4] (rev.16, p.20): "I/Os on APB2 with up to 18 MHz toggling speed."

With DMA-to-GPIO

Another solution for copying from SRAM to GPIO is to use the DMA (Direct Memory Access). The DMA will copy the data from A to B without using the CPU.

u8 buffer[8] = {0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00};

/* DMA setup */

/* channel 1: mem:8bit -> peri:8bit */
DMA1_Channel1->CNDTR = 8;
DMA1_Channel1->CMAR = (uint32_t)buffer;
DMA1_Channel1->CPAR = (uint32_t)&(GPIOD->ODR);
DMA1_Channel1->CCR = 0;


In the example, the DMA is not as fast as the CPU without optimizations, but this is caused by the while loop at the end. Reading the DMA register triggers data requests on the data bus, which is also used by the DMA for the transfer to GPIO.

Replace the while loop by an empty infinite for loop to see the difference.

for(;;) {}

Now the DMA has pretty much the same performances (3.6MHz) compared to the CPU:

Note: all DMA tests where made with -O0. There is a weird result when optimization is activated at -O2:


To conclude this post, the STM32F1 can achieve 18MHz on GPIO in an enrolled loop with compiler optimizations -O2 (by t he way, -O3 does not give better results and requires to include some compiler libraries in the final binary).

You can compare this result with my post about the STM32F4 and its much nicer 84MHz regarding the clock of its CPU core of 168MHz.

I expected much higher performances from the DMA-to-GPIO solution. Maybe using 32-bit DMA read from SRAM and the memory controller (FSMC) the performances could be significantly increased? I would be glad to know if I made a configuration mistake somewhere in the DMA registers...

DMA, GPIO, and soft UART (update from 2015.02.10)

By email, M. Ambridge asked me an interesting question. He is planning to create a Tx-only soft-UART, and wrote me to know whether it would be possible to use the DMA-to-GPIO solution for a fixed baudrate output (e.g., 230K4).

Above: STM32F1 system architecture [RM0008] (rev. 275, p. 48).

As seen in this post, the DMA and the CPU share the same bus to access the SRAM. I imagine that they would disturb each other when running concurrently, and the soft-UART would not have a fixed baudrate as required.

This is actually confirmed in the reference manual [RM0008] (rev. 15, p. 275):

The DMA controller performs direct memory transfer by sharing the system bus with the Cortex®-M3 core. The DMA request may stop the CPU access to the system bus for some bus cycles, when the CPU and DMA are targeting the same destination (memory or peripheral). The bus matrix implements round-robin scheduling, thus ensuring at least half of the system bus bandwidth (both to memory and peripheral) for the CPU.

How to set the clock?

According to the clock tree [RM0008] (rev. 15, p. 93), the CPU core and the DMA also share the same clock, therefore it is not possible to run the DMA controller at a different speed than the CPU itself.

My suggested solution is:

  1. set CPU/DMA clock to the required clock for your targeted baudrate.
  2. DMA-to-GPIO and having the CPU doing nothing (sleep).
  3. interrupt on "end of DMA", and resume both CPU clock and processing.


[1]GNU gcc --version
$ arm-none-eabi-g++ --version
arm-none-eabi-g++ (4.8.3-11ubuntu1+11) 4.8.3 20140913 (release)
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
[2]Compilation command-line: arm-none-eabi-g++ -O2 -std=gnu++0x -fno-exceptions -mcpu=cortex-m3 -mthumb --specs=nosys.specs -nostdlib -T main.ld main.cpp -o main.elf
[3]GNU gcc, Link options,
[4]GNU gcc, Optimization option,
[5]ST, STM32F103x8, STM32F103xB datasheet,
[RM0008](1, 2, 3, 4) ST, STM32F101xx, STM32F102xx, STM32F103xx, STM32F105xx and STM32F107xx Reference Manual


/* register definitions */
#include "stm32f10x.h"

/* Cortex-M architecture allows plain C startup code
 * from the linker file */
extern unsigned int __data_flash_start_addr, __data_flash_end_addr, __data_sram_start_addr, __data_sram_end_addr, __bss_start_addr, __bss_end_addr, __stack_end_addr;

/* Exception handlers prototypes */
void EmptyHandler(void);
void ResetHandler(void);
void NmiHandler(void);
void HardFaultHandler(void);
void MemManageHandler(void);
void BusFaultHandler(void);
void UsageFaultHandler(void);
void SvCallHandler(void);
void DebugMonitorHandler(void);

/* Exception and interrupt vector */
void (* const vector[])(void) __attribute__ ((section(".vector"))) __attribute__ ((used)) =
        (void (*)())&__stack_end_addr,  /* 0x0000_0000  stack address   */
        ResetHandler,                           /* 0x0000_0004  Reset                   */
        NmiHandler,                                     /* 0x0000_0008  NMI                             */
        HardFaultHandler,                       /* 0x0000_000C  HardFault               */
        MemManageHandler,                       /* 0x0000_0010  MemManage               */
        BusFaultHandler,                        /* 0x0000_0014  BusFault                */
        UsageFaultHandler,              /* 0x0000_0018  UsageFault              */
        0x0,                                                    /* 0x0000_001C  Reserved                */
        0x0,                                                    /* 0x0000_0020  Reserved                */
        0x0,                                                    /* 0x0000_0024  Reserved                */
        0x0,                                                    /* 0x0000_0028  Reserved                */
        SvCallHandler,                          /* 0x0000_002C  SVcall                  */
        DebugMonitorHandler,            /* 0x0000_0030  Debug Monitor   */
        0x0,                                                    /* 0x0000_0034  Reserved                */
        EmptyHandler,                           /* 0x0000_0038  PendSV                  */
        EmptyHandler,                           /* 0x0000_003C  SysTick                 */

/* stack */
char stack[4096] __attribute__ ((section (".stack"))) __attribute__ ((used)) = { 0 };

/* */
inline void memcpy(void* dest, const void* src, u32 length) {
        char* dst8 = (char*)dest;
        char* src8 = (char*)src;

        while (length--) {
                *dst8++ = *src8++;

inline void mempat(void* dest, u8 pattern, u32 length) {
        char* dst8 = (char*)dest;

        while (length--) {
                *dst8++ = pattern;

__attribute__ ((noreturn)) void EmptyHandler(void) {
        for(;;) {}

__attribute__ ((noreturn)) void NmiHandler(void) {
        for(;;) {}

__attribute__ ((noreturn)) void HardFaultHandler(void) {
        for(;;) {}

__attribute__ ((noreturn)) void MemManageHandler(void) {
        for(;;) {}

__attribute__ ((noreturn)) void BusFaultHandler(void) {
        for(;;) {}

__attribute__ ((noreturn)) void UsageFaultHandler(void) {
        for(;;) {}

__attribute__ ((noreturn)) void SvCallHandler(void) {
        for(;;) {}

__attribute__ ((noreturn)) void DebugMonitorHandler(void) {
        for(;;) {}

int main(void);

__attribute__ ((noreturn)) void ResetHandler(void) {
        /* Copy .data to SRAM */
        memcpy(&__data_sram_start_addr, &__data_flash_start_addr, &__data_sram_end_addr - &__data_sram_start_addr);
        /* Set .bss to zero */
        mempat(&__bss_start_addr, 0x00, &__bss_end_addr - &__bss_start_addr);

        /* jump to main */

        /* should never return from main */
        for(;;) {}

int main(void) {
        /* #1 configuration
         * CPU now running at 8MHz (HSI) */

        /* flash settings */
        /* Enable or disable the Prefetch Buffer */
        FLASH->ACR =
                /* FLASH_ACR_HLFCYA */
                | 0b010; /* FLASH_ACR_LATENCY: 2 wait states */

        /* Configure system clock
         * External oscillator: 8MHz
         * Max PLL multiplicator: x9
         * Max SYSCLK: 72MHz
         * Max AHB: SYSCLK = 72MHz
         * Max APB1: SYSCLK/2 = 36MHz
         * Max APB2: SYSCLK = 72MHz
         * Max ADC: SYSCLK/6 = 12MHz (max freq 14) */
        RCC->CFGR =
                  RCC_CFGR_MCO_PLL                      /* Output clock is PLL/2 */
                /* USBPRE */
                | RCC_CFGR_PLLMULL9                     /* PLL multiplicator is 9 */
                | RCC_CFGR_PLLXTPRE_HSE         /* oscillator prescaler is /1 */
                | RCC_CFGR_PLLSRC_HSE           /* PLL input is external oscillator */
                | RCC_CFGR_ADCPRE_DIV6          /* ADC prescaler is 6 */
                | RCC_CFGR_PPRE2_DIV1           /* APB2 prescaler is 1 */
                | RCC_CFGR_PPRE1_DIV2           /* APB1 prescaler is 2 */
                | RCC_CFGR_HPRE_DIV1;           /* AHB prescaler is 1 */
                /* SWS */
                /* SW */

                const u32 rcc_cr_hserdy_msk = 0x00020000;
                const u32 rcc_cr_pllrdy_msk = 0x02000000;
                const u32 rcc_cfgr_sw_msk   = 0x00000003;

                /* Clock control register */
                RCC->CR = RCC_CR_HSEON;         /* Enable external oscillator */

                /* Wait for locked external oscillator */
                while((RCC->CR & rcc_cr_hserdy_msk) != RCC_CR_HSERDY);

                /* Clock control register */
                RCC->CR |=
                        /* PLLRDY */
                        /* CSSON */
                        /* HSEBYP */
                        /* HSERDY */
                        /* HSEON */
                        /* HSICAL */
                        /* HSITRIM */
                        /* HSIRDY */
                        /* HSION */

                /* Wait for locked PLL */
                while((RCC->CR & rcc_cr_pllrdy_msk) != RCC_CR_PLLRDY);

                RCC->CFGR &= ~0x00000003; /* clear */
                RCC->CFGR |= RCC_CFGR_SW_PLL;   /* SYSCLK is PLL */

                /* Wait for SYSCLK to be PPL */
                while((RCC->CFGR & rcc_cfgr_sw_msk) != RCC_CFGR_SW_PLL);

        /* GPIO is in APB2 peripherals */
        /* enable APB2 clock */
        RCC->APB2ENR =
                | RCC_APB2ENR_IOPBEN
                | RCC_APB2ENR_IOPCEN
                | RCC_APB2ENR_IOPDEN
                | RCC_APB2ENR_IOPEEN
                | RCC_APB2ENR_USART1EN;

        /* Set PA to output mode */
        /* CRL: configuration register low (0..7) */
        GPIOD->CRL &= 0x00000000;       /* clear */
        GPIOD->CRL |= 0x33333333;       /* set */

        /* #2 bit banging */
                const u8 buffer[8] = {0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00};

                 * <bit banging code here>