Tuesday, January 1, 2008

blinker2

Blinking an led. Part 2.

Full source for a working, although slow blinking, example is located here:
blinker2-20080106a.tar.gz
blinker2.tar.gz


In this post I am going to look at blinking the LED using C with gcc and binutils instead of assembler. This example is targeted at the Luminary Micro Stellaris LM3S811, LM3S1968, and LM3S6965 evaluation boards.

Just like Linux, high level language development is an exercise in self abuse. I am pretty sure, no, I am certain that most developers are not aware of what the compiler is doing behind the scenes. Whats even more scary is how many embedded developers fall into this category. I will take a tangent here in a bit to discuss the problem.

Repeating blinker1 using C is actually not that big of a leap. I am using the C compiler that I created in a prior post, which has no C library to get in the way. The compiler switches I am using are hopefully keeping the C library out of the way anyway.

With blinker1, on reset, we branched directly to the main program. When you add C with or without a C library you will want to branch to some pre-main code that does things like zero the .bss memory (I find it best to initialize variables at run time and not make assumptions). Then prep the argc and argv arguments and call main. So the first change I made was to at least put a pre main step. Its not a bad idea to call main and then fall into an infinite loop. The alternative is to branch to main and remember that you cannot return from the function.

So the startup code using my bootloader now looks like this (there is a link to the full source below which includes the normal-not-bootloader startup code).

/* novectors.s */
.cpu cortex-m3
.thumb
.global _start
_start:
bl main
b .
.end

I have tried all of the tricks over the years and the compilers simply cannot understand when they should not optimize writes to hardware registers. The only way to insure it is done right is to write a few core functions in assembler. I call them PUT32 and GET32. And since the compiler doesnt optimize that well (it optimizes when you dont want it to but not so much when you do) I have assembler functions for the read-modify-write functions PUTGETSET32 and PUTGETRESET32.

/* putget.s */
;@-----------------------
.cpu cortex-m3
.thumb
;@-----------------------
.thumb_func
.globl PUT32
PUT32:
str r1,[r0]
bx lr
;@-----------------------
.thumb_func
.globl GET32
GET32:
ldr r0,[r0]
bx lr
;@-----------------------
.thumb_func
.globl PUTGETRESET
PUTGETRESET:
ldr r2,[r0]
bic r2,r1
str r2,[r0]
bx lr
;@-----------------------
.thumb_func
.globl PUTGETSET
PUTGETSET:
ldr r2,[r0]
orr r2,r1
str r2,[r0]
bx lr
;@-----------------------
.end
;@-----------------------

Okay so Reset goes to _start, then _start calls main. Just like blinker1 main is going to setup the leds then go into an infinite loop with counters for delays between the setting and resetting of the led output pin. I have created defines for the hardware addresses and bits.

//------------------------------------------------------------------------
//------------------------------------------------------------------------
/* blinker2.c */

#include "lmistuff.h"

#define CHIP6965

#ifdef CHIP1968
#define LED_RCGC2 LED_1968_RCGC2
#define LED_BASE LED_1968_BASE
#define LED_PIN LED_1968_PIN
#endif

#ifdef CHIP6965
#define LED_RCGC2 LED_6965_RCGC2
#define LED_BASE LED_6965_BASE
#define LED_PIN LED_6965_PIN
#endif
//------------------------------------------------------------------------
unsigned int ra;
//------------------------------------------------------------------------
void PUT32 ( unsigned long, unsigned long);
unsigned long GET32 ( unsigned long );
void PUTGETSET ( unsigned long addr, unsigned long bit );
void PUTGETRESET ( unsigned long addr, unsigned long bit );
//------------------------------------------------------------------------

void SetLed ( void )
{
PUT32(LED_BASE+GPIO_DATA+((LED_PIN) << 2),LED_PIN);
}
//------------------------------------------------------------------------
void ResetLed ( void )
{
PUT32(LED_BASE+GPIO_DATA+((LED_PIN) << 2),0);
}
//------------------------------------------------------------------------
void SetupLeds ( void )
{
PUTGETSET (SYSCON_BASE+SYSCON_RCGC2,LED_RCGC2);
PUTGETRESET(LED_BASE+GPIO_AFSEL,LED_PIN);
PUTGETSET (LED_BASE+GPIO_DIR ,LED_PIN);
PUTGETRESET(LED_BASE+GPIO_ODR ,LED_PIN);
PUTGETSET (LED_BASE+GPIO_DEN ,LED_PIN);
}
//------------------------------------------------------------------------
int main ( void )
{
SetupLeds();
while(1)
{
for(ra=0;ra < 0x3E0000;ra++) ; SetLed();
for(ra=0;ra < 0x3E0000;ra++) ; ResetLed();
}
}
//------------------------------------------------------------------------
//------------------------------------------------------------------------

Lastly the Makefile


COPS = -Wall -O2 -mthumb -nostdlib -nostartfiles -ffreestanding

all : blinker2.bin blinker2.norm.bin

novectors.o : novectors.s
arm-thumb-elf-as novectors.s -o novectors.o

vectors.o : vectors.s
arm-thumb-elf-as vectors.s -o vectors.o

putget.o : putget.s
arm-thumb-elf-as putget.s -o putget.o

blinker2.o : blinker2.c lmistuff.h
arm-thumb-elf-gcc $(COPS) -c blinker2.c -o blinker2.o

blinker2.elf : novectors.o putget.o blinker2.o blmemmap
arm-thumb-elf-gcc $(COPS) novectors.o putget.o blinker2.o -T blmemmap -o blinker2.elf
arm-thumb-elf-objdump -D blinker2.elf > blinker2.list

blinker2.bin : blinker2.elf
arm-thumb-elf-objcopy blinker2.elf blinker2.bin -O binary

blinker2.norm.elf : vectors.o putget.o blinker2.o memmap
arm-thumb-elf-gcc -Wall $(COPS) vectors.o putget.o blinker2.o -T memmap -o blinker2.norm.elf
arm-thumb-elf-objdump -D blinker2.norm.elf > blinker2.norm.list

blinker2.norm.bin : blinker2.norm.elf
arm-thumb-elf-objcopy blinker2.norm.elf blinker2.norm.bin -O binary

clean:
rm *.bin
rm *.o
rm *.elf
rm *.list

And that should do it right? Well, no actually. It doesnt work. The led comes on and stays on. Why is that? Doesnt a computer do what we tell it to do, not what we want it to do? Actually that is probably still true. Here is the problem. I optimized, if you remove the -O2 or change it to -O0 then it will work. Another way to make it work is to put the word volatile in front of the unsigned int ra; definition. Why?

The optimizer looks at the code and realizes that the variable ra is counting sure, but nothing is using ra as an input. That counting is just a waste of time get rid of it. So the compiler has reduced our main loop to:

while(1)
{
ra=0x3E0000; SetLed();
ra=0x3E0000; ResetLed();
}

Using the -S option to compile to assembler (you can also see this if you compile and dissasseble):

ldr r5, .L17
.L14:
mov r4, #248
lsl r4, r4, #14
str r4, [r5]
bl SetLed
str r4, [r5]
bl ResetLed
b .L14
.L18:
.align 2
.L17:
.word ra

248 is 0xF8, when shifted 14 times it becomes 0x3E0000, as expected. At least the compiler is not completely cruel and has the heart to set ra to 0x3E0000 to replace the for loop. As humans we can see that even that could have been optimized out. Or they could have set ra one time outside the while loop. But ra is a global and this compiler apparently does not penetrate into the SetLed and ResetLed code to see that ra is not needed nor touched there. (I have seen other compilers do this which usually means you have to separate functions into separately compiled files so that the optimizer cannot penetrate into your code). The least painful solution is to mark ra as volatile:

volatile unsigned int ra;

This tells the compiler that whatever you do, any time ra changes you must write the change to the memory location used to hold ra. You will find that some compilers will optimize variables to registers and do the things you said for them to do but didnt feel the need to waste cycles writing intermediate values to memory. This is actually quite desireable in general except in cases where you are writing to a hardware register and rely on all of the writes happening. Compiling for debug, depending on the compiler, might mean to the compiler assume all variables are volatile (so that the debugger can watch memory locations). Here again, you would be surprised how many developers are unaware of what is going on behind the curtain and are surprised when code that has been working for months/years when compiled for debug stops working when it is compiled for production.

Before showing the output of adding the volatile, a quick note. I am using gcc 4.2.2 built as a cross compiler as described in a previous post. I also created a gcc 3.3.6 cross compiler and a gcc 3.4.6 cross compiler. Both of the 3.x.x compilers are not willing and/or able to optimize out the delay counter, and for those you would get a blinking led. It would be a long while between blinks but it would work. I cut a lot out but you can basically see it counting from zero to some number before calling SetLed.

mov r3, #0
.L16:
add r3, r3, #1
cmp r3, r4
bls .L16
str r3, [r1]
bl SetLed

Since the count of 0x3E0000 was determined for the blinker1 by assuming two instructions in the delay loop and trying to count to 8 million, having three instructions means it will take 50% longer than its blinker1 counterpart. Ahh, this also demonstrates that during this loop ra was not stored back to memory each time it changed, only at the end after the loop is the value ra saved to memory.

The Coded Sourcery G++ Lite compiler, 2007-q3, is 4.2.2 based and likewise optimizes out the for loop all together.

Okay, so we tell the compiler that ra is volatile

.L20:
ldr r3, [r4]
add r3, r3, #1
str r3, [r4]
ldr r3, .L25+4
ldr r2, [r4]
cmp r2, r3
bls .L20

Well, the computer did what we told it to do, every time it needs to read and write ra it reads it from memory then changes it and writes it back. And perhaps because of the limited number of registers or who knows why, the max count value is not stored in a register it has to be fetched every time (ldr r3,.L25+4). So now our timeout count that was determined based on two instructions per loop, now has 7 instructions in the loop, assuming all instructions are one clock cycle that means it should take 3.5 times longer to blink than blinker1.

Just for my own entertainment if we follow blinker1 literally and use this as our delay loop

for(ra=0x3E0000;ra;ra--) ; ResetLed();

We get

.L21:
ldr r3, [r4]
sub r3, r3, #1
str r3, [r4]
ldr r3, [r4]
cmp r3, #0
bne .L21

Since it is comparing with zero now instead of a terminal count we save one instruction which because it had a memory access can take longer in general unless you have zero-wait-state memory.

Again, just for fun, if I remove the volatile and use gcc 3.4.6, it gets very close to the optimal solution:

.L20:
sub r3, r3, #1
cmp r3, #0
bne .L20

The problem I have is that other compilers know that a sub is really a subs which means set the flags after you do the subtract. And branch if not equal means branch if not equal to zero, subs already set the z or zero flag so there is no need for the additional compare with zero, it has already been done.

So this trivial program has created lots of problems, and for this specific task you would have been closer to success on your first try with the older gcc 3.x.x compiler over the current 4.2.2.

If you ever have the opportunity to examine what ARM's own compiler does, it can be absolutely amazing. Now that Keil is ARM and I think it uses the rvct as a compiler the Keil eval may very well do amazing things. I will have to look at that some day.

Oh, the reason I wrote the PUTGETSET and PUTGETRESET functions in assembler is because the output of gcc had a few extra unnecessary operations.

Hmmm, very interesting note. blinker2 had a bug. For those that dont know this semi-documented feature of the ARM cores (Even the popular ARM7), is that when you are executing in thumb mode the lsbit of your PC (r15) is set, in arm mode the bit is clear. Likewise compilers have to produce the proper addresses with the bit set or clear. I had the vectors.s table wrong, I wasnt using it because I was running with the bootloader. Bootloader works because the reset vector pointed to the C function main which was compiled as thumb thus telling the linker to generate an address in the vector table with the lsbit set. When I changed the code so that the reset vector jumped over the vector table to a _start label then from there I bl main, I had not declared _start as a .thumb_func so despite producing thumb instructions it created an ARM address in the vector table. And the 811 board I just got would not execute. What disturbs me greatly about all of this is that this is supposed to be a thumb or thumb2 ONLY core from ARM. If you know its thumb only shouldnt you be ignoring anything to do with ARM mode? Wouldnt that be a safe assumption? I guess not.

20070106a, fixed the vectors.s bug and added LM3S811 support.