Lately I've been having fun with Clang's (and GCC's) -S option in order to examine the assembly output of small programs written in C. I've found this is a really great way to both learn about what your compiler is doing with your code and how to write assembly code on your own. One of the more interesting things I've learned from doing this is just how function calls are made on UNIX-like systems (Linux, BSD, macOS, Solaris, etc.) running on x86-64 hardware. When using a high(er) level language like C it's really not something I ever think about; the compiler takes care of the details. But what exaclty happens when you call a function?
The answer is actually fairly complicated and depends on several factors including the computer's architecture, the operating system, the language, the compiler, and the number and nature of the arguments being passed to the function. For simplicity's sake in this post I'm going to make a few assumptions: we're on an x86-64 machine, running macOS, using C, compiling with Clang or GCC, and only passing 64-bit integer arguments. Given all that what we'll be looking at is the System V AMD64 ABI calling convention. Under this convention the first six integer arguments are passed via the rdi
, rsi
, rdx
, rcx
, r8
, and r9
registers in reverse order. After that, integer arguments are passed on the stack, also in reverse order. Floating point arguments are a little different and are beyond the scope of this post.
Sometimes the best way to really get a feel for how something works in general is to look at several specific examples that differ only slightly. In the case of function calling conventions one important aspect that can easily be changed is the number of arguments being passed. The following scenarios present a simple and contrived C program and its corresponding assembly output. Each example calls a function with more arguments than the last.
It's important to note that the output presented here is being created without any optimizations. Compiler optimizations tend to produce much faster code at the expense of understandability. Often times the resulting optimized assembly code bears little resemblance to the structure of the original C code. Of course, it's a good idea to try looking at the output produced with optimizations, but for the purposes of this post it would only confuse things. Because no optimization is being done the compiler is more or less forced to make naive assumptions about the state of the call stack. This can result in verbose assembly output that appears to serve little or no purpose in these or similar contrived programs.
Each example program below has been compiled with both Clang (1000.11.45.5) and GCC (9.2.0) using the following flags: -O0 -S -fno-asynchronous-unwind-tables
. I'll go over the Clang output and then the GCC output individually before comparing the two.
One Argument
When only one argument is passed to the function only the rdi
register is used. It's about as straight forward as you can hope for. The program below takes an integer and passes it to a function called doubleNumber
which uses a left shift to double the argument before returning the result.
Example 1: Original C Code
#include <stdint.h> uint64_t doubleNumber(uint64_t a) { return a << 1; } int main(int argc, char* argv[]) { uint64_t a = 1; doubleNumber(a); return 0; }
Note that the result is not printed after being calculated as it turns out that printf
, and variadic functions in general, use a slightly different calling convention that is beyond the scope of this post. Additionally, if you ran this code through a linter (such as Splint) it would warn you about the return value of doubleNumber
being unused. This is intentional, but we'll see how the System V AMD64 ABI handles integer return values in a moment even if nothing in this post actually uses them.
Example 1: Assembly Output (Clang)
.section __TEXT,__text,regular,pure_instructions .macosx_version_min 10, 13 .globl _doubleNumber ## -- Begin function doubleNumber .p2align 4, 0x90 _doubleNumber: ## @doubleNumber ## %bb.0: pushq %rbp movq %rsp, %rbp movq %rdi, -8(%rbp) movq -8(%rbp), %rdi shlq $1, %rdi movq %rdi, %rax popq %rbp retq ## -- End function .globl _main ## -- Begin function main .p2align 4, 0x90 _main: ## @main ## %bb.0: pushq %rbp movq %rsp, %rbp subq $32, %rsp movl $0, -4(%rbp) movl %edi, -8(%rbp) movq %rsi, -16(%rbp) movq $1, -24(%rbp) movq -24(%rbp), %rdi callq _doubleNumber xorl %ecx, %ecx movq %rax, -32(%rbp) ## 8-byte Spill movl %ecx, %eax addq $32, %rsp popq %rbp retq ## -- End function .subsections_via_symbols
The key part of the above assembly output is the following three instructions:
movq $1, -24(%rbp) movq -24(%rbp), %rdi callq _doubleNumber
The first instruction moves our immediate argument, 1, to the top of the stack (24 bytes from wherever rbp
is pointing) presumably for safe keeping. The second moves the argument into rdi
for the call to _doubleNumber
. And the third actually calls the function. In this particular case storing the argument on the stack is not necessary, and in fact if you change the above to just be
movq $1, %rdi callq _doubleNumber
it will work just fine. I'm fairly certain the compiler does this because without further analysis (the kind it does when making optimizations) it can't know for sure whether or not the original value will be needed later.
Let's take a closer look inside _doubleNumber
:
movq %rdi, -8(%rbp) movq -8(%rbp), %rdi shlq $1, %rdi movq %rdi, %rax
We can see some similar business with the stack is going on before doing a logical left shift on rdi
and moving rdi
into rax
as the function return value. Like in the main function, all that stack manipulation isn't really necessary.
Example 1: Assembly Output (GCC)
.text .globl _doubleNumber _doubleNumber: pushq %rbp movq %rsp, %rbp movq %rdi, -8(%rbp) movq -8(%rbp), %rax addq %rax, %rax popq %rbp ret .globl _main _main: pushq %rbp movq %rsp, %rbp subq $32, %rsp movl %edi, -20(%rbp) movq %rsi, -32(%rbp) movq $1, -8(%rbp) movq -8(%rbp), %rax movq %rax, %rdi call _doubleNumber movl $0, %eax leave ret .ident "GCC: (Homebrew GCC 9.2.0) 9.2.0" .subsections_via_symbols
Inside the main function there's four instrunctions worth looking at:
movq $1, -8(%rbp) movq -8(%rbp), %rax movq %rax, %rdi call _doubleNumber
This moves the immediate value 1 onto the stack (at 8 bytes from where rbp
is pointing) just in case it is needed later, takes that value on the stack and moves it into rax
, moves rax
into rdi
, and then finally calls _doubleNumber
. All of the funny business with placing the argument on the stack and moving it into rax
is unecessary but should come as no surprise when considering the lack of optimization.
GCC's output for doubleNumber
, however, is somewhat unexepected:
movq %rdi, -8(%rbp) movq -8(%rbp), %rax addq %rax, %rax
First, the argument in rdi
is moved onto the stack and from there it's moved into the return value register rax
. As we've seen before, using the stack isn't necessary here. The reason I say this output is unexepected is because -O0
was used, which should eliminate all optimizations (at least that's how I understand it). Despite using a left shift in the C code we can see that GCC instead simply adds the contents of rax
to itself. This is functionally equivalent and sets things up nicely for the function return since the final computed value is already in rax
.
Differences between Clang and GCC
In this first example the output of Clang and GCC is largely the same. There's some superficial differences in where they push values on the stack before calling _doubleNumber
but the real surprise is the use of addq
over shlq
.
Three Arguments
In order to clearly demonstrate the usage of three arguments this program is slightly different from the first. Instead of only doubling a number it doubles three numbers and adds them together.
Example 2: Original C Code
#include <stdint.h> uint64_t doubleAndThenSumNumbers(uint64_t a, uint64_t b, uint64_t c) { a = a << 1; b = b << 1; c = c << 1; return a + b + c; } int main(int argc, char* argv[]) { uint64_t a = 1, b = 2, c = 3; doubleAndThenSumNumbers(a, b, c); return 0; }
Example 2: Assembly Output (Clang)
.section __TEXT,__text,regular,pure_instructions .macosx_version_min 10, 13 .globl _doubleAndThenSumNumbers ## -- Begin function doubleAndThenSumNumbers .p2align 4, 0x90 _doubleAndThenSumNumbers: ## @doubleAndThenSumNumbers ## %bb.0: pushq %rbp movq %rsp, %rbp movq %rdi, -8(%rbp) movq %rsi, -16(%rbp) movq %rdx, -24(%rbp) movq -8(%rbp), %rdx shlq $1, %rdx movq %rdx, -8(%rbp) movq -16(%rbp), %rdx shlq $1, %rdx movq %rdx, -16(%rbp) movq -24(%rbp), %rdx shlq $1, %rdx movq %rdx, -24(%rbp) movq -8(%rbp), %rdx addq -16(%rbp), %rdx addq -24(%rbp), %rdx movq %rdx, %rax popq %rbp retq ## -- End function .globl _main ## -- Begin function main .p2align 4, 0x90 _main: ## @main ## %bb.0: pushq %rbp movq %rsp, %rbp subq $48, %rsp movl $0, -4(%rbp) movl %edi, -8(%rbp) movq %rsi, -16(%rbp) movq $1, -24(%rbp) movq $2, -32(%rbp) movq $3, -40(%rbp) movq -24(%rbp), %rdi movq -32(%rbp), %rsi movq -40(%rbp), %rdx callq _doubleAndThenSumNumbers xorl %ecx, %ecx movq %rax, -48(%rbp) ## 8-byte Spill movl %ecx, %eax addq $48, %rsp popq %rbp retq ## -- End function .subsections_via_symbols
The important part of the main function looks pretty familiar:
movq $1, -24(%rbp) movq $2, -32(%rbp) movq $3, -40(%rbp) movq -24(%rbp), %rdi movq -32(%rbp), %rsi movq -40(%rbp), %rdx callq _doubleAndThenSumNumbers
Just like the first example, this program moves the immediate value 1 onto the stack for safe keeping before moving it into rdi
. It also pushes the values 2 and 3 onto the stack and then moves them into rsi
and rdx
respectively before calling _doubleAndThenSumNumbers
just like you'd expect based on the calling convention.
Let's take a look at the interesting parts of the _doubleAndThenSumNumbers
:
movq %rdi, -8(%rbp) movq %rsi, -16(%rbp) movq %rdx, -24(%rbp)
The function starts off by moving all three arguments onto the stack.
movq -8(%rbp), %rdx shlq $1, %rdx movq %rdx, -8(%rbp) movq -16(%rbp), %rdx shlq $1, %rdx movq %rdx, -16(%rbp) movq -24(%rbp), %rdx shlq $1, %rdx movq %rdx, -24(%rbp)
And then it goes through each argument, moving it into rdx
, left shifts it, and then puts it back on the stack.
movq -8(%rbp), %rdx addq -16(%rbp), %rdx addq -24(%rbp), %rdx movq %rdx, %rax
Finally the first doubled number is moved into rdx
and then the next two doubled numbers are added to it before moving the final total into the return value register rax
.
Example 2: Assembly Output (GCC)
.text .globl _doubleAndThenSumNumbers _doubleAndThenSumNumbers: pushq %rbp movq %rsp, %rbp movq %rdi, -8(%rbp) movq %rsi, -16(%rbp) movq %rdx, -24(%rbp) salq -8(%rbp) salq -16(%rbp) salq -24(%rbp) movq -8(%rbp), %rdx movq -16(%rbp), %rax addq %rax, %rdx movq -24(%rbp), %rax addq %rdx, %rax popq %rbp ret .globl _main _main: pushq %rbp movq %rsp, %rbp subq $48, %rsp movl %edi, -36(%rbp) movq %rsi, -48(%rbp) movq $1, -8(%rbp) movq $2, -16(%rbp) movq $3, -24(%rbp) movq -24(%rbp), %rdx movq -16(%rbp), %rcx movq -8(%rbp), %rax movq %rcx, %rsi movq %rax, %rdi call _doubleAndThenSumNumbers movl $0, %eax leave ret .ident "GCC: (Homebrew GCC 9.2.0) 9.2.0" .subsections_via_symbols
For the most part main
begins like we'd expect:
movq $1, -8(%rbp) movq $2, -16(%rbp) movq $3, -24(%rbp) movq -24(%rbp), %rdx movq -16(%rbp), %rcx movq -8(%rbp), %rax movq %rcx, %rsi movq %rax, %rdi call _doubleAndThenSumNumbers
Here we see the usual moving the arguments onto the stack before moving them into rdi
, rsi
, and rdx
. I would love to know why GCC thinks it's necessary to move 1 and 2 from their place on the stack into rax
and rcx
first before moving the values to rdi
and rsi
. After the arguments are in place _doubleAndThenSumNumbers
gets called.
movq %rdi, -8(%rbp) movq %rsi, -16(%rbp) movq %rdx, -24(%rbp)
Inside _doubleAndThenSumNumbers
things start off with the arguments being moved onto the stack.
salq -8(%rbp) salq -16(%rbp) salq -24(%rbp) movq -8(%rbp), %rdx movq -16(%rbp), %rax addq %rax, %rdx movq -24(%rbp), %rax addq %rdx, %rax
The next three instructions shift the arguments left by one bit (since no second operand is specified a default of 1 is used). After that the first argument is moved into rdx
, the second argument is moved into rax
, and then the two are added together. Finally the third argument is moved into rax
and rdx
is added to it for the return value.
Differences between Clang and GCC
When it comes to how the function is called the only thing Clang and GCC does differently is some of the steps they take before moving the arguments into the appropriate registers. I think it's interesting that inside _doubleAndThenSumNumbers
Clang goes through the trouble of moving the arguments from the stack into a register before doing the left shift but GCC does not.
Thirteen Arguments
Now things finally start to get interesting. As noted before, the System V AMD64 ABI specifies that the first six arguments to a function should be passed via registers. Until now our contrived functions have had less than six arguments. Let's see what happens when we go well over that limit.
Example 3: Original C Code
#include <stdint.h> uint64_t doubleAndThenSumNumbers( uint64_t a, uint64_t b, uint64_t c, uint64_t d, uint64_t e, uint64_t f, uint64_t g, uint64_t h, uint64_t i, uint64_t j, uint64_t k, uint64_t l, uint64_t m) { a = a << 1; b = b << 1; c = c << 1; d = d << 1; e = e << 1; f = f << 1; g = g << 1; h = h << 1; i = i << 1; j = j << 1; k = k << 1; l = l << 1; m = m << 1; return a + b + c + d + e + f + g + h + i + j + k + l + m; } int main(int argc, char* argv[]) { uint64_t a = 1, b = 2, c = 3, d = 4, e = 5, f = 6, g = 7, h = 8, i = 9, j = 10, k = 11, l = 12, m = 13; doubleAndThenSumNumbers(a, b, c, d, e, f, g, h, i, j, k, l, m); return 0; }
Example 3: Assembly Output (Clang)
.section __TEXT,__text,regular,pure_instructions .macosx_version_min 10, 13 .globl _doubleAndThenSumNumbers ## -- Begin function doubleAndThenSumNumbers .p2align 4, 0x90 _doubleAndThenSumNumbers: ## @doubleAndThenSumNumbers ## %bb.0: pushq %rbp movq %rsp, %rbp pushq %r15 pushq %r14 pushq %r12 pushq %rbx movq 64(%rbp), %rax movq 56(%rbp), %r10 movq 48(%rbp), %r11 movq 40(%rbp), %rbx movq 32(%rbp), %r14 movq 24(%rbp), %r15 movq 16(%rbp), %r12 movq %rdi, -40(%rbp) movq %rsi, -48(%rbp) movq %rdx, -56(%rbp) movq %rcx, -64(%rbp) movq %r8, -72(%rbp) movq %r9, -80(%rbp) movq -40(%rbp), %rcx shlq $1, %rcx movq %rcx, -40(%rbp) movq -48(%rbp), %rcx shlq $1, %rcx movq %rcx, -48(%rbp) movq -56(%rbp), %rcx shlq $1, %rcx movq %rcx, -56(%rbp) movq -64(%rbp), %rcx shlq $1, %rcx movq %rcx, -64(%rbp) movq -72(%rbp), %rcx shlq $1, %rcx movq %rcx, -72(%rbp) movq -80(%rbp), %rcx shlq $1, %rcx movq %rcx, -80(%rbp) movq 16(%rbp), %rcx shlq $1, %rcx movq %rcx, 16(%rbp) movq 24(%rbp), %rcx shlq $1, %rcx movq %rcx, 24(%rbp) movq 32(%rbp), %rcx shlq $1, %rcx movq %rcx, 32(%rbp) movq 40(%rbp), %rcx shlq $1, %rcx movq %rcx, 40(%rbp) movq 48(%rbp), %rcx shlq $1, %rcx movq %rcx, 48(%rbp) movq 56(%rbp), %rcx shlq $1, %rcx movq %rcx, 56(%rbp) movq 64(%rbp), %rcx shlq $1, %rcx movq %rcx, 64(%rbp) movq -40(%rbp), %rcx addq -48(%rbp), %rcx addq -56(%rbp), %rcx addq -64(%rbp), %rcx addq -72(%rbp), %rcx addq -80(%rbp), %rcx addq 16(%rbp), %rcx addq 24(%rbp), %rcx addq 32(%rbp), %rcx addq 40(%rbp), %rcx addq 48(%rbp), %rcx addq 56(%rbp), %rcx addq 64(%rbp), %rcx movq %rax, -88(%rbp) ## 8-byte Spill movq %rcx, %rax movq %r12, -96(%rbp) ## 8-byte Spill movq %r10, -104(%rbp) ## 8-byte Spill movq %r11, -112(%rbp) ## 8-byte Spill movq %rbx, -120(%rbp) ## 8-byte Spill movq %r14, -128(%rbp) ## 8-byte Spill movq %r15, -136(%rbp) ## 8-byte Spill popq %rbx popq %r12 popq %r14 popq %r15 popq %rbp retq ## -- End function .globl _main ## -- Begin function main .p2align 4, 0x90 _main: ## @main ## %bb.0: pushq %rbp movq %rsp, %rbp pushq %r15 pushq %r14 pushq %r13 pushq %r12 pushq %rbx subq $184, %rsp movl $0, -44(%rbp) movl %edi, -48(%rbp) movq %rsi, -56(%rbp) movq $1, -64(%rbp) movq $2, -72(%rbp) movq $3, -80(%rbp) movq $4, -88(%rbp) movq $5, -96(%rbp) movq $6, -104(%rbp) movq $7, -112(%rbp) movq $8, -120(%rbp) movq $9, -128(%rbp) movq $10, -136(%rbp) movq $11, -144(%rbp) movq $12, -152(%rbp) movq $13, -160(%rbp) movq -64(%rbp), %rdi movq -72(%rbp), %rsi movq -80(%rbp), %rdx movq -88(%rbp), %rcx movq -96(%rbp), %r8 movq -104(%rbp), %r9 movq -112(%rbp), %rax movq -120(%rbp), %r10 movq -128(%rbp), %r11 movq -136(%rbp), %rbx movq -144(%rbp), %r14 movq -152(%rbp), %r15 movq -160(%rbp), %r12 movq %rax, (%rsp) movq %r10, 8(%rsp) movq %r11, 16(%rsp) movq %rbx, 24(%rsp) movq %r14, 32(%rsp) movq %r15, 40(%rsp) movq %r12, 48(%rsp) callq _doubleAndThenSumNumbers xorl %r13d, %r13d movq %rax, -168(%rbp) ## 8-byte Spill movl %r13d, %eax addq $184, %rsp popq %rbx popq %r12 popq %r13 popq %r14 popq %r15 popq %rbp retq ## -- End function .subsections_via_symbols
The first thing to note about this example is that before it even starts to do anything with the arguments themselves it saves the contents of rbx
, r12
, r13
, r14
, and r15
before allocating more stack space for local variables.
pushq %r15 pushq %r14 pushq %r13 pushq %r12 pushq %rbx
These registers are callee preserved meaning that their values must be saved across function calls.
subq $184, %rsp movl $0, -44(%rbp) movl %edi, -48(%rbp) movq %rsi, -56(%rbp)
With 13 64-bit variables we're going to need a lot of room. 184 bytes on the stack should be enough.
movq $1, -64(%rbp) movq $2, -72(%rbp) movq $3, -80(%rbp) movq $4, -88(%rbp) movq $5, -96(%rbp) movq $6, -104(%rbp) movq $7, -112(%rbp) movq $8, -120(%rbp) movq $9, -128(%rbp) movq $10, -136(%rbp) movq $11, -144(%rbp) movq $12, -152(%rbp) movq $13, -160(%rbp)
By now the above thirteen instructions should have been predictable. All the arguments for our function are moved onto the stack.
movq -64(%rbp), %rdi movq -72(%rbp), %rsi movq -80(%rbp), %rdx movq -88(%rbp), %rcx movq -96(%rbp), %r8 movq -104(%rbp), %r9 movq -112(%rbp), %rax movq -120(%rbp), %r10 movq -128(%rbp), %r11 movq -136(%rbp), %rbx movq -144(%rbp), %r14 movq -152(%rbp), %r15 movq -160(%rbp), %r12
And here the arguments are moved from the stack to their respective registers. But what about the last seven arguments? The System V AMD64 ABI clearly states that only the first six integer arguments are passed via reigsters. Why then do all the arguments end up in registers?
movq %rax, (%rsp) movq %r10, 8(%rsp) movq %r11, 16(%rsp) movq %rbx, 24(%rsp) movq %r14, 32(%rsp) movq %r15, 40(%rsp) movq %r12, 48(%rsp) callq _doubleAndThenSumNumbers
Honestly, I'm not 100% sure why Clang takes this intermediary step. But as we can see above, all the arguments end up on the stack. It's important to note that when calling a function with more than six arguments rsp
should point to the seventh with the rest immediately following.
This time around _doubleAndThenSumNumbers
is a bit of a doozy at first glance. If we break it down like we did main we'll see it's no more complicated than before.
pushq %r15 pushq %r14 pushq %r12 pushq %rbx
First the callee saved registers are being pushed for preservation. This means we should expect them to be used inside this function.
movq 64(%rbp), %rax movq 56(%rbp), %r10 movq 48(%rbp), %r11 movq 40(%rbp), %rbx movq 32(%rbp), %r14 movq 24(%rbp), %r15 movq 16(%rbp), %r12
And they immediately are. The seven arguments passed on the stack are moved into registers (rax
, r10
, and r11
don't need to be saved).
movq %rdi, -40(%rbp) movq %rsi, -48(%rbp) movq %rdx, -56(%rbp) movq %rcx, -64(%rbp) movq %r8, -72(%rbp) movq %r9, -80(%rbp)
After that the rest of the arguments, the ones in the registers, are moved on the stack.
movq -40(%rbp), %rcx shlq $1, %rcx movq %rcx, -40(%rbp) movq -48(%rbp), %rcx shlq $1, %rcx movq %rcx, -48(%rbp) movq -56(%rbp), %rcx shlq $1, %rcx movq %rcx, -56(%rbp) movq -64(%rbp), %rcx shlq $1, %rcx movq %rcx, -64(%rbp) movq -72(%rbp), %rcx shlq $1, %rcx movq %rcx, -72(%rbp) movq -80(%rbp), %rcx shlq $1, %rcx movq %rcx, -80(%rbp) movq 16(%rbp), %rcx shlq $1, %rcx movq %rcx, 16(%rbp) movq 24(%rbp), %rcx shlq $1, %rcx movq %rcx, 24(%rbp) movq 32(%rbp), %rcx shlq $1, %rcx movq %rcx, 32(%rbp) movq 40(%rbp), %rcx shlq $1, %rcx movq %rcx, 40(%rbp) movq 48(%rbp), %rcx shlq $1, %rcx movq %rcx, 48(%rbp) movq 56(%rbp), %rcx shlq $1, %rcx movq %rcx, 56(%rbp) movq 64(%rbp), %rcx shlq $1, %rcx movq %rcx, 64(%rbp)
This part looks way more complicated than it actually is. All it's doing now is going through each of our variables on the stack and moving them into rcx
, left shifting the variable, and then putting back into its place on the stack.
movq -40(%rbp), %rcx addq -48(%rbp), %rcx addq -56(%rbp), %rcx addq -64(%rbp), %rcx addq -72(%rbp), %rcx addq -80(%rbp), %rcx addq 16(%rbp), %rcx addq 24(%rbp), %rcx addq 32(%rbp), %rcx addq 40(%rbp), %rcx addq 48(%rbp), %rcx addq 56(%rbp), %rcx addq 64(%rbp), %rcx
Now to complete the useful part of _doubleAndThenSumNumbers
the first argument/variable is moved into rcx
and then the rest are added to rcx
.
movq %rax, -88(%rbp) ## 8-byte Spill movq %rcx, %rax movq %r12, -96(%rbp) ## 8-byte Spill movq %r10, -104(%rbp) ## 8-byte Spill movq %r11, -112(%rbp) ## 8-byte Spill movq %rbx, -120(%rbp) ## 8-byte Spill movq %r14, -128(%rbp) ## 8-byte Spill movq %r15, -136(%rbp) ## 8-byte Spill
In amongst the register spill cleanup rcx
is moved to the return register rax
. The register spill is yet another artifact of the lack of optimization.
popq %rbx popq %r12 popq %r14 popq %r15 popq %rbp retq
And finally the callee saved registers are popped before returning back to _main
.
xorl %r13d, %r13d movq %rax, -168(%rbp) ## 8-byte Spill movl %r13d, %eax addq $184, %rsp
To begin winding things down the bottom 32 bits of r13
(r13d
) is zeroed out. The return value from _doubleAndThenSumNumbers
stored in rax
is moved onto the stack where it could be used later if we weren't about to exit. r13d
is then moved into the return register eax
. And the stack frame is restored.
popq %rbx popq %r12 popq %r13 popq %r14 popq %r15 popq %rbp retq
Finally, just like inside _doubleAndThenSumNumbers
the callee saved registers are restored before returning.
Example 3: Assembly Output (GCC)
.text .globl _doubleAndThenSumNumbers _doubleAndThenSumNumbers: pushq %rbp movq %rsp, %rbp movq %rdi, -8(%rbp) movq %rsi, -16(%rbp) movq %rdx, -24(%rbp) movq %rcx, -32(%rbp) movq %r8, -40(%rbp) movq %r9, -48(%rbp) salq -8(%rbp) salq -16(%rbp) salq -24(%rbp) salq -32(%rbp) salq -40(%rbp) salq -48(%rbp) salq 16(%rbp) salq 24(%rbp) salq 32(%rbp) salq 40(%rbp) salq 48(%rbp) salq 56(%rbp) salq 64(%rbp) movq -8(%rbp), %rdx movq -16(%rbp), %rax addq %rax, %rdx movq -24(%rbp), %rax addq %rax, %rdx movq -32(%rbp), %rax addq %rax, %rdx movq -40(%rbp), %rax addq %rax, %rdx movq -48(%rbp), %rax addq %rax, %rdx movq 16(%rbp), %rax addq %rax, %rdx movq 24(%rbp), %rax addq %rax, %rdx movq 32(%rbp), %rax addq %rax, %rdx movq 40(%rbp), %rax addq %rax, %rdx movq 48(%rbp), %rax addq %rax, %rdx movq 56(%rbp), %rax addq %rax, %rdx movq 64(%rbp), %rax addq %rdx, %rax popq %rbp ret .globl _main _main: pushq %rbp movq %rsp, %rbp addq $-128, %rsp movl %edi, -116(%rbp) movq %rsi, -128(%rbp) movq $1, -8(%rbp) movq $2, -16(%rbp) movq $3, -24(%rbp) movq $4, -32(%rbp) movq $5, -40(%rbp) movq $6, -48(%rbp) movq $7, -56(%rbp) movq $8, -64(%rbp) movq $9, -72(%rbp) movq $10, -80(%rbp) movq $11, -88(%rbp) movq $12, -96(%rbp) movq $13, -104(%rbp) movq -48(%rbp), %r8 movq -40(%rbp), %rdi movq -32(%rbp), %rcx movq -24(%rbp), %rdx movq -16(%rbp), %rsi movq -8(%rbp), %rax pushq -104(%rbp) pushq -96(%rbp) pushq -88(%rbp) pushq -80(%rbp) pushq -72(%rbp) pushq -64(%rbp) pushq -56(%rbp) movq %r8, %r9 movq %rdi, %r8 movq %rax, %rdi call _doubleAndThenSumNumbers addq $56, %rsp movl $0, %eax leave ret .ident "GCC: (Homebrew GCC 9.2.0) 9.2.0" .subsections_via_symbols
GCC's output is short and straight-forward with only a few bits that whose purpose isn't immediately obvious.
addq $-128, %rsp movl %edi, -116(%rbp) movq %rsi, -128(%rbp)
GCC begins by first allocating 128 bytes of space on the stack before preserving the values of edi
and rsi
.
movq $1, -8(%rbp) movq $2, -16(%rbp) movq $3, -24(%rbp) movq $4, -32(%rbp) movq $5, -40(%rbp) movq $6, -48(%rbp) movq $7, -56(%rbp) movq $8, -64(%rbp) movq $9, -72(%rbp) movq $10, -80(%rbp) movq $11, -88(%rbp) movq $12, -96(%rbp) movq $13, -104(%rbp)
Right away all of our integer values are moved onto the stack.
movq -48(%rbp), %r8 movq -40(%rbp), %rdi movq -32(%rbp), %rcx movq -24(%rbp), %rdx movq -16(%rbp), %rsi movq -8(%rbp), %rax
And then the first six are moved into the appropriate registers.
pushq -104(%rbp) pushq -96(%rbp) pushq -88(%rbp) pushq -80(%rbp) pushq -72(%rbp) pushq -64(%rbp) pushq -56(%rbp)
The remaining seven are actually pushed onto the stack (as opposed to being moved).
movq %r8, %r9 movq %rdi, %r8 movq %rax, %rdi call _doubleAndThenSumNumbers
Next there's a little bit of shuffling of the variables to put them in the right order. Above we saw that r9
wasn't used and rax
was used instead of rdi
. This section of code corrects that before calling _doubleAndThenSumNumbers
.
movq %rdi, -8(%rbp) movq %rsi, -16(%rbp) movq %rdx, -24(%rbp) movq %rcx, -32(%rbp) movq %r8, -40(%rbp) movq %r9, -48(%rbp)
Inside _doubleAndThenSumNumbers
the arguments passed by registers are first moved onto the stack.
salq -8(%rbp) salq -16(%rbp) salq -24(%rbp) salq -32(%rbp) salq -40(%rbp) salq -48(%rbp) salq 16(%rbp) salq 24(%rbp) salq 32(%rbp) salq 40(%rbp) salq 48(%rbp) salq 56(%rbp) salq 64(%rbp)
All of the arguments are shifted left by one bit.
movq -8(%rbp), %rdx movq -16(%rbp), %rax addq %rax, %rdx movq -24(%rbp), %rax addq %rax, %rdx movq -32(%rbp), %rax addq %rax, %rdx movq -40(%rbp), %rax addq %rax, %rdx movq -48(%rbp), %rax addq %rax, %rdx movq 16(%rbp), %rax addq %rax, %rdx movq 24(%rbp), %rax addq %rax, %rdx movq 32(%rbp), %rax addq %rax, %rdx movq 40(%rbp), %rax addq %rax, %rdx movq 48(%rbp), %rax addq %rax, %rdx movq 56(%rbp), %rax addq %rax, %rdx movq 64(%rbp), %rax addq %rdx, %rax
More straight-forward but suboptimal code. The first doubled number is is moved into rdx
. After that each doubled number is moved into rax
and then added to rdx
. Finally the result is moved from rdx
into the return register rax
.
Differences between Clang and GCC
Perhaps the largest difference between Clang and GCC in this example is that GCC does the left shift on the variables when they're on the stack. Obviously location of the operand for salq
doesn't matter as far as the calculation goes, but it is as much as four times slower than a combined mov
from the stack to a register and salq
using the register's contents. I wrote a short program to test the performance of both methods and I found doing the left shift directly on a variable on the stack was about 1.5 times slower. It's likely the CPU is being clever about how it actually executes the instructions so the difference in a tight loop isn't s great as it could otherwise be.
With optimizations turned on both Clang and GCC produce nearly identical code that makes clever use of leaq
. I think it's interesting that Clang appears to be making some minimal amount of consideration for performance while GCC opts for somewhat more straight forward output.
Conclusions
Hopefully this post has shed some light on what it is happening when you call a function (in C anyways) on a modern computer. After going through the excercise of writing some simple programs and inspecting the assembly output produced by a couple compilers it's hard for me to even remember exactly what my own initial confusion was regarding the System V ABI. Sometimes seeing something in action in a "real" scenario brings a clarity that reading a desciption in a spec can't.
Related Reading
The following links are highly recommended for an in depth understanding of the System V ABI.
- eds. Matz, Hubicka, Jaeger, Mitchell. System V Application Binary Interface - AMD64 Architecture Processor Supplement - Draft Version 0.99.7
- Intel 64 and IA-32 Architectures Software Developer's Manual
- Bryant, Randal E. and O'Hallaron, David R. x86-64 Machine-Level Programming
- Bendersky, Eli. Stack frame layout on x86-64
- Gunawardena, Ananda. Lecture 27 - C and Assembly
- Doeppner. x64 Cheat Sheet
- x86 Disassembly/Functions and Stack Frames
- System V AMD64 ABI
- System V ABI
- Calling Conventions
- Hubicka, Jan. Function Calling Sequence
- x86-64 Instructions and ABI
- Fog, Agner. Calling conventions for different C++ compilers and operating systems
Supplimental Reading
These links aren't relevant to the subject of this post but are nonetheless very helpful if you're not intimately familiar with assembly on UNIX-like systems.
- x64 Architecture
- Kalra, Mohit. gdb — Assembly Language Debugging 101
- Michaux, Peter. Assembly "hello, world" for OS X
- Albert, David. Understanding C by learning assembly
- Raskind, Gwynne. Disassembling the Assembly, Part 1
- .align directive proper usage with .align(5) and 0x90
- How can I examine the stack frame with GDB?
- Why do byte spills occur and what do they achieve?