Lately I've been having fun with Clang's (and GCC's) -S option in order to examine the assembly output of small programs written in C. I've found this is a really great way to both learn about what your compiler is doing with your code and how to write assembly code on your own. One of the more interesting things I've learned from doing this is just how function calls are made on UNIX-like systems (Linux, BSD, macOS, Solaris, etc.) running on x86-64 hardware. When using a high(er) level language like C it's really not something I ever think about; the compiler takes care of the details. But what exaclty happens when you call a function?
The answer is actually fairly complicated and depends on several factors including the computer's architecture, the operating system, the language, the compiler, and the number and nature of the arguments being passed to the function. For simplicity's sake in this post I'm going to make a few assumptions: we're on an x86-64 machine, running macOS, using C, compiling with Clang or GCC, and only passing 64-bit integer arguments. Given all that what we'll be looking at is the System V AMD64 ABI calling convention. Under this convention the first six integer arguments are passed via the rdi, rsi, rdx, rcx, r8, and r9 registers in reverse order. After that, integer arguments are passed on the stack, also in reverse order. Floating point arguments are a little different and are beyond the scope of this post.
Sometimes the best way to really get a feel for how something works in general is to look at several specific examples that differ only slightly. In the case of function calling conventions one important aspect that can easily be changed is the number of arguments being passed. The following scenarios present a simple and contrived C program and its corresponding assembly output. Each example calls a function with more arguments than the last.
It's important to note that the output presented here is being created without any optimizations. Compiler optimizations tend to produce much faster code at the expense of understandability. Often times the resulting optimized assembly code bears little resemblance to the structure of the original C code. Of course, it's a good idea to try looking at the output produced with optimizations, but for the purposes of this post it would only confuse things. Because no optimization is being done the compiler is more or less forced to make naive assumptions about the state of the call stack. This can result in verbose assembly output that appears to serve little or no purpose in these or similar contrived programs.
Each example program below has been compiled with both Clang (1000.11.45.5) and GCC (9.2.0) using the following flags: -O0 -S -fno-asynchronous-unwind-tables. I'll go over the Clang output and then the GCC output individually before comparing the two.
One Argument
When only one argument is passed to the function only the rdi register is used. It's about as straight forward as you can hope for. The program below takes an integer and passes it to a function called doubleNumber which uses a left shift to double the argument before returning the result.
Example 1: Original C Code
#include <stdint.h>
uint64_t doubleNumber(uint64_t a) {
return a << 1;
}
int main(int argc, char* argv[]) {
uint64_t a = 1;
doubleNumber(a);
return 0;
}
Note that the result is not printed after being calculated as it turns out that printf, and variadic functions in general, use a slightly different calling convention that is beyond the scope of this post. Additionally, if you ran this code through a linter (such as Splint) it would warn you about the return value of doubleNumber being unused. This is intentional, but we'll see how the System V AMD64 ABI handles integer return values in a moment even if nothing in this post actually uses them.
Example 1: Assembly Output (Clang)
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 13
.globl _doubleNumber ## -- Begin function doubleNumber
.p2align 4, 0x90
_doubleNumber: ## @doubleNumber
## %bb.0:
pushq %rbp
movq %rsp, %rbp
movq %rdi, -8(%rbp)
movq -8(%rbp), %rdi
shlq $1, %rdi
movq %rdi, %rax
popq %rbp
retq
## -- End function
.globl _main ## -- Begin function main
.p2align 4, 0x90
_main: ## @main
## %bb.0:
pushq %rbp
movq %rsp, %rbp
subq $32, %rsp
movl $0, -4(%rbp)
movl %edi, -8(%rbp)
movq %rsi, -16(%rbp)
movq $1, -24(%rbp)
movq -24(%rbp), %rdi
callq _doubleNumber
xorl %ecx, %ecx
movq %rax, -32(%rbp) ## 8-byte Spill
movl %ecx, %eax
addq $32, %rsp
popq %rbp
retq
## -- End function
.subsections_via_symbols
The key part of the above assembly output is the following three instructions:
movq $1, -24(%rbp) movq -24(%rbp), %rdi callq _doubleNumber
The first instruction moves our immediate argument, 1, to the top of the stack (24 bytes from wherever rbp is pointing) presumably for safe keeping. The second moves the argument into rdi for the call to _doubleNumber. And the third actually calls the function. In this particular case storing the argument on the stack is not necessary, and in fact if you change the above to just be
movq $1, %rdi callq _doubleNumber
it will work just fine. I'm fairly certain the compiler does this because without further analysis (the kind it does when making optimizations) it can't know for sure whether or not the original value will be needed later.
Let's take a closer look inside _doubleNumber:
movq %rdi, -8(%rbp) movq -8(%rbp), %rdi shlq $1, %rdi movq %rdi, %rax
We can see some similar business with the stack is going on before doing a logical left shift on rdi and moving rdi into rax as the function return value. Like in the main function, all that stack manipulation isn't really necessary.
Example 1: Assembly Output (GCC)
.text
.globl _doubleNumber
_doubleNumber:
pushq %rbp
movq %rsp, %rbp
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
addq %rax, %rax
popq %rbp
ret
.globl _main
_main:
pushq %rbp
movq %rsp, %rbp
subq $32, %rsp
movl %edi, -20(%rbp)
movq %rsi, -32(%rbp)
movq $1, -8(%rbp)
movq -8(%rbp), %rax
movq %rax, %rdi
call _doubleNumber
movl $0, %eax
leave
ret
.ident "GCC: (Homebrew GCC 9.2.0) 9.2.0"
.subsections_via_symbols
Inside the main function there's four instrunctions worth looking at:
movq $1, -8(%rbp) movq -8(%rbp), %rax movq %rax, %rdi call _doubleNumber
This moves the immediate value 1 onto the stack (at 8 bytes from where rbp is pointing) just in case it is needed later, takes that value on the stack and moves it into rax, moves rax into rdi, and then finally calls _doubleNumber. All of the funny business with placing the argument on the stack and moving it into rax is unecessary but should come as no surprise when considering the lack of optimization.
GCC's output for doubleNumber, however, is somewhat unexepected:
movq %rdi, -8(%rbp) movq -8(%rbp), %rax addq %rax, %rax
First, the argument in rdi is moved onto the stack and from there it's moved into the return value register rax. As we've seen before, using the stack isn't necessary here. The reason I say this output is unexepected is because -O0 was used, which should eliminate all optimizations (at least that's how I understand it). Despite using a left shift in the C code we can see that GCC instead simply adds the contents of rax to itself. This is functionally equivalent and sets things up nicely for the function return since the final computed value is already in rax.
Differences between Clang and GCC
In this first example the output of Clang and GCC is largely the same. There's some superficial differences in where they push values on the stack before calling _doubleNumber but the real surprise is the use of addq over shlq.
Three Arguments
In order to clearly demonstrate the usage of three arguments this program is slightly different from the first. Instead of only doubling a number it doubles three numbers and adds them together.
Example 2: Original C Code
#include <stdint.h>
uint64_t doubleAndThenSumNumbers(uint64_t a, uint64_t b, uint64_t c) {
a = a << 1;
b = b << 1;
c = c << 1;
return a + b + c;
}
int main(int argc, char* argv[]) {
uint64_t
a = 1,
b = 2,
c = 3;
doubleAndThenSumNumbers(a, b, c);
return 0;
}
Example 2: Assembly Output (Clang)
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 13
.globl _doubleAndThenSumNumbers ## -- Begin function doubleAndThenSumNumbers
.p2align 4, 0x90
_doubleAndThenSumNumbers: ## @doubleAndThenSumNumbers
## %bb.0:
pushq %rbp
movq %rsp, %rbp
movq %rdi, -8(%rbp)
movq %rsi, -16(%rbp)
movq %rdx, -24(%rbp)
movq -8(%rbp), %rdx
shlq $1, %rdx
movq %rdx, -8(%rbp)
movq -16(%rbp), %rdx
shlq $1, %rdx
movq %rdx, -16(%rbp)
movq -24(%rbp), %rdx
shlq $1, %rdx
movq %rdx, -24(%rbp)
movq -8(%rbp), %rdx
addq -16(%rbp), %rdx
addq -24(%rbp), %rdx
movq %rdx, %rax
popq %rbp
retq
## -- End function
.globl _main ## -- Begin function main
.p2align 4, 0x90
_main: ## @main
## %bb.0:
pushq %rbp
movq %rsp, %rbp
subq $48, %rsp
movl $0, -4(%rbp)
movl %edi, -8(%rbp)
movq %rsi, -16(%rbp)
movq $1, -24(%rbp)
movq $2, -32(%rbp)
movq $3, -40(%rbp)
movq -24(%rbp), %rdi
movq -32(%rbp), %rsi
movq -40(%rbp), %rdx
callq _doubleAndThenSumNumbers
xorl %ecx, %ecx
movq %rax, -48(%rbp) ## 8-byte Spill
movl %ecx, %eax
addq $48, %rsp
popq %rbp
retq
## -- End function
.subsections_via_symbols
The important part of the main function looks pretty familiar:
movq $1, -24(%rbp)
movq $2, -32(%rbp)
movq $3, -40(%rbp)
movq -24(%rbp), %rdi
movq -32(%rbp), %rsi
movq -40(%rbp), %rdx
callq _doubleAndThenSumNumbers
Just like the first example, this program moves the immediate value 1 onto the stack for safe keeping before moving it into rdi. It also pushes the values 2 and 3 onto the stack and then moves them into rsi and rdx respectively before calling _doubleAndThenSumNumbers just like you'd expect based on the calling convention.
Let's take a look at the interesting parts of the _doubleAndThenSumNumbers:
movq %rdi, -8(%rbp)
movq %rsi, -16(%rbp)
movq %rdx, -24(%rbp)
The function starts off by moving all three arguments onto the stack.
movq -8(%rbp), %rdx
shlq $1, %rdx
movq %rdx, -8(%rbp)
movq -16(%rbp), %rdx
shlq $1, %rdx
movq %rdx, -16(%rbp)
movq -24(%rbp), %rdx
shlq $1, %rdx
movq %rdx, -24(%rbp)
And then it goes through each argument, moving it into rdx, left shifts it, and then puts it back on the stack.
movq -8(%rbp), %rdx
addq -16(%rbp), %rdx
addq -24(%rbp), %rdx
movq %rdx, %rax
Finally the first doubled number is moved into rdx and then the next two doubled numbers are added to it before moving the final total into the return value register rax.
Example 2: Assembly Output (GCC)
.text .globl _doubleAndThenSumNumbers _doubleAndThenSumNumbers: pushq %rbp movq %rsp, %rbp movq %rdi, -8(%rbp) movq %rsi, -16(%rbp) movq %rdx, -24(%rbp) salq -8(%rbp) salq -16(%rbp) salq -24(%rbp) movq -8(%rbp), %rdx movq -16(%rbp), %rax addq %rax, %rdx movq -24(%rbp), %rax addq %rdx, %rax popq %rbp ret .globl _main _main: pushq %rbp movq %rsp, %rbp subq $48, %rsp movl %edi, -36(%rbp) movq %rsi, -48(%rbp) movq $1, -8(%rbp) movq $2, -16(%rbp) movq $3, -24(%rbp) movq -24(%rbp), %rdx movq -16(%rbp), %rcx movq -8(%rbp), %rax movq %rcx, %rsi movq %rax, %rdi call _doubleAndThenSumNumbers movl $0, %eax leave ret .ident "GCC: (Homebrew GCC 9.2.0) 9.2.0" .subsections_via_symbols
For the most part main begins like we'd expect:
movq $1, -8(%rbp)
movq $2, -16(%rbp)
movq $3, -24(%rbp)
movq -24(%rbp), %rdx
movq -16(%rbp), %rcx
movq -8(%rbp), %rax
movq %rcx, %rsi
movq %rax, %rdi
call _doubleAndThenSumNumbers
Here we see the usual moving the arguments onto the stack before moving them into rdi, rsi, and rdx. I would love to know why GCC thinks it's necessary to move 1 and 2 from their place on the stack into rax and rcx first before moving the values to rdi and rsi. After the arguments are in place _doubleAndThenSumNumbers gets called.
movq %rdi, -8(%rbp)
movq %rsi, -16(%rbp)
movq %rdx, -24(%rbp)
Inside _doubleAndThenSumNumbers things start off with the arguments being moved onto the stack.
salq -8(%rbp)
salq -16(%rbp)
salq -24(%rbp)
movq -8(%rbp), %rdx
movq -16(%rbp), %rax
addq %rax, %rdx
movq -24(%rbp), %rax
addq %rdx, %rax
The next three instructions shift the arguments left by one bit (since no second operand is specified a default of 1 is used). After that the first argument is moved into rdx, the second argument is moved into rax, and then the two are added together. Finally the third argument is moved into rax and rdx is added to it for the return value.
Differences between Clang and GCC
When it comes to how the function is called the only thing Clang and GCC does differently is some of the steps they take before moving the arguments into the appropriate registers. I think it's interesting that inside _doubleAndThenSumNumbers Clang goes through the trouble of moving the arguments from the stack into a register before doing the left shift but GCC does not.
Thirteen Arguments
Now things finally start to get interesting. As noted before, the System V AMD64 ABI specifies that the first six arguments to a function should be passed via registers. Until now our contrived functions have had less than six arguments. Let's see what happens when we go well over that limit.
Example 3: Original C Code
#include <stdint.h>
uint64_t doubleAndThenSumNumbers(
uint64_t a,
uint64_t b,
uint64_t c,
uint64_t d,
uint64_t e,
uint64_t f,
uint64_t g,
uint64_t h,
uint64_t i,
uint64_t j,
uint64_t k,
uint64_t l,
uint64_t m) {
a = a << 1;
b = b << 1;
c = c << 1;
d = d << 1;
e = e << 1;
f = f << 1;
g = g << 1;
h = h << 1;
i = i << 1;
j = j << 1;
k = k << 1;
l = l << 1;
m = m << 1;
return a + b + c + d + e + f + g + h + i + j + k + l + m;
}
int main(int argc, char* argv[]) {
uint64_t
a = 1,
b = 2,
c = 3,
d = 4,
e = 5,
f = 6,
g = 7,
h = 8,
i = 9,
j = 10,
k = 11,
l = 12,
m = 13;
doubleAndThenSumNumbers(a, b, c, d, e, f, g, h, i, j, k, l, m);
return 0;
}
Example 3: Assembly Output (Clang)
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 13
.globl _doubleAndThenSumNumbers ## -- Begin function doubleAndThenSumNumbers
.p2align 4, 0x90
_doubleAndThenSumNumbers: ## @doubleAndThenSumNumbers
## %bb.0:
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %r12
pushq %rbx
movq 64(%rbp), %rax
movq 56(%rbp), %r10
movq 48(%rbp), %r11
movq 40(%rbp), %rbx
movq 32(%rbp), %r14
movq 24(%rbp), %r15
movq 16(%rbp), %r12
movq %rdi, -40(%rbp)
movq %rsi, -48(%rbp)
movq %rdx, -56(%rbp)
movq %rcx, -64(%rbp)
movq %r8, -72(%rbp)
movq %r9, -80(%rbp)
movq -40(%rbp), %rcx
shlq $1, %rcx
movq %rcx, -40(%rbp)
movq -48(%rbp), %rcx
shlq $1, %rcx
movq %rcx, -48(%rbp)
movq -56(%rbp), %rcx
shlq $1, %rcx
movq %rcx, -56(%rbp)
movq -64(%rbp), %rcx
shlq $1, %rcx
movq %rcx, -64(%rbp)
movq -72(%rbp), %rcx
shlq $1, %rcx
movq %rcx, -72(%rbp)
movq -80(%rbp), %rcx
shlq $1, %rcx
movq %rcx, -80(%rbp)
movq 16(%rbp), %rcx
shlq $1, %rcx
movq %rcx, 16(%rbp)
movq 24(%rbp), %rcx
shlq $1, %rcx
movq %rcx, 24(%rbp)
movq 32(%rbp), %rcx
shlq $1, %rcx
movq %rcx, 32(%rbp)
movq 40(%rbp), %rcx
shlq $1, %rcx
movq %rcx, 40(%rbp)
movq 48(%rbp), %rcx
shlq $1, %rcx
movq %rcx, 48(%rbp)
movq 56(%rbp), %rcx
shlq $1, %rcx
movq %rcx, 56(%rbp)
movq 64(%rbp), %rcx
shlq $1, %rcx
movq %rcx, 64(%rbp)
movq -40(%rbp), %rcx
addq -48(%rbp), %rcx
addq -56(%rbp), %rcx
addq -64(%rbp), %rcx
addq -72(%rbp), %rcx
addq -80(%rbp), %rcx
addq 16(%rbp), %rcx
addq 24(%rbp), %rcx
addq 32(%rbp), %rcx
addq 40(%rbp), %rcx
addq 48(%rbp), %rcx
addq 56(%rbp), %rcx
addq 64(%rbp), %rcx
movq %rax, -88(%rbp) ## 8-byte Spill
movq %rcx, %rax
movq %r12, -96(%rbp) ## 8-byte Spill
movq %r10, -104(%rbp) ## 8-byte Spill
movq %r11, -112(%rbp) ## 8-byte Spill
movq %rbx, -120(%rbp) ## 8-byte Spill
movq %r14, -128(%rbp) ## 8-byte Spill
movq %r15, -136(%rbp) ## 8-byte Spill
popq %rbx
popq %r12
popq %r14
popq %r15
popq %rbp
retq
## -- End function
.globl _main ## -- Begin function main
.p2align 4, 0x90
_main: ## @main
## %bb.0:
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
subq $184, %rsp
movl $0, -44(%rbp)
movl %edi, -48(%rbp)
movq %rsi, -56(%rbp)
movq $1, -64(%rbp)
movq $2, -72(%rbp)
movq $3, -80(%rbp)
movq $4, -88(%rbp)
movq $5, -96(%rbp)
movq $6, -104(%rbp)
movq $7, -112(%rbp)
movq $8, -120(%rbp)
movq $9, -128(%rbp)
movq $10, -136(%rbp)
movq $11, -144(%rbp)
movq $12, -152(%rbp)
movq $13, -160(%rbp)
movq -64(%rbp), %rdi
movq -72(%rbp), %rsi
movq -80(%rbp), %rdx
movq -88(%rbp), %rcx
movq -96(%rbp), %r8
movq -104(%rbp), %r9
movq -112(%rbp), %rax
movq -120(%rbp), %r10
movq -128(%rbp), %r11
movq -136(%rbp), %rbx
movq -144(%rbp), %r14
movq -152(%rbp), %r15
movq -160(%rbp), %r12
movq %rax, (%rsp)
movq %r10, 8(%rsp)
movq %r11, 16(%rsp)
movq %rbx, 24(%rsp)
movq %r14, 32(%rsp)
movq %r15, 40(%rsp)
movq %r12, 48(%rsp)
callq _doubleAndThenSumNumbers
xorl %r13d, %r13d
movq %rax, -168(%rbp) ## 8-byte Spill
movl %r13d, %eax
addq $184, %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
retq
## -- End function
.subsections_via_symbols
The first thing to note about this example is that before it even starts to do anything with the arguments themselves it saves the contents of rbx, r12, r13, r14, and r15 before allocating more stack space for local variables.
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
These registers are callee preserved meaning that their values must be saved across function calls.
subq $184, %rsp
movl $0, -44(%rbp)
movl %edi, -48(%rbp)
movq %rsi, -56(%rbp)
With 13 64-bit variables we're going to need a lot of room. 184 bytes on the stack should be enough.
movq $1, -64(%rbp)
movq $2, -72(%rbp)
movq $3, -80(%rbp)
movq $4, -88(%rbp)
movq $5, -96(%rbp)
movq $6, -104(%rbp)
movq $7, -112(%rbp)
movq $8, -120(%rbp)
movq $9, -128(%rbp)
movq $10, -136(%rbp)
movq $11, -144(%rbp)
movq $12, -152(%rbp)
movq $13, -160(%rbp)
By now the above thirteen instructions should have been predictable. All the arguments for our function are moved onto the stack.
movq -64(%rbp), %rdi
movq -72(%rbp), %rsi
movq -80(%rbp), %rdx
movq -88(%rbp), %rcx
movq -96(%rbp), %r8
movq -104(%rbp), %r9
movq -112(%rbp), %rax
movq -120(%rbp), %r10
movq -128(%rbp), %r11
movq -136(%rbp), %rbx
movq -144(%rbp), %r14
movq -152(%rbp), %r15
movq -160(%rbp), %r12
And here the arguments are moved from the stack to their respective registers. But what about the last seven arguments? The System V AMD64 ABI clearly states that only the first six integer arguments are passed via reigsters. Why then do all the arguments end up in registers?
movq %rax, (%rsp)
movq %r10, 8(%rsp)
movq %r11, 16(%rsp)
movq %rbx, 24(%rsp)
movq %r14, 32(%rsp)
movq %r15, 40(%rsp)
movq %r12, 48(%rsp)
callq _doubleAndThenSumNumbers
Honestly, I'm not 100% sure why Clang takes this intermediary step. But as we can see above, all the arguments end up on the stack. It's important to note that when calling a function with more than six arguments rsp should point to the seventh with the rest immediately following.
This time around _doubleAndThenSumNumbers is a bit of a doozy at first glance. If we break it down like we did main we'll see it's no more complicated than before.
pushq %r15
pushq %r14
pushq %r12
pushq %rbx
First the callee saved registers are being pushed for preservation. This means we should expect them to be used inside this function.
movq 64(%rbp), %rax
movq 56(%rbp), %r10
movq 48(%rbp), %r11
movq 40(%rbp), %rbx
movq 32(%rbp), %r14
movq 24(%rbp), %r15
movq 16(%rbp), %r12
And they immediately are. The seven arguments passed on the stack are moved into registers (rax, r10, and r11 don't need to be saved).
movq %rdi, -40(%rbp)
movq %rsi, -48(%rbp)
movq %rdx, -56(%rbp)
movq %rcx, -64(%rbp)
movq %r8, -72(%rbp)
movq %r9, -80(%rbp)
After that the rest of the arguments, the ones in the registers, are moved on the stack.
movq -40(%rbp), %rcx shlq $1, %rcx movq %rcx, -40(%rbp) movq -48(%rbp), %rcx shlq $1, %rcx movq %rcx, -48(%rbp) movq -56(%rbp), %rcx shlq $1, %rcx movq %rcx, -56(%rbp) movq -64(%rbp), %rcx shlq $1, %rcx movq %rcx, -64(%rbp) movq -72(%rbp), %rcx shlq $1, %rcx movq %rcx, -72(%rbp) movq -80(%rbp), %rcx shlq $1, %rcx movq %rcx, -80(%rbp) movq 16(%rbp), %rcx shlq $1, %rcx movq %rcx, 16(%rbp) movq 24(%rbp), %rcx shlq $1, %rcx movq %rcx, 24(%rbp) movq 32(%rbp), %rcx shlq $1, %rcx movq %rcx, 32(%rbp) movq 40(%rbp), %rcx shlq $1, %rcx movq %rcx, 40(%rbp) movq 48(%rbp), %rcx shlq $1, %rcx movq %rcx, 48(%rbp) movq 56(%rbp), %rcx shlq $1, %rcx movq %rcx, 56(%rbp) movq 64(%rbp), %rcx shlq $1, %rcx movq %rcx, 64(%rbp)
This part looks way more complicated than it actually is. All it's doing now is going through each of our variables on the stack and moving them into rcx, left shifting the variable, and then putting back into its place on the stack.
movq -40(%rbp), %rcx addq -48(%rbp), %rcx addq -56(%rbp), %rcx addq -64(%rbp), %rcx addq -72(%rbp), %rcx addq -80(%rbp), %rcx addq 16(%rbp), %rcx addq 24(%rbp), %rcx addq 32(%rbp), %rcx addq 40(%rbp), %rcx addq 48(%rbp), %rcx addq 56(%rbp), %rcx addq 64(%rbp), %rcx
Now to complete the useful part of _doubleAndThenSumNumbers the first argument/variable is moved into rcx and then the rest are added to rcx.
movq %rax, -88(%rbp) ## 8-byte Spill movq %rcx, %rax movq %r12, -96(%rbp) ## 8-byte Spill movq %r10, -104(%rbp) ## 8-byte Spill movq %r11, -112(%rbp) ## 8-byte Spill movq %rbx, -120(%rbp) ## 8-byte Spill movq %r14, -128(%rbp) ## 8-byte Spill movq %r15, -136(%rbp) ## 8-byte Spill
In amongst the register spill cleanup rcx is moved to the return register rax. The register spill is yet another artifact of the lack of optimization.
popq %rbx popq %r12 popq %r14 popq %r15 popq %rbp retq
And finally the callee saved registers are popped before returning back to _main.
xorl %r13d, %r13d movq %rax, -168(%rbp) ## 8-byte Spill movl %r13d, %eax addq $184, %rsp
To begin winding things down the bottom 32 bits of r13 (r13d) is zeroed out. The return value from _doubleAndThenSumNumbers stored in rax is moved onto the stack where it could be used later if we weren't about to exit. r13d is then moved into the return register eax. And the stack frame is restored.
popq %rbx popq %r12 popq %r13 popq %r14 popq %r15 popq %rbp retq
Finally, just like inside _doubleAndThenSumNumbers the callee saved registers are restored before returning.
Example 3: Assembly Output (GCC)
.text .globl _doubleAndThenSumNumbers _doubleAndThenSumNumbers: pushq %rbp movq %rsp, %rbp movq %rdi, -8(%rbp) movq %rsi, -16(%rbp) movq %rdx, -24(%rbp) movq %rcx, -32(%rbp) movq %r8, -40(%rbp) movq %r9, -48(%rbp) salq -8(%rbp) salq -16(%rbp) salq -24(%rbp) salq -32(%rbp) salq -40(%rbp) salq -48(%rbp) salq 16(%rbp) salq 24(%rbp) salq 32(%rbp) salq 40(%rbp) salq 48(%rbp) salq 56(%rbp) salq 64(%rbp) movq -8(%rbp), %rdx movq -16(%rbp), %rax addq %rax, %rdx movq -24(%rbp), %rax addq %rax, %rdx movq -32(%rbp), %rax addq %rax, %rdx movq -40(%rbp), %rax addq %rax, %rdx movq -48(%rbp), %rax addq %rax, %rdx movq 16(%rbp), %rax addq %rax, %rdx movq 24(%rbp), %rax addq %rax, %rdx movq 32(%rbp), %rax addq %rax, %rdx movq 40(%rbp), %rax addq %rax, %rdx movq 48(%rbp), %rax addq %rax, %rdx movq 56(%rbp), %rax addq %rax, %rdx movq 64(%rbp), %rax addq %rdx, %rax popq %rbp ret .globl _main _main: pushq %rbp movq %rsp, %rbp addq $-128, %rsp movl %edi, -116(%rbp) movq %rsi, -128(%rbp) movq $1, -8(%rbp) movq $2, -16(%rbp) movq $3, -24(%rbp) movq $4, -32(%rbp) movq $5, -40(%rbp) movq $6, -48(%rbp) movq $7, -56(%rbp) movq $8, -64(%rbp) movq $9, -72(%rbp) movq $10, -80(%rbp) movq $11, -88(%rbp) movq $12, -96(%rbp) movq $13, -104(%rbp) movq -48(%rbp), %r8 movq -40(%rbp), %rdi movq -32(%rbp), %rcx movq -24(%rbp), %rdx movq -16(%rbp), %rsi movq -8(%rbp), %rax pushq -104(%rbp) pushq -96(%rbp) pushq -88(%rbp) pushq -80(%rbp) pushq -72(%rbp) pushq -64(%rbp) pushq -56(%rbp) movq %r8, %r9 movq %rdi, %r8 movq %rax, %rdi call _doubleAndThenSumNumbers addq $56, %rsp movl $0, %eax leave ret .ident "GCC: (Homebrew GCC 9.2.0) 9.2.0" .subsections_via_symbols
GCC's output is short and straight-forward with only a few bits that whose purpose isn't immediately obvious.
addq $-128, %rsp
movl %edi, -116(%rbp)
movq %rsi, -128(%rbp)
GCC begins by first allocating 128 bytes of space on the stack before preserving the values of edi and rsi.
movq $1, -8(%rbp) movq $2, -16(%rbp) movq $3, -24(%rbp) movq $4, -32(%rbp) movq $5, -40(%rbp) movq $6, -48(%rbp) movq $7, -56(%rbp) movq $8, -64(%rbp) movq $9, -72(%rbp) movq $10, -80(%rbp) movq $11, -88(%rbp) movq $12, -96(%rbp) movq $13, -104(%rbp)
Right away all of our integer values are moved onto the stack.
movq -48(%rbp), %r8 movq -40(%rbp), %rdi movq -32(%rbp), %rcx movq -24(%rbp), %rdx movq -16(%rbp), %rsi movq -8(%rbp), %rax
And then the first six are moved into the appropriate registers.
pushq -104(%rbp) pushq -96(%rbp) pushq -88(%rbp) pushq -80(%rbp) pushq -72(%rbp) pushq -64(%rbp) pushq -56(%rbp)
The remaining seven are actually pushed onto the stack (as opposed to being moved).
movq %r8, %r9 movq %rdi, %r8 movq %rax, %rdi call _doubleAndThenSumNumbers
Next there's a little bit of shuffling of the variables to put them in the right order. Above we saw that r9 wasn't used and rax was used instead of rdi. This section of code corrects that before calling _doubleAndThenSumNumbers.
movq %rdi, -8(%rbp) movq %rsi, -16(%rbp) movq %rdx, -24(%rbp) movq %rcx, -32(%rbp) movq %r8, -40(%rbp) movq %r9, -48(%rbp)
Inside _doubleAndThenSumNumbers the arguments passed by registers are first moved onto the stack.
salq -8(%rbp) salq -16(%rbp) salq -24(%rbp) salq -32(%rbp) salq -40(%rbp) salq -48(%rbp) salq 16(%rbp) salq 24(%rbp) salq 32(%rbp) salq 40(%rbp) salq 48(%rbp) salq 56(%rbp) salq 64(%rbp)
All of the arguments are shifted left by one bit.
movq -8(%rbp), %rdx movq -16(%rbp), %rax addq %rax, %rdx movq -24(%rbp), %rax addq %rax, %rdx movq -32(%rbp), %rax addq %rax, %rdx movq -40(%rbp), %rax addq %rax, %rdx movq -48(%rbp), %rax addq %rax, %rdx movq 16(%rbp), %rax addq %rax, %rdx movq 24(%rbp), %rax addq %rax, %rdx movq 32(%rbp), %rax addq %rax, %rdx movq 40(%rbp), %rax addq %rax, %rdx movq 48(%rbp), %rax addq %rax, %rdx movq 56(%rbp), %rax addq %rax, %rdx movq 64(%rbp), %rax addq %rdx, %rax
More straight-forward but suboptimal code. The first doubled number is is moved into rdx. After that each doubled number is moved into rax and then added to rdx. Finally the result is moved from rdx into the return register rax.
Differences between Clang and GCC
Perhaps the largest difference between Clang and GCC in this example is that GCC does the left shift on the variables when they're on the stack. Obviously location of the operand for salq doesn't matter as far as the calculation goes, but it is as much as four times slower than a combined mov from the stack to a register and salq using the register's contents. I wrote a short program to test the performance of both methods and I found doing the left shift directly on a variable on the stack was about 1.5 times slower. It's likely the CPU is being clever about how it actually executes the instructions so the difference in a tight loop isn't s great as it could otherwise be.
With optimizations turned on both Clang and GCC produce nearly identical code that makes clever use of leaq. I think it's interesting that Clang appears to be making some minimal amount of consideration for performance while GCC opts for somewhat more straight forward output.
Conclusions
Hopefully this post has shed some light on what it is happening when you call a function (in C anyways) on a modern computer. After going through the excercise of writing some simple programs and inspecting the assembly output produced by a couple compilers it's hard for me to even remember exactly what my own initial confusion was regarding the System V ABI. Sometimes seeing something in action in a "real" scenario brings a clarity that reading a desciption in a spec can't.
Related Reading
The following links are highly recommended for an in depth understanding of the System V ABI.
- eds. Matz, Hubicka, Jaeger, Mitchell. System V Application Binary Interface - AMD64 Architecture Processor Supplement - Draft Version 0.99.7
- Intel 64 and IA-32 Architectures Software Developer's Manual
- Bryant, Randal E. and O'Hallaron, David R. x86-64 Machine-Level Programming
- Bendersky, Eli. Stack frame layout on x86-64
- Gunawardena, Ananda. Lecture 27 - C and Assembly
- Doeppner. x64 Cheat Sheet
- x86 Disassembly/Functions and Stack Frames
- System V AMD64 ABI
- System V ABI
- Calling Conventions
- Hubicka, Jan. Function Calling Sequence
- x86-64 Instructions and ABI
- Fog, Agner. Calling conventions for different C++ compilers and operating systems
Supplimental Reading
These links aren't relevant to the subject of this post but are nonetheless very helpful if you're not intimately familiar with assembly on UNIX-like systems.
- x64 Architecture
- Kalra, Mohit. gdb — Assembly Language Debugging 101
- Michaux, Peter. Assembly "hello, world" for OS X
- Albert, David. Understanding C by learning assembly
- Raskind, Gwynne. Disassembling the Assembly, Part 1
- .align directive proper usage with .align(5) and 0x90
- How can I examine the stack frame with GDB?
- Why do byte spills occur and what do they achieve?