Lately I've been having fun with Clang's (and GCC's) -S option in order to examine the assembly output of small programs written in C. I've found this is a really great way to both learn about what your compiler is doing with your code and how to write assembly code on your own. One of the more interesting things I've learned from doing this is just how function calls are made on UNIX-like systems (Linux, BSD, macOS, Solaris, etc.) running on x86-64 hardware. When using a high(er) level language like C it's really not something I ever think about; the compiler takes care of the details. But what exaclty happens when you call a function?
The answer is actually fairly complicated and depends on several factors including the computer's architecture, the operating system, the language, the compiler, and the number and nature of the arguments being passed to the function. For simplicity's sake in this post I'm going to make a few assumptions: we're on an x86-64 machine, running macOS, using C, compiling with Clang or GCC, and only passing 64-bit integer arguments. Given all that what we'll be looking at is the System V AMD64 ABI calling convention. Under this convention the first six integer arguments are passed via the rdi
, rsi
, rdx
, rcx
, r8
, and r9
registers in reverse order. After that, integer arguments are passed on the stack, also in reverse order. Floating point arguments are a little different and are beyond the scope of this post.
Sometimes the best way to really get a feel for how something works in general is to look at several specific examples that differ only slightly. In the case of function calling conventions one important aspect that can easily be changed is the number of arguments being passed. The following scenarios present a simple and contrived C program and its corresponding assembly output. Each example calls a function with more arguments than the last.
It's important to note that the output presented here is being created without any optimizations. Compiler optimizations tend to produce much faster code at the expense of understandability. Often times the resulting optimized assembly code bears little resemblance to the structure of the original C code. Of course, it's a good idea to try looking at the output produced with optimizations, but for the purposes of this post it would only confuse things. Because no optimization is being done the compiler is more or less forced to make naive assumptions about the state of the call stack. This can result in verbose assembly output that appears to serve little or no purpose in these or similar contrived programs.
Each example program below has been compiled with both Clang (1000.11.45.5) and GCC (9.2.0) using the following flags: -O0 -S -fno-asynchronous-unwind-tables
. I'll go over the Clang output and then the GCC output individually before comparing the two.
One Argument
When only one argument is passed to the function only the rdi
register is used. It's about as straight forward as you can hope for. The program below takes an integer and passes it to a function called doubleNumber
which uses a left shift to double the argument before returning the result.
Example 1: Original C Code
#include <stdint.h> uint64_t doubleNumber(uint64_t a) { return a << 1; } int main(int argc, char* argv[]) { uint64_t a = 1; doubleNumber(a); return 0; }
Note that the result is not printed after being calculated as it turns out that printf
, and variadic functions in general, use a slightly different calling convention that is beyond the scope of this post. Additionally, if you ran this code through a linter (such as Splint) it would warn you about the return value of doubleNumber
being unused. This is intentional, but we'll see how the System V AMD64 ABI handles integer return values in a moment even if nothing in this post actually uses them.
Example 1: Assembly Output (Clang)
.section __TEXT,__text,regular,pure_instructions .macosx_version_min 10, 13 .globl _doubleNumber ## -- Begin function doubleNumber .p2align 4, 0x90 _doubleNumber: ## @doubleNumber ## %bb.0: pushq %rbp movq %rsp, %rbp movq %rdi, -8(%rbp) movq -8(%rbp), %rdi shlq $1, %rdi movq %rdi, %rax popq %rbp retq ## -- End function .globl _main ## -- Begin function main .p2align 4, 0x90 _main: ## @main ## %bb.0: pushq %rbp movq %rsp, %rbp subq $32, %rsp movl $0, -4(%rbp) movl %edi, -8(%rbp) movq %rsi, -16(%rbp) movq $1, -24(%rbp) movq -24(%rbp), %rdi callq _doubleNumber xorl %ecx, %ecx movq %rax, -32(%rbp) ## 8-byte Spill movl %ecx, %eax addq $32, %rsp popq %rbp retq ## -- End function .subsections_via_symbols
The key part of the above assembly output is the following three instructions:
movq $1, -24(%rbp) movq -24(%rbp), %rdi callq _doubleNumber
The first instruction moves our immediate argument, 1, to the top of the stack (24 bytes from wherever rbp
is pointing) presumably for safe keeping. The second moves the argument into rdi
for the call to _doubleNumber
. And the third actually calls the function. In this particular case storing the argument on the stack is not necessary, and in fact if you change the above to just be
movq $1, %rdi callq _doubleNumber
it will work just fine. I'm fairly certain the compiler does this because without further analysis (the kind it does when making optimizations) it can't know for sure whether or not the original value will be needed later.
Let's take a closer look inside _doubleNumber
:
movq %rdi, -8(%rbp) movq -8(%rbp), %rdi shlq $1, %rdi movq %rdi, %rax
We can see some similar business with the stack is going on before doing a logical left shift on rdi
and moving rdi
into rax
as the function return value. Like in the main function, all that stack manipulation isn't really necessary.
Example 1: Assembly Output (GCC)
.text .globl _doubleNumber _doubleNumber: pushq %rbp movq %rsp, %rbp movq %rdi, -8(%rbp) movq -8(%rbp), %rax addq %rax, %rax popq %rbp ret .globl _main _main: pushq %rbp movq %rsp, %rbp subq $32, %rsp movl %edi, -20(%rbp) movq %rsi, -32(%rbp) movq $1, -8(%rbp) movq -8(%rbp), %rax movq %rax, %rdi call _doubleNumber movl $0, %eax leave ret .ident "GCC: (Homebrew GCC 9.2.0) 9.2.0" .subsections_via_symbols
Inside the main function there's four instrunctions worth looking at:
movq $1, -8(%rbp) movq -8(%rbp), %rax movq %rax, %rdi call _doubleNumber
This moves the immediate value 1 onto the stack (at 8 bytes from where rbp
is pointing) just in case it is needed later, takes that value on the stack and moves it into rax
, moves rax
into rdi
, and then finally calls _doubleNumber
. All of the funny business with placing the argument on the stack and moving it into rax
is unecessary but should come as no surprise when considering the lack of optimization.
GCC's output for doubleNumber
, however, is somewhat unexepected:
movq %rdi, -8(%rbp) movq -8(%rbp), %rax addq %rax, %rax
First, the argument in rdi
is moved onto the stack and from there it's moved into the return value register rax
. As we've seen before, using the stack isn't necessary here. The reason I say this output is unexepected is because -O0
was used, which should eliminate all optimizations (at least that's how I understand it). Despite using a left shift in the C code we can see that GCC instead simply adds the contents of rax
to itself. This is functionally equivalent and sets things up nicely for the function return since the final computed value is already in rax
.
Differences between Clang and GCC
In this first example the output of Clang and GCC is largely the same. There's some superficial differences in where they push values on the stack before calling _doubleNumber
but the real surprise is the use of addq
over shlq
.