Anatomy of a Function Call

Lately I've been having fun with Clang's (and GCC's) -S option in order to examine the assembly output of small programs written in C. I've found this is a really great way to both learn about what your compiler is doing with your code and how to write assembly code on your own. One of the more interesting things I've learned from doing this is just how function calls are made on UNIX-like systems (Linux, BSD, macOS, Solaris, etc.) running on x86-64 hardware. When using a high(er) level language like C it's really not something I ever think about; the compiler takes care of the details. But what exaclty happens when you call a function?

The answer is actually fairly complicated and depends on several factors including the computer's architecture, the operating system, the language, the compiler, and the number and nature of the arguments being passed to the function. For simplicity's sake in this post I'm going to make a few assumptions: we're on an x86-64 machine, running macOS, using C, compiling with Clang or GCC, and only passing 64-bit integer arguments. Given all that what we'll be looking at is the System V AMD64 ABI calling convention. Under this convention the first six integer arguments are passed via the rdi, rsi, rdx, rcx, r8, and r9 registers in reverse order. After that, integer arguments are passed on the stack, also in reverse order. Floating point arguments are a little different and are beyond the scope of this post.

Sometimes the best way to really get a feel for how something works in general is to look at several specific examples that differ only slightly. In the case of function calling conventions one important aspect that can easily be changed is the number of arguments being passed. The following scenarios present a simple and contrived C program and its corresponding assembly output. Each example calls a function with more arguments than the last.

It's important to note that the output presented here is being created without any optimizations. Compiler optimizations tend to produce much faster code at the expense of understandability. Often times the resulting optimized assembly code bears little resemblance to the structure of the original C code. Of course, it's a good idea to try looking at the output produced with optimizations, but for the purposes of this post it would only confuse things. Because no optimization is being done the compiler is more or less forced to make naive assumptions about the state of the call stack. This can result in verbose assembly output that appears to serve little or no purpose in these or similar contrived programs.

Each example program below has been compiled with both Clang (1000.11.45.5) and GCC (9.2.0) using the following flags: -O0 -S -fno-asynchronous-unwind-tables. I'll go over the Clang output and then the GCC output individually before comparing the two.

One Argument

When only one argument is passed to the function only the rdi register is used. It's about as straight forward as you can hope for. The program below takes an integer and passes it to a function called doubleNumber which uses a left shift to double the argument before returning the result.

Example 1: Original C Code

#include <stdint.h>

uint64_t doubleNumber(uint64_t a) {
  return a << 1;
}

int main(int argc, char* argv[]) {
  uint64_t a = 1;

  doubleNumber(a);

  return 0;
}

Note that the result is not printed after being calculated as it turns out that printf, and variadic functions in general, use a slightly different calling convention that is beyond the scope of this post. Additionally, if you ran this code through a linter (such as Splint) it would warn you about the return value of doubleNumber being unused. This is intentional, but we'll see how the System V AMD64 ABI handles integer return values in a moment even if nothing in this post actually uses them.

Example 1: Assembly Output (Clang)

	.section	__TEXT,__text,regular,pure_instructions
	.macosx_version_min 10, 13
	.globl	_doubleNumber           ## -- Begin function doubleNumber
	.p2align	4, 0x90
_doubleNumber:                          ## @doubleNumber
## %bb.0:
	pushq	%rbp
	movq	%rsp, %rbp
	movq	%rdi, -8(%rbp)
	movq	-8(%rbp), %rdi
	shlq	$1, %rdi
	movq	%rdi, %rax
	popq	%rbp
	retq
                                        ## -- End function
	.globl	_main                   ## -- Begin function main
	.p2align	4, 0x90
_main:                                  ## @main
## %bb.0:
	pushq	%rbp
	movq	%rsp, %rbp
	subq	$32, %rsp
	movl	$0, -4(%rbp)
	movl	%edi, -8(%rbp)
	movq	%rsi, -16(%rbp)
	movq	$1, -24(%rbp)
	movq	-24(%rbp), %rdi
	callq	_doubleNumber
	xorl	%ecx, %ecx
	movq	%rax, -32(%rbp)         ## 8-byte Spill
	movl	%ecx, %eax
	addq	$32, %rsp
	popq	%rbp
	retq
                                        ## -- End function

.subsections_via_symbols

The key part of the above assembly output is the following three instructions:

	movq    $1, -24(%rbp)
	movq    -24(%rbp), %rdi
	callq   _doubleNumber

The first instruction moves our immediate argument, 1, to the top of the stack (24 bytes from wherever rbp is pointing) presumably for safe keeping. The second moves the argument into rdi for the call to _doubleNumber. And the third actually calls the function. In this particular case storing the argument on the stack is not necessary, and in fact if you change the above to just be

	movq    $1, %rdi
	callq   _doubleNumber

it will work just fine. I'm fairly certain the compiler does this because without further analysis (the kind it does when making optimizations) it can't know for sure whether or not the original value will be needed later.

Let's take a closer look inside _doubleNumber:

	movq    %rdi, -8(%rbp)
	movq    -8(%rbp), %rdi
	shlq    $1, %rdi
	movq    %rdi, %rax

We can see some similar business with the stack is going on before doing a logical left shift on rdi and moving rdi into rax as the function return value. Like in the main function, all that stack manipulation isn't really necessary.

Example 1: Assembly Output (GCC)

	.text
	.globl _doubleNumber
_doubleNumber:
	pushq	%rbp
	movq	%rsp, %rbp
	movq	%rdi, -8(%rbp)
	movq	-8(%rbp), %rax
	addq	%rax, %rax
	popq	%rbp
	ret
	.globl _main
_main:
	pushq	%rbp
	movq	%rsp, %rbp
	subq	$32, %rsp
	movl	%edi, -20(%rbp)
	movq	%rsi, -32(%rbp)
	movq	$1, -8(%rbp)
	movq	-8(%rbp), %rax
	movq	%rax, %rdi
	call	_doubleNumber
	movl	$0, %eax
	leave
	ret
        .ident	"GCC: (Homebrew GCC 9.2.0) 9.2.0"
	.subsections_via_symbols

Inside the main function there's four instrunctions worth looking at:

	movq    $1, -8(%rbp)
	movq    -8(%rbp), %rax
	movq    %rax, %rdi
	call    _doubleNumber

This moves the immediate value 1 onto the stack (at 8 bytes from where rbp is pointing) just in case it is needed later, takes that value on the stack and moves it into rax, moves rax into rdi, and then finally calls _doubleNumber. All of the funny business with placing the argument on the stack and moving it into rax is unecessary but should come as no surprise when considering the lack of optimization.

GCC's output for doubleNumber, however, is somewhat unexepected:

	movq    %rdi, -8(%rbp)
	movq    -8(%rbp), %rax
	addq    %rax, %rax

First, the argument in rdi is moved onto the stack and from there it's moved into the return value register rax. As we've seen before, using the stack isn't necessary here. The reason I say this output is unexepected is because -O0 was used, which should eliminate all optimizations (at least that's how I understand it). Despite using a left shift in the C code we can see that GCC instead simply adds the contents of rax to itself. This is functionally equivalent and sets things up nicely for the function return since the final computed value is already in rax.

Differences between Clang and GCC

In this first example the output of Clang and GCC is largely the same. There's some superficial differences in where they push values on the stack before calling _doubleNumber but the real surprise is the use of addq over shlq.

Three Arguments

In order to clearly demonstrate the usage of three arguments this program is slightly different from the first. Instead of only doubling a number it doubles three numbers and adds them together.

Example 2: Original C Code

#include <stdint.h>

uint64_t doubleAndThenSumNumbers(uint64_t a, uint64_t b, uint64_t c) {
  a = a << 1;
  b = b << 1;
  c = c << 1;

  return a + b + c;
}

int main(int argc, char* argv[]) {
  uint64_t
  a = 1,
  b = 2,
  c = 3;

  doubleAndThenSumNumbers(a, b, c);

  return 0;
}

Example 2: Assembly Output (Clang)

	.section	__TEXT,__text,regular,pure_instructions
	.macosx_version_min 10, 13
	.globl	_doubleAndThenSumNumbers ## -- Begin function doubleAndThenSumNumbers
	.p2align	4, 0x90
_doubleAndThenSumNumbers:               ## @doubleAndThenSumNumbers
## %bb.0:
	pushq	%rbp
	movq	%rsp, %rbp
	movq	%rdi, -8(%rbp)
	movq	%rsi, -16(%rbp)
	movq	%rdx, -24(%rbp)
	movq	-8(%rbp), %rdx
	shlq	$1, %rdx
	movq	%rdx, -8(%rbp)
	movq	-16(%rbp), %rdx
	shlq	$1, %rdx
	movq	%rdx, -16(%rbp)
	movq	-24(%rbp), %rdx
	shlq	$1, %rdx
	movq	%rdx, -24(%rbp)
	movq	-8(%rbp), %rdx
	addq	-16(%rbp), %rdx
	addq	-24(%rbp), %rdx
	movq	%rdx, %rax
	popq	%rbp
	retq
                                        ## -- End function
	.globl	_main                   ## -- Begin function main
	.p2align	4, 0x90
_main:                                  ## @main
## %bb.0:
	pushq	%rbp
	movq	%rsp, %rbp
	subq	$48, %rsp
	movl	$0, -4(%rbp)
	movl	%edi, -8(%rbp)
	movq	%rsi, -16(%rbp)
	movq	$1, -24(%rbp)
	movq	$2, -32(%rbp)
	movq	$3, -40(%rbp)
	movq	-24(%rbp), %rdi
	movq	-32(%rbp), %rsi
	movq	-40(%rbp), %rdx
	callq	_doubleAndThenSumNumbers
	xorl	%ecx, %ecx
	movq	%rax, -48(%rbp)         ## 8-byte Spill
	movl	%ecx, %eax
	addq	$48, %rsp
	popq	%rbp
	retq
                                        ## -- End function

.subsections_via_symbols

The important part of the main function looks pretty familiar:

        movq    $1, -24(%rbp)
        movq    $2, -32(%rbp)
        movq    $3, -40(%rbp)
        movq    -24(%rbp), %rdi
        movq    -32(%rbp), %rsi
        movq    -40(%rbp), %rdx
        callq   _doubleAndThenSumNumbers

Just like the first example, this program moves the immediate value 1 onto the stack for safe keeping before moving it into rdi. It also pushes the values 2 and 3 onto the stack and then moves them into rsi and rdx respectively before calling _doubleAndThenSumNumbers just like you'd expect based on the calling convention.

Let's take a look at the interesting parts of the _doubleAndThenSumNumbers:

        movq    %rdi, -8(%rbp)
        movq    %rsi, -16(%rbp)
        movq    %rdx, -24(%rbp)

The function starts off by moving all three arguments onto the stack.

        movq    -8(%rbp), %rdx
        shlq    $1, %rdx
        movq    %rdx, -8(%rbp)
        movq    -16(%rbp), %rdx
        shlq    $1, %rdx
        movq    %rdx, -16(%rbp)
        movq    -24(%rbp), %rdx
        shlq    $1, %rdx
        movq    %rdx, -24(%rbp)

And then it goes through each argument, moving it into rdx, left shifts it, and then puts it back on the stack.

        movq    -8(%rbp), %rdx
        addq    -16(%rbp), %rdx
        addq    -24(%rbp), %rdx
        movq    %rdx, %rax

Finally the first doubled number is moved into rdx and then the next two doubled numbers are added to it before moving the final total into the return value register rax.

Example 2: Assembly Output (GCC)

	.text
	.globl _doubleAndThenSumNumbers
_doubleAndThenSumNumbers:
	pushq	%rbp
	movq	%rsp, %rbp
	movq	%rdi, -8(%rbp)
	movq	%rsi, -16(%rbp)
	movq	%rdx, -24(%rbp)
	salq	-8(%rbp)
	salq	-16(%rbp)
	salq	-24(%rbp)
	movq	-8(%rbp), %rdx
	movq	-16(%rbp), %rax
	addq	%rax, %rdx
	movq	-24(%rbp), %rax
	addq	%rdx, %rax
	popq	%rbp
	ret
	.globl _main
_main:
	pushq	%rbp
	movq	%rsp, %rbp
	subq	$48, %rsp
	movl	%edi, -36(%rbp)
	movq	%rsi, -48(%rbp)
	movq	$1, -8(%rbp)
	movq	$2, -16(%rbp)
	movq	$3, -24(%rbp)
	movq	-24(%rbp), %rdx
	movq	-16(%rbp), %rcx
	movq	-8(%rbp), %rax
	movq	%rcx, %rsi
	movq	%rax, %rdi
	call	_doubleAndThenSumNumbers
	movl	$0, %eax
	leave
	ret
	.ident	"GCC: (Homebrew GCC 9.2.0) 9.2.0"
	.subsections_via_symbols

For the most part main begins like we'd expect:

        movq    $1, -8(%rbp)
        movq    $2, -16(%rbp)
        movq    $3, -24(%rbp)
        movq    -24(%rbp), %rdx
        movq    -16(%rbp), %rcx
        movq    -8(%rbp), %rax
        movq    %rcx, %rsi
        movq    %rax, %rdi
	call	_doubleAndThenSumNumbers

Here we see the usual moving the arguments onto the stack before moving them into rdi, rsi, and rdx. I would love to know why GCC thinks it's necessary to move 1 and 2 from their place on the stack into rax and rcx first before moving the values to rdi and rsi. After the arguments are in place _doubleAndThenSumNumbers gets called.

        movq    %rdi, -8(%rbp)
        movq    %rsi, -16(%rbp)
        movq    %rdx, -24(%rbp)

Inside _doubleAndThenSumNumbers things start off with the arguments being moved onto the stack.

        salq    -8(%rbp)
        salq    -16(%rbp)
        salq    -24(%rbp)
        movq    -8(%rbp), %rdx
        movq    -16(%rbp), %rax
        addq    %rax, %rdx
        movq    -24(%rbp), %rax
        addq    %rdx, %rax

The next three instructions shift the arguments left by one bit (since no second operand is specified a default of 1 is used). After that the first argument is moved into rdx, the second argument is moved into rax, and then the two are added together. Finally the third argument is moved into rax and rdx is added to it for the return value.

Differences between Clang and GCC

When it comes to how the function is called the only thing Clang and GCC does differently is some of the steps they take before moving the arguments into the appropriate registers. I think it's interesting that inside _doubleAndThenSumNumbers Clang goes through the trouble of moving the arguments from the stack into a register before doing the left shift but GCC does not.

Thirteen Arguments

Now things finally start to get interesting. As noted before, the System V AMD64 ABI specifies that the first six arguments to a function should be passed via registers. Until now our contrived functions have had less than six arguments. Let's see what happens when we go well over that limit.

Example 3: Original C Code

#include <stdint.h>

uint64_t doubleAndThenSumNumbers(
uint64_t a,
uint64_t b,
uint64_t c,
uint64_t d,
uint64_t e,
uint64_t f,
uint64_t g,
uint64_t h,
uint64_t i,
uint64_t j,
uint64_t k,
uint64_t l,
uint64_t m) {
  a = a << 1;
  b = b << 1;
  c = c << 1;
  d = d << 1;
  e = e << 1;
  f = f << 1;
  g = g << 1;
  h = h << 1;
  i = i << 1;
  j = j << 1;
  k = k << 1;
  l = l << 1;
  m = m << 1;

  return a + b + c + d + e + f + g + h + i + j + k + l + m;
}

int main(int argc, char* argv[]) {
  uint64_t
  a = 1,
  b = 2,
  c = 3,
  d = 4,
  e = 5,
  f = 6,
  g = 7,
  h = 8,
  i = 9,
  j = 10,
  k = 11,
  l = 12,
  m = 13;

  doubleAndThenSumNumbers(a, b, c, d, e, f, g, h, i, j, k, l, m);

  return 0;
}

Example 3: Assembly Output (Clang)

	.section	__TEXT,__text,regular,pure_instructions
	.macosx_version_min 10, 13
	.globl	_doubleAndThenSumNumbers ## -- Begin function doubleAndThenSumNumbers
	.p2align	4, 0x90
_doubleAndThenSumNumbers:               ## @doubleAndThenSumNumbers
## %bb.0:
	pushq	%rbp
	movq	%rsp, %rbp
	pushq	%r15
	pushq	%r14
	pushq	%r12
	pushq	%rbx
	movq	64(%rbp), %rax
	movq	56(%rbp), %r10
	movq	48(%rbp), %r11
	movq	40(%rbp), %rbx
	movq	32(%rbp), %r14
	movq	24(%rbp), %r15
	movq	16(%rbp), %r12
	movq	%rdi, -40(%rbp)
	movq	%rsi, -48(%rbp)
	movq	%rdx, -56(%rbp)
	movq	%rcx, -64(%rbp)
	movq	%r8, -72(%rbp)
	movq	%r9, -80(%rbp)
	movq	-40(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, -40(%rbp)
	movq	-48(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, -48(%rbp)
	movq	-56(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, -56(%rbp)
	movq	-64(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, -64(%rbp)
	movq	-72(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, -72(%rbp)
	movq	-80(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, -80(%rbp)
	movq	16(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, 16(%rbp)
	movq	24(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, 24(%rbp)
	movq	32(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, 32(%rbp)
	movq	40(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, 40(%rbp)
	movq	48(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, 48(%rbp)
	movq	56(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, 56(%rbp)
	movq	64(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, 64(%rbp)
	movq	-40(%rbp), %rcx
	addq	-48(%rbp), %rcx
	addq	-56(%rbp), %rcx
	addq	-64(%rbp), %rcx
	addq	-72(%rbp), %rcx
	addq	-80(%rbp), %rcx
	addq	16(%rbp), %rcx
	addq	24(%rbp), %rcx
	addq	32(%rbp), %rcx
	addq	40(%rbp), %rcx
	addq	48(%rbp), %rcx
	addq	56(%rbp), %rcx
	addq	64(%rbp), %rcx
	movq	%rax, -88(%rbp)         ## 8-byte Spill
	movq	%rcx, %rax
	movq	%r12, -96(%rbp)         ## 8-byte Spill
	movq	%r10, -104(%rbp)        ## 8-byte Spill
	movq	%r11, -112(%rbp)        ## 8-byte Spill
	movq	%rbx, -120(%rbp)        ## 8-byte Spill
	movq	%r14, -128(%rbp)        ## 8-byte Spill
	movq	%r15, -136(%rbp)        ## 8-byte Spill
	popq	%rbx
	popq	%r12
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
                                        ## -- End function
	.globl	_main                   ## -- Begin function main
	.p2align	4, 0x90
_main:                                  ## @main
## %bb.0:
	pushq	%rbp
	movq	%rsp, %rbp
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	subq	$184, %rsp
	movl	$0, -44(%rbp)
	movl	%edi, -48(%rbp)
	movq	%rsi, -56(%rbp)
	movq	$1, -64(%rbp)
	movq	$2, -72(%rbp)
	movq	$3, -80(%rbp)
	movq	$4, -88(%rbp)
	movq	$5, -96(%rbp)
	movq	$6, -104(%rbp)
	movq	$7, -112(%rbp)
	movq	$8, -120(%rbp)
	movq	$9, -128(%rbp)
	movq	$10, -136(%rbp)
	movq	$11, -144(%rbp)
	movq	$12, -152(%rbp)
	movq	$13, -160(%rbp)
	movq	-64(%rbp), %rdi
	movq	-72(%rbp), %rsi
	movq	-80(%rbp), %rdx
	movq	-88(%rbp), %rcx
	movq	-96(%rbp), %r8
	movq	-104(%rbp), %r9
	movq	-112(%rbp), %rax
	movq	-120(%rbp), %r10
	movq	-128(%rbp), %r11
	movq	-136(%rbp), %rbx
	movq	-144(%rbp), %r14
	movq	-152(%rbp), %r15
	movq	-160(%rbp), %r12
	movq	%rax, (%rsp)
	movq	%r10, 8(%rsp)
	movq	%r11, 16(%rsp)
	movq	%rbx, 24(%rsp)
	movq	%r14, 32(%rsp)
	movq	%r15, 40(%rsp)
	movq	%r12, 48(%rsp)
	callq	_doubleAndThenSumNumbers
	xorl	%r13d, %r13d
	movq	%rax, -168(%rbp)        ## 8-byte Spill
	movl	%r13d, %eax
	addq	$184, %rsp
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
                                        ## -- End function

.subsections_via_symbols

The first thing to note about this example is that before it even starts to do anything with the arguments themselves it saves the contents of rbx, r12, r13, r14, and r15 before allocating more stack space for local variables.

        pushq   %r15
        pushq   %r14
        pushq   %r13
        pushq   %r12
        pushq   %rbx

These registers are callee preserved meaning that their values must be saved across function calls.

        subq    $184, %rsp
        movl    $0, -44(%rbp)
        movl    %edi, -48(%rbp)
        movq    %rsi, -56(%rbp)

With 13 64-bit variables we're going to need a lot of room. 184 bytes on the stack should be enough.

        movq    $1, -64(%rbp)
        movq    $2, -72(%rbp)
        movq    $3, -80(%rbp)
        movq    $4, -88(%rbp)
        movq    $5, -96(%rbp)
        movq    $6, -104(%rbp)
        movq    $7, -112(%rbp)
        movq    $8, -120(%rbp)
        movq    $9, -128(%rbp)
        movq    $10, -136(%rbp)
        movq    $11, -144(%rbp)
        movq    $12, -152(%rbp)
        movq    $13, -160(%rbp)

By now the above thirteen instructions should have been predictable. All the arguments for our function are moved onto the stack.

        movq    -64(%rbp), %rdi
        movq    -72(%rbp), %rsi
        movq    -80(%rbp), %rdx
        movq    -88(%rbp), %rcx
        movq    -96(%rbp), %r8
        movq    -104(%rbp), %r9
        movq    -112(%rbp), %rax
        movq    -120(%rbp), %r10
        movq    -128(%rbp), %r11
        movq    -136(%rbp), %rbx
        movq    -144(%rbp), %r14
        movq    -152(%rbp), %r15
        movq    -160(%rbp), %r12

And here the arguments are moved from the stack to their respective registers. But what about the last seven arguments? The System V AMD64 ABI clearly states that only the first six integer arguments are passed via reigsters. Why then do all the arguments end up in registers?

        movq    %rax, (%rsp)
        movq    %r10, 8(%rsp)
        movq    %r11, 16(%rsp)
        movq    %rbx, 24(%rsp)
        movq    %r14, 32(%rsp)
        movq    %r15, 40(%rsp)
        movq    %r12, 48(%rsp)
        callq   _doubleAndThenSumNumbers

Honestly, I'm not 100% sure why Clang takes this intermediary step. But as we can see above, all the arguments end up on the stack. It's important to note that when calling a function with more than six arguments rsp should point to the seventh with the rest immediately following.

This time around _doubleAndThenSumNumbers is a bit of a doozy at first glance. If we break it down like we did main we'll see it's no more complicated than before.

        pushq   %r15
        pushq   %r14
        pushq   %r12
        pushq   %rbx

First the callee saved registers are being pushed for preservation. This means we should expect them to be used inside this function.

        movq  64(%rbp), %rax
        movq  56(%rbp), %r10
        movq  48(%rbp), %r11
        movq  40(%rbp), %rbx
        movq  32(%rbp), %r14
        movq  24(%rbp), %r15
        movq  16(%rbp), %r12

And they immediately are. The seven arguments passed on the stack are moved into registers (rax, r10, and r11 don't need to be saved).

        movq    %rdi, -40(%rbp)
        movq    %rsi, -48(%rbp)
        movq    %rdx, -56(%rbp)
        movq    %rcx, -64(%rbp)
        movq    %r8, -72(%rbp)
        movq    %r9, -80(%rbp)

After that the rest of the arguments, the ones in the registers, are moved on the stack.

	movq	-40(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, -40(%rbp)
	movq	-48(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, -48(%rbp)
	movq	-56(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, -56(%rbp)
	movq	-64(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, -64(%rbp)
	movq	-72(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, -72(%rbp)
	movq	-80(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, -80(%rbp)
	movq	16(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, 16(%rbp)
	movq	24(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, 24(%rbp)
	movq	32(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, 32(%rbp)
	movq	40(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, 40(%rbp)
	movq	48(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, 48(%rbp)
	movq	56(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, 56(%rbp)
	movq	64(%rbp), %rcx
	shlq	$1, %rcx
	movq	%rcx, 64(%rbp)

This part looks way more complicated than it actually is. All it's doing now is going through each of our variables on the stack and moving them into rcx, left shifting the variable, and then putting back into its place on the stack.

	movq	-40(%rbp), %rcx
	addq	-48(%rbp), %rcx
	addq	-56(%rbp), %rcx
	addq	-64(%rbp), %rcx
	addq	-72(%rbp), %rcx
	addq	-80(%rbp), %rcx
	addq	16(%rbp), %rcx
	addq	24(%rbp), %rcx
	addq	32(%rbp), %rcx
	addq	40(%rbp), %rcx
	addq	48(%rbp), %rcx
	addq	56(%rbp), %rcx
	addq	64(%rbp), %rcx

Now to complete the useful part of _doubleAndThenSumNumbers the first argument/variable is moved into rcx and then the rest are added to rcx.

	movq	%rax, -88(%rbp)         ## 8-byte Spill
	movq	%rcx, %rax
	movq	%r12, -96(%rbp)         ## 8-byte Spill
	movq	%r10, -104(%rbp)        ## 8-byte Spill
	movq	%r11, -112(%rbp)        ## 8-byte Spill
	movq	%rbx, -120(%rbp)        ## 8-byte Spill
	movq	%r14, -128(%rbp)        ## 8-byte Spill
	movq	%r15, -136(%rbp)        ## 8-byte Spill

In amongst the register spill cleanup rcx is moved to the return register rax. The register spill is yet another artifact of the lack of optimization.

	popq	%rbx
	popq	%r12
	popq	%r14
	popq	%r15
	popq	%rbp
	retq

And finally the callee saved registers are popped before returning back to _main.

	xorl	%r13d, %r13d
	movq	%rax, -168(%rbp)        ## 8-byte Spill
	movl	%r13d, %eax
	addq	$184, %rsp

To begin winding things down the bottom 32 bits of r13 (r13d) is zeroed out. The return value from _doubleAndThenSumNumbers stored in rax is moved onto the stack where it could be used later if we weren't about to exit. r13d is then moved into the return register eax. And the stack frame is restored.

	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	retq

Finally, just like inside _doubleAndThenSumNumbers the callee saved registers are restored before returning.

Example 3: Assembly Output (GCC)

	.text
	.globl _doubleAndThenSumNumbers
_doubleAndThenSumNumbers:
	pushq	%rbp
	movq	%rsp, %rbp
	movq	%rdi, -8(%rbp)
	movq	%rsi, -16(%rbp)
	movq	%rdx, -24(%rbp)
	movq	%rcx, -32(%rbp)
	movq	%r8, -40(%rbp)
	movq	%r9, -48(%rbp)
	salq	-8(%rbp)
	salq	-16(%rbp)
	salq	-24(%rbp)
	salq	-32(%rbp)
	salq	-40(%rbp)
	salq	-48(%rbp)
	salq	16(%rbp)
	salq	24(%rbp)
	salq	32(%rbp)
	salq	40(%rbp)
	salq	48(%rbp)
	salq	56(%rbp)
	salq	64(%rbp)
	movq	-8(%rbp), %rdx
	movq	-16(%rbp), %rax
	addq	%rax, %rdx
	movq	-24(%rbp), %rax
	addq	%rax, %rdx
	movq	-32(%rbp), %rax
	addq	%rax, %rdx
	movq	-40(%rbp), %rax
	addq	%rax, %rdx
	movq	-48(%rbp), %rax
	addq	%rax, %rdx
	movq	16(%rbp), %rax
	addq	%rax, %rdx
	movq	24(%rbp), %rax
	addq	%rax, %rdx
	movq	32(%rbp), %rax
	addq	%rax, %rdx
	movq	40(%rbp), %rax
	addq	%rax, %rdx
	movq	48(%rbp), %rax
	addq	%rax, %rdx
	movq	56(%rbp), %rax
	addq	%rax, %rdx
	movq	64(%rbp), %rax
	addq	%rdx, %rax
	popq	%rbp
	ret
	.globl _main
_main:
	pushq	%rbp
	movq	%rsp, %rbp
	addq	$-128, %rsp
	movl	%edi, -116(%rbp)
	movq	%rsi, -128(%rbp)
	movq	$1, -8(%rbp)
	movq	$2, -16(%rbp)
	movq	$3, -24(%rbp)
	movq	$4, -32(%rbp)
	movq	$5, -40(%rbp)
	movq	$6, -48(%rbp)
	movq	$7, -56(%rbp)
	movq	$8, -64(%rbp)
	movq	$9, -72(%rbp)
	movq	$10, -80(%rbp)
	movq	$11, -88(%rbp)
	movq	$12, -96(%rbp)
	movq	$13, -104(%rbp)
	movq	-48(%rbp), %r8
	movq	-40(%rbp), %rdi
	movq	-32(%rbp), %rcx
	movq	-24(%rbp), %rdx
	movq	-16(%rbp), %rsi
	movq	-8(%rbp), %rax
	pushq	-104(%rbp)
	pushq	-96(%rbp)
	pushq	-88(%rbp)
	pushq	-80(%rbp)
	pushq	-72(%rbp)
	pushq	-64(%rbp)
	pushq	-56(%rbp)
	movq	%r8, %r9
	movq	%rdi, %r8
	movq	%rax, %rdi
	call	_doubleAndThenSumNumbers
	addq	$56, %rsp
	movl	$0, %eax
	leave
	ret
	.ident	"GCC: (Homebrew GCC 9.2.0) 9.2.0"
	.subsections_via_symbols

GCC's output is short and straight-forward with only a few bits that whose purpose isn't immediately obvious.

        addq    $-128, %rsp
        movl    %edi, -116(%rbp)
        movq    %rsi, -128(%rbp)

GCC begins by first allocating 128 bytes of space on the stack before preserving the values of edi and rsi.

	movq	$1, -8(%rbp)
	movq	$2, -16(%rbp)
	movq	$3, -24(%rbp)
	movq	$4, -32(%rbp)
	movq	$5, -40(%rbp)
	movq	$6, -48(%rbp)
	movq	$7, -56(%rbp)
	movq	$8, -64(%rbp)
	movq	$9, -72(%rbp)
	movq	$10, -80(%rbp)
	movq	$11, -88(%rbp)
	movq	$12, -96(%rbp)
	movq	$13, -104(%rbp)

Right away all of our integer values are moved onto the stack.

	movq	-48(%rbp), %r8
	movq	-40(%rbp), %rdi
	movq	-32(%rbp), %rcx
	movq	-24(%rbp), %rdx
	movq	-16(%rbp), %rsi
	movq	-8(%rbp), %rax

And then the first six are moved into the appropriate registers.

	pushq	-104(%rbp)
	pushq	-96(%rbp)
	pushq	-88(%rbp)
	pushq	-80(%rbp)
	pushq	-72(%rbp)
	pushq	-64(%rbp)
	pushq	-56(%rbp)

The remaining seven are actually pushed onto the stack (as opposed to being moved).

	movq	%r8, %r9
	movq	%rdi, %r8
	movq	%rax, %rdi
	call	_doubleAndThenSumNumbers

Next there's a little bit of shuffling of the variables to put them in the right order. Above we saw that r9 wasn't used and rax was used instead of rdi. This section of code corrects that before calling _doubleAndThenSumNumbers.

	movq	%rdi, -8(%rbp)
	movq	%rsi, -16(%rbp)
	movq	%rdx, -24(%rbp)
	movq	%rcx, -32(%rbp)
	movq	%r8, -40(%rbp)
	movq	%r9, -48(%rbp)

Inside _doubleAndThenSumNumbers the arguments passed by registers are first moved onto the stack.

	salq	-8(%rbp)
	salq	-16(%rbp)
	salq	-24(%rbp)
	salq	-32(%rbp)
	salq	-40(%rbp)
	salq	-48(%rbp)
	salq	16(%rbp)
	salq	24(%rbp)
	salq	32(%rbp)
	salq	40(%rbp)
	salq	48(%rbp)
	salq	56(%rbp)
	salq	64(%rbp)

All of the arguments are shifted left by one bit.

	movq	-8(%rbp), %rdx
	movq	-16(%rbp), %rax
	addq	%rax, %rdx
	movq	-24(%rbp), %rax
	addq	%rax, %rdx
	movq	-32(%rbp), %rax
	addq	%rax, %rdx
	movq	-40(%rbp), %rax
	addq	%rax, %rdx
	movq	-48(%rbp), %rax
	addq	%rax, %rdx
	movq	16(%rbp), %rax
	addq	%rax, %rdx
	movq	24(%rbp), %rax
	addq	%rax, %rdx
	movq	32(%rbp), %rax
	addq	%rax, %rdx
	movq	40(%rbp), %rax
	addq	%rax, %rdx
	movq	48(%rbp), %rax
	addq	%rax, %rdx
	movq	56(%rbp), %rax
	addq	%rax, %rdx
	movq	64(%rbp), %rax
	addq	%rdx, %rax

More straight-forward but suboptimal code. The first doubled number is is moved into rdx. After that each doubled number is moved into rax and then added to rdx. Finally the result is moved from rdx into the return register rax.

Differences between Clang and GCC

Perhaps the largest difference between Clang and GCC in this example is that GCC does the left shift on the variables when they're on the stack. Obviously location of the operand for salq doesn't matter as far as the calculation goes, but it is as much as four times slower than a combined mov from the stack to a register and salq using the register's contents. I wrote a short program to test the performance of both methods and I found doing the left shift directly on a variable on the stack was about 1.5 times slower. It's likely the CPU is being clever about how it actually executes the instructions so the difference in a tight loop isn't s great as it could otherwise be.

With optimizations turned on both Clang and GCC produce nearly identical code that makes clever use of leaq. I think it's interesting that Clang appears to be making some minimal amount of consideration for performance while GCC opts for somewhat more straight forward output.

Conclusions

Hopefully this post has shed some light on what it is happening when you call a function (in C anyways) on a modern computer. After going through the excercise of writing some simple programs and inspecting the assembly output produced by a couple compilers it's hard for me to even remember exactly what my own initial confusion was regarding the System V ABI. Sometimes seeing something in action in a "real" scenario brings a clarity that reading a desciption in a spec can't.

Related Reading

The following links are highly recommended for an in depth understanding of the System V ABI.

Supplimental Reading

These links aren't relevant to the subject of this post but are nonetheless very helpful if you're not intimately familiar with assembly on UNIX-like systems.