264 Week 13 notes

Comp 264-002, Spring 2019,MWF, 11:30-12:20, Cuneo 218

Readings (from BOH3)

Chapter 3:

Section 3.1
Section 3.2
Section 3.3
Section 3.4
Section 3.5

Chapter 7: Linking

1. What the gcc command really does: drives linking, assembling, loading

    cpp foo.c > foo.i
    gcc -c foo.c        // creates foo.o
    gcc -Og -S bar.c        // creates bar.s
    gcc foo.o bar.s

gcc -v

cpp/cc1/as/ld

On many macs, the gcc command fires up the clang compiler. That is not the gcc compiler! Clang is the Xcode compiler, supporting C and also Objective C.

2. Static linking (ld)

symbol resolution
relocation

Example 1

Look at two objdumps: main.od and sum.od (quite different from objdumps of elfs)

main.c:
#define MAX 1000

extern int R[MAX];

long summer(int * A);
long x;

void main() {
    int i;
    for (i=0; i<MAX; i++) R[i] = rand();
    long sum = summer(R);
    printf("sum is %ld\n", sum);
}

sum.c:
#define MAX 1000

int R[MAX];

long summer() {
    int i;
    long sum = 0;
    for (i=0; i<MAX; i++) sum += R[i];
    return sum;
}

How do address references get resolved?

jumps within a function: handled as a relative offset from %rip (IBM 360:
calls to other procedures: address has to be worked out, entered into code before running
references to stack data: relative to %rsp
references to global data: address is often worked out by ld

In the shellcode we had to make some extra effort to get the data references correct.

The assembler code contains copious relocation directives.

Three object files:

relocatable object file (.o file): the output of the compiler/assembler
shared object file (.so/.dll file): a dynamic library
executable object file: some people just call these "executable" or ".exe" or "elf" files

Basic elf sections:

header
.text: code
.rodata: strings
.data: initialized data
.bss: uninitialized data. "Block Started by Symbol", but maybe also Better Save Space
.symtab: table of information about globally visible data and functions. objdump -t matrix.o, objdump -t main.o
.rel.text: list of unresolved addresses in the .text section (data and jumps to other sections). objdump -R sum/a.out
.rel.data: info about global data defined in this module
.debug: created by gcc -g; contains debug info like the source code
.line: source-code line-number info
.strtab

and more.

Example 2

objdump -h matrix.o

objdump -h matrix

Three kinds of symbols, from the perspective of a current module m:

global symbols defined within m
global symbols defined in some other module, but referenced by m
local symbols

In C, all function names are global symbols.

Example: sum. In main.c, I declared summer(), but I could leave this out and it would still compile.

In C, to use an external variable, you must declare it with extern. And a type; see foobar/main.c:

extern int which;

Why can you get away without declarations of functions?

Coordinating

#define MAX 100

with .h files

static variables:

int f() {
static int x = 0;
return ++x;
}

int g() {
static int x = 0;
return ++x;
}

void main() {
    f(); f();f();f();
    g();g();
    printf(%d %d\n", f(), g())
}

This prints 5 3

COMMON vs .bss: the former are uninitialized global variables. .bss is uninitialized static vars, and global vars initialized to 0.

readelf -a sum/main.o

Example 3

foobar with and without initialization

main.c	foo.c and bar.c
#include <stdio.h> #include <stdlib.h> extern int which; void foo(); void bar(); void main(){ foo(); bar(); printf("the value is %d\n", which); }	int which = 0; void foo() { which =37; } ---------------------------------------- int which = 0; void bar() { which =61; }

strong symbols: initialized
weak symbols: uninitialized

Multiple strong symbols with the same name are not allowed
If x is strong one place and weak in others, use the strong instance
If x is weak in two or more places, it's compiler's choice.

mangling and C++

(and C99 via gcc -std=c99; ok maybe not)

Foo: : bar (int, long) is encoded as bar__3Fooil

compile c99.c with g++ instead of gcc. Then look at an objdump -ad of the resulting executable.

Example 4: doubleint

with gcc -Wall -Og -o foobar foo.c bar.c, we get a warning, but very bad things are happening.

What do you think -Wall does?

Linking to static libraries

Where is printf?

The linker copies to the executable the parts of the static library that are needed. Not the whole thing. This solves several problems. Static libraries end in .a: libc.a

If we had libc.o, then the entire thing would be linked in. We could put each function in its own .o file: printf.o, getchar.o, rand.o. But that is tedious.

gcc -c addvec.c mulvec.c // creates addvec.o, mulvec.o

ar rcs libvector.a addvec.o mulvec.o

Static linking with libraries via gcc is order-dependent! So the libraries must be listed after the code.

Executables

The linker assumes memory will start at 0x40 0000, and assigns addresses based on that. Multiple .data sections are combined into one; ditto multiple .bss and .text sections.

Relocation types:

R_X86_64_PC32: used for 32-bit addressing relative to %rip. That is, the offsets from %rip are less than 32 bits. It's still 64-bit addressing

R_X86_64_32: absolute 32-bit addresses

Today, everything is relocated at run time to a random location, ie not 0x40 0000.

Position-independent code

This is generated with the -fpic flag to gcc. It means that the code can be loaded anywhere, without the need for relocation.

Jumps and calls within a module are done relative to %rip, so these are not an issue. Neither is stack-based data. But what about global data, or calls from one module to another (that is, where do we find the address of a function in a different module)?

The trick is to observe that the offset between the start of the data section and the start of the text section is known at link time. BOC3 puts it this way:

No matter where we load an object module in memory, the data segment is always the same distance from the code segment

The BOH3 rule might be a little misleading: the data-to-text offset can be different for different linked libraries, with different sizes of the data and text segments. But for any one linked library, the distance is fixed and the data segment for that library will always be loaded at the same offset relative to the text segment.

At the start of the data segment we will place the global offset table (or GOT), which is a table of pointers to code and global data. If one of those global data items is myarray, a pointer to myarray might be stored in slot 5 of the GOT. Then the code, to access myarray, loads the pointer to the GOT, and then loads the address at GOT[5]. That is the pointer to myarray.

In order for this to work, code must be generated so that accesses are done relative to the GOT. The actual distance to the GOT can be supplied later; that is a single "relocation".

The other thing to note is that multiple processes are using the shared library, and each process's GOT table is supposed to be at the same offset relative to the library? How can that work? The answer here is virtual memory. Each process shares the same virtual-memory pages that contain the code, but each has their own pages for the GOT data. That way, each process can have their own data, and yet share the code.

Use of the GOT table does mean an extra level of indirection: to access a variable, we have to load the address of the variable from the GOT, and only then can we load the variable itself. This does result in a modest performance hit.

The code for the GOT-mediated access is generated when we compile with the -f pic flag (pic = position-independent code).

Note that the GOT table would not be necessary for accessing variables defined in the same module: we could, in principle, access them the same way we access any other data. The offset is known at compile time; no relocation is needed. However, the GOT table is essential to access data defined in a different module. In fact, the GOT is used for pretty much all data accesses.

Example5

(from eli.thegreenplace.net/2011/11/11/position-independent-code-pic-in-shared-libraries-on-x64)

// to be compiled with:
// gcc -c -fpic foo.c
// gcc -shared -o lib.so lib.o
//
int myglob = 42;

int ml_func(int a, int b)
{
return myglob + a + b;
}

If we compile this, the disassembled code we get for ml_func is this:

00000000000005aa <ml_func>:
5aa:    55                       push   %rbp
5ab:    48 89 e5                 mov    %rsp,%rbp
5ae:    89 7d fc                 mov    %edi,-0x4(%rbp)
5b1:    89 75 f8                 mov    %esi,-0x8(%rbp)
5b4:    48 8b 05 25 0a 20 00     mov    0x200a25(%rip),%rax        # 200fe0 <myglob-0x40>
5bb:    8b 10                    mov    (%rax),%edx
5bd:    8b 45 fc                 mov    -0x4(%rbp),%eax
5c0:    01 c2                    add    %eax,%edx
5c2:    8b 45 f8                 mov    -0x8(%rbp),%eax
5c5:    01 d0                    add    %edx,%eax
5c7:    5d                       pop    %rbp
5c8:    c3                       retq

At the line labeled 5b4, we see a load of 0x200a25(%rip) into %rax. This should mean %rax is pointing to myglob. The next instructions confirm it: we load (%rax) into eax, and then add the quantities that had been in %edi and %esi, namely a and b (they were saved on the stack in the third and fourth lines).

0x200a25 should be the offset to the GOT. But it's an offset relative to %rip, which, at the time of the instruction, was 0x5bb. That puts the GOT at address 0x200a25 + 0x5bb = 0x200fe0.

Now let's call readelf -S on lib.so. We get this line for the address of the start of GOT:

[16] .got PROGBITS 0000000000200fd8 00000fd8

The start of the GOT is 8 bytes lower, making 0x200fe0 the third 4-byte entry.

Trampolines and the Procedure Linkage Table

(Yes, they're called trampolines, because the flow of control jumps in and then jumps out. Sort of.)

It turns out that, while the same process could be used to look up the address of procedures in other modules, this is not how procedure lookups are done. Much of the issue is that, when an entire shared library is mapped into memory, there are lots of functions, and there is a reasonable chance we are only going to call a few of them. So there's a scheme to resolve function references only as needed.

Here's a code example:

int myglob = 42;

int ml_util_func(int a)
{
    return a + 1;
}

int ml_func(int a, int b)
{
    int c = b + ml_util_func(a);
    myglob += c;
    return b + myglob;
}

We save it in lib2.c and compile it as with lib.c above. If we disassemble lib2.so, we get

0000000000000619 <ml_func>:
 619:	55                   	push   %rbp
 61a:	48 89 e5             	mov    %rsp,%rbp
 61d:	48 83 ec 20          	sub    $0x20,%rsp
 621:	89 7d ec             	mov    %edi,-0x14(%rbp)
 624:	89 75 e8             	mov    %esi,-0x18(%rbp)
 627:	8b 45 ec             	mov    -0x14(%rbp),%eax
 62a:	89 c7                	mov    %eax,%edi
 62c:	e8 df fe ff ff       	callq  510 <ml_util_func@plt>

This looks like a normal function call. But let's look at the ode for ml_util_func@plt (which is different from ml_util_func):

0000000000000510 <ml_util_func@plt>:
510:    ff 25 02 0b 20 00        jmpq   *0x200b02(%rip)        # 201018 <ml_util_func+0x200a0e>
516:    68 00 00 00 00           pushq $0x0
51b:    e9 e0 ff ff ff           jmpq   500 <.plt>

The first jmpq is to 0x200b02 + 0x516, which is 0x201018. If we look with readelf -S, we find:

[18] .got.plt PROGBITS 0000000000201000 00001000

Using readelf -r, we find this:

Relocation section '.rela.plt' at offset 0x4c8 contains 1 entry:
Offset Info Type Sym. Value Sym. Name + Addend
000000201018 000600000007 R_X86_64_JUMP_SLO 000000000000060a ml_util_func + 0

So, at 0x201018 there is a pointer to the actual function.

Why so roundabout? Here is the strategy:

1. The original call is to the location ml_util_func@plt, which is at a known location PLT[n]

2. That location, PLT[n], contains an instruction to branch to the address stored in GOT[n1].

3. Initially, the location GOT[n1] contains a pointer to code to run the resolver. When run, this code will:

find the real address of ml_util_func()
place that address in GOT[n1]
call the real ml_util_func()

4. The next time we call ml_util_func@plt, we again encounter the instruction to branch to the address stored in GOT[n1]. This time, though, that address is the address of the real ml_util_func(), so it is called with a total of one extra memory-address load.

Why doesn't the resolver do all this ahead of time? Because a typical shared library has thousands of functions, but only a few might be called. So this strategy, sometimes called lazy resolution, in most cases saves considerable time.

Why doesn't the resolver put the address of ml_util_func() into PLT[n] instead of GOT[n1], saving one memory fetch? Because the PLT table is in a page (or set of pages) of virtual memory that is marked read-only. Generally, pages that contain executable code are marked read-only, for security if nothing else. But the GOT table is on a data page, which is marked writable but also non-executable. We load an address from GOT[N1], but there is no instruction there.