The ELF Thread Local Storage ABI (TLS) is a storage model for variables that
allows each thread to have a unique copy of a global variable. This model
is used to implement C++'s thread_local
storage model. On thread creation the
variable will be given its initial value from the initial TLS image. TLS
variables are for instance useful as buffers in thread safe code or for per
thread book keeping. C style errors like errno or dlerror can also be handled
this way.
TLS variables are much like any other global/static variable. In implementation
their initial data winds up in the PT_TLS
segment. The PT_TLS
segment
is inside of a read only PT_LOAD
segment despite TLS variables being writable.
This segment is then copied into the process for each thread in a unique
writable location. The location the PT_TLS
segment is copied to is influenced
by the segment's alignment to ensure that the alignment of TLS variables is
respected.
ABI
The actual interface that the compiler, linker, and dynamic linker must adhere to is actually quite simple despite the details of the implementation being more complex. The compiler and the linker must emit code and dynamic relocations that use one of the 4 access models (described in a following section). The dynamic linker and thread implementation must then set everything up so that this actually works. Different architectures have different ABIs but they're similar enough at broad strokes that we can speak about most of them as if there was just one ABI. This document will assume that either x86-64 or AArch64 is being used and will point out differences when they occur.
The TLS ABI makes use of a few terms:
- Thread Pointer: This is a unique address in each thread, generally stored
in a register. Thread local variables lie at offsets from the thread pointer.
Thread Pointer will be abbreviated and used as
$tp
in this document.$tp
is what__builtin_thread_pointer()
returns on AArch64. On AArch64$tp
is given by a special register namedTPIDR_EL0
that can be accessed usingmrs <reg>, TPIDR_EL0
. Onx86_64
thefs.base
segment base is used and can be accessed with%fs:
and can be loaded from%fs:0
orrdfsbase
instruction. - TLS Segment: This is the image of data in each module and specified by the
PT_TLS
program header in each module. Not every module has aPT_TLS
program header and thus not every module has a TLS segment. Each module has at most one TLS segment and correspondingly at most onePT_TLS
program header. - Static TLS set: This is the sum total of modules that are known to the
dynamic linker at program start up time. It consists of the main executable
and every library transitively mentioned by
DT_NEEDED
. Modules that require being in the Static TLS set haveDF_STATIC_TLS
set on theirDT_FLAGS
entry in their dynamic table (given by thePT_DYNAMIC
segment). - TLS Region: This is a contiguous region of memory unique to each
thread.
$tp
will point to some point in this region. It contains the TLS segment of every module in Static TLS set as well as some implementation-private data, which is sometimes called the TCB (Thread Control Block). On AArch64 a 16-byte reserved space starting at$tp
is also sometimes called the TCB. We will refer to this space as the "ABI TCB" in this doc. - TLS Block: This is an individual thread's copy of a TLS segment. There is one TLS block per TLS segment per thread.
- Module ID: The module ID is not statically known except for the main executable's module ID which is always 1. Other module's module IDs are chosen by the dynamic linker. It's just a unique non-zero ID for each module. In theory it could be any non-zero 64-bit value that is unique to the module like a hash or something. In practice it's just a simple counter that the dynamic linker maintains.
- The main executable: This is the module that contains the start address. It,
is also treated in a special way in one of the access models. It always
has a Module ID of 1. This is the only module that can use fixed offsets
from
$tp
via the Local Exec model described below.
To comply with the ABI all access models must be supported.
Access Models
There are 4 access models specified by the ABI:
global-dynamic
local-dynamic
initial-exec
local-exec
These are the values that can be used for -ftls-model=...
and
__attribute__((tls_model("...")))
Which model is used relates to:
- Which module is performing the access:
- The main executable
- A module in the static TLS set
- A module that was loaded after startup, e.g. by
dlopen
- Which module the variable being accessed is defined in:
- Within the same module (i.e.
local-*
) - In a different module (i.e.
global-*
)
- Within the same module (i.e.
global-dynamic
Can be used from anywhere, for any variable.local-dynamic
Can be used by any module, for any variable defined in that same module.initial-exec
Can be used by any module for any variable defined in the static TLS set.local-exec
Can be used by the main executable for variables defined in the main executable.
Global Dynamic
Global dynamic is the most general access format. It is also the slowest.
Any thread-local global variable should be accessible with this method. This
access model must be used if a dynamic library accesses a symbol defined in
another module (see exception in section on Initial Exec). Symbols defined
within the executable need not use this access model. The main executable can
also avoid using this access model. This is the default access model when
compiling with -fPIC
as is the norm for shared libraries.
This access model works by calling a function defined in the dynamic linker.
There are two ways functions might be called, via TLSDESC, or via
__tls_get_addr
.
In the case of __tls_get_addr
it is passed the pair of GOT
entries
associated with this symbol. Specifically it is passed the pointer to the first
and the second entry comes right after it. For a given symbol S
, the first
entry, denoted GOT_S[0]
, must contain the Module ID of the module in which
S
was defined. The second entry, denoted GOT_S[1]
, must contain offset into
TLS Block, which is the same as the offset of the symbol in the PT_TLS
segment
of the associated module. The pointer to S
is then computed using
__tls_get_addr(GOT_S)
. The implementation of __tls_get_addr
will be
discussed later.
TLSDESC is an alternative ABI for global-dynamic
access (and local-dynamic
)
where a different pair of GOT
slots are used where the first GOT
slot
contains a function pointer. The second contains some dynamic linker defined
auxiliary data. This allows the dynamic linker a choice over which function is
called depending on circumstance.
In both cases the calls to these functions must be implemented by a specific
code sequence and a specific set of relocs. This allows the linker to recognize
these accesses and potentially relax them to the local-dynamic
access model.
(NOTE: The following paragraph contains details about how the compiler upholds its end of the ABI. Skip this paragraph if you don't care about that.)
For the compiler to emit code for this access model a call needs to be emitted
against __tls_get_addr
(defined by the dynamic linker) and a reference to the
symbol name. Specifically the compiler the emits code for (minding the
additional relocation needed for the GOT itself) __tls_get_addr(GOT_S)
. The
linker then emits two dynamic relocations when generating the GOT. On x86_64
these are R_X86_64_DTPMOD
and R_X86_64_DTPOFF
. On AArch64 these are
R_AARCH64_DTPMOD
and R_AARCH64_DTPOFF
. These relocations reference the symbol
regardless of whether or not the module defines a symbol by that name or not.
Local Dynamic
Local dynamic the same as Global Dynamic but for local symbols. It can be
thought of as a single global-dynamic
access to the TLS block of this module.
Then because every variable defined in the module is at fixed offsets from the
TLS block the compiler can optimize multiple global-dynamic
calls into one.
The compiler will relax a global-dynamic
access to a local-dynamic
access
whenever the variables are local/static or have hidden visibility. The linker
may sometimes be able to relax some global-dynamic
accesses to local-dynamic
as well.
The following gives an example of how the compiler might emit code for this access model:
static thread_local char buf[buf_cap];
static thread_local size_t buf_size = 0;
while(*str && buf_size < buf_cap) {
buf[buf_size++] = *str++;
}
might be lowered to
// GOT_module[0] is the module ID of this module
// GOT_module[1] is just 0
// <X> denotes the offset of X in this module's TLS block
tls = __tls_get_addr(GOT_module)
while(*str && *(size_t*)(tls+<buf_size>) < buf_cap) {
(char*)(tls+<buf>)[*(size_t*)(tls+<buf_size>)++] = *str++;
}
If this code used global dynamic it would have to make at least 2 calls, one to
get the pointer for buf and the other to get the pointer for buf_size
.
Initial Exec
This access model can be used anytime the compiler knows the module that the
symbol being accessed is defined in will be loaded in the initial set of
executables rather than opened using dlopen
. This access model is generally
only used when the main executable is accessing a global symbol with default
visibility. This is because compiling an executable is the only time the
compiler knows that any code generated will be in the initial executable set. If
a DSO is compiled to make thread local accesses use this model then the DSO
cannot be safely opened with dlopen
. This is acceptable in performance
critical applications and in cases where you know the binary will never be
dlopen-ed such as in the case of libc. Modules compiled/linked this way have
their DF_STATIC_TLS
flag set.
Initial Exec is the default when compiling without -fPIC
.
The compiler emits code without even calling __tls_get_addr
for this access
model. It does so using a single GOT entry, which we'll denote GOT_s
for symbol
s
, for which the compiler emits relocations, to ensure that
extern thread_local int a;
extern thread_local int b;
int main() {
return a + b;
}
would be lowered to something like the following
int main() {
return *(int*)($tp + GOT[a]) + *(int*)($tp + GOT[b]);
}
Note that on x86 architectures GOT[s]
will actually resolve to a negative
value.
Local Exec
This is the fastest access model and can only be used if the symbol is in the
first TLS block, which is the TLS block of the main executable. In practice only
the main executable can use this access mode because any shared library can't
(and normally wouldn't need to) know if it is accessing something from the main
executable. The linker will relax initial-exec
to local-exec
. The compiler
can't do this without explicit instructions via -ftls-model
or
__attribute__((tls_model("...")))
because the compiler cannot know if the
current translation unit is going to be linked into a main executable or a
shared library.
The precise details of how this offset is computed changes a bit from architecture to architecture.
example code:
static thread_local int a;
static thread_local int b;
int main() {
return a + b;
}
would be lowered to
int main() {
return (int*)($tp+TPOFF_a) + (int*)($tp+TPOFF_b));
}
On AArch64 TPOFF_a == max(16, p_align) + <a>
where p_align
is exactly the
p_align
field of the main executable's PT_TLS
segment and <a>
is the
offset of a
from the beginning of the main executable's TLS segment.
On x86_64
TPOFF_a == -<a>
where <a>
is the offset of the a
from the end
of the main executable's TLS segment.
The linker is aware of what TPOFF_X
is for any given X
and fills in this
value.
Implementation
This section discusses the implementation as it is implemented on Fuchsia. This said the broad strokes here are widely similar across different libc implementations including musl and glibc.
The actual implementation of all of this introduces a few more details. Namely
the so-called "DTV" (Dynamic Thread Vector) (denoted dtv
in this doc), which
indexes TLS blocks by module ID. The following diagram shows what the initial
executable set looks like. In Fuchsia's implementation we actually store a
bunch of meta information in a thread descriptor struct along with the
ABI TCB (denoted tcb
below). In our implementation we use the first 8 bytes
of this space to point to the DTV. At first tcb
points to dtv
as shown in
the below diagrams but after a dlopen this can change.
arm64:
*------------------------------------------------------------------------------*
| thread | tcb | X | tls1 | ... | tlsN | ... | tls_cnt | dtv[1] | ... | dtv[N] |
*------------------------------------------------------------------------------*
^ ^ ^ ^ ^
td tp dtv[1] dtv[n+1] dtv
Here X
has size min(16, tls_align) - 16
where tls_align
is the maximum
alignment of all loaded TLS segments from the static TLS set. This is set by
the static linker since the static linker resolves TPOFF_*
values. This
padding is set that so that if, as required, $tp
is aligned to main
executable's PT_TLS
segment's p_align
value then tls1 - $tp
will be
max(16, p_align)
. This ensures that there is always at least a 16 byte space
for the ABI TCB (denoted tcb
in the diagram above).
x86:
*-----------------------------------------------------------------------------*
| tls_cnt | dtv[1] | ... | dtv[N] | ... | tlsN | ... | tls1 | tcb | thread |
*-----------------------------------------------------------------------------*
^ ^ ^ ^
dtv dtv[n+1] dtv[1] tp/td
Here td
denotes the "thread descriptor pointer". In both implementations this
points to the thread descriptor. A subtle point not made apparent in these
diagrams is that tcb
is actually a member of the thread descriptor struct in
both cases but on AArch64 it is the last member and on x86_64
it is the first
member.
dlopen
This picture explains what happens for the initial executables but it doesn't
explain what happens in the dlopen
case. When __tls_get_addr
is called it
first checks to see if tls_cnt
is such that the module ID (given by GOT_s[0]
) is within the dtv
. If it is then it simply looks up dtv[GOT_s[0]] + GOT_s[1]
but if it isn't something more complicated happens. See the implementation of
__tls_get_new
in dynlink.c.
In a nutshell a sufficiently large space was already allocated for a larger dtv
on a call to dlopen
. It is an invariant of the system that sufficient space
will always exist somewhere already allocated. The larger space is then setup to
be a proper dtv
. tcb
is then set to point to this new larger dtv
. Future
accesses will then use the simpler code path since tls_cnt
will be large
enough.