|Age||Commit message (Collapse)||Author||Lines|
currently the bfd linker does not seem to create tls segments where
p_vaddr%p_align != 0, but this is valid in ELF and then the runtime
computed tls offset must satisfy
offset%p_align == (base+p_vaddr)%p_align
and in case of local exec tls (main executable) the smallest such
offset must be used (otherwise it is incompatible with the offset
computed by the static linker). the !TLS_ABOVE_TP case is handled
correctly (the offset is negative then in the formula).
the ldso code for TLS_ABOVE_TP is changed so the static tls offset
of each module satisfies the formula.
this is the first part of a series of patches intended to make
__syscall fully self-contained in the object file produced using
syscall.h, which will make it possible for crt1 code to perform
the (confusingly named) i386 __vsyscall mechanism, which this commit
removes, was introduced before the presence of a valid thread pointer
was mandatory; back then the thread pointer was setup lazily only if
threads were used. the intent was to be able to perform syscalls using
the kernel's fast entry point in the VDSO, which can use the sysenter
(Intel) or syscall (AMD) instruction instead of int $128, but without
inlining an access to the __syscall global at the point of each
syscall, which would incur a significant size cost from PIC setup
everywhere. the mechanism also shuffled registers/calling convention
around to avoid spills of call-saved registers, and to avoid
allocating ebx or ebp via asm constraints, since there are plenty of
broken-but-supported compiler versions which are incapable of
allocating ebx with -fPIC or ebp with -fno-omit-frame-pointer.
the new mechanism preserves the properties of avoiding spills and
avoiding allocation of ebx/ebp in constraints, but does it inline,
using some fairly simple register shuffling, and uses a field of the
thread structure rather than global data for the vdso-provided syscall
for now, the external __syscall function is refactored not to use the
old __vsyscall so it can be kept, but the intent is to remove it too.
the hard problem here is unlinking threads from a list when they exit
without creating a window of inconsistency where the kernel task for a
thread still exists and is still executing instructions in userspace,
but is not reflected in the list. the magic solution here is getting
rid of per-thread exit futex addresses (set_tid_address), and instead
using the exit futex to unlock the global thread list.
since pthread_join can no longer see the thread enter a detach_state
of EXITED (which depended on the exit futex address pointing to the
detach_state), it must now observe the unlocking of the thread list
lock before it can unmap the joined thread and return. it doesn't
actually have to take the lock. for this, a __tl_sync primitive is
offered, with a signature that will allow it to be enhanced for quick
return even under contention on the lock, if needed. for now, the
exiting thread always performs a futex wake on its detach_state. a
future change could optimize this out except when there is already a
initial/dynamic variants of detached state no longer need to be
tracked separately, since the futex address is always set to the
global list lock, not a thread-local address that could become invalid
on detached thread exit. all detached threads, however, must perform a
second sigprocmask syscall to block implementation-internal signals,
since locking the thread list with them already blocked is not
the arch-independent C version of __unmapself no longer needs to take
a lock or setup its own futex address to release the lock, since it
must necessarily be called with the thread list lock already held,
guaranteeing exclusive access to the temporary stack.
changes to libc.threads_minus_1 no longer need to be atomic, since
they are guarded by the thread list lock. it is largely vestigial at
this point, and can be replaced with a cheaper boolean indicating
whether the process is multithreaded at some point in the future.
as explained in commit 6ba5517a460c6c438f64d69464fdfc3269a4c91a, some
archs use an offset (typicaly -0x8000) with their DTPOFF relocations,
which __tls_get_addr needs to invert. on affected archs, which lack
direct support for large immediates, this can cost multiple extra
instructions in the hot path. instead, incorporate the DTP_OFFSET into
the DTV entries. this means they are no longer valid pointers, so
store them as an array of uintptr_t rather than void *; this also
makes it easier to access slot 0 as a valid slot count.
commit e75b16cf93ebbc1ce758d3ea6b2923e8b2457c68 left behind cruft in
two places, __reset_tls and __tls_get_new, from back when it was
possible to have uninitialized gap slots indicated by a null pointer
in the DTV. since the concept of null pointer is no longer meaningful
with an offset applied, remove this cruft.
presently there are no archs with both TLSDESC and nonzero DTP_OFFSET,
but the dynamic TLSDESC relocation code is also updated to apply an
inverted offset to its offset field, so that the offset DTV would not
impose a runtime cost in TLSDESC resolver functions.
this facilitates building software that assumes a large default stack
size without any patching to call pthread_setattr_default_np or
pthread_attr_setstacksize at each thread creation site, using just
normally the PT_GNU_STACK header is used only to reflect whether
executable stack is desired, but with GNU ld at least, passing
-Wl,-z,stack-size=N will set a size on the program header. with this
patch, that size will be incorporated into the default stack size
(subject to increase-only rule and DEFAULT_STACK_MAX limit).
both static and dynamic linking honor the program header. for dynamic
linking, all libraries loaded at program start, including preloaded
ones, are considered. dlopened libraries are not considered, for
several reasons. extra logic would be needed to defer processing until
the load of the new library is commited, synchronization woud be
needed since other threads may be running concurrently, and the
effectiveness woud be limited since the larger size would not apply to
threads that already existed at the time of dlopen. programs that will
dlopen code expecting a large stack need to declare the requirement
themselves, or pthread_setattr_default_np can be used.
this cleans up what had become widespread direct inline use of "GNU C"
style attributes directly in the source, and lowers the barrier to
increased use of hidden visibility, which will be useful to recovering
some of the efficiency lost when the protected visibility hack was
dropped in commit dc2f368e565c37728b0d620380b849c3a1ddd78f, especially
on archs where the PLT ABI is costly.
In TLS variant I the TLS is above TP (or above a fixed offset from TP)
but on some targets there is a reserved gap above TP before TLS starts.
This matters for the local-exec tls access model when the offsets of
TLS variables from the TP are hard coded by the linker into the
executable, so the libc must compute these offsets the same way as the
linker. The tls offset of the main module has to be
If there is no TLS in the main module then the gap can be ignored
since musl does not use it and the tls access models of shared
libraries are not affected.
The previous setup only worked if (tls_align & -GAP_ABOVE_TP) == 0
(i.e. TLS did not require large alignment) because the gap was
treated as a fixed offset from TP. Now the TP points at the end
of the pthread struct (which is aligned) and there is a gap above
it (which may also need alignment).
The fix required changing TP_ADJ and __pthread_self on affected
targets (aarch64, arm and sh) and in the tlsdesc asm the offset to
access the dtv changed too.
previously, some accesses to the detached state (from pthread_join and
pthread_getattr_np) were unsynchronized; they were harmless in
programs with well-defined behavior, but ugly. other accesses (in
pthread_exit and pthread_detach) were synchronized by a poorly named
"exitlock", with an ad-hoc trylock operation on it open-coded in
pthread_detach, whose only purpose was establishing protocol for which
thread is responsible for deallocation of detached-thread resources.
instead, use an atomic detach_state and unify it with the futex used
to wait for thread exit. this eliminates 2 members from the pthread
structure, gets rid of the hackish lock usage, and makes rigorous the
trap added in commit 80bf5952551c002cf12d96deb145629765272db0 for
catching attempts to join detached threads. it should also make
attempt to detach an already-detached thread reliably trap.
the tid field in the pthread structure is not volatile, and really
shouldn't be, so as not to limit the compiler's ability to reorder,
merge, or split loads in code paths that may be relevant to
performance (like controlling lock ownership).
however, use of objects which are not volatile or atomic with futex
wait is inherently broken, since the compiler is free to transform a
single load into multiple loads, thereby using a different value for
the controlling expression of the loop and the value passed to the
futex syscall, leading the syscall to block instead of returning.
reportedly glibc's pthread_join was actually affected by an equivalent
issue in glibc on s390.
add a separate, dedicated join_futex object for pthread_join to use.
the static-linked version of __init_tls needs to locate the TLS
initialization image via the ELF program headers, which requires
determining the base address at which the program was loaded. the
existing code attempted to do this by comparing the actual address of
the program headers (obtained via auxv) with the virtual address for
the PT_PHDR record in the program headers. however, the linker seems
to produce a PT_PHDR record only when a program interpreter (dynamic
linker) is used. thus the computation failed and used the default base
address of 0, leading to a crash when trying to access the TLS image
at the wrong address.
the dynamic linker entry point and static-PIE rcrt1.o startup code
compute the base address instead by taking the difference between the
run-time address of _DYNAMIC and the virtual address in the PT_DYNAMIC
record. this patch copies the approach they use, but with a weak
symbolic reference to _DYNAMIC instead of obtaining the address from
the crt_arch.h asm. this works because relocations have already been
performed at the time __init_tls is called.
this both allows removal of some of the main remaining uses of the
SHARED macro and clears one obstacle to static-linked dlopen support,
which may be added at some point in the future.
specialized single-TLS-module versions of __copy_tls and __reset_tls
are removed and replaced with code adapted from their dynamic-linked
versions, capable of operating on a whole chain of TLS modules, and
use of the dynamic linker's DSO chain (which contains large struct dso
objects) by these functions is replaced with a new chain of struct
tls_module objects containing only the information needed for
implementing TLS. this may also yield some performance benefit
initializing TLS for a new thread when a large number of modules
without TLS have been loaded, since since there is no need to walk
structures for modules without TLS.
both static and dynamic linked versions of the __copy_tls function
have a hidden assumption that the alignment of the beginning or end of
the memory passed is suitable for storing an array of pointers for the
dtv. pthread_create satisfies this requirement except when
libc.tls_size is misaligned, which cannot happen with dynamic linking
due to way update_tls_size computes the total size, but could happen
with static linking and odd-sized TLS.
commit dab441aea240f3b7c18a26d2ef51979ea36c301c, which made thread
pointer init mandatory for all programs, rendered this store obsolete
by removing the early-return path for static programs with no TLS.
this slightly reduces the code size cost of TLS/thread-pointer for
static linking since __init_tp can be inlined into its only caller and
removed. this is analogous to the handling of __init_libc in
__libc_start_main, where the function only has external linkage when
it needs to be called from the dynamic linker.
part of the goal here is to eliminate use of the ATTR_LIBC_VISIBILITY
macro outside of libc.h, since it was never intended to be 'public'.
since 1.1.0, musl has nominally required a thread pointer to be setup.
most of the remaining code that was checking for its availability was
doing so for the sake of being usable by the dynamic linker. as of
commit 71f099cb7db821c51d8f39dfac622c61e54d794c, this is no longer
necessary; the thread pointer is now valid before any libc code
(outside of dynamic linker bootstrap functions) runs.
this commit essentially concludes "phase 3" of the "transition path
for removing lazy init of thread pointer" project that began during
the 1.1.0 release cycle.
as a result of commit 12e1e324683a1d381b7f15dd36c99b37dd44d940, kernel
processing of the robust list is only needed for process-shared
mutexes. previously the first attempt to lock any owner-tracked mutex
resulted in robust list initialization and a set_robust_list syscall.
this is no longer necessary, and since the kernel's record of the
robust list must now be cleared at thread exit time for detached
threads, optimizing it out is more worthwhile than before too.
There are two main abi variants for thread local storage layout:
(1) TLS is above the thread pointer at a fixed offset and the pthread
struct is below that. So the end of the struct is at known offset.
(2) the thread pointer points to the pthread struct and TLS starts
below it. So the start of the struct is at known (zero) offset.
Assembly code for the dynamic TLSDESC callback needs to access the
dynamic thread vector (dtv) pointer which is currently at the front
of the pthread struct. So in case of (1) the asm code needs to hard
code the offset from the end of the struct which can easily break if
the struct changes.
This commit adds a copy of the dtv at the end of the struct. New members
must not be added after dtv_copy, only before it. The size of the struct
is increased a bit, but there is opportunity for size optimizations.
a conservative estimate of 4*sizeof(size_t) was used as the minimum
alignment for thread-local storage, despite the only requirements
being alignment suitable for struct pthread and void* (which struct
pthread already contains). additional alignment required by the
application or libraries is encoded in their headers and is already
over-alignment prevented the builtin_tls array from ever being used in
dynamic-linked programs on 64-bit archs, thereby requiring allocation
at startup even in programs with no TLS of their own.
C99 6.10.3p11 disallows such constructs
so use an #ifdef outside of the argument list of __syscall
the main motivation for this change is to remove the assumption that
the tid of the main thread is also the pid of the process. (the value
returned by the set_tid_address syscall was used to fill both fields
despite it semantically being the tid.) this is historically and
presently true on linux and unlikely to change, but it conceivably
could be false on other systems that otherwise reproduce the linux
only a few parts of the code were actually still using the cached pid.
in a couple places (aio and synccall) it was a minor optimization to
avoid a syscall. caching could be reintroduced, but lazily as part of
the public getpid function rather than at program startup, if it's
deemed important for performance later. in other places (cancellation
and pthread_kill) the pid was completely unnecessary; the tkill
syscall can be used instead of tgkill. this is actually a rather
subtle issue, since tgkill is supposedly a solution to race conditions
that can affect use of tkill. however, as documented in the commit
message for commit 7779dbd2663269b465951189b4f43e70839bc073, tgkill
does not actually solve this race; it just limits it to happening
within one process rather than between processes. we use a lock that
avoids the race in pthread_kill, and the use in the cancellation
signal handler is self-targeted and thus not subject to tid reuse
races, so both are safe regardless of which syscall (tgkill or tkill)
this commit adds non-stub implementations of setlocale, duplocale,
newlocale, and uselocale, along with the data structures and minimal
code needed for representing the active locale on a per-thread basis
and optimizing the common case where thread-local locale settings are
not in use.
at this point, the data structures only contain what is necessary to
represent LC_CTYPE (a single flag) and LC_MESSAGES (a name for use in
finding message translation files). representation for the other
categories will be added later; the expectation is that a single
pointer will suffice for each.
for LC_CTYPE, the strings "C" and "POSIX" are treated as special; any
other string is accepted and treated as "C.UTF-8". for other
categories, any string is accepted after being truncated to a maximum
supported length (currently 15 bytes). for LC_MESSAGES, the name is
kept regardless of whether libc itself can use such a message
translation locale, since applications using catgets or gettext should
be able to use message locales libc is not aware of. for other
categories, names which are not successfully loaded as locales (which,
at present, means all names) are treated as aliases for "C". setlocale
locale settings are not yet used anywhere, so this commit should have
no visible effects except for the contents of the string returned by
such separation serves multiple purposes:
- by having the common path for __tls_get_addr alone in its own
function with a tail call to the slow case, code generation is
- by having __tls_get_addr in it own file, it can be replaced on a
per-arch basis as needed, for optimization or ABI-specific purposes.
- by removing __tls_get_addr from __init_tls.c, a few bytes of code
are shaved off of static binaries (which are unlikely to use this
function unless the linker messed up).
the motivation for the errno_ptr field in the thread structure, which
this commit removes, was to allow the main thread's errno to keep its
address when lazy thread pointer initialization was used. &errno was
evaluated prior to setting up the thread pointer and stored in
errno_ptr for the main thread; subsequently created threads would have
errno_ptr pointing to their own errno_val in the thread structure.
since lazy initialization was removed, there is no need for this extra
level of indirection; __errno_location can simply return the address
of the thread's errno_val directly. this does cause &errno to change,
but the change happens before entry to application code, and thus is
such kernels cannot support threads, but the thread pointer is also
important for other purposes, most notably stack protector. without a
valid thread pointer, all code compiled with stack protector will
crash. the same applies to any use of thread-local storage by
applications or libraries.
the concept of this patch is to fall back to using the modify_ldt
syscall, which has been around since linux 1.0, to setup the gs
segment register. since the kernel does not have a way to
automatically assign ldt entries, use of slot zero is hard-coded. if
this fallback path is used, __set_thread_area returns a positive value
(rather than the usual zero for success, or negative for error)
indicating to the caller that the thread pointer was successfully set,
but only for the main thread, and that thread creation will not work
properly. the code in __init_tp has been changed accordingly to record
this result for later use by pthread_create.
the function itself was static, but the weak alias provided an
externally visible reference and thus prevented the dead code from
being omitted from the output. so this change actually reduces bloat
in mandatory static-linked code.
this is the first step in an overhaul aimed at greatly simplifying and
optimizing everything dealing with thread-local state.
previously, the thread pointer was initialized lazily on first access,
or at program startup if stack protector was in use, or at certain
random places where inconsistent state could be reached if it were not
initialized early. while believed to be fully correct, the logic was
fragile and non-obvious.
in the first phase of the thread pointer overhaul, support is retained
(and in some cases improved) for systems/situation where loading the
thread pointer fails, e.g. old kernels.
some notes on specific changes:
- the confusing use of libc.main_thread as an indicator that the
thread pointer is initialized is eliminated in favor of an explicit
- sigaction no longer needs to ensure that the thread pointer is
initialized before installing a signal handler (this was needed to
prevent a situation where the signal handler caused the thread
pointer to be initialized and the subsequent sigreturn cleared it
again) but it still needs to ensure that implementation-internal
thread-related signals are not blocked.
- pthread tsd initialization for the main thread is deferred in a new
manner to minimize bloat in the static-linked __init_tp code.
- pthread_setcancelstate no longer needs special handling for the
situation before the thread pointer is initialized. it simply fails
on systems that cannot support a thread pointer, which are
- pthread_cleanup_push/pop now check for missing thread pointer and
nop themselves out in this case, so stdio no longer needs to avoid
the cancellable path when the thread pointer is not available.
a number of cases remain where certain interfaces may crash if the
system does not support a thread pointer. at this point, these should
be limited to pthread interfaces, and the number of such cases should
be fewer than before.
the external mmap function is heavy because it has to handle error
reporting that the kernel cannot do, and has to do some locking for
arcane race-condition-avoidance purposes. for allocating initial TLS,
we do not need any of that; the raw syscall suffices.
on i386, this change shaves off 13% of the size of .text for the empty
this is needed for reused threads in the SIGEV_THREAD timer
notification system, and could be reused elsewhere in the future if
needed, though it should be refactored for such use.
for static linking, __init_tls.c is simply modified to export the TLS
info in a structure with external linkage, rather than using statics.
this perhaps makes the code more clear, since the statics were poorly
named for statics. the new __reset_tls.c is only linked if it is used.
for dynamic linking, the code is in dynlink.c. sharing code with
__copy_tls is not practical since __reset_tls must also re-zero
apparently this was never noticed before because the linker normally
optimizes dynamic TLS models to non-dynamic ones when static linking,
thus eliminating the calls to __tls_get_addr which crash when the dtv
is missing. however, some libsupc++ code on ARM was calling
__tls_get_addr when static linked and crashing. the reason is unclear
to me, but with this issue fixed it should work now anyway.
libc is the macro, __libc is the internal symbol, but under some
configurations on old/broken compilers, the symbol might not actually
exist and the libc macro might instead use __libc_loc() to obtain
access to the object.
this mirrors the stdio_impl.h cleanup. one header which is not
strictly needed, errno.h, is left in pthread_impl.h, because since
pthread functions return their error codes rather than using errno,
nearly every single pthread function needs the errno constants.
in a few places, rather than bringing in string.h to use memset, the
memset was replaced by direct assignment. this seems to generate much
better code anyway, and makes many functions which were previously
non-leaf functions into leaf functions (possibly eliminating a great
deal of bloat on some platforms where non-leaf functions require ugly
prologue and/or epilogue).
despite documentation that makes it sound a lot different, the only
ABI-constraint difference between TLS variants II and I seems to be
that variant II stores the initial TLS segment immediately below the
thread pointer (i.e. the thread pointer points to the end of it) and
variant I stores the initial TLS segment above the thread pointer,
requiring the thread descriptor to be stored below. the actual value
stored in the thread pointer register also tends to have per-arch
random offsets applied to it for silly micro-optimization purposes.
with these changes applied, TLS should be basically working on all
supported archs except microblaze. I'm still working on getting the
necessary information and a working toolchain that can build TLS
binaries for microblaze, but in theory, static-linked programs with
TLS and dynamic-linked programs where only the main executable uses
TLS should already work on microblaze.
alignment constraints have not yet been heavily tested, so it's
possible that this code does not always align TLS segments correctly
on archs that need TLS variant I.
the code in __libc_start_main is now responsible for parsing auxv,
rather than duplicating the parsing all over the place. this should
shave off a few cycles and some code size. __init_libc is left as an
external-linkage function despite the fact that it could be static, to
prevent it from being inlined and permanently wasting stack space when
main is called.
a few other minor changes are included, like eliminating per-thread
ssp canaries (they were likely broken when combined with certain
dlopen usages, and completely unnecessary) and some other unnecessary
checks. since this code gets linked into every program, it should be
as small and simple as possible.
unlike other implementations, this one reserves memory for new TLS in
all pre-existing threads at dlopen-time, and dlopen will fail with no
resources consumed and no new libraries loaded if memory is not
available. memory is not immediately distributed to running threads;
that would be too complex and too costly. instead, assurances are made
that threads needing the new TLS can obtain it in an async-signal-safe
way from a buffer belonging to the dynamic linker/new module (via
atomic fetch-and-add based allocator).
I've re-appropriated the lock that was previously used for __synccall
(synchronizing set*id() syscalls between threads) as a general
pthread_create lock. it's a "backwards" rwlock where the "read"
operation is safe atomic modification of the live thread count, which
multiple threads can perform at the same time, and the "write"
operation is making sure the count does not increase during an
operation that depends on it remaining bounded (__synccall or dlopen).
in static-linked programs that don't use __synccall, this lock is a
no-op and has no cost.
only TLS in the main program is supported so far; TLS defined in
shared libraries will not work yet.
the design for TLS in dynamic-linked programs is mostly complete too,
but I have not yet implemented it. cost is nonzero but still low for
programs which do not use TLS and/or do not use threads (a few hundred
bytes of new code, plus dependency on memcpy). i believe it can be
made smaller at some point by merging __init_tls and __init_security
into __libc_start_main and avoiding duplicate auxv-parsing code.
at the same time, I've also slightly changed the logic pthread_create
uses to allocate guard pages to ensure that guard pages are not
counted towards commit charge.