|Age||Commit message (Collapse)||Author||Lines|
dl_iterate_phdr was wrongly reporting the address of the DSO's PT_TLS
image rather than the calling thread's instance of the TLS. the man
page, which is essentially normative for a nonstandard function of
this sort, clearly specifies the latter. it does not clarify where
exactly within/relative-to the image the pointer should point, but the
reasonable thing to do is match the ABI's DTP offset, and this seems
to be what other implementations do.
as the outcome of Austin Group tracker issue #62, future editions of
POSIX have dropped the requirement that fork be AS-safe. this allows
but does not require implementations to synchronize fork with internal
locks and give forked children of multithreaded parents a partly or
fully unrestricted execution environment where they can continue to
use the standard library (per POSIX, they can only portably use
up until recently, taking this allowance did not seem desirable.
however, commit 8ed2bd8bfcb4ea6448afb55a941f4b5b2b0398c0 exposed the
extent to which applications and libraries are depending on the
ability to use malloc and other non-AS-safe interfaces in MT-forked
children, by converting latent very-low-probability catastrophic state
corruption into predictable deadlock. dealing with the fallout has
been a huge burden for users/distros.
while it looks like most of the non-portable usage in applications
could be fixed given sufficient effort, at least some of it seems to
occur in language runtimes which are exposing the ability to run
unrestricted code in the child as part of the contract with the
programmer. any attempt at fixing such contracts is not just a
technical problem but a social one, and is probably not tractable.
this patch extends the fork function to take locks for all libc
singletons in the parent, and release or reset those locks in the
child, so that when the underlying fork operation takes place, the
state protected by these locks is consistent and ready for the child
to use. locking is skipped in the case where the parent is
single-threaded so as not to interfere with legacy AS-safety property
of fork in single-threaded programs. lock order is mostly arbitrary,
but the malloc locks (including bump allocator in case it's used) must
be taken after the locks on any subsystems that might use malloc, and
non-AS-safe locks cannot be taken while the thread list lock is held,
imposing a requirement that it be taken last.
this change lifts undocumented restrictions on calls by replacement
mallocs to libc functions that might take these locks, and sets the
stage for lifting restrictions on the child execution environment
after multithreaded fork.
care is taken to #define macros to replace all four functions (malloc,
calloc, realloc, free) even if not all of them will be used, using an
undefined symbol name for the ones intended not to be used so that any
inadvertent future use will be caught at compile time rather than
directed to the wrong implementation.
thread-local buffers allocated for dlerror need to be queued for free
at a later time when the owning thread exits, since malloc may be
replaced by application code and the exiting context is not valid to
call application code from. the code to process queue of pending
frees, introduced in commit aa5a9d15e09851f7b4a1668e9dbde0f6234abada,
gratuitously held the lock for the entire duration of queue
processing, updating the global queue pointer after each free, despite
there being no logical requirement that all frees finish before
another thread can access the queue.
instead, immediately claim the whole queue for freeing and release the
lock, then walk the list and perform frees without the lock held. the
change is unlikely to make any meaningful difference to performance,
but it eliminates one point where the allocator is called under an
internal lock. since the allocator may be application-provided, such
calls are undesirable because they allow application code to impede
forward progress of libc functions in other threads arbitrarily long,
and to induce deadlock if it calls a libc function that requires the
the change also eliminates a lock ordering consideration that's an
impediment upcoming work with multithreaded fork.
in commit 22daaea39f1cc5f7391f0a5cd84576ffb58c2860, the
__dlsym_redir_time64 function providing the backend for __dlsym_time64
was defined only in the dynamic linker, and thus was undefined when
static linking a program referencing dlsym. use the same stub_dlsym
definition that provides __dlsym (the non-redirecting backend) for
static linked programs to provide it, conditional on _REDIR_TIME64.
Some declarations of __tls_get_new were left in the code, even
though the definition got removed in
install dynamic tls synchronously at dlopen, streamline access
this can make the build fail with
ld: lib/libc.so: hidden symbol `__tls_get_new' isn't defined
when libc.so is linked without --gc-sections, because a .hidden
declaration in asm code creates a reference even if the symbol
is not actually used.
we don't actually support building asm source files as thumb1, but
it's possible that the condition __ARM_ARCH>=5 would be false on old
compilers that did not define __ARM_ARCH at all. avoiding that would
require enumerating all of the possible __ARM_ARCH_*__ macros for
as noted in commit 05870abeaac0588fb9115cfd11f96880a0af2108, mov lr,pc
is not valid for saving a return address when in thumb mode. since
this code is a hot path (dynamic TLS access), don't do the out-of-line
bl->bx chaining to save the return value; instead, use the fact that
this file is preprocessed asm to add the missing thumb bit with an add
in place of the mov.
the change here does not affect builds for ISA levels new enough to
have a thread pointer read instruction, or for armv5 and later as long
as the compiler properly defines __ARM_ARCH, or for any build as arm
(not thumb) code. it's likely that it makes no difference whatsoever
to any present-day practical build environments, but nonetheless now
as an alternative, we could just assume __thumb__ implies availability
of blx since we don't support building asm source files as thumb1. I
didn't do that in order to avoid having a wrong assumption here if
that ever changes.
maintainer's note: these are not meaningful/correct/needed and the
clang integrated assembler errors out upon seeing them.
Author: Alex Suykov <email@example.com>
Author: Aric Belsito <firstname.lastname@example.org>
Author: Drew DeVault <email@example.com>
Author: Michael Clark <firstname.lastname@example.org>
Author: Michael Forney <email@example.com>
Author: Stefan O'Rear <firstname.lastname@example.org>
This port has involved the work of many people over several years. I
have tried to ensure that everyone with substantial contributions has
been credited above; if any omissions are found they will be noted
later in an update to the authors/contributors list in the COPYRIGHT
The version committed here comes from the riscv/riscv-musl repo's
commit 3fe7e2c75df78eef42dcdc352a55757729f451e2, with minor changes by
me for issues found during final review:
- a_ll/a_sc atomics are removed (according to the ISA spec, lr/sc
are not safe to use in separate inline asm fragments)
- a_cas[_p] is fixed to be a memory barrier
- the call from the _start assembly into the C part of crt1/ldso is
changed to allow for the possibility that the linker does not place
them nearby each other.
- DTP_OFFSET is defined correctly so that local-dynamic TLS works
- reloc.h LDSO_ARCH logic is simplified and made explicit.
- unused, non-functional crti/n asm files are removed.
- an empty .sdata section is added to crt1 so that the
__global_pointer reference is resolvable.
- indentation style errors in some asm files are fixed.
with the glibc generation counter model for reusing dynamic tls slots
after dlclose, it's really not possible to get away with fewer than 4
working registers. for us however it's always been possible, but
tricky, and only became apparent after the switch to installing new
dynamic tls at dlopen time. by merging the negated thread pointer into
the addend early, the register holding the thread pointer can
immediately be reused, bringing the working register count down to
three. this allows saving/restoring via a single stp/ldp pair, since
the return register x0 does not need to be saved.
net reduction of 3 instructions, 2 of which were push/pop.
previously, dynamic loading of new libraries with thread-local storage
allocated the storage needed for all existing threads at load-time,
precluding late failure that can't be handled, but left installation
in existing threads to take place lazily on first access. this imposed
an additional memory access and branch on every dynamic tls access,
and imposed a requirement, which was not actually met, that the
dynamic tlsdesc asm functions preserve all call-clobbered registers
before calling C code to to install new dynamic tls on first access.
the x86[_64] versions of this code wrongly omitted saving and
restoring of fpu/vector registers, assuming the compiler would not
generate anything using them in the called C code. the arm and aarch64
versions saved known existing registers, but failed to be future-proof
against expansion of the register file.
now that we track live threads in a list, it's possible to install the
new dynamic tls for each thread at dlopen time. for the most part,
synchronization is not needed, because if a thread has not
synchronized with completion of the dlopen, there is no way it can
meaningfully request access to a slot past the end of the old dtv,
which remains valid for accessing slots which already existed.
however, it is necessary to ensure that, if a thread sees its new dtv
pointer, it sees correct pointers in each of the slots that existed
prior to the dlopen. my understanding is that, on most real-world
coherency architectures including all the ones we presently support, a
built-in consume order guarantees this; however, don't rely on that.
instead, the SYS_membarrier syscall is used to ensure that all threads
see the stores to the slots of their new dtv prior to the installation
of the new dtv. if it is not supported, the same is implemented in
userspace via signals, using the same mechanism as __synccall.
the __tls_get_addr function, variants, and dynamic tlsdesc asm
functions are all updated to remove the fallback paths for claiming
new dynamic tls, and are now all branch-free.
__dl_thread_cleanup is called from the context of an exiting thread
that is not in a consistent state valid for calling application code.
since commit c9f415d7ea2dace5bf77f6518b6afc36bb7a5732, it's possible
(and supported usage) for the allocator to have been replaced by the
application, so __dl_thread_cleanup can no longer call free. instead,
reuse the message buffer as a linked-list pointer, and queue it to be
freed the next time any dynamic linker error message is generated.
when invoking the assembler, arm gcc does not always pass the right
flags to enable use of vfp instruction mnemonics. for C code it
produces, it emits the .fpu directive, but this does not help when
building asm source files, which tlsdesc needs to be. to fix, use an
explicit directive here.
commit 0beb9dfbecad38af9759b1e83eeb007e28b70abb introduced this
regression. it has not appeared in any release.
the indirect function call is a significant portion of the code path
for the dynamic case, and most users are probably building for ISA
levels where it can be omitted.
we could drop at least one register save/restore (lr) with this
change, and possibly another (ip) with some clever shuffling, but it's
not clear whether there's a way to do it that's not more expensive, or
whether avoiding the save/restore would have any practical effect, so
in the interest of avoiding complexity it's omitted for now.
unlike other asm where the baseline ISA is used, these functions are
hot paths and use ISA-level specializations.
call-clobbered vfp registers are saved before calling __tls_get_new,
since there is no guarantee it won't use them. while setjmp/longjmp
have to use hwcap to decide whether to the fpu is in use, since
application code could be using vfp registers even if libc was
compiled as pure softfloat, __tls_get_new is part of libc and can be
assumed not to have access to vfp registers if tlsdesc.S does not.
thus it suffices just to check the predefined preprocessor macros. the
check for __ARM_PCS_VFP is redundant; !__SOFTFP__ must always be true
if the target ISA level includes fpu instructions/registers.
libc.h was intended to be a header for access to global libc state and
related interfaces, but ended up included all over the place because
it was the way to get the weak_alias macro. most of the inclusions
removed here are places where weak_alias was needed. a few were
recently introduced for hidden. some go all the way back to when
libc.h defined CANCELPT_BEGIN and _END, and all (wrongly implemented)
cancellation points had to include it.
remaining spurious users are mostly callers of the LOCK/UNLOCK macros
and files that use the LFS64 macro to define the awful *64 aliases.
in a few places, new inclusion of libc.h is added because several
internal headers no longer implicitly include libc.h.
declarations for __lockfile and __unlockfile are moved from libc.h to
stdio_impl.h so that the latter does not need libc.h. putting them in
libc.h made no sense at all, since the macros in stdio_impl.h are
needed to use them correctly anyway.
this cleans up what had become widespread direct inline use of "GNU C"
style attributes directly in the source, and lowers the barrier to
increased use of hidden visibility, which will be useful to recovering
some of the efficiency lost when the protected visibility hack was
dropped in commit dc2f368e565c37728b0d620380b849c3a1ddd78f, especially
on archs where the PLT ABI is costly.
three ABIs are supported: the default with 68881 80-bit fpu format and
results returned in floating point registers, softfloat-only with the
same format, and coldfire fpu with IEEE single/double only. only the
first is tested at all, and only under qemu which has fpu emulation
basic functionality smoke tests have been performed for the most
common arch-specific breakage via libc-test and qemu user-level
emulation. some sysvipc failures remain, but are shared with other big
endian archs and will be fixed separately.
In TLS variant I the TLS is above TP (or above a fixed offset from TP)
but on some targets there is a reserved gap above TP before TLS starts.
This matters for the local-exec tls access model when the offsets of
TLS variables from the TP are hard coded by the linker into the
executable, so the libc must compute these offsets the same way as the
linker. The tls offset of the main module has to be
If there is no TLS in the main module then the gap can be ignored
since musl does not use it and the tls access models of shared
libraries are not affected.
The previous setup only worked if (tls_align & -GAP_ABOVE_TP) == 0
(i.e. TLS did not require large alignment) because the gap was
treated as a fixed offset from TP. Now the TP points at the end
of the pthread struct (which is aligned) and there is a gap above
it (which may also need alignment).
The fix required changing TP_ADJ and __pthread_self on affected
targets (aarch64, arm and sh) and in the tlsdesc asm the offset to
access the dtv changed too.
analogous to commit 5bf7eba213cacc4c1220627c91c28deff2ffecda, use of
AT_PHDR/PT_PHDR does not actually work to find the program base, and
the method with _DYNAMIC vs PT_DYNAMIC must be used as an alternative.
patch by Shiz, along with testing to confirm that this fixes unwinding
in static PIE.
this could only happen if an incomplete auxv was passed into the
program, but it's better to just initialize the data anyway.
This was missed when writing the port initially.
based on patch submitted by Jaydeep Patil, with minor changes.
patch by Mahesh Bodapati and Jaydeep Patil of Imagination
this eliminates the last need for the SHARED macro to control how
files in the src tree are compiled. the same code is used for both
libc.a and libc.so, with additional code for the dynamic linker (from
the new ldso tree) being added to libc.so but not libc.a. separate .o
and .lo object files still exist for the src tree, but the only
difference is that the .lo files are built as PIC.
in the future, if/when we add dlopen support for static-linked
programs, much of the code in dynlink.c may be moved back into the src
tree, but properly factored into separate source files. in that case,
the code in the ldso tree will be reduced to just the dynamic linker
entry point, self-relocation, and loading of libraries needed by the
like elsewhere, use a weak alias that the dynamic linker will override
with a more complete version capable of handling shared libraries.
the function name is still __-prefixed because it requires an asm
wrapper to pass the caller's address in order for RTLD_NEXT to work.
since this was the last function in dynlink.c still used for static
linking, now the whole file is conditional on SHARED being defined.
the ultimate goal of this change is to get all code used in libc.a out
of dynlink.c, so that the dynamic linker code can be moved to its own
tree and object files in the src tree can all be shared between libc.a
this is possible with the new build system that allows src/*/$(ARCH)/*
files which do not shadow a file in the parent directory, and yields a
more logical organization. eventually it will be possible to remove
arch/*/src from the build system.
if two or more threads accessed tls in a dso that was loaded after
the threads were created, then __tls_get_new could do out-of-bound
memory access (leading to segfault).
accidentally byte count was used instead of element count when
the new dtv pointer was computed. (dso->new_dtv is (void**).)
it is rare that the same dso provides dtv for several threads,
the crash was not observed in practice, but possible to trigger.
commit ad1cd43a86645ba2d4f7c8747240452a349d6bc1 eliminated
preprocessor-level omission of references to the init/fini array
symbols from object files going into libc.so. the references are weak,
and the intent was that the linker would resolve them to zero in
libc.so, but instead it leaves undefined references that could be
satisfied at runtime. normally these references would be harmless,
since the code using them does not even get executed, but some older
binutils versions produce a linking error: when linking a program
against libc.so, ld first tries to use the hidden init/fini array
symbols produced by the linker script to satisfy the references in
libc.so, then produces an error because the definitions are hidden.
ideally ld would have already provided definitions of these symbols
when linking libc.so, but the linker script for -shared omits them.
to avoid this situation, the dynamic linker now provides its own dummy
definitions of the init/fini array symbols for libc.so. since they are
hidden, everything binds at ld time and no references remain in the
dynamic symbol table. with modern binutils and --gc-sections, both
the dummy empty array objects and the code referencing them get
dropped at link time, anyway.
the _init and _fini symbols are also switched back to using weak
definitions rather than weak references since the latter behave
somewhat problematically in general, and the weak definition approach
was known to work well.
the nommu kernel shares memory when it can anyway for private
read-only maps, but semantically the map should be private. this can
make a difference when debugging breakpoints are to be used, in which
case the kernel may need to ensure that the mapping is not shared.
the new behavior matches how the kernel FDPIC loader maps the main
program and/or program interpreter (dynamic linker) binary.
also fix visibility of the glue function used.
this both allows removal of some of the main remaining uses of the
SHARED macro and clears one obstacle to static-linked dlopen support,
which may be added at some point in the future.
specialized single-TLS-module versions of __copy_tls and __reset_tls
are removed and replaced with code adapted from their dynamic-linked
versions, capable of operating on a whole chain of TLS modules, and
use of the dynamic linker's DSO chain (which contains large struct dso
objects) by these functions is replaced with a new chain of struct
tls_module objects containing only the information needed for
implementing TLS. this may also yield some performance benefit
initializing TLS for a new thread when a large number of modules
without TLS have been loaded, since since there is no need to walk
structures for modules without TLS.
use weak definitions that the dynamic linker can override instead of
preprocessor conditionals on SHARED so that the same libc start and
exit code can be used for both static and dynamic linking.
on linux/nommu, non-writable private mappings of files may actually
use memory shared with other processes or the fs cache. the old nommu
loader code (used when mmap with MAP_FIXED fails) simply wrote over
top of the original file mapping, possibly clobbering this shared
memory. no such breakage was observed in practice, but it should have
the new code starts by mapping anonymous writable memory on archs that
might support nommu, then maps load segments over top of it, falling
back to read if MAP_FIXED fails. we use an anonymous map rather than a
writable file map to avoid reading more data from disk than needed.
since pages cannot be loaded lazily on fault, in case of large
data/bss, mapping the full file may read a lot of data that will
subsequently be thrown away when processing additional LOAD segments.
as a result, we cannot skip the first LOAD segment when operating in
these changes affect only non-FDPIC nommu support.
these files are all accepted as legacy arm syntax when producing arm
code, but legacy syntax cannot be used for producing thumb2 with
access to the full ISA. even after switching to UAL, some asm source
files contain instructions which are not valid in thumb mode, so these
will need to be addressed separately.
when a library being loaded has bss (i.e. data segment with
p_memsz>p_filesz), this region needs to be zeroed with a combination
of memset and/or mmap. the regular ELF loader always did this but the
FDPIC code path omitted it, leading to objects in bss having
when determining which module an address belongs to, all function
descriptor ranges must be checked first, in case the allocated memory
falls inside another module's memory range.
dladdr itself must also check addresses against function descriptors
before doing a best-match search against the symbol table. even when
doing the latter (e.g. for code addresses obtained from mcontext_t),
also check whether the best-match was a function, and if so, replace
the result with a function descriptor address. which is the nominal
"base address" of the function and which the caller needs if it
intends to subsequently call the matching function.
since commits 2907afb8dbd4c1d34825c3c9bd2b41564baca210 and
6fc30c2493fcfedec89e45088bea87766a1e3286, __dls2 is no longer called
via symbol lookup, but instead uses relative addressing that needs to
be resolved at link time. on some linker versions, and/or if
-Bsymbolic-functions is not used, the linker may leave behind a
dynamic relocation, which is not suitable for bootstrapping the
dynamic linker, if the reference to __dls2 is marked hidden but the
definition is not actually hidden. correcting the definition to use
hidden visibility fixes the problem.
the static-PIE entry point rcrt1 was likewise affected and is also
fixed by this patch.
lookup the dso an address falls in based on the loadmap and not just a
base/length. fix the main app's fake loadmap used when loaded by a
non-fdpic-aware loader so that it does not cover the whole memory
function descriptor addresses are also matched for future use by
dladdr, but reverse lookups of function descriptors via dladdr have
not been implemented yet. some revisions may be needed in the future
once reclaim_gaps supports fdpic, so that function descriptors
allocated in reclaimed heap space do not get detected as belonging to
the module whose gaps they were allocated in.