|Age||Commit message (Collapse)||Author||Lines|
__tls_get_addr should not be called with an invalid TLS module id of
0. in practice it probably "works", returning the DTV length as if it
were a pointer, and the callback should probably not inspect
dlpi_tls_data in this case, but it's likely that some real-world
callbacks use a check on dlpi_tls_data being non-null, rather than on
dlpi_tls_modid being nonzero, to conclude that the module has TLS.
dl_iterate_phdr was wrongly reporting the address of the DSO's PT_TLS
image rather than the calling thread's instance of the TLS. the man
page, which is essentially normative for a nonstandard function of
this sort, clearly specifies the latter. it does not clarify where
exactly within/relative-to the image the pointer should point, but the
reasonable thing to do is match the ABI's DTP offset, and this seems
to be what other implementations do.
reportedly the GNU linker can emit such segments, causing spurious
failure to load due to mmap with a length of zero producing EINVAL.
no action is required for such a load map (it's effectively a nop in
the program headers table) so just treat it as always successful.
as the outcome of Austin Group tracker issue #62, future editions of
POSIX have dropped the requirement that fork be AS-safe. this allows
but does not require implementations to synchronize fork with internal
locks and give forked children of multithreaded parents a partly or
fully unrestricted execution environment where they can continue to
use the standard library (per POSIX, they can only portably use
up until recently, taking this allowance did not seem desirable.
however, commit 8ed2bd8bfcb4ea6448afb55a941f4b5b2b0398c0 exposed the
extent to which applications and libraries are depending on the
ability to use malloc and other non-AS-safe interfaces in MT-forked
children, by converting latent very-low-probability catastrophic state
corruption into predictable deadlock. dealing with the fallout has
been a huge burden for users/distros.
while it looks like most of the non-portable usage in applications
could be fixed given sufficient effort, at least some of it seems to
occur in language runtimes which are exposing the ability to run
unrestricted code in the child as part of the contract with the
programmer. any attempt at fixing such contracts is not just a
technical problem but a social one, and is probably not tractable.
this patch extends the fork function to take locks for all libc
singletons in the parent, and release or reset those locks in the
child, so that when the underlying fork operation takes place, the
state protected by these locks is consistent and ready for the child
to use. locking is skipped in the case where the parent is
single-threaded so as not to interfere with legacy AS-safety property
of fork in single-threaded programs. lock order is mostly arbitrary,
but the malloc locks (including bump allocator in case it's used) must
be taken after the locks on any subsystems that might use malloc, and
non-AS-safe locks cannot be taken while the thread list lock is held,
imposing a requirement that it be taken last.
this change lifts undocumented restrictions on calls by replacement
mallocs to libc functions that might take these locks, and sets the
stage for lifting restrictions on the child execution environment
after multithreaded fork.
care is taken to #define macros to replace all four functions (malloc,
calloc, realloc, free) even if not all of them will be used, using an
undefined symbol name for the ones intended not to be used so that any
inadvertent future use will be caught at compile time rather than
directed to the wrong implementation.
the only place stdio was used here was for reading the ldso path file,
taking advantage of getdelim to automatically allocate and resize the
buffer. the motivation for use here was that, with shared libraries,
stdio is already available anyway and free to use. this has long been
a nuisance to users because getdelim's use of realloc here triggered a
valgrind bug, but removing it doesn't really fix that; on some archs
even calling the valgrind-interposed malloc at this point will crash.
the actual motivation for this change is moving towards getting rid of
use of application-provided malloc in parts of libc where it would be
called with libc-internal locks held, leading to the possibility of
deadlock if the malloc implementation doesn't follow unwritten rules
about which libc functions are safe for it to call. since getdelim is
required to produce a pointer as if by malloc (i.e. that can be passed
to reallor or free), it necessarily must use the public malloc.
instead of performing a realloc loop as the path file is read, first
query its size with fstat and allocate only once. this produces
slightly different truncation behavior when racing with writes to a
file, but neither behavior is or could be made safe anyway; on a live
system, ldso path files should be replaced by atomic rename only. the
change should also reduce memory waste.
Otherwise lldb doesn't notice the new library and stack traces
containing it get cut off unhelpfully.
commit 188759bbee057aa94db2bbb7cf7f5855f3b9ab53 documented the intent
to allow recursive dlopen based on tracking ctor_visitor, but used a
kernel tid rather than the pthread_t to identify the caller. as a
result, it would not behave as intended under fork by a ctor, where
the child tid would not match.
queue_ctors should not be called with the init_fini_lock held, since
it may longjmp out on allocation failure. this introduces a minor
TOCTOU race with p->constructed, but one already exists further down
anyway, and by design it's okay to run through the queue more than
once anyway. the only reason we bother to check p->constructed at all
is to avoid spurious failure of dlopen when the library is already
fully loaded and constructed.
dtv_copy, canary2, and canary_at_end existed solely to match multiple
ABI and asm-accessed layouts simultaneously. now that pthread_arch.h
can be included before struct __pthread is defined, the struct layout
can depend on macros defined by pthread_arch.h.
this is in preparation for improving behavior of malloc interposition.
this eliminates consumers of malloc_impl.h outside of the malloc
as a result of commit b6a6cd703ffefa6352249fb01f4da28d85d17306,
the REL_NONE case is now redundant.
the bug fixed in commit b82cd6c78d812d38c31febba5a9e57dbaa7919c4 was
mostly masked on arm because __hwcap was zero at the point of the call
from the dynamic linker to __set_thread_area, causing the access to
libc.auxv to be skipped and kuser_helper versions of TLS access and
atomics to be used instead of the armv6 or v7 versions. however, on
kernels with kuser_helper removed for hardening it would crash.
since __set_thread_area potentially uses __hwcap, it must be
initialized before the function is called. move the AT_HWCAP lookup
from stage 3 to stage 2b.
at least gcc 9 broke execution of DT_INIT/DT_FINI for fdpic archs
(presently only sh) by recognizing that the stores to the
compound-literal function descriptor constructed to call them were
dead stores. there's no way to make a "may_alias function", so instead
launder the descriptor through an asm-statement barrier. in practice
just making the compound literal volatile seemed to have worked too,
but this should be less of a hack and more accurately convey the
semantics of what transformations are not valid.
commit 1c84c99913bf1cd47b866ed31e665848a0da84a2 moved the call to
__init_tp above the initialization of libc.auxv, inadvertently
breaking archs where __set_thread_area examines auxv for the sake of
determining the TLS/atomic model needed at runtime. this broke armv6
this interface contract is entirely internal to dynlink.c.
if symbols are being redirected to provide the new time64 ABI, dlsym
must perform matching redirections; otherwise, it would poke a hole in
the magic and return pointers to functions that are not safe to call
from a caller using time64 types.
rather than duplicating a table of redirections, use the time64
symbols present in libc's symbol table to derive the decision for
whether a particular symbol needs to be redirected.
commit ffab43602b5900c86b7040abdda8ccf6cdec95f5 broke this by moving
relocations after not only the allocation of storage for the main
thread's static TLS, but after the copying of the TLS image. thus,
relocation results were not reflected in the main thread's copy. this
could be fixed by calling __reset_tls after relocations, but instead
split the allocation and installation before/after relocations so that
there's not a redundant copy.
due to commit 71af5309874269bcc9e4b84ea716fab33d888c1d, updating of
static_tls_cnt needs to be kept with allocation of static TLS, before
relocations, rather than after installation.
Using common code path for all symbol lookups fixes three dlsym issues:
- st_shndx of STT_TLS symbols were not checked and thus an undefined
tls symbol reference could be incorrectly treated as a definition
(the sysv hash lookup returns undefined symbols, gnu does not, so should
be rare in practice).
- symbol binding was not checked so a hidden symbol may be returned
(in principle STB_LOCAL symbols may appear in the dynamic symbol table
for hidden symbols, but linkers most likely don't produce it).
- mips specific behaviour was not applied (ARCH_SYM_REJECT_UND) so
undefined symbols may be returned on mips.
always_inline is used to avoid relocation performance regression, the
code generation for find_sym should not be affected.
commit 7a9669e977e5f750cf72ccbd2614f8b72ce02c4c added use of the
symbol reference as the definition, in place of performing a lookup,
for STT_SECTION symbol references that were first found used in FDPIC.
such references may happen in certain other cases, such as
local-dynamic TLS and with relocation types that require a symbol but
that are being used for non-symbolic purposes, like the powerpc
unaligned address relocations.
in all such cases I'm aware of, the symbol referenced is a section
symbol (STT_SECTION); however, the important semantic property is not
its being a section, but rather its binding local (STB_LOCAL). check
the latter instead of the former for greater generality and semantic
R_PPC_UADDR32 (R_PPC64_UADDR64) has the same meaning as R_PPC_ADDR32
(R_PPC64_ADDR64), except that its address need not be aligned. For
powerpc64, BFD ld(1) will automatically convert between ADDR<->UADDR
relocations when the address is/isn't at its native alignment. This
will happen if, for example, there is a pointer in a packed struct.
gold and lld do not currently generate R_PPC64_UADDR64, but pass
through misaligned R_PPC64_ADDR64 relocations from object files,
possibly relaxing them to misaligned R_PPC64_RELATIVE. In both cases
(relaxed or not) this violates the PSABI, which defines the relevant
field type as "a 64-bit field occupying 8 bytes, the alignment of
which is 8 bytes unless otherwise specified."
All three linkers violate the PSABI on 32-bit powerpc, where the only
difference is that the field is 32 bits wide, aligned to 4 bytes.
Currently musl fails to load executables linked by BFD ld containing
R_PPC64_UADDR64, with the error "unsupported relocation type 43".
This change provides compatibility with BFD ld on powerpc64, and any
static linker on either architecture that starts following the PSABI
as a result of commit ffab43602b5900c86b7040abdda8ccf6cdec95f5,
static_tls_cnt is now valid during relocations at program startup, so
it's no longer necessary to condition the check against static_tls_cnt
on this being a runtime (dlopen) relocation.
this is analogous to commit 2f1f51ae7b2d78247568e7fdb8462f3c19e469a4,
and should have been caught at the same time since it was right next
to the code moved in that commit. between final stage 3 reloc_all and
the jump to the main program's entry point, it is not valid to call
any functions which may be interposed by the application; doing so
results in execution of application code before ctors have run, and on
fdpic archs, before the main program's fdpic self-fixups have taken
place, which will produce runaway wrong execution.
commit c8b49b2fbc7faa8bf065220f11963d76c8a2eb93 introduced code that
checked bestsym to determine whether a matching symbol was found, but
bestsym is uninitialized if not. instead use best, consistent with use
in the rest of the function.
simplified from bug report and patch by Cheng Liu.
after commit a48ccc159a5fa061a18419296100ee48a1cd6cc9 removed the use
of _Noreturn on the stage3_func type (which only worked due to it
being defined to the "GNU C" attribute in C99 mode), GCC could no
longer assume that the ends of __dls2 and __dls2b are unreachable, and
produced a warning that a function marked _Noreturn returns.
also, since commit 4390383b32250a941ec616e8bff6f568a801b1c0, the
_Noreturn declaration for __libc_start_main in crt1/rcrt1 has been not
only inconsistent with the definition, but wrong. formally,
__libc_start_main does return, via a (hopefully) tail call to a helper
function after the barrier. incorrect usage of _Noreturn in the
declaration was probably formal UB.
the _Noreturn specifiers were not useful in any of these places, so
remove them all. now, the only remaining usage of _Noreturn is in
public interfaces where _Noreturn is part of their contract.
currently the bfd linker does not seem to create tls segments where
p_vaddr%p_align != 0, but this is valid in ELF and then the runtime
computed tls offset must satisfy
offset%p_align == (base+p_vaddr)%p_align
and in case of local exec tls (main executable) the smallest such
offset must be used (otherwise it is incompatible with the offset
computed by the static linker). the !TLS_ABOVE_TP case is handled
correctly (the offset is negative then in the formula).
the ldso code for TLS_ABOVE_TP is changed so the static tls offset
of each module satisfies the formula.
tls_offset should always point to the end of the allocated static tls
area, but this was not handled correctly on "tls variant 1" targets
in the dynamic linker:
after application tls was allocated, tls_offset was aligned up,
potentially wasting tls space. (alignment may be needed at the
begining of the tls area, not at the end, but that will be fixed
separately as it is unlikely to affect real binaries.)
when static tls was allocated for a shared library, tls_offset was
only updated with the size of the tls segment which does not include
alignment gaps, which can easily happen if the tls size update for
one library leaves tls_offset misaligned for the next one. this can
cause oob access in __copy_tls or arbitrary breakage at tls access.
(the issue was observed on aarch64 with rust binaries)
maintainer's note: commit 9d44b6460ab603487dab4d916342d9ba4467e6b9
removed their use.
this is the first part of a series of patches intended to make
__syscall fully self-contained in the object file produced using
syscall.h, which will make it possible for crt1 code to perform
the (confusingly named) i386 __vsyscall mechanism, which this commit
removes, was introduced before the presence of a valid thread pointer
was mandatory; back then the thread pointer was setup lazily only if
threads were used. the intent was to be able to perform syscalls using
the kernel's fast entry point in the VDSO, which can use the sysenter
(Intel) or syscall (AMD) instruction instead of int $128, but without
inlining an access to the __syscall global at the point of each
syscall, which would incur a significant size cost from PIC setup
everywhere. the mechanism also shuffled registers/calling convention
around to avoid spills of call-saved registers, and to avoid
allocating ebx or ebp via asm constraints, since there are plenty of
broken-but-supported compiler versions which are incapable of
allocating ebx with -fPIC or ebp with -fno-omit-frame-pointer.
the new mechanism preserves the properties of avoiding spills and
avoiding allocation of ebx/ebp in constraints, but does it inline,
using some fairly simple register shuffling, and uses a field of the
thread structure rather than global data for the vdso-provided syscall
for now, the external __syscall function is refactored not to use the
old __vsyscall so it can be kept, but the intent is to remove it too.
this affected the error path where dlopen successfully found and
loaded the requested dso and all its dependencies, but failed to
resolve one or more relocations, causing the operation to fail after
storage for the ctor queue was allocated.
commit 188759bbee057aa94db2bbb7cf7f5855f3b9ab53 wrongly put the free
for the ctor_queue array in the error path inside a loop over each
loaded dso that needed to be backed-out, rather than just doing it
once. in addition, the exit path also observed the ctor_queue pointer
still being nonzero, and would attempt to call ctors on the backed-out
dsos unless the double-free crashed the process first.
together with the previous two commits, this completes restoration of
the property that dynamic-linked apps with no external deps and no tls
have no failure paths before entry.
neither has or can have any dependencies, but since commit
403555690775f7c8806372644f543518e6664e3b, gratuitous zero-length deps
arrays were being allocated for them. use a dummy array instead.
traditionally, we've provided a guarantee that dynamic-linked
applications with no external dependencies (nothing but libc) and no
thread-local storage have no failure paths before the entry point.
normally, thanks to reclaim_gaps, such a malloc will not require a
syscall anyway, but if segment alignment is unlucky, it might. use a
builtin array for this common special case.
in the case where malloc is being replaced, it's not valid to call
malloc between final relocations and main app's crt1 entry point; on
fdpic archs the main app's entry point will not yet have performed the
self-fixups necessary to call its code.
to fix, reorder queue_ctors before final relocations. an alternative
solution would be doing the allocation from __libc_start_init, after
the entry point but before any ctors run. this is less desirable,
since it would leave a call to malloc that might be provided by the
application happening at startup when doing so can be easily avoided.
previously, going way back, there was simply no synchronization here.
a call to exit concurrent with ctor execution from dlopen could cause
a dtor to execute concurrently with its corresponding ctor, or could
cause dtors for newly-constructed libraries to be skipped.
introduce a shutting_down state that blocks further ctor execution,
producing the quiescence the dtor execution loop needs to ensure any
kind of consistency, and that blocks further calls to dlopen so that a
call into dlopen from a dtor cannot deadlock.
better approaches to some of this may be possible, but the changes
here at least make things safe.
previously, shared library constructors at program start and dlopen
time were executed in reverse load order. some libraries, however,
rely on a depth-first dependency order, which most other dynamic
linker implementations provide. this is a much more reasonable, less
arbitrary order, and it turns out to have much better properties with
regard to how slow-running ctors affect multi-threaded programs, and
how recursive dlopen behaves.
this commit builds on previous work tracking direct dependencies of
each dso (commit 403555690775f7c8806372644f543518e6664e3b), and
performs a topological sort on the dependency graph at load time while
the main ldso lock is held and before success is committed, producing
a queue of constructors needed by the newly-loaded dso (or main
application). in the case of circular dependencies, the dependency
chain is simply broken at points where it becomes circular.
when the ctor queue is run, the init_fini_lock is held only for
iteration purposes; it's released during execution of each ctor, so
that arbitrarily-long-running application code no longer runs with a
lock held in the caller. this prevents a dlopen with slow ctors in one
thread from arbitrarily delaying other threads that call dlopen.
fully-independent ctors can run concurrently; when multiple threads
call dlopen with a shared dependency, one will end up executing the
ctor while the other waits on a condvar for it to finish.
another corner case improved by these changes is recursive dlopen
(call from a ctor). previously, recursive calls to dlopen could cause
a ctor for a library to be executed before the ctor for its
dependency, even when there was no relation between the calling
library and the library it was loading, simply due to the naive
reverse-load-order traversal. now, we can guarantee that recursive
dlopen in non-circular-dependency usage preserves the desired ctor
execution order properties, and that even in circular usage, at worst
the libraries whose ctors call dlopen will fail to have completed
construction when ctors that depend on them run.
init_fini_lock is changed to a normal, non-recursive mutex, since it
is no longer held while calling back into application code.
this makes calling dlsym on the main app more consistent with the
global symbol table (load order), and is a prerequisite for
dependency-order ctor execution to work correctly with LD_PRELOAD.
commit 403555690775f7c8806372644f543518e6664e3b introduced runtime
realloc of an array that may have been allocated before symbols were
resolved outside of libc, which is invalid if the allocator has been
replaced. track this condition and manually copy if needed.
dlsym with an explicit handle is specified to use "dependency order",
a breadth-first search rooted at the argument. this has always been
implemented by iterating a flattened dependency list built at dlopen
time. however, the logic for building this list was completely wrong
except in trivial cases; it simply used the list of libraries loaded
since a given library, and their direct dependencies, as that
library's dependencies, which could result in misordering, wrongful
omission of deep dependencies from the search, and wrongful inclusion
of unrelated libraries in the search.
further, libraries did not have any recorded list of resolved
dependencies until they were explicitly dlopened, meaning that
DT_NEEDED entries had to be resolved again whenever a library
participated as a dependency of more than one dlopened library.
with this overhaul, the resolved direct dependency list of each
library is always recorded when it is first loaded, and can be
extended to a full flattened breadth-first search list if dlopen is
called on the library. the extension is performed using the direct
dependency list as a queue and appending copies of the direct
dependency list of each dependency in the queue, excluding duplicates,
until the end of the queue is reached. the direct deps remain
available for future use as the initial subarray of the full deps
first-load logic in dlopen is updated to match these changes, and
code introduced in commit 9d44b6460ab603487dab4d916342d9ba4467e6b9
wrongly attempted to read past the end of the currently-installed dtv
to determine if a dso provides new, not-already-installed tls. this
logic was probably leftover from an earlier draft of the code that
wrongly installed the new dtv before populating it.
it would work if we instead queried the new, not-yet-installed dtv,
but instead, replace the incorrect check with a simple range check
against old_cnt. this also catches modules that have no tls at all
with a single condition.
code introduced in commit 9d44b6460ab603487dab4d916342d9ba4467e6b9
wrongly assumed the dso list tail was the right place to find new dtv
storage. however, this is only true if the last-loaded dependency has
tls. the correct place to get it is the dso corresponding to the tls
module list tail. introduce a container_of macro to get it, and use
ultimately, dynamic tls allocation should be refactored so that this
is not an issue. there is no reason to be allocating new dtv space at
each load_library; instead it could happen after all new libraries
have been loaded but before they are committed. such changes may be
made later, but this commit fixes the present regression.
the motivation for this change is twofold. first, it gets the fallback
logic out of the dynamic linker, improving code readability and
organization. second, it provides application code that wants to use
the membarrier syscall, which depends on preregistration of intent
before the process becomes multithreaded unless unbounded latency is
acceptable, with a symbol that, when linked, ensures that this
commit 9d44b6460ab603487dab4d916342d9ba4467e6b9 inadvertently
contained leftover logic from a previous approach to the fallback
signaling loop. it had no adverse effect, since j was always nonzero
if the loop body was reachable, but it makes no sense to be there with
the current approach to avoid signaling self.
previously, dynamic loading of new libraries with thread-local storage
allocated the storage needed for all existing threads at load-time,
precluding late failure that can't be handled, but left installation
in existing threads to take place lazily on first access. this imposed
an additional memory access and branch on every dynamic tls access,
and imposed a requirement, which was not actually met, that the
dynamic tlsdesc asm functions preserve all call-clobbered registers
before calling C code to to install new dynamic tls on first access.
the x86[_64] versions of this code wrongly omitted saving and
restoring of fpu/vector registers, assuming the compiler would not
generate anything using them in the called C code. the arm and aarch64
versions saved known existing registers, but failed to be future-proof
against expansion of the register file.
now that we track live threads in a list, it's possible to install the
new dynamic tls for each thread at dlopen time. for the most part,
synchronization is not needed, because if a thread has not
synchronized with completion of the dlopen, there is no way it can
meaningfully request access to a slot past the end of the old dtv,
which remains valid for accessing slots which already existed.
however, it is necessary to ensure that, if a thread sees its new dtv
pointer, it sees correct pointers in each of the slots that existed
prior to the dlopen. my understanding is that, on most real-world
coherency architectures including all the ones we presently support, a
built-in consume order guarantees this; however, don't rely on that.
instead, the SYS_membarrier syscall is used to ensure that all threads
see the stores to the slots of their new dtv prior to the installation
of the new dtv. if it is not supported, the same is implemented in
userspace via signals, using the same mechanism as __synccall.
the __tls_get_addr function, variants, and dynamic tlsdesc asm
functions are all updated to remove the fallback paths for claiming
new dynamic tls, and are now all branch-free.
commit a603a75a72bb469c6be4963ed1b55fabe675fe15 removed attribute
const from __errno_location and pthread_self, and the same reasoning
forced arch definitions of __pthread_self to use volatile asm,
significantly impacting code generation and imposing manual caching of
pointers where the impact might be noticable.
reorder the thread pointer setup and place it across a strong barrier
(symbolic function lookup) so that there is no assumed ordering
between the initialization and the accesses to the thread pointer in
the placement triggered -Wmisleading-indentation warnings if enabled,
and was gratuitously confusing to anyone reading the code.
commit 6ba5517a460c6c438f64d69464fdfc3269a4c91a modified
__tls_get_addr to offset the address by +DTP_OFFSET (0x8000 on
powerpc, mips, etc.) and adjusted the result of DTPREL relocations by
-DTP_OFFSET to compensate, but missed changing the argument setup for
calls to __tls_get_addr from dlsym.