|Age||Commit message (Collapse)||Author||Lines|
the motivation for this change is twofold. first, it gets the fallback
logic out of the dynamic linker, improving code readability and
organization. second, it provides application code that wants to use
the membarrier syscall, which depends on preregistration of intent
before the process becomes multithreaded unless unbounded latency is
acceptable, with a symbol that, when linked, ensures that this
previously, dynamic loading of new libraries with thread-local storage
allocated the storage needed for all existing threads at load-time,
precluding late failure that can't be handled, but left installation
in existing threads to take place lazily on first access. this imposed
an additional memory access and branch on every dynamic tls access,
and imposed a requirement, which was not actually met, that the
dynamic tlsdesc asm functions preserve all call-clobbered registers
before calling C code to to install new dynamic tls on first access.
the x86[_64] versions of this code wrongly omitted saving and
restoring of fpu/vector registers, assuming the compiler would not
generate anything using them in the called C code. the arm and aarch64
versions saved known existing registers, but failed to be future-proof
against expansion of the register file.
now that we track live threads in a list, it's possible to install the
new dynamic tls for each thread at dlopen time. for the most part,
synchronization is not needed, because if a thread has not
synchronized with completion of the dlopen, there is no way it can
meaningfully request access to a slot past the end of the old dtv,
which remains valid for accessing slots which already existed.
however, it is necessary to ensure that, if a thread sees its new dtv
pointer, it sees correct pointers in each of the slots that existed
prior to the dlopen. my understanding is that, on most real-world
coherency architectures including all the ones we presently support, a
built-in consume order guarantees this; however, don't rely on that.
instead, the SYS_membarrier syscall is used to ensure that all threads
see the stores to the slots of their new dtv prior to the installation
of the new dtv. if it is not supported, the same is implemented in
userspace via signals, using the same mechanism as __synccall.
the __tls_get_addr function, variants, and dynamic tlsdesc asm
functions are all updated to remove the fallback paths for claiming
new dynamic tls, and are now all branch-free.
the __synccall mechanism provides stop-the-world synchronous execution
of a callback in all threads of the process. it is used to implement
multi-threaded setuid/setgid operations, since Linux lacks them at the
kernel level, and for some other less-critical purposes.
this change eliminates dependency on /proc/self/task to determine the
set of live threads, which in addition to being an unwanted dependency
and a potential point of resource-exhaustion failure, turned out to be
inaccurate. test cases provided by Alexey Izbyshev showed that it
could fail to reflect newly created threads. due to how the
presignaling phase worked, this usually yielded a deadlock if hit, but
in the worst case it could also result in threads being silently
missed (allowed to continue running without executing the callback).
the hard problem here is unlinking threads from a list when they exit
without creating a window of inconsistency where the kernel task for a
thread still exists and is still executing instructions in userspace,
but is not reflected in the list. the magic solution here is getting
rid of per-thread exit futex addresses (set_tid_address), and instead
using the exit futex to unlock the global thread list.
since pthread_join can no longer see the thread enter a detach_state
of EXITED (which depended on the exit futex address pointing to the
detach_state), it must now observe the unlocking of the thread list
lock before it can unmap the joined thread and return. it doesn't
actually have to take the lock. for this, a __tl_sync primitive is
offered, with a signature that will allow it to be enhanced for quick
return even under contention on the lock, if needed. for now, the
exiting thread always performs a futex wake on its detach_state. a
future change could optimize this out except when there is already a
initial/dynamic variants of detached state no longer need to be
tracked separately, since the futex address is always set to the
global list lock, not a thread-local address that could become invalid
on detached thread exit. all detached threads, however, must perform a
second sigprocmask syscall to block implementation-internal signals,
since locking the thread list with them already blocked is not
the arch-independent C version of __unmapself no longer needs to take
a lock or setup its own futex address to release the lock, since it
must necessarily be called with the thread list lock already held,
guaranteeing exclusive access to the temporary stack.
changes to libc.threads_minus_1 no longer need to be atomic, since
they are guarded by the thread list lock. it is largely vestigial at
this point, and can be replaced with a cheaper boolean indicating
whether the process is multithreaded at some point in the future.
whether signals need to be blocked at thread start, and whether
unblocking is necessary in the entry point function, has historically
depended on intricacies of the cancellation design and on whether
there are scheduling operations to perform on the new thread before
its successful creation can be committed. future changes to track an
AS-safe list of live threads will require signals to be blocked
whenever changes are made to the list, so ...
prior to commits b8742f32602add243ee2ce74d804015463726899 and
40bae2d32fd6f3ffea437fa745ad38a1fe77b27e, a signal mask for the entry
function to restore was part of the pthread structure. it was removed
to trim down the size of the structure, which both saved a small
amount of stack space and improved code generation on archs where
small immediate displacements are less costly than arbitrary ones, by
limiting the range of offsets between the base of the thread
structure, its members, and the thread pointer. these commits moved
the saved mask to a special structure used only when special
scheduling was needed, in which case the pthread_create caller and new
thread had to synchronize with each other and could use this memory to
pass a mask.
this commit partially reverts the above two commits, but instead of
putting the mask back in the pthread structure, it moves all "start
argument" members out of the pthread structure, trimming it down
further, and puts them in a separate structure passed on the new
thread's stack. the code path for explicit scheduling of the new
thread is also changed to synchronize with the calling thread in such
a way to avoid spurious futex wakes.
prior to linux 2.6.22, futex wait could fail with EINTR even for
non-interrupting (SA_RESTART) signals. this was no problem provided
the caller simply restarted the wait, but sem_[timed]wait is required
by POSIX to return when interrupted by a signal. commit
a113434cd68ce30642c4995b1caadcd084be6f09 introduced this behavior, and
commit c0ed5a201b2bdb6d1896064bec0020c9973db0a1 reverted it based on a
mistaken belief that it was not required. this belief stems from a bug
in the specification: the description requires the function to return
when interrupted, but the errors section marks EINTR as a "may fail"
condition rather than a "shall fail" one.
since there does seem to be significant value in the change made in
commit c0ed5a201b2bdb6d1896064bec0020c9973db0a1, making it so that
programs that call sem_wait without checking for EINTR don't silently
make forward progress without obtaining the semaphore or treat it as a
fatal error and abort, add a behind-the-scenes mechanism in the
__timedwait backend to suppress EINTR in programs that have never
installed interrupting signal handlers, and have sigaction track and
report this state. this way the semaphore code is not cluttered by
workarounds and can be updated (to be done in next commit) to reflect
the high-level logic for conforming behavior.
these changes are based loosely on a patch by Markus Wichmann, with
the main changes being atomic update to flag object and moving the
workaround from sem_timedwait to the __timedwait futex backend.
as explained in commit 6ba5517a460c6c438f64d69464fdfc3269a4c91a, some
archs use an offset (typicaly -0x8000) with their DTPOFF relocations,
which __tls_get_addr needs to invert. on affected archs, which lack
direct support for large immediates, this can cost multiple extra
instructions in the hot path. instead, incorporate the DTP_OFFSET into
the DTV entries. this means they are no longer valid pointers, so
store them as an array of uintptr_t rather than void *; this also
makes it easier to access slot 0 as a valid slot count.
commit e75b16cf93ebbc1ce758d3ea6b2923e8b2457c68 left behind cruft in
two places, __reset_tls and __tls_get_new, from back when it was
possible to have uninitialized gap slots indicated by a null pointer
in the DTV. since the concept of null pointer is no longer meaningful
with an offset applied, remove this cruft.
presently there are no archs with both TLSDESC and nonzero DTP_OFFSET,
but the dynamic TLSDESC relocation code is also updated to apply an
inverted offset to its offset field, so that the offset DTV would not
impose a runtime cost in TLSDESC resolver functions.
stack size default is increased from 80k to 128k. this coincides with
Linux's hard-coded default stack for the main thread (128k is
initially committed; growth beyond that up to ulimit is contingent on
additional allocation succeeding) and GNU ld's default PT_GNU_STACK
size for FDPIC, at least on sh.
guard size default is increased from 4k to 8k to reduce the risk of
guard page jumping on overflow, since use of just over 4k of stack is
common (PATH_MAX buffers, etc.).
limit to 8MB/1MB, repectively. since the defaults cannot be reduced
once increased, excessively large settings would lead to an
unrecoverably broken state. this change is in preparation to allow
defaults to be increased via program headers at the linker level.
creation of threads that really need larger sizes needs to be done
with an explicit attribute.
per POSIX, deletion of a key for which some threads still have values
stored is permitted, and newly created keys must initially hold the
null value in all threads. these properties were not met by our
implementation; if a key was deleted with values left and a new key
was created in the same slot, the old values were still visible.
moreover, due to lack of any synchronization in pthread_key_delete,
there was a TOCTOU race whereby a concurrent pthread_exit could
attempt to call a null destructor pointer for the newly orphaned
this commit introduces a solution based on __synccall, stopping the
world to zero out the values for deleted keys, but only does so lazily
when all key slots have been exhausted. pthread_key_delete is split
off into a separate translation unit so that static-linked programs
which only create keys but never delete them will not pull in the
a global rwlock is added to synchronize creation and deletion of keys
with dtor execution. since the dtor execution loop now has to release
and retake the lock around its call to each dtor, checks are made not
to call the nodtor dummy function for keys which lack a dtor.
pthread_atfork.c does not actually include pthread_impl.h and has no
reason to, so it wasn't getting the declaration. move it to libc.h
which is already included by both fork.c and pthread_atfork.c. this
makes more sense anyway since the function has little to do with
pthreads anyway aside from the name.
these were overlooked for various reasons in earlier stages.
commits leading up to this one have moved the vast majority of
libc-internal interface declarations to appropriate internal headers,
allowing them to be type-checked and setting the stage to limit their
visibility. the ones that have not yet been moved are mostly
namespace-protected aliases for standard/public interfaces, which
exist to facilitate implementing plain C functions in terms of POSIX
functionality, or C or POSIX functionality in terms of extensions that
are not standardized. some don't quite fit this description, but are
"internally public" interfacs between subsystems of libc.
rather than create a number of newly-named headers to declare these
functions, and having to add explicit include directives for them to
every source file where they're needed, I have introduced a method of
wrapping the corresponding public headers.
parallel to the public headers in $(srcdir)/include, we now have
wrappers in $(srcdir)/src/include that come earlier in the include
path order. they include the public header they're wrapping, then add
declarations for namespace-protected versions of the same interfaces
and any "internally public" interfaces for the subsystem they
along these lines, the wrapper for features.h is now responsible for
the definition of the hidden, weak, and weak_alias macros. this means
source files will no longer need to include any special headers to
access these features.
over time, it is my expectation that the scope of what is "internally
public" will expand, reducing the number of source files which need to
include *_impl.h and related headers down to those which are actually
implementing the corresponding subsystems, not just using them.
this is not a public interface, and does not even necessarily match
the syscall on all archs that have a syscall by that name.
on archs where it's implemented in C, no action on the source file is
needed; the hidden declaration in pthread_arch.h suffices.
these are not a public interface and are not intended to be callable
from anywhere but the public clone function or other places in libc.
it's already included in all places where these are needed, and aside
from __tls_get_addr, they're all implementation internals.
the wrapper start function that performs scheduling operations is
unreachable if pthread_attr_setinheritsched is never called, so move
it there rather than the pthread_create source file, saving some code
size for static-linked programs.
eliminate the awkward startlock mechanism and corresponding fields of
the pthread structure that were only used at startup.
instead of having pthread_create perform the scheduling operations and
having the new thread wait for them to be completed, start the new
thread with a wrapper start function that performs its own scheduling,
sending the result code back via a futex. this way the new thread can
use storage from the calling thread's stack rather than permanent
fields in the pthread structure.
over time the pthread structure has accumulated a lot of cruft taking
up size. this commit removes unused fields and packs booleans and
other small data more efficiently. changes which would also require
changing code are not included at this time.
non-volatile booleans are packed as unsigned char bitfield members.
the canceldisable and cancelasync fields need volatile qualification
due to how they're accessed from the cancellation signal handler and
cancellable syscalls called from signal handlers. since volatile
bitfield semantics are not clearly defined, discrete char objects are
the pid field is completely removed; it has been unused since commit
the tid field's type is changed to int because its use is as a value
in futexes, which are defined as plain int. it has no conceptual
relationship to pid_t. also, its position is not ABI.
startlock is reduced to a length-1 array. the second element was
presumably intended as a waiter count, but it was never used and made
no sense, since there is at most one waiter.
previously, some accesses to the detached state (from pthread_join and
pthread_getattr_np) were unsynchronized; they were harmless in
programs with well-defined behavior, but ugly. other accesses (in
pthread_exit and pthread_detach) were synchronized by a poorly named
"exitlock", with an ad-hoc trylock operation on it open-coded in
pthread_detach, whose only purpose was establishing protocol for which
thread is responsible for deallocation of detached-thread resources.
instead, use an atomic detach_state and unify it with the futex used
to wait for thread exit. this eliminates 2 members from the pthread
structure, gets rid of the hackish lock usage, and makes rigorous the
trap added in commit 80bf5952551c002cf12d96deb145629765272db0 for
catching attempts to join detached threads. it should also make
attempt to detach an already-detached thread reliably trap.
if the last thread exited via pthread_exit, the logic that marked it
dead did not account for the possibility of it targeting itself via
atexit handlers. for example, an atexit handler calling
pthread_kill(pthread_self(), SIGKILL) would return success
(previously, ESRCH) rather than causing termination via the signal.
move the release of killlock after the determination is made whether
the exiting thread is the last thread. in the case where it's not,
move the release all the way to the end of the function. this way we
can clear the tid rather than spending storage on a dedicated
dead-flag. clearing the tid is also preferable in that it hardens
against inadvertent use of the value after the thread has terminated
but before it is joined.
the tid field in the pthread structure is not volatile, and really
shouldn't be, so as not to limit the compiler's ability to reorder,
merge, or split loads in code paths that may be relevant to
performance (like controlling lock ownership).
however, use of objects which are not volatile or atomic with futex
wait is inherently broken, since the compiler is free to transform a
single load into multiple loads, thereby using a different value for
the controlling expression of the loop and the value passed to the
futex syscall, leading the syscall to block instead of returning.
reportedly glibc's pthread_join was actually affected by an equivalent
issue in glibc on s390.
add a separate, dedicated join_futex object for pthread_join to use.
in the original submission of the patch that became commit
7c709f2d4f9872d1b445f760b0e68da89e256b9e, and in subsequent reading of
it by others, it was not clear that the new member had to be inserted
before canary_at_end, or that inserting it at that location was safe.
add comments to document.
In all cases this is just a change from two volatile int to one.
A variant of this new lock algorithm has been presented at SAC'16, see
https://hal.inria.fr/hal-01304108. A full version of that paper is
available at https://hal.inria.fr/hal-01236734.
The main motivation of this is to improve on the safety of the basic lock
implementation in musl. This is achieved by squeezing a lock flag and a
congestion count (= threads inside the critical section) into a single
int. Thereby an unlock operation does exactly one memory
transfer (a_fetch_add) and never touches the value again, but still
detects if a waiter has to be woken up.
This is a fix of a use-after-free bug in pthread_detach that had
temporarily been patched. Therefore this patch also reverts
This is also the only place where internal knowledge of the lock
algorithm is used.
The main price for the improved safety is a little bit larger code.
Under high congestion, the scheduling behavior will be different
compared to the previous algorithm. In that case, a successful
put-to-sleep may appear out of order compared to the arrival in the
The flag 1<<7 is used in several places for different purposes that are
not always easy to distinguish. Mark those usages that correspond to the
flag that is used by the kernel for futexes.
x32 has another gratuitous difference to all other archs:
it passes an array of 64bit values to __tls_get_addr().
usually it is an array of size_t.
commit 31fb174dd295e50f7c5cf18d31fcfd5fe5a063b7 used
DEFAULT_GUARD_SIZE from pthread_impl.h in a static initializer,
breaking build on archs where its definition, PAGE_SIZE, is not a
constant. instead, just define DEFAULT_GUARD_SIZE as 4096, the minimal
page size on any arch we support. pthread_create rounds up to whole
pages anyway, so defining it to 1 would also work, but a moderately
meaningful value is nicer to programs that use
pthread_attr_getguardsize on default-initialized attribute objects.
the TLS ABI spec for mips, powerpc, and some other (presently
unsupported) RISC archs has the return value of __tls_get_addr offset
by +0x8000 and the result of DTPOFF relocations offset by -0x8000. I
had previously assumed this part of the ABI was actually just an
implementation detail, since the adjustments cancel out. however, when
the local dynamic model is used for accessing TLS that's known to be
in the same DSO, either of the following may happen:
1. the -0x8000 offset may already be applied to the argument structure
passed to __tls_get_addr at ld time, without any opportunity for
2. __tls_get_addr may be used with a zero offset argument to obtain a
base address for the module's TLS, to which the caller then applies
immediate offsets for individual objects accessed using the local
dynamic model. since the immediate offsets have the -0x8000 adjustment
applied to them, the base address they use needs to include the
it would be possible, but more complex, to store the pointers in the
dtv array with the +0x8000 offset pre-applied, to avoid the runtime
cost of adding 0x8000 on each call to __tls_get_addr. this change
could be made later if measurements show that it would help.
i386, x86_64, x32, and powerpc all use TLS for stack protector canary
values in the default stack protector ABI, but the location only
matched the ABI on i386 and x86_64. on x32, the expected location for
the canary contained the tid, thus producing spurious mismatches
(resulting in process termination) upon fork. on powerpc, the expected
location contained the stdio_locks list head, so returning from a
function after calling flockfile produced spurious mismatches. in both
cases, the random canary was not present, and a predictable value was
used instead, making the stack protector hardening much less effective
than it should be.
in the current fix, the thread structure has been expanded to have
canary fields at all three possible locations, and archs that use a
non-default location must define a macro in pthread_arch.h to choose
which location is used. for most archs (which lack TLS canary ABI) the
choice does not matter.
this fixes truncation of error messages containing long pathnames or
the dlerror state was previously required by POSIX to be global. the
resolution of bug 97 relaxed the requirements to allow thread-safe
implementations of dlerror with thread-local state and message buffer.
this global lock allows certain unlock-type primitives to exclude
mmap/munmap operations which could change the identity of virtual
addresses while references to them still exist.
the original design mistakenly assumed mmap/munmap would conversely
need to exclude the same operations which exclude mmap/munmap, so the
vmlock was implemented as a sort of 'symmetric recursive rwlock'. this
turned out to be unnecessary.
commit 25d12fc0fc51f1fae0f85b4649a6463eb805aa8f already shortened the
interval during which mmap/munmap held their side of the lock, but
left the inappropriate lock design and some inefficiency.
the new design uses a separate function, __vm_wait, which does not
hold any lock itself and only waits for lock users which were already
present when it was called to release the lock. this is sufficient
because of the way operations that need to be excluded are sequenced:
the "unlock-type" operations using the vmlock need only block
mmap/munmap operations that are precipitated by (and thus sequenced
after) the atomic-unlock they perform while holding the vmlock.
this allows for a spectacular lack of synchronization in the __vm_wait
There are two main abi variants for thread local storage layout:
(1) TLS is above the thread pointer at a fixed offset and the pthread
struct is below that. So the end of the struct is at known offset.
(2) the thread pointer points to the pthread struct and TLS starts
below it. So the start of the struct is at known (zero) offset.
Assembly code for the dynamic TLSDESC callback needs to access the
dynamic thread vector (dtv) pointer which is currently at the front
of the pthread struct. So in case of (1) the asm code needs to hard
code the offset from the end of the struct which can easily break if
the struct changes.
This commit adds a copy of the dtv at the end of the struct. New members
must not be added after dtv_copy, only before it. The size of the struct
is increased a bit, but there is opportunity for size optimizations.
the memory model we use internally for atomics permits plain loads of
values which may be subject to concurrent modification without
requiring that a special load function be used. since a compiler is
free to make transformations that alter the number of loads or the way
in which loads are performed, the compiler is theoretically free to
break this usage. the most obvious concern is with atomic cas
constructs: something of the form tmp=*p;a_cas(p,tmp,f(tmp)); could be
transformed to a_cas(p,*p,f(*p)); where the latter is intended to show
multiple loads of *p whose resulting values might fail to be equal;
this would break the atomicity of the whole operation. but even more
fundamental breakage is possible.
with the changes being made now, objects that may be modified by
atomics are modeled as volatile, and the atomic operations performed
on them by other threads are modeled as asynchronous stores by
hardware which happens to be acting on the request of another thread.
such modeling of course does not itself address memory synchronization
between cores/cpus, but that aspect was already handled. this all
seems less than ideal, but it's the best we can do without mandating a
C11 compiler and using the C11 model for atomics.
in the case of pthread_once_t, the ABI type of the underlying object
is not volatile-qualified. so we are assuming that accessing the
object through a volatile-qualified lvalue via casts yields volatile
access semantics. the language of the C standard is somewhat unclear
on this matter, but this is an assumption the linux kernel also makes,
and seems to be the correct interpretation of the standard.
previously, the __timedwait function was optionally a cancellation
point depending on whether it was passed a pointer to a cleaup
function and context to register. as of now, only one caller actually
used such a cleanup function (and it may face removal soon); most
callers either passed a null pointer to disable cancellation or a
dummy cleanup function.
now, __timedwait is never a cancellation point, and __timedwait_cp is
the cancellable version. this makes the intent of the calling code
more obvious and avoids ugly dummy functions and long argument lists.
based on patch by Jens Gustedt.
the main difficulty here is handling the difference between start
function signatures and thread return types for C11 threads versus
POSIX threads. pointers to void are assumed to be able to represent
faithfully all values of int. the function pointer for the thread
start function is cast to an incorrect type for passing through
pthread_create, but is cast back to its correct type before calling so
that the behavior of the call is well-defined.
changes to the existing threads implementation were kept minimal to
reduce the risk of regressions, and duplication of code that carries
implementation-specific assumptions was avoided for ease and safety of
this is analogous commit fffc5cda10e0c5c910b40f7be0d4fa4e15bb3f48
which fixed the corresponding issue for mutexes.
the robust list can't be used here because the locks do not share a
common layout with mutexes. at some point it may make sense to simply
incorporate a mutex object into the FILE structure and use it, but
that would be a much more invasive change, and it doesn't mesh well
with the current design that uses a simpler code path for internal
locking and pulls in the recursive-mutex-like code when the flockfile
API is used explicitly.
for unknown syscall commands, the kernel produces ENOSYS, not EINVAL.
the immediate issue that was reported by Jens Gustedt and needed to be
fixed was corruption of the cv/mutex waiter states when switching to
using a new mutex with the cv after all waiters were unblocked but
before they finished returning from the wait function.
self-synchronized destruction was also handled poorly and may have had
race conditions. and the use of sequence numbers for waking waiters
admitted a theoretical missed-wakeup if the sequence number wrapped
through the full 32-bit space.
the new implementation is largely documented in the comments in the
source. the basic principle is to use linked lists initially attached
to the cv object, but detachable on signal/broadcast, made up of nodes
residing in automatic storage (stack) on the threads that are waiting.
this eliminates the need for waiters to access the cv object after
they are signaled, and allows us to limit wakeup to one waiter at a
time during broadcasts even when futex requeue cannot be used.
performance is also greatly improved, roughly double some tests.
basically nothing is changed in the process-shared cond var case,
where this implementation does not work, since processes do not have
access to one another's local storage.
when manipulating the robust list, the order of stores matters,
because the code may be asynchronously interrupted by a fatal signal
and the kernel will then access the robust list in what is essentially
an async-signal context.
previously, aliasing considerations made it seem unlikely that a
compiler could reorder the stores, but proving that they could not be
reordered incorrectly would have been extremely difficult. instead
I've opted to make all the pointers used as part of the robust list,
including those in the robust list head and in the individual mutexes,
in addition, the format of the robust list has been changed to point
back to the head at the end, rather than ending with a null pointer.
this is to match the documented kernel robust list ABI. the null
pointer, which was previously used, only worked because faults during
access terminate the robust list processing.
private-futex uses the virtual address of the futex int directly as
the hash key rather than requiring the kernel to resolve the address
to an underlying backing for the mapping in which it lies. for certain
usage patterns it improves performance significantly.
in many places, the code using futex __wake and __wait operations was
already passing a correct fixed zero or nonzero flag for the priv
argument, so no change was needed at the site of the call, only in the
__wake and __wait functions themselves. in other places, especially
where the process-shared attribute for a synchronization object was
not previously tracked, additional new code is needed. for mutexes,
the only place to store the flag is in the type field, so additional
bit masking logic is needed for accessing the type.
for non-process-shared condition variable broadcasts, the futex
requeue operation is unable to requeue from a private futex to a
process-shared one in the mutex structure, so requeue is simply
disabled in this case by waking all waiters.
for robust mutexes, the kernel always performs a non-private wake when
the owner dies. in order not to introduce a behavioral regression in
non-process-shared robust mutexes (when the owning thread dies), they
are simply forced to be treated as process-shared for now, giving
correct behavior at the expense of performance. this can be fixed by
adding explicit code to pthread_exit to do the right thing for
non-shared robust mutexes in userspace rather than relying on the
kernel to do it, and will be fixed in this way later.
since not all supported kernels have private futex support, the new
code detects EINVAL from the futex syscall and falls back to making
the call without the private flag. no attempt to cache the result is
made; caching it and using the cached value efficiently is somewhat
difficult, and not worth the complexity when the benefits would be
seen only on ancient kernels which have numerous other limitations and
the motivation for the errno_ptr field in the thread structure, which
this commit removes, was to allow the main thread's errno to keep its
address when lazy thread pointer initialization was used. &errno was
evaluated prior to setting up the thread pointer and stored in
errno_ptr for the main thread; subsequently created threads would have
errno_ptr pointing to their own errno_val in the thread structure.
since lazy initialization was removed, there is no need for this extra
level of indirection; __errno_location can simply return the address
of the thread's errno_val directly. this does cause &errno to change,
but the change happens before entry to application code, and thus is
1. the thread result field was reused for storing a kernel timer id,
but would be overwritten if the application code exited or cancelled
2. low pointer values were used as the indicator that the timer id is
a kernel timer id rather than a thread id. this is not portable, as
mmap may return low pointers on some conditions. instead, use the fact
that pointers must be aligned and kernel timer ids must be
non-negative to map pointers into the negative integer space.
3. signals were not blocked until after the timer thread started, so a
race condition could allow a signal handler to run in the timer thread
when it's not supposed to exist. this is mainly problematic if the
calling thread was the only thread where the signal was unblocked and
the signal handler assumes it runs in that thread.
there are several reasons for this change. one is getting rid of the
repetition of the syscall signature all over the place. another is
sharing the constant masks without costly GOT accesses in PIC.
the main motivation, however, is accurately representing whether we
want to block signals that might be handled by the application, or all