| Age | Commit message (Collapse) | Author | Lines | 
|---|
|  | this fixes truncation of error messages containing long pathnames or
symbol names.
the dlerror state was previously required by POSIX to be global. the
resolution of bug 97 relaxed the requirements to allow thread-safe
implementations of dlerror with thread-local state and message buffer. | 
|  |  | 
|  | this global lock allows certain unlock-type primitives to exclude
mmap/munmap operations which could change the identity of virtual
addresses while references to them still exist.
the original design mistakenly assumed mmap/munmap would conversely
need to exclude the same operations which exclude mmap/munmap, so the
vmlock was implemented as a sort of 'symmetric recursive rwlock'. this
turned out to be unnecessary.
commit 25d12fc0fc51f1fae0f85b4649a6463eb805aa8f already shortened the
interval during which mmap/munmap held their side of the lock, but
left the inappropriate lock design and some inefficiency.
the new design uses a separate function, __vm_wait, which does not
hold any lock itself and only waits for lock users which were already
present when it was called to release the lock. this is sufficient
because of the way operations that need to be excluded are sequenced:
the "unlock-type" operations using the vmlock need only block
mmap/munmap operations that are precipitated by (and thus sequenced
after) the atomic-unlock they perform while holding the vmlock.
this allows for a spectacular lack of synchronization in the __vm_wait
function itself. | 
|  | There are two main abi variants for thread local storage layout:
 (1) TLS is above the thread pointer at a fixed offset and the pthread
 struct is below that. So the end of the struct is at known offset.
 (2) the thread pointer points to the pthread struct and TLS starts
 below it. So the start of the struct is at known (zero) offset.
Assembly code for the dynamic TLSDESC callback needs to access the
dynamic thread vector (dtv) pointer which is currently at the front
of the pthread struct. So in case of (1) the asm code needs to hard
code the offset from the end of the struct which can easily break if
the struct changes.
This commit adds a copy of the dtv at the end of the struct. New members
must not be added after dtv_copy, only before it. The size of the struct
is increased a bit, but there is opportunity for size optimizations. | 
|  | the memory model we use internally for atomics permits plain loads of
values which may be subject to concurrent modification without
requiring that a special load function be used. since a compiler is
free to make transformations that alter the number of loads or the way
in which loads are performed, the compiler is theoretically free to
break this usage. the most obvious concern is with atomic cas
constructs: something of the form tmp=*p;a_cas(p,tmp,f(tmp)); could be
transformed to a_cas(p,*p,f(*p)); where the latter is intended to show
multiple loads of *p whose resulting values might fail to be equal;
this would break the atomicity of the whole operation. but even more
fundamental breakage is possible.
with the changes being made now, objects that may be modified by
atomics are modeled as volatile, and the atomic operations performed
on them by other threads are modeled as asynchronous stores by
hardware which happens to be acting on the request of another thread.
such modeling of course does not itself address memory synchronization
between cores/cpus, but that aspect was already handled. this all
seems less than ideal, but it's the best we can do without mandating a
C11 compiler and using the C11 model for atomics.
in the case of pthread_once_t, the ABI type of the underlying object
is not volatile-qualified. so we are assuming that accessing the
object through a volatile-qualified lvalue via casts yields volatile
access semantics. the language of the C standard is somewhat unclear
on this matter, but this is an assumption the linux kernel also makes,
and seems to be the correct interpretation of the standard. | 
|  | previously, the __timedwait function was optionally a cancellation
point depending on whether it was passed a pointer to a cleaup
function and context to register. as of now, only one caller actually
used such a cleanup function (and it may face removal soon); most
callers either passed a null pointer to disable cancellation or a
dummy cleanup function.
now, __timedwait is never a cancellation point, and __timedwait_cp is
the cancellable version. this makes the intent of the calling code
more obvious and avoids ugly dummy functions and long argument lists. | 
|  | based on patch by Jens Gustedt.
the main difficulty here is handling the difference between start
function signatures and thread return types for C11 threads versus
POSIX threads. pointers to void are assumed to be able to represent
faithfully all values of int. the function pointer for the thread
start function is cast to an incorrect type for passing through
pthread_create, but is cast back to its correct type before calling so
that the behavior of the call is well-defined.
changes to the existing threads implementation were kept minimal to
reduce the risk of regressions, and duplication of code that carries
implementation-specific assumptions was avoided for ease and safety of
future maintenance. | 
|  | this is analogous commit fffc5cda10e0c5c910b40f7be0d4fa4e15bb3f48
which fixed the corresponding issue for mutexes.
the robust list can't be used here because the locks do not share a
common layout with mutexes. at some point it may make sense to simply
incorporate a mutex object into the FILE structure and use it, but
that would be a much more invasive change, and it doesn't mesh well
with the current design that uses a simpler code path for internal
locking and pulls in the recursive-mutex-like code when the flockfile
API is used explicitly. | 
|  | for unknown syscall commands, the kernel produces ENOSYS, not EINVAL. | 
|  | the immediate issue that was reported by Jens Gustedt and needed to be
fixed was corruption of the cv/mutex waiter states when switching to
using a new mutex with the cv after all waiters were unblocked but
before they finished returning from the wait function.
self-synchronized destruction was also handled poorly and may have had
race conditions. and the use of sequence numbers for waking waiters
admitted a theoretical missed-wakeup if the sequence number wrapped
through the full 32-bit space.
the new implementation is largely documented in the comments in the
source. the basic principle is to use linked lists initially attached
to the cv object, but detachable on signal/broadcast, made up of nodes
residing in automatic storage (stack) on the threads that are waiting.
this eliminates the need for waiters to access the cv object after
they are signaled, and allows us to limit wakeup to one waiter at a
time during broadcasts even when futex requeue cannot be used.
performance is also greatly improved, roughly double some tests.
basically nothing is changed in the process-shared cond var case,
where this implementation does not work, since processes do not have
access to one another's local storage. | 
|  | when manipulating the robust list, the order of stores matters,
because the code may be asynchronously interrupted by a fatal signal
and the kernel will then access the robust list in what is essentially
an async-signal context.
previously, aliasing considerations made it seem unlikely that a
compiler could reorder the stores, but proving that they could not be
reordered incorrectly would have been extremely difficult. instead
I've opted to make all the pointers used as part of the robust list,
including those in the robust list head and in the individual mutexes,
volatile.
in addition, the format of the robust list has been changed to point
back to the head at the end, rather than ending with a null pointer.
this is to match the documented kernel robust list ABI. the null
pointer, which was previously used, only worked because faults during
access terminate the robust list processing. | 
|  | private-futex uses the virtual address of the futex int directly as
the hash key rather than requiring the kernel to resolve the address
to an underlying backing for the mapping in which it lies. for certain
usage patterns it improves performance significantly.
in many places, the code using futex __wake and __wait operations was
already passing a correct fixed zero or nonzero flag for the priv
argument, so no change was needed at the site of the call, only in the
__wake and __wait functions themselves. in other places, especially
where the process-shared attribute for a synchronization object was
not previously tracked, additional new code is needed. for mutexes,
the only place to store the flag is in the type field, so additional
bit masking logic is needed for accessing the type.
for non-process-shared condition variable broadcasts, the futex
requeue operation is unable to requeue from a private futex to a
process-shared one in the mutex structure, so requeue is simply
disabled in this case by waking all waiters.
for robust mutexes, the kernel always performs a non-private wake when
the owner dies. in order not to introduce a behavioral regression in
non-process-shared robust mutexes (when the owning thread dies), they
are simply forced to be treated as process-shared for now, giving
correct behavior at the expense of performance. this can be fixed by
adding explicit code to pthread_exit to do the right thing for
non-shared robust mutexes in userspace rather than relying on the
kernel to do it, and will be fixed in this way later.
since not all supported kernels have private futex support, the new
code detects EINVAL from the futex syscall and falls back to making
the call without the private flag. no attempt to cache the result is
made; caching it and using the cached value efficiently is somewhat
difficult, and not worth the complexity when the benefits would be
seen only on ancient kernels which have numerous other limitations and
bugs anyway. | 
|  | the motivation for the errno_ptr field in the thread structure, which
this commit removes, was to allow the main thread's errno to keep its
address when lazy thread pointer initialization was used. &errno was
evaluated prior to setting up the thread pointer and stored in
errno_ptr for the main thread; subsequently created threads would have
errno_ptr pointing to their own errno_val in the thread structure.
since lazy initialization was removed, there is no need for this extra
level of indirection; __errno_location can simply return the address
of the thread's errno_val directly. this does cause &errno to change,
but the change happens before entry to application code, and thus is
not observable. | 
|  | 1. the thread result field was reused for storing a kernel timer id,
but would be overwritten if the application code exited or cancelled
the thread.
2. low pointer values were used as the indicator that the timer id is
a kernel timer id rather than a thread id. this is not portable, as
mmap may return low pointers on some conditions. instead, use the fact
that pointers must be aligned and kernel timer ids must be
non-negative to map pointers into the negative integer space.
3. signals were not blocked until after the timer thread started, so a
race condition could allow a signal handler to run in the timer thread
when it's not supposed to exist. this is mainly problematic if the
calling thread was the only thread where the signal was unblocked and
the signal handler assumes it runs in that thread. | 
|  | there are several reasons for this change. one is getting rid of the
repetition of the syscall signature all over the place. another is
sharing the constant masks without costly GOT accesses in PIC.
the main motivation, however, is accurately representing whether we
want to block signals that might be handled by the application, or all
signals. | 
|  | this function is mainly (purely?) for obtaining stack address
information, but we also provide the detach state since it's easy to
do anyway. | 
|  | the issue at hand is that many syscalls require as an argument the
kernel-ABI size of sigset_t, intended to allow the kernel to switch to
a larger sigset_t in the future. previously, each arch was defining
this size in syscall_arch.h, which was redundant with the definition
of _NSIG in bits/signal.h. as it's used in some not-quite-portable
application code as well, _NSIG is much more likely to be recognized
and understood immediately by someone reading the code, and it's also
shorter and less cluttered.
note that _NSIG is actually 65/129, not 64/128, but the division takes
care of throwing away the off-by-one part. | 
|  | this should generate faster and smaller code, especially with inline
syscalls. the conditional with cnt is ugly, but thankfully cnt is
always a constant anyway so it gets evaluated at compile time. it may
be preferable to make separate __wake and __wakeall macros without a
count argument.
priv flag is not used yet; private futex support still needs to be
done at some point in the future. | 
|  | linux's sched_* syscalls actually implement the TPS (thread
scheduling) functionality, not the PS (process scheduling)
functionality which the sched_* functions are supposed to have.
omitting support for the PS option (and having the sched_* interfaces
fail with ENOSYS rather than omitting them, since some broken software
assumes they exist) seems to be the only conforming way to do this on
linux. | 
|  | this mirrors the stdio_impl.h cleanup. one header which is not
strictly needed, errno.h, is left in pthread_impl.h, because since
pthread functions return their error codes rather than using errno,
nearly every single pthread function needs the errno constants.
in a few places, rather than bringing in string.h to use memset, the
memset was replaced by direct assignment. this seems to generate much
better code anyway, and makes many functions which were previously
non-leaf functions into leaf functions (possibly eliminating a great
deal of bloat on some platforms where non-leaf functions require ugly
prologue and/or epilogue). | 
|  | unlike other implementations, this one reserves memory for new TLS in
all pre-existing threads at dlopen-time, and dlopen will fail with no
resources consumed and no new libraries loaded if memory is not
available. memory is not immediately distributed to running threads;
that would be too complex and too costly. instead, assurances are made
that threads needing the new TLS can obtain it in an async-signal-safe
way from a buffer belonging to the dynamic linker/new module (via
atomic fetch-and-add based allocator).
I've re-appropriated the lock that was previously used for __synccall
(synchronizing set*id() syscalls between threads) as a general
pthread_create lock. it's a "backwards" rwlock where the "read"
operation is safe atomic modification of the live thread count, which
multiple threads can perform at the same time, and the "write"
operation is making sure the count does not increase during an
operation that depends on it remaining bounded (__synccall or dlopen).
in static-linked programs that don't use __synccall, this lock is a
no-op and has no cost. | 
|  | this code will not work yet because the necessary relocations are not
supported, and cannot be supported without some internal changes to
how relocation processing works (coming soon). | 
|  | some minor changes to how hard-coded sets for thread-related purposes
are handled were also needed, since the old object sizes were not
necessarily sufficient. things have gotten a bit ugly in this area,
and i think a cleanup is in order at some point, but for now the goal
is just to get the code working on all supported archs including mips,
which was badly broken by linux rejecting syscalls with the wrong
sigset_t size. | 
|  | these could have caused memory corruption due to invalid accesses to
the next field. all should be fixed now; I found the errors with fgrep
-r '__lock(&', which is bogus since the argument should be an array. | 
|  | i originally omitted these (optional, per POSIX) interfaces because i
considered them backwards implementation details. however, someone
later brought to my attention a fairly legitimate use case: allocating
thread stacks in memory that's setup for sharing and/or fast transfer
between CPU and GPU so that the thread can move data to a GPU directly
from automatic-storage buffers without having to go through additional
buffer copies.
perhaps there are other situations in which these interfaces are
useful too. | 
|  | I've been looking for data that would suggest a good default, and
since little has shown up, i'm doing this based on the limited data I
have. the value 80k is chosen to accommodate 64k of application data
(which happens to be the size of the buffer in git that made it crash
without a patch to call pthread_attr_setstacksize) plus the max stack
usage of most libc functions (with a few exceptions like crypt, which
will be fixed soon to avoid excessive stack usage, and [n]ftw, which
inherently uses a fair bit in recursive directory searching).
if further evidence emerges suggesting that the default should be
larger, I'll consider changing it again, but I'd like to avoid it
getting too large to avoid the issues of large commit charge and rapid
address space exhaustion on 32-bit machines. | 
|  |  | 
|  | it's ok to overlap with integer slot 3 on 32-bit because only slots
0-2 are used on process-local barriers. | 
|  | pthread structure has been adjusted to match the glibc/GCC abi for
where the canary is stored on i386 and x86_64. it will need variants
for other archs to provide the added security of the canary's entropy,
but even without that it still works as well as the old "minimal" ssp
support. eventually such changes will be made anyway, since they are
also needed for GCC/C11 thread-local storage support (not yet
implemented).
care is taken not to attempt initializing the thread pointer unless
the program actually uses SSP (by reference to __stack_chk_fail). | 
|  |  | 
|  | eliminate the sequence number field and instead use the counter as the
futex because of the way the lock is held, sequence numbers are
completely useless, and this frees up a field in the barrier structure
to be used as a waiter count for the count futex, which lets us avoid
some syscalls in the best case.
as of now, self-synchronized destruction and unmapping should be fully
safe. before any thread can return from the barrier, all threads in
the barrier have obtained the vm lock, and each holds a shared lock on
the barrier. the barrier memory is not inspected after the shared lock
count reaches 0, nor after the vm lock is released. | 
|  | this implementation is rather heavy-weight, but it's the first
solution i've found that's actually correct. all waiters actually wait
twice at the barrier so that they can synchronize exit, and they hold
a "vm lock" that prevents changes to virtual memory mappings (and
blocks pthread_barrier_destroy) until all waiters are finished
inspecting the barrier.
thus, it is safe for any thread to destroy and/or unmap the barrier's
memory as soon as pthread_barrier_wait returns, without further
synchronization. | 
|  | due to moving waiters from the cond var to the mutex in bcast, these
waiters upon wakeup would steal slots in the count from newer waiters
that had not yet been signaled, preventing the signal function from
taking any action.
to solve the problem, we simply use two separate waiter counts, and so
that the original "total" waiters count is undisturbed by broadcast
and still available for signal. | 
|  | testing revealed that the old implementation, while correct, was
giving way too many spurious wakeups due to races changing the value
of the condition futex. in a test program with 5 threads receiving
broadcast signals, the number of returns from pthread_cond_wait was
roughly 3 times what it should have been (2 spurious wakeups for every
legitimate wakeup). moreover, the magnitude of this effect seems to
grow with the number of threads.
the old implementation may also have had some nasty race conditions
with reuse of the cond var with a new mutex.
the new implementation is based on incrementing a sequence number with
each signal event. this sequence number has nothing to do with the
number of threads intended to be woken; it's only used to provide a
value for the futex wait to avoid deadlock. in theory there is a
danger of race conditions due to the value wrapping around after 2^32
signals. it would be nice to eliminate that, if there's a way.
testing showed no spurious wakeups (though they are of course
possible) with the new implementation, as well as slightly improved
performance. | 
|  | this avoids the "stampede effect" where pthread_cond_broadcast would
result in all waiters waking up simultaneously, only to immediately
contend for the mutex and go back to sleep. | 
|  | it's amazing none of the conformance tests i've run even bothered to
check whether something so basic works... | 
|  | several things are changed. first, i have removed the old __uniclone
function signature and replaced it with the "standard" linux
__clone/clone signature. this was necessary to expose clone to
applications anyway, and it makes it easier to port __clone to new
archs, since it's now testable independently of pthread_create.
secondly, i have removed all references to the ugly ldt descriptor
structure (i386 only) from the c code and pthread structure. in places
where it is needed, it is now created on the stack just when it's
needed, in assembly code. thus, the i386 __clone function takes the
desired thread pointer as its argument, rather than an ldt descriptor
pointer, just like on all other sane archs. this should not affect
applications since there is really no way an application can use clone
with threads/tls in a way that doesn't horribly conflict with and
clobber the underlying implementation's use. applications are expected
to use clone only for creating actual processes, possibly with new
namespace features and whatnot. | 
|  | fix up clone signature to match the actual behavior. the new
__syncall_wait function allows a __synccall callback to wait for other
threads to continue without returning, so that it can resume action
after the caller finishes. this interface could be made significantly
more general/powerful with minimal effort, but i'll wait to do that
until it's actually useful for something. | 
|  | like mutexes and semaphores, rwlocks suffered from a race condition
where the unlock operation could access the lock memory after another
thread successfully obtained the lock (and possibly destroyed or
unmapped the object). this has been fixed in the same way it was fixed
for other lock types.
in addition, the previous implementation favored writers over readers.
in the absence of other considerations, that is the best behavior for
rwlocks, and posix explicitly allows it. however posix also requires
read locks to be recursive. if writers are favored, any attempt to
obtain a read lock while a writer is waiting for the lock will fail,
causing "recursive" read locks to deadlock. this can be avoided by
keeping track of which threads already hold read locks, but doing so
requires unbounded memory usage, and there must be a fallback case
that favors readers in case memory allocation failed. and all of this
must be synchronized. the cost, complexity, and risk of errors in
getting it right is too great, so we simply favor readers.
tracking of the owner of write locks has been removed, as it was not
useful for anything. it could allow deadlock detection, but it's not
clear to me that returning EDEADLK (which a buggy program is likely to
ignore) is better than deadlocking; at least the latter behavior
prevents further data corruption. a correct program cannot invoke this
situation anyway.
the reader count and write lock state, as well as the "last minute"
waiter flag have all been combined into a single atomic lock. this
means all state transitions for the lock are atomic compare-and-swap
operations. this makes establishing correctness much easier and may
improve performance.
finally, some code duplication has been cleaned up. more is called
for, especially the standard __timedwait idiom repeated in all locks. | 
|  | new features:
- FUTEX_WAIT_BITSET op will be used for timed waits if available. this
  saves a call to clock_gettime.
- error checking for the timespec struct is now inside __timedwait so
  it doesn't need to be duplicated everywhere. cond_timedwait still
  needs to duplicate it to avoid unlocking the mutex, though.
- pushing and popping the cancellation handler is delegated to
  __timedwait, and cancellable/non-cancellable waits are unified. | 
|  | previously, stdio used spinlocks, which would be unacceptable if we
ever add support for thread priorities, and which yielded
pathologically bad performance if an application attempted to use
flockfile on a key file as a major/primary locking mechanism.
i had held off on making this change for fear that it would hurt
performance in the non-threaded case, but actually support for
recursive locking had already inflicted that cost. by having the
internal locking functions store a flag indicating whether they need
to perform unlocking, rather than using the actual recursive lock
counter, i was able to combine the conditionals at unlock time,
eliminating any additional cost, and also avoid a nasty corner case
where a huge number of calls to ftrylockfile could cause deadlock
later at the point of internal locking.
this commit also fixes some issues with usage of pthread_self
conflicting with __attribute__((const)) which resulted in crashes with
some compiler versions/optimizations, mainly in flockfile prior to
pthread_create. | 
|  | changing credentials in a multi-threaded program is extremely
difficult on linux because it requires synchronizing the change
between all threads, which have their own thread-local credentials on
the kernel side. this is further complicated by the fact that changing
the real uid can fail due to exceeding RLIMIT_NPROC, making it
possible that the syscall will succeed in some threads but fail in
others.
the old __rsyscall approach being replaced was robust in that it would
report failure if any one thread failed, but in this case, the program
would be left in an inconsistent state where individual threads might
have different uid. (this was not as bad as glibc, which would
sometimes even fail to report the failure entirely!)
the new approach being committed refuses to change real user id when
it cannot temporarily set the rlimit to infinity. this is completely
POSIX conformant since POSIX does not require an implementation to
allow real-user-id changes for non-privileged processes whatsoever.
still, setting the real uid can fail due to memory allocation in the
kernel, but this can only happen if there is not already a cached
object for the target user. thus, we forcibly serialize the syscalls
attempts, and fail the entire operation on the first failure. this
*should* lead to an all-or-nothing success/failure result, but it's
still fragile and highly dependent on kernel developers not breaking
things worse than they're already broken.
ideally linux will eventually add a CLONE_USERCRED flag that would
give POSIX conformant credential changes without any hacks from
userspace, and all of this code would become redundant and could be
removed ~10 years down the line when everyone has abandoned the old
broken kernels. i'm not holding my breath... | 
|  | if thread id was reused by the kernel between the time pthread_kill
read it from the userspace pthread_t object and the time of the tgkill
syscall, a signal could be sent to the wrong thread. the tgkill
syscall was supposed to prevent this race (versus the old tkill
syscall) but it can't; it can only help in the case where the tid is
reused in a different process, but not when the tid is reused in the
same process.
the only solution i can see is an extra lock to prevent threads from
exiting while another thread is trying to pthread_kill them. it should
be very very cheap in the non-contended case. | 
|  |  | 
|  |  | 
|  |  | 
|  | the new approach relies on the fact that the only ways to create
sigset_t objects without invoking UB are to use the sig*set()
functions, or from the masks returned by sigprocmask, sigaction, etc.
or in the ucontext_t argument to a signal handler. thus, as long as
sigfillset and sigaddset avoid adding the "protected" signals, there
is no way the application will ever obtain a sigset_t including these
bits, and thus no need to add the overhead of checking/clearing them
when sigprocmask or sigaction is called.
note that the old code actually *failed* to remove the bits from
sa_mask when sigaction was called.
the new implementations are also significantly smaller, simpler, and
faster due to ignoring the useless "GNU HURD signals" 65-1024, which
are not used and, if there's any sanity in the world, never will be
used. | 
|  | the previous implementation had at least 2 problems:
1. the case where additional threads reached the barrier before the
first wave was finished leaving the barrier was untested and seemed
not to be working.
2. threads leaving the barrier continued to access memory within the
barrier object after other threads had successfully returned from
pthread_barrier_wait. this could lead to memory corruption or crashes
if the barrier object had automatic storage in one of the waiting
threads and went out of scope before all threads finished returning,
or if one thread unmapped the memory in which the barrier object
lived.
the new implementation avoids both problems by making the barrier
state essentially local to the first thread which enters the barrier
wait, and forces that thread to be the last to return. | 
|  | this patch improves the correctness, simplicity, and size of
cancellation-related code. modulo any small errors, it should now be
completely conformant, safe, and resource-leak free.
the notion of entering and exiting cancellation-point context has been
completely eliminated and replaced with alternative syscall assembly
code for cancellable syscalls. the assembly is responsible for setting
up execution context information (stack pointer and address of the
syscall instruction) which the cancellation signal handler can use to
determine whether the interrupted code was in a cancellable state.
these changes eliminate race conditions in the previous generation of
cancellation handling code (whereby a cancellation request received
just prior to the syscall would not be processed, leaving the syscall
to block, potentially indefinitely), and remedy an issue where
non-cancellable syscalls made from signal handlers became cancellable
if the signal handler interrupted a cancellation point.
x86_64 asm is untested and may need a second try to get it right. | 
|  | otherwise we cannot support an application's desire to use
asynchronous cancellation within the callback function. this change
also slightly debloats pthread_create.c. |