|Age||Commit message (Collapse)||Author||Lines|
The result is the same but takes less code.
Note that __execvpe calls getenv which calls __strchrnul so even
using static output the size of the executable won't grow.
the hard problem here is unlinking threads from a list when they exit
without creating a window of inconsistency where the kernel task for a
thread still exists and is still executing instructions in userspace,
but is not reflected in the list. the magic solution here is getting
rid of per-thread exit futex addresses (set_tid_address), and instead
using the exit futex to unlock the global thread list.
since pthread_join can no longer see the thread enter a detach_state
of EXITED (which depended on the exit futex address pointing to the
detach_state), it must now observe the unlocking of the thread list
lock before it can unmap the joined thread and return. it doesn't
actually have to take the lock. for this, a __tl_sync primitive is
offered, with a signature that will allow it to be enhanced for quick
return even under contention on the lock, if needed. for now, the
exiting thread always performs a futex wake on its detach_state. a
future change could optimize this out except when there is already a
initial/dynamic variants of detached state no longer need to be
tracked separately, since the futex address is always set to the
global list lock, not a thread-local address that could become invalid
on detached thread exit. all detached threads, however, must perform a
second sigprocmask syscall to block implementation-internal signals,
since locking the thread list with them already blocked is not
the arch-independent C version of __unmapself no longer needs to take
a lock or setup its own futex address to release the lock, since it
must necessarily be called with the thread list lock already held,
guaranteeing exclusive access to the temporary stack.
changes to libc.threads_minus_1 no longer need to be atomic, since
they are guarded by the thread list lock. it is largely vestigial at
this point, and can be replaced with a cheaper boolean indicating
whether the process is multithreaded at some point in the future.
libc.h was intended to be a header for access to global libc state and
related interfaces, but ended up included all over the place because
it was the way to get the weak_alias macro. most of the inclusions
removed here are places where weak_alias was needed. a few were
recently introduced for hidden. some go all the way back to when
libc.h defined CANCELPT_BEGIN and _END, and all (wrongly implemented)
cancellation points had to include it.
remaining spurious users are mostly callers of the LOCK/UNLOCK macros
and files that use the LFS64 macro to define the awful *64 aliases.
in a few places, new inclusion of libc.h is added because several
internal headers no longer implicitly include libc.h.
declarations for __lockfile and __unlockfile are moved from libc.h to
stdio_impl.h so that the latter does not need libc.h. putting them in
libc.h made no sense at all, since the macros in stdio_impl.h are
needed to use them correctly anyway.
this was added so that posix_spawn and possibly other functionality
could be implemented in terms of vfork, but that turned out to be
unsafe. any such usage needs __clone with proper handling of stack
commits leading up to this one have moved the vast majority of
libc-internal interface declarations to appropriate internal headers,
allowing them to be type-checked and setting the stage to limit their
visibility. the ones that have not yet been moved are mostly
namespace-protected aliases for standard/public interfaces, which
exist to facilitate implementing plain C functions in terms of POSIX
functionality, or C or POSIX functionality in terms of extensions that
are not standardized. some don't quite fit this description, but are
"internally public" interfacs between subsystems of libc.
rather than create a number of newly-named headers to declare these
functions, and having to add explicit include directives for them to
every source file where they're needed, I have introduced a method of
wrapping the corresponding public headers.
parallel to the public headers in $(srcdir)/include, we now have
wrappers in $(srcdir)/src/include that come earlier in the include
path order. they include the public header they're wrapping, then add
declarations for namespace-protected versions of the same interfaces
and any "internally public" interfaces for the subsystem they
along these lines, the wrapper for features.h is now responsible for
the definition of the hidden, weak, and weak_alias macros. this means
source files will no longer need to include any special headers to
access these features.
over time, it is my expectation that the scope of what is "internally
public" will expand, reducing the number of source files which need to
include *_impl.h and related headers down to those which are actually
implementing the corresponding subsystems, not just using them.
previously, a common __posix_spawnx backend was used that accepted an
additional argument for the execve variant to call in the child. this
moderately bloated up the posix_spawn function, shuffling arguments
between stack and/or registers to call a 7-argument function from a
instead, tuck the exec function pointer in an unused part of the
(large) pthread_spawnattr_t structure, and have posix_spawnp duplicate
the attributes and fill in a pointer to __execvpe. the net code size
change is minimal, but the weight is shifted to the "heavier" function
which already pulls in more dependencies.
as a bonus, we get rid of an external symbol (__posix_spawnx) that had
no really good place for a declaration because it shouldn't have
existed to begin with.
without this, it's plausible that assembler or linker could complain
about an unsatisfiable relocation.
syscall.h was chosen as the header to declare it, since its intended
usage is alongside syscalls as a fallback for operations the direct
syscall does not support.
This lets fexecve work even when /proc isn't mounted.
the value 0x7f00 (as if by _exit(127)) is specified only for the case
where the child is created but then fails to exec the shell, since
traditional fork+exec implementations do not admit reporting an error
via errno in this case without additional machinery. it's unclear
whether an implementation not subject to this failure mode needs to
emulate it; one could read the standard as requiring that. if so,
additional code will need to be added to map posix_spawn errors into
the form system is expected to return. but for now, returning -1 to
indicate an error is significantly better behavior than always
reporting failures as if the shell failed to exec after fork.
this is more extensible if we need to consider additional errors, and
more efficient as long as the compiler does not know it can cache the
result of __errno_location (a surprisingly complex issue detailed in
It's better to make execvp continue PATH search on ENOTDIR rather than
issuing an error. Bogus entries should not render rest of PATH invalid.
Maintainer's note: POSIX seems to require the search to continue like
this as part of XBD 8.3 Other Environment Variables. Only errors that
conclusively determine non-existence are candidates for continuing;
otherwise for consistency we have to report the error.
If the syscall fails, errno must be set correctly for the caller.
There's no guarantee that the handlers registered with pthread_atfork
won't clobber errno, so we need to ensure it gets set after they are
the resolution to Austin Group issue #411 defined new semantics for
the posix_spawn dup2 file action in the (previously useless) case
where src and dest fd are equal. future issues will require the dup2
file action to remove the close-on-exec flag. without this change,
passing fds to a child with posix_spawn while avoiding fd-leak races
in a multithreaded parent required a complex dance with temporary fds.
based on patch by Petr Skocik. changes were made to preserve the
80-column formatting of the function and to remove code that became
unreachable as a result of the new functionality.
execvpe stack-allocates a buffer used to hold the full path
(combination of a PATH entry and the program name)
while searching through $PATH, so at least
NAME_MAX+PATH_MAX is needed.
The stack size can be made conditionally smaller
(the current 1024 appears appropriate)
should this larger size be burdensome in those situations.
per POSIX, EINVAL is not a mandatory error, only an optional one. but
reporting unsupported flags allows an application to fallback
gracefully when a requested feature is not supported. this is not
helpful now, but it may be in the future if additional flags are
had this checking been present before, applications would have been
able to check for the newly-added POSIX_SPAWN_SETSID feature (added in
commit bb439bb17108b67f3df9c9af824d3a607b5b059d) at runtime.
this functionality has been adopted for inclusion in the next issue of
POSIX as the result of Austin Group issue #1044.
based on patch by Daurnimator.
nominally the low bits of the trap number on sh are the number of
syscall arguments, but they have never been used by the kernel, and
some code making syscalls does not even know the number of arguments
and needs to pass an arbitrary high number anyway.
sh3/sh4 traditionally used the trap range 16-31 for syscalls, but part
of this range overlapped with hardware exceptions/interrupts on sh2
hardware, so an incompatible range 32-47 was chosen for sh2.
using trap number 31 everywhere, since it's in the existing sh3/sh4
range and does not conflict with sh2 hardware, is a proposed
unification of the kernel syscall convention that will allow binaries
to be shared between sh2 and sh3/sh4. if this is not accepted into the
kernel, we can refit the sh2 target with runtime selection mechanisms
for the trap number, but doing so would be invasive and would entail
since 1.1.0, musl has nominally required a thread pointer to be setup.
most of the remaining code that was checking for its availability was
doing so for the sake of being usable by the dynamic linker. as of
commit 71f099cb7db821c51d8f39dfac622c61e54d794c, this is no longer
necessary; the thread pointer is now valid before any libc code
(outside of dynamic linker bootstrap functions) runs.
this commit essentially concludes "phase 3" of the "transition path
for removing lazy init of thread pointer" project that began during
the 1.1.0 release cycle.
as a result of commit 12e1e324683a1d381b7f15dd36c99b37dd44d940, kernel
processing of the robust list is only needed for process-shared
mutexes. previously the first attempt to lock any owner-tracked mutex
resulted in robust list initialization and a set_robust_list syscall.
this is no longer necessary, and since the kernel's record of the
robust list must now be cleared at thread exit time for detached
threads, optimizing it out is more worthwhile than before too.
the specification for execvp itself is unclear as to whether
encountering a file that cannot be executed due to EACCES during the
PATH search is a mandatory error condition; however, XBD 8.3's
specification of the PATH environment variable clarifies that the
search continues until a file with "appropriate execution permissions"
since it seems undesirable/erroneous to report ENOENT rather than
EACCES when an early path element has a non-executable file and all
later path elements lack any file by the requested name, the new code
stores a flag indicating that EACCES was seen and sets errno back to
EACCES in this case.
the write function is a cancellation point and accesses thread-local
state belonging to the calling thread in the parent process. since
cancellation is blocked for the duration of posix_spawn, this is
probably safe, but it's fragile and unnecessary. making the syscall
directly is just as easy and clearly safe.
the resolution of austin group issue #370 removes the requirement that
posix_spawn fail when the close file action is performed on an
already-closed fd. since there are no other meaningful errors for
close, just ignoring the return value completely is the simplest fix.
the main motivation for this change is to remove the assumption that
the tid of the main thread is also the pid of the process. (the value
returned by the set_tid_address syscall was used to fill both fields
despite it semantically being the tid.) this is historically and
presently true on linux and unlikely to change, but it conceivably
could be false on other systems that otherwise reproduce the linux
only a few parts of the code were actually still using the cached pid.
in a couple places (aio and synccall) it was a minor optimization to
avoid a syscall. caching could be reintroduced, but lazily as part of
the public getpid function rather than at program startup, if it's
deemed important for performance later. in other places (cancellation
and pthread_kill) the pid was completely unnecessary; the tkill
syscall can be used instead of tgkill. this is actually a rather
subtle issue, since tgkill is supposedly a solution to race conditions
that can affect use of tkill. however, as documented in the commit
message for commit 7779dbd2663269b465951189b4f43e70839bc073, tgkill
does not actually solve this race; it just limits it to happening
within one process rather than between processes. we use a lock that
avoids the race in pthread_kill, and the use in the cancellation
signal handler is self-targeted and thus not subject to tid reuse
races, so both are safe regardless of which syscall (tgkill or tkill)
such archs are expected to omit definitions of the SYS_* macros for
syscalls their kernels lack from arch/$ARCH/bits/syscall.h. the
preprocessor is then able to select the an appropriate implementation
for affected functions. two basic strategies are used on a
where the old syscalls correspond to deprecated library-level
functions, the deprecated functions have been converted to wrappers
for the modern function, and the modern function has fallback code
(omitted at the preprocessor level on new archs) to make use of the
old syscalls if the new syscall fails with ENOSYS. this also improves
functionality on older kernels and eliminates the incentive to program
with deprecated library-level functions for the sake of compatibility
with older kernels.
in other situations where the old syscalls correspond to library-level
functions which are not deprecated but merely lack some new features,
such as the *at functions, the old syscalls are still used on archs
which support them. this may change at some point in the future if or
when fallback code is added to the new functions to make them usable
(possibly with reduced functionality) on old kernels.
open is handled specially because it is used from so many places, in
so many variants (2 or 3 arguments, setting errno or not, and
cancellable or not). trying to do it as a function would not only
increase bloat, but would also risk subtle breakage.
this is the first step towards supporting "new" archs where linux
lacks "old" syscalls.
this is the first step in an overhaul aimed at greatly simplifying and
optimizing everything dealing with thread-local state.
previously, the thread pointer was initialized lazily on first access,
or at program startup if stack protector was in use, or at certain
random places where inconsistent state could be reached if it were not
initialized early. while believed to be fully correct, the logic was
fragile and non-obvious.
in the first phase of the thread pointer overhaul, support is retained
(and in some cases improved) for systems/situation where loading the
thread pointer fails, e.g. old kernels.
some notes on specific changes:
- the confusing use of libc.main_thread as an indicator that the
thread pointer is initialized is eliminated in favor of an explicit
- sigaction no longer needs to ensure that the thread pointer is
initialized before installing a signal handler (this was needed to
prevent a situation where the signal handler caused the thread
pointer to be initialized and the subsequent sigreturn cleared it
again) but it still needs to ensure that implementation-internal
thread-related signals are not blocked.
- pthread tsd initialization for the main thread is deferred in a new
manner to minimize bloat in the static-linked __init_tp code.
- pthread_setcancelstate no longer needs special handling for the
situation before the thread pointer is initialized. it simply fails
on systems that cannot support a thread pointer, which are
- pthread_cleanup_push/pop now check for missing thread pointer and
nop themselves out in this case, so stdio no longer needs to avoid
the cancellable path when the thread pointer is not available.
a number of cases remain where certain interfaces may crash if the
system does not support a thread pointer. at this point, these should
be limited to pthread interfaces, and the number of such cases should
be fewer than before.
this is a requirement in the specification that was overlooked.
the va_arg call for the argv-terminating null pointer was missing,
so this pointer was being wrongly used as the environment pointer.
issue reported by Timo Teräs. proposed patch slightly modified to
simplify the resulting code.
the trick here is that sigaction can track for us which signals have
ever had a signal handler set for them, and only those signals need to
be considered for reset. this tracking mask may have false positives,
since it is impossible to remove bits from it without race conditions.
false negatives are not possible since the mask is updated with atomic
operations prior to making the sigaction syscall.
implementation-internal signals are set to SIG_IGN rather than SIG_DFL
so that a signal raised in the parent (e.g. calling pthread_cancel on
the thread executing pthread_spawn) does not have any chance make it
to the child, where it would cause spurious termination by signal.
this change reduces the minimum/typical number of syscalls in the
child from around 70 to 4 (including execve). this should greatly
improve the performance of posix_spawn and other interfaces which use
it (popen and system).
to facilitate these changes, sigismember is also changed to return 0
rather than -1 for invalid signals, and to return the actual status of
implementation-internal signals. POSIX allows but does not require an
error on invalid signal numbers, and in fact returning an error tends
to confuse applications which wrongly assume the return value of
sigismember is boolean.
failures prior to the exec attempt were reported correctly, but on
exec failure, the return value contained junk.
there are several reasons for this. some of them are related to race
conditions that arise since fork is required to be async-signal-safe:
if fork or pthread_create is called from a signal handler after the
fork syscall has returned but before the subsequent userspace code has
finished, inconsistent state could result. also, there seem to be
kernel and/or strace bugs related to arrival of signals during fork,
at least on some versions, and simply blocking signals eliminates the
possibility of such bugs.
I intend to add more Linux workarounds that depend on using these
pathnames, and some of them will be in "syscall" functions that, from
an anti-bloat standpoint, should not depend on the whole snprintf
this is both a minor scheduling optimization and a workaround for a
difficult-to-fix bug in qemu app-level emulation.
from the scheduling standpoint, it makes no sense to schedule the
parent thread again until the child has exec'd or exited, since the
parent will immediately block again waiting for it.
on the qemu side, as regular application code running on an underlying
libc, qemu cannot make arbitrary clone syscalls itself without
confusing the underlying implementation. instead, it breaks them down
into either fork-like or pthread_create-like cases. it was treating
the code in posix_spawn as pthread_create-like, due to CLONE_VM, which
caused horribly wrong behavior: CLONE_FILES broke the synchronization
mechanism, CLONE_SIGHAND broke the parent's signals, and CLONE_THREAD
caused the child's exec to end the parent -- if it hadn't already
crashed. however, qemu special-cases CLONE_VFORK and emulates that
with fork, even when CLONE_VM is also specified. this also gives
incorrect semantics for code that really needs the memory sharing, but
posix_spawn does not make use of the vm sharing except to avoid
momentary double commit charge.
programs using posix_spawn (including via popen) should now work
correctly under qemu app-level emulation.
for the duration of the vm-sharing clone used by posix_spawn, all
signals are blocked in the parent process, including
implementation-internal signals. since __synccall cannot do anything
until successfully signaling all threads, the fact that signals are
blocked automatically yields the necessary safety.
aside from debloating and general simplification, part of the
motivation for removing the explicit lock is to simplify the
synchronization logic of __synccall in hopes that it can be made
async-signal-safe, which is needed to make setuid and setgid, which
depend on __synccall, conform to the standard. whether this will be
possible remains to be seen.
patch by Jens Gustedt.
previously, the intended policy was to use __environ in code that must
conform to the ISO C namespace requirements, and environ elsewhere.
this policy was not followed in practice anyway, making things
confusing. on top of that, Jens reported that certain combinations of
link-time optimization options were breaking with the inconsistent
references; this seems to be a compiler or linker bug, but having it
go away is a nice side effect of the changes made here.
this avoids duplicating the fragile logic for executing an external
program without fork.
read should never return anything but 0 or sizeof ec here, but if it
does, we want to treat any other return as "success". then the caller
will get back the pid and is responsible for waiting on it when it
the proposed change was described in detail in detail previously on
the mailing list. in short, vfork is unsafe because:
1. the compiler could make optimizations that cause the child to
clobber the parent's local vars.
2. strace is buggy and allows the vforking parent to run before the
child execs when run under strace.
the new design uses a close-on-exec pipe instead of vfork semantics to
synchronize the parent and child so that the parent does not return
before the child has finished using its arguments (and now, also its
stack). this also allows reporting exec failures to the caller instead
of giving the caller a child that mysteriously exits with status 127
on exec error.
basic testing has been performed on both the success and failure code
paths. further testing should be done.