|Age||Commit message (Collapse)||Author||Lines|
as noted in commit 05870abeaac0588fb9115cfd11f96880a0af2108, mov lr,pc
is not a valid method for saving the return address in code that might
be built as thumb.
this one is unlikely to matter, since any ISA level that has thumb2
should also have native implementations of atomics that don't involve
kuser_helper, and the affected code is only used on very old kernels
to begin with.
mov lr,pc is not a valid way to save the return address in thumb mode
since it omits the thumb bit. use a chain of bl and bx to emulate blx.
this could be avoided by converting to a .S file with preprocessor
conditions to use blx if available, but the time cost here is
dominated by the syscall anyway.
while making this change, also remove the remnants of support for
pre-bx ISA levels. commit 9f290a49bf9ee247d540d3c83875288a7991699c
removed the hack from the parent code paths, but left the unnecessary
code in the child. keeping it would require rewriting two code paths
rather than one, and is useless for reasons described in that commit.
The only reason we needed to preserve the link register was because we
were using a branch-link instruction to branch to __cp_cancel.
Replacing this with a branch means we can avoid the save/restore as
the link register is no longer modified.
these are not a public interface and are not intended to be callable
from anywhere but the public clone function or other places in libc.
this cleans up what had become widespread direct inline use of "GNU C"
style attributes directly in the source, and lowers the barrier to
increased use of hidden visibility, which will be useful to recovering
some of the efficiency lost when the protected visibility hack was
dropped in commit dc2f368e565c37728b0d620380b849c3a1ddd78f, especially
on archs where the PLT ABI is costly.
__aeabi_read_tp used to call c code, but that was incorrect as the
arm runtime abi specifies special pcs for this function: it is only
allowed to clobber r0, ip, lr and cpsr.
maintainer's note: the old code explicitly saved and restored all
general-purpose registers which are call-clobbered in the normal
calling convention, so it's unlikely that any real-world compilers
produced code that could break. however theoretically they could have
chosen to use floating point registers, in which case the caller's
values of those registers would be clobbered.
commit 610c5a8524c3d6cd3ac5a5f1231422e7648a3791 changed the thread
pointer setup so tp points at the end of the pthread struct on arm,
but failed to update __aeabi_read_tp so it was off by 8.
this broke tls access in code that is compiled with -mtp=soft, which
is the default when target arch is pre armv6k or thumb1.
maintainer's note: no release versions are affected.
binutils commit bada43421274615d0d5f629a61a60b7daa71bc15 tightened
immediate fixup handling in gas in such a way that the final .arch of
an object file must be compatible with the fixups used when the
instruction was assembled; this in turn broke assembling of atomics.s,
at least in thumb mode.
it's not clear whether this should be considered a bug in gas, but
.object_arch is preferable anyway for our purpose here of controlling
the ISA level tag on the object file being produced, and it's the
intended directive for use in object files with runtime code
selection. research by Szabolcs Nagy confirmed that .object_arch is
supported in all relevant versions of binutils and clang's integrated
patch by Reiner Herrmann.
three problems are addressed:
- use of pc arithmetic, which was difficult if not impossible to make
correct in thumb mode on all models, so that relative rather than
absolute pointers to the backends could be used. this was designed
back when there was no coherent model for the early stages of the
dynamic linker before relocations, and is no longer necessary.
- assumption that data (the relative pointers to the backends) can be
accessed at a constant displacement from the code. this will not be
possible on future fdpic subarchs (for cortex-m), so move
responsibility for loading the backend code address to the caller.
- hard-coded arm opcodes using the .word directive. instead, use the
.arch directive to work around the assembler's refusal to assemble
instructions not available (or in some cases, available but just
considered deprecated) in the target isa level. the obscure v6t2
arch is used for v6 code so as to (1) allow generation of thumb2
output if -mthumb is active, and (2) avoid warnings/errors for mcr
barriers that clang would produce if we just set arch to v7-a.
in addition, the __aeabi_read_tp function is moved out of the inner
workings and implemented as an asm wrapper around a C function, so
that asm code does not need to read global data. the asm wrapper
serves to satisfy the ABI calling convention requirements for this
this file's .data section was not aligned, and just happened to get
the correct alignment with past builds. it's likely that the move of
atomic.s from arch/arm/src to src/thread/arm caused the change in
alignment, which broke the atomic and thread-pointer access fragments
on actual armv5 hardware.
this is possible with the new build system that allows src/*/$(ARCH)/*
files which do not shadow a file in the parent directory, and yields a
more logical organization. eventually it will be possible to remove
arch/*/src from the build system.
these files are all accepted as legacy arm syntax when producing arm
code, but legacy syntax cannot be used for producing thumb2 with
access to the full ISA. even after switching to UAL, some asm source
files contain instructions which are not valid in thumb mode, so these
will need to be addressed separately.
the idea of the three-instruction sequence being removed was to be
able to return to thumb code when used on armv4t+ from a thumb caller,
but also to be able to run on armv4 without the bx instruction
available (in which case the low bit of lr would always be 0).
however, without compiler support for generating such a sequence from
C code, which does not exist and which there is unlikely to be
interest in implementing, there is little point in having it in the
asm, and it would likely be easier to add pre-armv4t support via
enhanced linker handling of R_ARM_V4BX than at the compiler level.
removing this code simplifies adding support for building libc in
thumb2-only form (for cortex-m).
in a few places, non-hidden symbols were referenced from asm in ways
that assumed ld-time binding. while these is no semantic reason these
symbols need to be hidden, fixing the references without making them
hidden was going to be ugly, and hidden reduces some bloat anyway.
in the asm files, .global/.hidden directives have been moved to the
top to unclutter the actual code.
calls to __aeabi_read_tp may be generated by the compiler to access
TLS on pre-v6 targets. previously, this function was hard-coded to
call the kuser helper, which would crash on kernels with kuser helper
to fix the problem most efficiently, the definition of __aeabi_read_tp
is moved so that it's an alias for the new __a_gettp. however, on v7+
targets, code to initialize the runtime choice of thread-pointer
loading code is not even compiled, meaning that defining
__aeabi_read_tp would have caused an immediate crash due to using the
default implementation of __a_gettp with a HCF instruction.
fortunately there is an elegant solution which reduces overall code
size: putting the native thread-pointer loading instruction in the
default code path for __a_gettp, so that separate default/native code
paths are not needed. this function should never be called before
__set_thread_area anyway, and if it is called early on pre-v6
hardware, the old behavior (crashing) is maintained.
ideally __aeabi_read_tp would not be called at all on v7+ targets
anyway -- in fact, prior to the overhaul, the same problem existed,
but it was never caught by users building for v7+ with kuser disabled.
however, it's possible for calls to __aeabi_read_tp to end up in a v7+
binary if some of the object files were built for pre-v7 targets, e.g.
in the case of static libraries that were built separately, so this
case needs to be handled.
previously, builds for pre-armv6 targets hard-coded use of the "kuser
helper" system for atomics and thread-pointer access, resulting in
binaries that fail to run (crash) on systems where this functionality
has been disabled (as a security/hardening measure) in the kernel.
additionally, builds for armv6 hard-coded an outdated/deprecated
memory barrier instruction which may require emulation (extremely
slow) on future models.
this overhaul replaces the behavior for all pre-armv7 builds (both of
the above cases) to perform runtime detection of the appropriate
mechanisms for barrier, atomic compare-and-swap, and thread pointer
access. detection is based on information provided by the kernel in
auxv: presence of the HWCAP_TLS bit for AT_HWCAP and the architecture
version encoded in AT_PLATFORM. direct use of the instructions is
preferred when possible, since probing for the existence of the kuser
helper page would be difficult and would incur runtime cost.
for builds targeting armv7 or later, the runtime detection code is not
compiled at all, and much more efficient versions of the non-cas
atomic operations are provided by using ldrex/strex directly rather
than wrapping cas.
The architecture-specific assembly versions of clone did not set errno on
failure, which is inconsistent with glibc. __clone still returns the error
via its return value, and clone is now a wrapper that sets errno as needed.
The public clone has also been moved to src/linux, as it's not directly
related to the pthreads API.
__clone is called by pthread_create, which does not report errors via
errno. Though not strictly necessary, it's nice to avoid clobbering errno
despite documentation that makes it sound a lot different, the only
ABI-constraint difference between TLS variants II and I seems to be
that variant II stores the initial TLS segment immediately below the
thread pointer (i.e. the thread pointer points to the end of it) and
variant I stores the initial TLS segment above the thread pointer,
requiring the thread descriptor to be stored below. the actual value
stored in the thread pointer register also tends to have per-arch
random offsets applied to it for silly micro-optimization purposes.
with these changes applied, TLS should be basically working on all
supported archs except microblaze. I'm still working on getting the
necessary information and a working toolchain that can build TLS
binaries for microblaze, but in theory, static-linked programs with
TLS and dynamic-linked programs where only the main executable uses
TLS should already work on microblaze.
alignment constraints have not yet been heavily tested, so it's
possible that this code does not always align TLS segments correctly
on archs that need TLS variant I.
the code to exit the new thread/process after the start function
returns was mixed up in its syscall convention.
stale state information indicating that a thread was possibly blocked
at a cancellation point could get left behind if longjmp was used to
exit a signal handler that interrupted a cancellation point.
to fix the issue, we throw away the state information entirely and
simply compare the saved instruction pointer to a range of code
addresses in the __syscall_cp_asm function. all the ugly PIC work
(which becomes minimal anyway with this approach) is defered to
cancellation time instead of happening at every syscall, which should
improve performance too.
this commit also fixes cancellation on arm, which was mildly broken
(race condition, not checking cancellation flag once inside the
cancellation point zone). apparently i forgot to implement that. the
new arm code is untested, but appears correct; i'll test and fix it
later if there are problems.
this port assumes eabi calling conventions, eabi linux syscall
convention, and presence of the kernel helpers at 0xffff0f?0 needed
for threads support. otherwise it makes very few assumptions, and the
code should work even on armv4 without thumb support, as well as on
systems with thumb interworking. the bits headers declare this a
little endian system, but as far as i can tell the code should work
equally well on big endian.
some small details are probably broken; so far, testing has been
limited to qemu/aboriginal linux.