Age | Commit message (Collapse) | Author | Lines |
|
same approach as in sqrt.
sqrtl was broken on aarch64, riscv64 and s390x targets because
of missing quad precision support and on m68k-sf because of
missing ld80 sqrtl.
this implementation is written for quad precision and then
edited to make it work for both m68k and x86 style ld80 formats
too, but it is not expected to be optimal for them.
note: using fp instructions for the initial estimate when such
instructions are available (e.g. double prec sqrt or rsqrt) is
avoided because of fenv correctness.
|
|
for targets where long double is different from double.
|
|
same method as in sqrt, this was tested on all inputs against
an sqrtf instruction. (the only difference found was that x86
sqrtf does not signal the x86 specific input-denormal exception
on negative subnormal inputs while the software sqrtf does,
this is fine as it was designed for ieee754 exceptions only.)
there is known faster method:
"Computing Floating-Point Square Roots via Bivariate Polynomial Evaluation"
that computes sqrtf directly via pipelined polynomial evaluation
which allows more parallelism, but the design does not generalize
easily to higher precisions.
|
|
approximate 1/sqrt(x) and sqrt(x) with goldschmidt iterations.
this is known to be a fast method for computing sqrt, but it is
tricky to get right, so added detailed comments.
use a lookup table for the initial estimate, this adds 256bytes
rodata but it can be shared between sqrt, sqrtf and sqrtl.
this saves one iteration compared to a linear estimate.
this is for soft float targets, but it supports fenv by using a
floating-point operation to get the final result. the result
is correctly rounded in all rounding modes. if fenv support is
turned off then the nearest rounded result is computed and
inexact exception is not signaled.
assumes fast 32bit integer arithmetics and 32 to 64bit mul.
|
|
this is actually a functional fix at present, since the C sqrtl does
not support ld80 and just wraps double sqrt. once that's fixed it will
just be an optimization.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The final rounding operation should be done with the correct sign
otherwise huge results may incorrectly get rounded to or away from
infinity in upward or downward rounding modes.
This affected sinh and sinhf which set the sign on the result after
a potentially overflowing mul. There may be other non-nearest rounding
issues, but this was a known long standing issue with large ulp error
(depending on how ulp is defined near infinity).
The fix should have no effect on sinh and sinhf performance but may
have a tiny effect on cosh and coshf.
|
|
Handle when after reduction |y| > pi/4+tiny. This happens in directed
rounding modes because the fast round to int code does not give the
nearest integer. In such cases the reduction may not be symmetric
between x and -x so e.g. cos(x)==cos(-x) may not hold (but polynomial
evaluation is not symmetric either with directed rounding so fixing
that would require more changes with bigger performance impact).
The fix only adds two predictable branches in nearest rounding mode,
simple ubenchmark does not show relevant performance regression in
nearest rounding mode.
The code could be improved: e.g reducing the medium size threshold
such that two step reduction is enough instead of three, and the
single precision case can avoid the issue by doing the round to int
differently, but this fix was kept minimal.
|
|
these did not truncate excess precision in the return value. fixing
them looks like considerable work, and the current C code seems to
outperform them significantly anyway.
long double functions are left in place because they are not subject
to excess precision issues and probably better than the C code.
|
|
this commit is for the sake of reviewable history.
|
|
|
|
analogous to commit 1c9afd69051a64cf085c6fb3674a444ff9a43857 for
atan[2][f].
|
|
for functions implemented in C, this is a requirement of C11 (F.6);
strictly speaking that text does not apply to standard library
functions, but it seems to be intended to apply to them, and C2x is
expected to make it a requirement.
failure to drop excess precision is particularly bad for inverse trig
functions, where a value with excess precision can be outside the
range of the function (entire range, or range for a particular
subdomain), breaking reasonable invariants a caller may expect.
|
|
|
|
|
|
SQRT.fmt exists on MIPS II+ (float), MIPS III+ (double).
ABS.fmt exists on MIPS I+ but only cores with ABS2008 flag in FCSR
implement the required behaviour.
|
|
Both sqrt and sqrtf shifted the signed exponent as signed int to adjust
the bit representation of the result. There are signed right shifts too
in the code but those are implementation defined and are expected to
compile to arithmetic shift on supported compilers and targets.
|
|
lrint in (LONG_MAX, 1/DBL_EPSILON) and in (-1/DBL_EPSILON, LONG_MIN)
is not trivial: rounding to int may be inexact, but the conversion to
int may overflow and then the inexact flag must not be raised. (the
overflow threshold is rounding mode dependent).
this matters on 32bit targets (without single instruction lrint or
rint), so the common case (when there is no overflow) is optimized by
inlining the lrint logic, otherwise the old code is kept as a fallback.
on my laptop an i486 lrint call is asm:10ns, old c:30ns, new c:21ns
on a smaller arm core: old c:71ns, new c:34ns
on a bigger arm core: old c:27ns, new c:19ns
|
|
commit f3ed8bfe8a82af1870ddc8696ed4cc1d5aa6b441 inadvertently removed
labels that were still needed.
|
|
commit 31c5fb80b9eae86f801be4f46025bc6532a554c5 introduced underflow
code paths for the i386 math asm, along with checks on the fpu status
word to skip the underflow-generation instructions if the underflow
flag was already raised. unfortunately, at least one such path, in
log1p, returned with 2 items on the x87 stack rather than just 1 item
for the return value. this is a violation of the ABI's calling
convention, and could cause subsequent floating point code to produce
NANs due to x87 stack overflow. if floating point results are used in
flow control, this can lead to runaway wrong code execution.
rather than reviewing each "underflow already raised" code path for
correctness, remove them all. they're likely slower than just
performing the underflow code unconditionally, and significantly more
complex.
all of this code should be ripped out and replaced by C source files
with inline asm. doing so would preclude this kind of error by having
the compiler perform all x87 stack register allocation and stack
manipulation, and would produce comparable or better code. however
such a change is a much larger project.
|
|
Author: Alex Suykov <alex.suykov@gmail.com>
Author: Aric Belsito <lluixhi@gmail.com>
Author: Drew DeVault <sir@cmpwn.com>
Author: Michael Clark <mjc@sifive.com>
Author: Michael Forney <mforney@mforney.org>
Author: Stefan O'Rear <sorear2@gmail.com>
This port has involved the work of many people over several years. I
have tried to ensure that everyone with substantial contributions has
been credited above; if any omissions are found they will be noted
later in an update to the authors/contributors list in the COPYRIGHT
file.
The version committed here comes from the riscv/riscv-musl repo's
commit 3fe7e2c75df78eef42dcdc352a55757729f451e2, with minor changes by
me for issues found during final review:
- a_ll/a_sc atomics are removed (according to the ISA spec, lr/sc
are not safe to use in separate inline asm fragments)
- a_cas[_p] is fixed to be a memory barrier
- the call from the _start assembly into the C part of crt1/ldso is
changed to allow for the possibility that the linker does not place
them nearby each other.
- DTP_OFFSET is defined correctly so that local-dynamic TLS works
- reloc.h LDSO_ARCH logic is simplified and made explicit.
- unused, non-functional crti/n asm files are removed.
- an empty .sdata section is added to crt1 so that the
__global_pointer reference is resolvable.
- indentation style errors in some asm files are fixed.
|
|
from https://github.com/ARM-software/optimized-routines,
commit 04884bd04eac4b251da4026900010ea7d8850edc
The underflow exception is signaled if the result is in the subnormal
range even if the result is exact.
code size change: +3421 bytes.
benchmark on x86_64 before, after, speedup:
-Os:
pow rthruput: 102.96 ns/call 33.38 ns/call 3.08x
pow latency: 144.37 ns/call 54.75 ns/call 2.64x
-O3:
pow rthruput: 98.91 ns/call 32.79 ns/call 3.02x
pow latency: 138.74 ns/call 53.78 ns/call 2.58x
|
|
from https://github.com/ARM-software/optimized-routines,
commit 04884bd04eac4b251da4026900010ea7d8850edc
TOINT_INTRINSICS and EXP_USE_TOINT_NARROW cases are unused.
The underflow exception is signaled if the result is in the subnormal
range even if the result is exact (e.g. exp2(-1023.0)).
code size change: -1672 bytes.
benchmark on x86_64 before, after, speedup:
-Os:
exp rthruput: 12.73 ns/call 6.68 ns/call 1.91x
exp latency: 45.78 ns/call 21.79 ns/call 2.1x
exp2 rthruput: 6.35 ns/call 5.26 ns/call 1.21x
exp2 latency: 26.00 ns/call 16.58 ns/call 1.57x
-O3:
exp rthruput: 12.75 ns/call 6.73 ns/call 1.89x
exp latency: 45.91 ns/call 21.80 ns/call 2.11x
exp2 rthruput: 6.47 ns/call 5.40 ns/call 1.2x
exp2 latency: 26.03 ns/call 16.54 ns/call 1.57x
|
|
from https://github.com/ARM-software/optimized-routines,
commit 04884bd04eac4b251da4026900010ea7d8850edc
code size change: +2458 bytes (+1524 bytes with fma).
benchmark on x86_64 before, after, speedup:
-Os:
log2 rthruput: 16.08 ns/call 10.49 ns/call 1.53x
log2 latency: 44.54 ns/call 25.55 ns/call 1.74x
-O3:
log2 rthruput: 15.92 ns/call 10.11 ns/call 1.58x
log2 latency: 44.66 ns/call 26.16 ns/call 1.71x
|
|
from https://github.com/ARM-software/optimized-routines,
commit 04884bd04eac4b251da4026900010ea7d8850edc
Assume __FP_FAST_FMA implies __builtin_fma is inlined as a single
instruction.
code size change: +4588 bytes (+2540 bytes with fma).
benchmark on x86_64 before, after, speedup:
-Os:
log rthruput: 12.61 ns/call 7.95 ns/call 1.59x
log latency: 41.64 ns/call 23.38 ns/call 1.78x
-O3:
log rthruput: 12.51 ns/call 7.75 ns/call 1.61x
log latency: 41.82 ns/call 23.55 ns/call 1.78x
|
|
from https://github.com/ARM-software/optimized-routines,
commit 04884bd04eac4b251da4026900010ea7d8850edc
POWF_SCALE != 1.0 case only matters if TOINT_INTRINSICS is set, which
is currently not supported for any target.
SNaN is not supported, it would require an issignalingf
implementation.
code size change: -816 bytes.
benchmark on x86_64 before, after, speedup:
-Os:
powf rthruput: 95.14 ns/call 20.04 ns/call 4.75x
powf latency: 137.00 ns/call 34.98 ns/call 3.92x
-O3:
powf rthruput: 92.48 ns/call 13.67 ns/call 6.77x
powf latency: 131.11 ns/call 35.15 ns/call 3.73x
|
|
from https://github.com/ARM-software/optimized-routines,
commit 04884bd04eac4b251da4026900010ea7d8850edc
In expf TOINT_INTRINSICS is kept, but is unused, it would require support
for __builtin_round and __builtin_lround as single instruction.
code size change: +94 bytes.
benchmark on x86_64 before, after, speedup:
-Os:
expf rthruput: 9.19 ns/call 8.11 ns/call 1.13x
expf latency: 34.19 ns/call 18.77 ns/call 1.82x
exp2f rthruput: 5.59 ns/call 6.52 ns/call 0.86x
exp2f latency: 17.93 ns/call 16.70 ns/call 1.07x
-O3:
expf rthruput: 9.12 ns/call 4.92 ns/call 1.85x
expf latency: 34.44 ns/call 18.99 ns/call 1.81x
exp2f rthruput: 5.58 ns/call 4.49 ns/call 1.24x
exp2f latency: 17.95 ns/call 16.94 ns/call 1.06x
|
|
from https://github.com/ARM-software/optimized-routines,
commit 04884bd04eac4b251da4026900010ea7d8850edc
code size change: +177 bytes.
benchmark on x86_64 before, after, speedup:
-Os:
log2f rthruput: 11.38 ns/call 5.99 ns/call 1.9x
log2f latency: 35.01 ns/call 22.57 ns/call 1.55x
-O3:
log2f rthruput: 10.82 ns/call 5.58 ns/call 1.94x
log2f latency: 35.13 ns/call 21.04 ns/call 1.67x
|
|
from https://github.com/ARM-software/optimized-routines,
commit 04884bd04eac4b251da4026900010ea7d8850edc,
with minor changes to better fit into musl.
code size change: +289 bytes.
benchmark on x86_64 before, after, speedup:
-Os:
logf rthruput: 8.40 ns/call 6.14 ns/call 1.37x
logf latency: 31.79 ns/call 24.33 ns/call 1.31x
-O3:
logf rthruput: 8.43 ns/call 5.58 ns/call 1.51x
logf latency: 32.04 ns/call 20.88 ns/call 1.53x
|
|
|
|
These are supposed to be used in tail call positions when handling
special cases in new code. (fp exceptions may be raised "naturally"
by the common code path if special casing is more effort.)
This implements the error handling apis used in
https://github.com/ARM-software/optimized-routines
without errno setting.
|
|
Mark atanhi, atanlo, and aT in atanl.c as static, as they're not
intended to be part of the public API.
These are already static in the LDBL_MANT_DIG == 64 code, so this
patch is just making the LDBL_MANT_DIG == 113 code do the same thing.
|
|
fma is only available on recent x86_64 cpus and it is much faster than
a software fma, so this should be done with a runtime check, however
that requires more changes, this patch just adds the code so it can be
tested when musl is compiled with -mfma or -mfma4.
|
|
vfma is available in the vfpv4 fpu and above, the ACLE standard feature
test for double precision hardware fma support is
__ARM_FEATURE_FMA && __ARM_FP&8
we need further checks to work around clang bugs (fixed in clang >=7.0)
&& !__SOFTFP__
because __ARM_FP is defined even with -mfloat-abi=soft
&& !BROKEN_VFP_ASM
to disable the single precision code when inline asm handling is broken.
For runtime selection the HWCAP_ARM_VFPv4 hwcap flag can be used, but
that requires further work.
|
|
These are only available on hard float target and sqrt is not available
in the base ISA, so further check is used.
|
|
These are available in the s390x baseline isa -march=z900.
|
|
libc.h was intended to be a header for access to global libc state and
related interfaces, but ended up included all over the place because
it was the way to get the weak_alias macro. most of the inclusions
removed here are places where weak_alias was needed. a few were
recently introduced for hidden. some go all the way back to when
libc.h defined CANCELPT_BEGIN and _END, and all (wrongly implemented)
cancellation points had to include it.
remaining spurious users are mostly callers of the LOCK/UNLOCK macros
and files that use the LFS64 macro to define the awful *64 aliases.
in a few places, new inclusion of libc.h is added because several
internal headers no longer implicitly include libc.h.
declarations for __lockfile and __unlockfile are moved from libc.h to
stdio_impl.h so that the latter does not need libc.h. putting them in
libc.h made no sense at all, since the macros in stdio_impl.h are
needed to use them correctly anyway.
|
|
|
|
this makes significant differences to codegen on archs with an
expensive PLT-calling ABI; on i386 and gcc 7.3 for example, the sin
and sinf functions no longer touch call-saved registers or the stack
except for pushing outgoing arguments. performance is likely improved
too, but no measurements were taken.
|
|
|