|Age||Commit message (Collapse)||Author||Lines|
U+00DF ('ß') has had an uppercase form (U+1E9E) available since
Unicode 5.1, but Unicode lacks the case mappings for it due to
stability policy. when I added support for the new character in commit
1a63a9fc30e7a1f1239e3cedcb5041e5ec1c5351, I omitted the mapping in the
lowercase-to-uppercase direction. this choice was not based on any
actual information, only assumptions.
this commit adds bidirectional case mappings between U+00DF and
U+1E9E, and removes the special-case hack that allowed U+00DF to be
identified as lowecase despite lacking a mapping. aside from strong
evidence that this is the "right" behavior for real-world usage of
these characters, several factors informed this decision:
- the other "potentially correct" mapping, to "SS", is not
representable in the C case-mapping system anyway.
- leaving one letter in lowercase form when transforming a string to
uppercase is obviously wrong.
- having a character which is nominally lowercase but which is fixed
under case mapping violates reasonable invariants.
isspace can be a bottleneck in a simple parser, inlining it
gives slightly smaller and faster code
src/locale/pleval.o already had this optimization, the size
change for other libc functions for i386 is
src/internal/intscan.o 2134 2118 -16
src/locale/dcngettext.o 1562 1552 -10
src/network/res_msend.o 1961 1940 -21
src/network/lookup_name.o 2627 2608 -19
src/network/getnameinfo.o 1814 1811 -3
src/network/lookup_serv.o 643 624 -19
src/stdio/vfscanf.o 2675 2663 -12
src/stdlib/atoll.o 117 107 -10
src/stdlib/atoi.o 95 91 -4
src/stdlib/atol.o 95 91 -4
src/time/strptime.o 1515 1503 -12
(TOTALS) 432451 432321 -130
the main practical purposes of this commit are to remove a huge amount
of clutter from the src/locale directory, to cut down on the length of
the $(AR) and $(LD) command lines, and to reduce the amount of space
wasted by object file headers in the static libc.a. build time may
also be reduced, though this has not been measured.
as an additional justification, if there ever were a need for the
behavior of these functions to vary by locale, it would be necessary
for the non-_l versions to call the _l versions, so that linking the
former without the latter would not be possible anyway.
wctype_t was incorrectly "int" rather than "long" on x86_64. not only
is this an ABI incompatibility; it's also a major design flaw if we
ever wanted wctype_t to be implemented as a pointer, which would be
necessary if locales support custom character classes, since int is
too small to store a converted pointer. this commit fixes wctype_t to
be unsigned long on all archs, matching the LSB ABI; this change does
not matter for C code, but for C++ it affects mangling.
the same issue applied to wctrans_t. glibc/LSB defines this type as
const __int32_t *, but since no such definition is visible, I've just
expanded the definition, int, everywhere.
it would be nice if these types (which don't vary by arch) could be in
wctype.h, but the OB XSI requirement in POSIX that wchar.h expose some
types and functions from wctype.h precludes doing so. glibc works
around this with some hideous hacks, but trying to duplicate that
would go against the intent of musl's headers.
this way they'll go into .rodata, decreasing memory pressure.
since the correct declaration was not visible, and since the
representation of the types wchar_t and wint_t always match, a
compiler would have to go out of its way to make this bug manifest,
but better to fix it anyway.
unicode char data has both "W" and "F" wide types and the old table
only included the "W" ones. this omitted U+3000 (ideographic space)
and all the wide-ascii, etc.
this should be the last major fix needed to support running
glibc-linked conforming POSIX programs with musl in place of glibc, as
long as musl provides the features they need and they don't use
pthread cancellation (which is implemented as c++ exceptions in glibc,
and fundamentally incompatible with musl).
i tried to go with improving the old binary-search-based algorithm,
but between growth in the number of ranges, bad performance, and lack
of confidence in the binary search code's stability under changes in
the table, i decided it was worth the extra 1.8k to have something
clean and maintainable.
also note that, like the alpha and punct tables, there's definitely
room to optimize the nonspacing/wide tables by overlapping subtables.
this is not a high priority, but i've begun looking into how to do it,
and i suspect the table sizes can be roughly halved. if that turns out
to be true, the new, fast, table-based implementation will be roughly
the same size as if i had just extended the old binary search one.
also special-case ß (U+00DF) as lowercase even though it does not have
a mapping to uppercase. unicode added an uppercase version of this
character but does not map it, presumably because the uppercase
version is not actually used except for some obscure purpose...
this happened due to their entries in UnicodeData.txt
alpha is defined as unicode property "Alphabetic" plus category Nd
minus ASCII digits minus 2 special-cased Thai punctuation marks
supposedly misclassified by Unicode as letters.
punct is defined as all of unicode except control, alphanumeric, and
the tables were generated by a simple tool based on the code posted
previously to the mailing list. in the future, this and other code
used for maintaining locale/iconv/i18n data will be published either
in the main source repository or in a separate locale data generation