|Age||Commit message (Collapse)||Author||Lines|
commit 7590203c486d9002522019045d34ee3dee0a66f5 omitted static here.
these accept the netbsd/openbsd message catalog file format,
consisting of a sorted list of set headers and a sorted list of
message headers for each set, admitting trivial binary search for
the gnu format was not chosen because it's unusably bad. it does not
admit efficient (log time or better) lookups; rather, it requires
linear search or hash table lookups, and the hash function is awful:
it's literally set_id*msg_id.
Some packages call gettext to format a message to be sent to perror.
If the currently set user locale points to a non-existent .mo file,
open via __map_file in dcngettext will set errno to ENOENT.
Maintainer's notes: Non-modification of errno is a documented part of
the interface contract for the GNU version of this function and likely
other versions. The issue being fixed here seems to be a regression
from commit 1b52863e244ecee5b5935b6d36bb9e6efe84c035, which enabled
setting of errno from __map_file.
commit d88e5dfa8b989dafff4b748bfb3cba3512c8482e inadvertently changed
the argument pased to __get_locale from part (the current ;-delimited
component) to name (the full string).
commit aeeac9ca5490d7d90fe061ab72da446c01ddf746 introduced fail-safe
invariants that creating a locale_t object for the C locale or C.UTF-8
locale will always succeed. extend the guarantee to also cover the
- newlocale(LC_ALL_MASK, "", 0)
- newlocale(LC_ALL_MASK-LC_CTYPE_MASK, "C", 0)
provided that the LANG/LC_* environment variables have not been
changed by the program. these usages are idiomatic for getting the
default locale, and for getting a locale that behaves as the C locale
except for honoring the default locale's character encoding.
unify the code paths for allocated and non-allocated locale objects,
always using a tmp object. this is necessary to avoid clobbering the
base locale object too soon if we allow for the possibility that
looking up an explicitly requested locale name may fail, and makes the
code simpler and cleaner anyway.
eliminate the complex and fragile logic for checking whether one of
the non-allocated locale objects can be used for the result, and
instead just memcmp against each of them.
introduce a new LOC_MAP_FAILED sentinel for errors, since null
pointers for a category's locale map indicate the C locale. at this
time, __get_locale does not fail, so there should be no functional
change by this commit.
there is no good reason to wait to find and process the plural rules
for a translated message file until a gettext form requesting plural
rule processing is used. it just imposes additional synchronization,
here in the form of clunky use of atomics.
it looks like there may also have been a race condition where nplurals
could be seen without plural_rule being seen, possibly leading to null
pointer dereference. if so, this commit fixes it.
this further reduces the number of source files which need to include
libc.h and thereby be potentially exposed to libc global state and
this will also facilitate further improvements like adding an inline
fast-path, if we want to do so later.
libc.h was intended to be a header for access to global libc state and
related interfaces, but ended up included all over the place because
it was the way to get the weak_alias macro. most of the inclusions
removed here are places where weak_alias was needed. a few were
recently introduced for hidden. some go all the way back to when
libc.h defined CANCELPT_BEGIN and _END, and all (wrongly implemented)
cancellation points had to include it.
remaining spurious users are mostly callers of the LOCK/UNLOCK macros
and files that use the LFS64 macro to define the awful *64 aliases.
in a few places, new inclusion of libc.h is added because several
internal headers no longer implicitly include libc.h.
declarations for __lockfile and __unlockfile are moved from libc.h to
stdio_impl.h so that the latter does not need libc.h. putting them in
libc.h made no sense at all, since the macros in stdio_impl.h are
needed to use them correctly anyway.
commits leading up to this one have moved the vast majority of
libc-internal interface declarations to appropriate internal headers,
allowing them to be type-checked and setting the stage to limit their
visibility. the ones that have not yet been moved are mostly
namespace-protected aliases for standard/public interfaces, which
exist to facilitate implementing plain C functions in terms of POSIX
functionality, or C or POSIX functionality in terms of extensions that
are not standardized. some don't quite fit this description, but are
"internally public" interfacs between subsystems of libc.
rather than create a number of newly-named headers to declare these
functions, and having to add explicit include directives for them to
every source file where they're needed, I have introduced a method of
wrapping the corresponding public headers.
parallel to the public headers in $(srcdir)/include, we now have
wrappers in $(srcdir)/src/include that come earlier in the include
path order. they include the public header they're wrapping, then add
declarations for namespace-protected versions of the same interfaces
and any "internally public" interfaces for the subsystem they
along these lines, the wrapper for features.h is now responsible for
the definition of the hidden, weak, and weak_alias macros. this means
source files will no longer need to include any special headers to
access these features.
over time, it is my expectation that the scope of what is "internally
public" will expand, reducing the number of source files which need to
include *_impl.h and related headers down to those which are actually
implementing the corresponding subsystems, not just using them.
locale_impl.h could have been used, but this function is completely
independent of anything else, and preserving that property seems nice.
since this iconv implementation's output is stateless, it's necessary
to know before writing anything to the output buffer whether the
conversion of the current input character will fit.
previously we used a hard-coded table of the output size needed for
each supported output encoding, but failed to update the table when
adding support for conversion to jis-based encodings and again when
adding separate encoding identifiers for implicit-endianness utf-16/32
and ucs-2/4 variants, resulting in out-of-bound table reads and
incorrect size checks. no buffer overflow was possible, but the
affected characters could be converted incorrectly, and iconv could
potentially produce an incorrect return value as a result.
remove the hard-coded table, and instead perform the recursive iconv
conversion to a temporary buffer, measuring the output size and
transferring it to the actual output buffer only if the whole
converted result fits.
this case is handled with a recursive call to iconv using a
specially-constructed conversion descriptor. the constant 0 was used
as the offset for utf-8, since utf-8 appears first in the charmaps
table, but the offset used needs to point into the charmap entry, past
the name/aliases at the beginning, to the byte identifying the
encoding. as a result of this error, junk was produced.
instead, call find_charmap so we don't have to hard-code a nontrivial
offset. with this change, the code has been tested and found to work
in the case of converting the affected hkscs characters to utf-8.
commit 95c6044e2ae85846330814c4ac5ebf4102dbe02c split UTF-32 and
UTF-32BE but neglected to add a case for the former as a destination
encoding, resulting in it wrongly being handled by the default case.
the intent was that the value of the macro be chosen to encode "big
endian" in the low bits, so that no code would be needed, but this was
botched; instead, handle it the way UCS2 is handled.
commit a223dbd27ae36fe53f9f67f86caf685b729593fc added the reverse
conversions to JIS-based encodings, but omitted the check for remining
buffer space in the case where the next character to be written was
single-byte, allowing conversion to continue past the end of the
use of MB_CUR_MAX encoded a hidden dependency on the currently active
locale for the calling thread, whereas nl_langinfo_l is supposed to
report for the locale passed as an argument.
In all cases this is just a change from two volatile int to one.
in the unified code for handling utf-16 and ucs2 output, the check for
ucs2 wrongly looked at the source charset rather than the destination
previously, the charset names without endianness specified were always
interpreted as big endian. unicode specifies that UTF-16 and UTF-32
have BOM-determined endianness if BOM is present, and are otherwise
big endian. since commit 5b546faa67544af395d6407553762b37e9711157
added support for stateful encodings, it is now possible to implement
BOM support via the conversion descriptor state.
for conversions to these charsets, the output is always big endian and
does not have a BOM.
these encodings are still commonly used in messaging protocols and
such. the reverse mapping is implemented as a binary search of a list
of the jis 0208 characters in unicode order; the existing forward
table is used to perform the comparison in the search.
previously, 8-bit codepages could only remap the high 128 bytes; the
low range was assumed/forced to agree with ascii. interpretation of
codepage table headers has been changed so that it's possible to
represent mappings for up to 256 slots (fewer if the initial portion
of the map is elided because it coincides with unicode codepoints).
this requires consuming a bit more of the 10-bit space of characters
that can be represented in 8-bit codepages, but there's still a plenty
left. the size of the legacy_chars table is actually reduced now by
eliding the first 256 entries and considering them to map implicitly
via the identity map.
before these changes, there seem to have been minor bugs/omissions in
codepage table generation, so it's likely that some actual bug fixes
are silently included in this commit. round-trip testing of a few
codepages was performed on the new version of the code, but no
differential testing against the old version was done.
the new version of the code used to generate these tables forces a
newline every 256 entries, whereas at the time these files were
originally generated and committed, it only wrapped them at 80
columns. the new behavior ensures that localized changes to the
tables, if they are ever needed, will produce localized diffs. other
tables including hkscs were already committed in the new format.
binary comparison of the generated object files was performed to
confirm that no spurious changes slipped in.
this implementation aims to match the baseline defined by rfc1468 (the
original mime charset definition) plus the halfwidth katakana
extension included in the whatwg definition of the charset. rejection
of si/so controls and newlines in doublebyte state are not currently
enforced. the jis x 0201 mode is currently interpreted as having the
yen sign and overline character in place of backslash and tilde; ascii
mode has the standard ascii characters in those slots.
assuming pointers obtained from malloc have some nonzero alignment,
repurpose the low bit of iconv_t as an indicator that the descriptor
is a stateless value representing the source and destination character
the special case where mbrtowc returns 0 but consumed 1 byte of input
does not need to be considered, because the short-circuit for low
bytes already covered that case.
short-circuiting low bytes before the switch precluded support for
character encodings that don't coincide with ascii in this range. this
limitation affected iso-2022 encodings, which use the esc byte to
introduce a shift sequence, and things like ebcdic.
this is in preparation to support stateful conversion descriptors,
which are necessarily allocated and thus must be freed in iconv_close.
putting it in a separate TU will avoid pulling in free if iconv_close
is not referenced.
this change is made to avoid having assumptions about the encoding
spread out across the file, and to facilitate future change to a form
that can accommodate allocted, stateful descriptors when needed.
this commit should not produce any functional changes; with the
compiler tested the only change to code generation was minor
reordering of local variables on stack.
since setlocale(cat, NULL) is required to return the setting for the
global locale, there is no standard mechanism to obtain the name of
the currently active thread-local locale set by uselocale. this makes
it impossible for application/library software to load appropriate
translations, etc. unless using the gettext implementation provided by
libc, which has privileged access to libc internals.
to fill this gap, glibc introduced the _NL_LOCALE_NAME macro which can
be used with nl_langinfo to obtain the name. GNU gettext/gnulib code
already use this functionality on glibc, and can easily be adapted to
make use of it on non-glibc systems if it's available; for other
systems they poke at locale implementation internals, which we want to
avoid. this patch provides a compatible interface to the one glibc
commit 97bd6b09dbe7478d5a90a06ecd9e5b59389d8eb9 refactored the table
lookup into a function and introduced an error in index computation.
the error caused garbage to be read from the table if the given charmap
had a non-zero number of elided entries.
Per 1003.1-2008 (2016 ed.), catopen must set errno on failure.
We set errno to EOPNOTSUPP because musl does not currently support
there was missing reverse-conversion logic for the case, handled
specially in the character set tables, where a byte represents a
unicode codepoint with the same value.
this patch adds code to handle the case, and refactors the two-level
10-bit table lookup for legacy character sets into a function to avoid
repeating it yet another time as part of the fix.
often translations will be named only by language, whereas locale
names may also include a territory code, modifier, and codeset
portion. previously, only translations exactly matching the locale
name were loaded. this was a major usability issue, requiring
workarounds like symlinks or tweaking of the locale name.
with these changes, gettext now searches for translations by first
removing the codeset portion of the locale name, then trying the
remainder in full, with modifier (@mod) removed, with territory code
(_XX) removed, and with both removed.
part of the reason gettext lacked support for searching fallbacks
before is that the candidate pathname for a translation file was
constructed on each call and used as the key to lookup an
already-mapped translation file. this was very costly/inefficient. we
now use the tuple of textdomain binding pointer, locale map pointer,
and integer category id as the key for looking up a translation file
based on patch by He X.
when called for LC_ALL, setlocale has to return a string representing
the state of all locale categories. the simplest way to do this was to
always return a delimited list of values for each category, but that's
not friendly in the fairly common case where all categories have the
same setting. He X proposed a patch to check for this case and return
a single name; this patch is a simplified approach to do the same.
use the standard strnlen idiom for cases where lengths greater than an
imposed limit are going to be rejected immediately anyway.
the plural_rule field of allocated msgcat structures was assumed to be
initially-null but was never initialized. for future-proofing, the
nplurals field which was left uninitialized should also be cleared.
likewise, in the binding structure, the active field could be used
uninitialized by a technicality: the a_store which stores the initial
value of 0 may be implemented as a cas operation, which reads the old
rather than fixing these issues individually, just use calloc for both
allocations. this does result in wasteful clearing of name buffers (up
to NAME_MAX+PATH_MAX) before filling them, but since the size if
bounded and the time is dominated by filesystem operations, it really
doesn't matter; simplicity and future-proofing have more value here.
modified from patch submitted by He X.
this loop was only supposed to deactivate other bindings for the same
text domain name, but due to copy-and-paste error, deactivated all
patch by He X.
it was wrongly returning a null pointer instead of an empty string.
commit 844212d94f582c4e3c5055e0a1524931e89ebe76, which did not make it
into any releases, changed nl_langinfo(CODESET) to always return
"UTF-8", even in the byte-based C locale. this was problematic because
application software was found to use the string match for "UTF-8" to
activate its own UTF-8 processing. this both undermines the byte-based
functionality of the C locale, and if mixed with with calls to the
standard multibyte functions, which happened in practice, could result
in severe mis-handling of input.
the motive for the previous change was that, to avoid widespread
compatibility problems, the string returned by nl_langinfo(CODESET)
needs to be accepted by iconv and by third-party character conversion
code. thus, the only remaining choice is "ASCII". this choice
accurately represents the intent that high bytes do not have
individual meaning in the C locale, but it does mean that iconv, when
passed nl_langinfo(CODESET) in the C locale, will produce errors in
cases where mbrtowc would have succeeded. for reference, glibc behaves
similarly in this regard, so I don't think it will be a problem.
per ISO C, CHAR_MAX, not -1, is the value used to indicate that a char
field in struct lconv is unavailable.
patch by Julien Ramseier.
this restores the original behavior prior to the addition of the
byte-based C locale and fixes what is effectively a regression in
musl's property of always providing working UTF-8 support.
commit 1507ebf837334e9e07cfab1ca1c2e88449069a80 introduced the codeset
name "UTF-8-CODE-UNITS" for the byte-based C locale to represent that
the semantic content is UTF-8 but that it is being processed as code
units (bytes) rather than whole multibyte characters. however, many
programs assume that the codeset name is usable with iconv and/or
comes from a set of standard/widely-used names known to the
application. such programs are likely to produce warnings or errors,
run with reduced functionality, or mangle character data when run
explicitly in the C locale.
the standard places basically no requirements for the string returned
by nl_langinfo(CODESET) and how it interacts with other interfaces, so
returning "UTF-8" is permissible. moreover, it seems like the right
thing to do, since the identity of the character encoding as "UTF-8"
is independent of whether it is being processed as bytes of characters
by the standard library functions.
this patch adjusts libc components which use the multibyte functions
internally, and which depend on them operating in a particular
encoding, to make the appropriate locale changes before calling them
and restore the calling thread's locale afterwards. activating the
byte-based C locale without these changes would cause regressions in
stdio and iconv.
in the case of iconv, the current implementation was simply using the
multibyte functions as UTF-8 conversions. setting a multibyte UTF-8
locale for the duration of the iconv operation allows the code to
in the case of stdio, POSIX requires that FILE streams have an
encoding rule bound at the time of setting wide orientation. as long
as all locales, including the C locale, used the same encoding,
treating high bytes as UTF-8, there was no need to store an encoding
rule as part of the stream's state.
a new locale field in the FILE structure points to the locale that
should be made active during fgetwc/fputwc/ungetwc on the stream. it
cannot point to the locale active at the time the stream becomes
oriented, because this locale could be mutable (the global locale) or
could be destroyed (locale_t objects produced by newlocale) before the
stream is closed. instead, a pointer to the static C or C.UTF-8 locale
object added in commit commit aeeac9ca5490d7d90fe061ab72da446c01ddf746
is used. this is valid since categories other than LC_CTYPE will not
affect these functions.
this patch makes the functions which work directly on multibyte
characters treat the high bytes as individual abstract code units
rather than as multibyte sequences when MB_CUR_MAX is 1. since
MB_CUR_MAX is presently defined as a constant 4, all of the new code
added is dead code, and optimizing compilers' code generation should
not be affected at all. a future commit will activate the new code.
as abstract code units, bytes 0x80 to 0xff are represented by wchar_t
values 0xdf80 to 0xdfff, at the end of the surrogates range. this
ensures that they will never be misinterpreted as Unicode characters,
and that all wctype functions return false for these "characters"
without needing locale-specific logic. a high range outside of Unicode
such as 0x7fffff80 to 0x7fffffff was also considered, but since C11's
char16_t also needs to be able to represent conversions of these
bytes, the surrogate range was the natural choice.