summaryrefslogtreecommitdiff
path: root/src/locale
AgeCommit message (Collapse)AuthorLines
2017-03-21search locale name variants for gettext translationsRich Felker-32/+55
often translations will be named only by language, whereas locale names may also include a territory code, modifier, and codeset portion. previously, only translations exactly matching the locale name were loaded. this was a major usability issue, requiring workarounds like symlinks or tweaking of the locale name. with these changes, gettext now searches for translations by first removing the codeset portion of the locale name, then trying the remainder in full, with modifier (@mod) removed, with territory code (_XX) removed, and with both removed. part of the reason gettext lacked support for searching fallbacks before is that the candidate pathname for a translation file was constructed on each call and used as the key to lookup an already-mapped translation file. this was very costly/inefficient. we now use the tuple of textdomain binding pointer, locale map pointer, and integer category id as the key for looking up a translation file mapping. based on patch by He X.
2017-03-21make setlocale return a single name for LC_ALL if all categories matchRich Felker-2/+5
when called for LC_ALL, setlocale has to return a string representing the state of all locale categories. the simplest way to do this was to always return a delimited list of values for each category, but that's not friendly in the fairly common case where all categories have the same setting. He X proposed a patch to check for this case and return a single name; this patch is a simplified approach to do the same.
2017-01-29avoid unbounded strlen in gettext functionsRich Felker-3/+3
use the standard strnlen idiom for cases where lengths greater than an imposed limit are going to be rejected immediately anyway.
2017-01-29fix use of uninitialized pointer in gettext coreRich Felker-2/+2
the plural_rule field of allocated msgcat structures was assumed to be initially-null but was never initialized. for future-proofing, the nplurals field which was left uninitialized should also be cleared. likewise, in the binding structure, the active field could be used uninitialized by a technicality: the a_store which stores the initial value of 0 may be implemented as a cas operation, which reads the old value. rather than fixing these issues individually, just use calloc for both allocations. this does result in wasteful clearing of name buffers (up to NAME_MAX+PATH_MAX) before filling them, but since the size if bounded and the time is dominated by filesystem operations, it really doesn't matter; simplicity and future-proofing have more value here. modified from patch submitted by He X.
2017-01-29fix bindtextdomain logic error deactivating other domainsRich Felker-1/+1
this loop was only supposed to deactivate other bindings for the same text domain name, but due to copy-and-paste error, deactivated all other bindings. patch by He X.
2015-11-10fix return value of nl_langinfo for invalid item argumentsRich Felker-5/+5
it was wrongly returning a null pointer instead of an empty string.
2015-10-01make nl_langinfo(CODESET) always return "ASCII" in byte-based C localeRich Felker-1/+1
commit 844212d94f582c4e3c5055e0a1524931e89ebe76, which did not make it into any releases, changed nl_langinfo(CODESET) to always return "UTF-8", even in the byte-based C locale. this was problematic because application software was found to use the string match for "UTF-8" to activate its own UTF-8 processing. this both undermines the byte-based functionality of the C locale, and if mixed with with calls to the standard multibyte functions, which happened in practice, could result in severe mis-handling of input. the motive for the previous change was that, to avoid widespread compatibility problems, the string returned by nl_langinfo(CODESET) needs to be accepted by iconv and by third-party character conversion code. thus, the only remaining choice is "ASCII". this choice accurately represents the intent that high bytes do not have individual meaning in the C locale, but it does mean that iconv, when passed nl_langinfo(CODESET) in the C locale, will produce errors in cases where mbrtowc would have succeeded. for reference, glibc behaves similarly in this regard, so I don't think it will be a problem.
2015-09-24fix localeconv field value for unavailable valuesRich Felker-14/+15
per ISO C, CHAR_MAX, not -1, is the value used to indicate that a char field in struct lconv is unavailable. patch by Julien Ramseier.
2015-09-09fix breakage in nl_langinfo from previous commitRich Felker-1/+1
2015-09-09make nl_langinfo(CODESET) always return "UTF-8"Rich Felker-2/+1
this restores the original behavior prior to the addition of the byte-based C locale and fixes what is effectively a regression in musl's property of always providing working UTF-8 support. commit 1507ebf837334e9e07cfab1ca1c2e88449069a80 introduced the codeset name "UTF-8-CODE-UNITS" for the byte-based C locale to represent that the semantic content is UTF-8 but that it is being processed as code units (bytes) rather than whole multibyte characters. however, many programs assume that the codeset name is usable with iconv and/or comes from a set of standard/widely-used names known to the application. such programs are likely to produce warnings or errors, run with reduced functionality, or mangle character data when run explicitly in the C locale. the standard places basically no requirements for the string returned by nl_langinfo(CODESET) and how it interacts with other interfaces, so returning "UTF-8" is permissible. moreover, it seems like the right thing to do, since the identity of the character encoding as "UTF-8" is independent of whether it is being processed as bytes of characters by the standard library functions.
2015-06-16byte-based C locale, phase 2: stdio and iconv (multibyte callers)Rich Felker-0/+6
this patch adjusts libc components which use the multibyte functions internally, and which depend on them operating in a particular encoding, to make the appropriate locale changes before calling them and restore the calling thread's locale afterwards. activating the byte-based C locale without these changes would cause regressions in stdio and iconv. in the case of iconv, the current implementation was simply using the multibyte functions as UTF-8 conversions. setting a multibyte UTF-8 locale for the duration of the iconv operation allows the code to continue working. in the case of stdio, POSIX requires that FILE streams have an encoding rule bound at the time of setting wide orientation. as long as all locales, including the C locale, used the same encoding, treating high bytes as UTF-8, there was no need to store an encoding rule as part of the stream's state. a new locale field in the FILE structure points to the locale that should be made active during fgetwc/fputwc/ungetwc on the stream. it cannot point to the locale active at the time the stream becomes oriented, because this locale could be mutable (the global locale) or could be destroyed (locale_t objects produced by newlocale) before the stream is closed. instead, a pointer to the static C or C.UTF-8 locale object added in commit commit aeeac9ca5490d7d90fe061ab72da446c01ddf746 is used. this is valid since categories other than LC_CTYPE will not affect these functions.
2015-06-16byte-based C locale, phase 1: multibyte character handling functionsRich Felker-1/+2
this patch makes the functions which work directly on multibyte characters treat the high bytes as individual abstract code units rather than as multibyte sequences when MB_CUR_MAX is 1. since MB_CUR_MAX is presently defined as a constant 4, all of the new code added is dead code, and optimizing compilers' code generation should not be affected at all. a future commit will activate the new code. as abstract code units, bytes 0x80 to 0xff are represented by wchar_t values 0xdf80 to 0xdfff, at the end of the surrogates range. this ensures that they will never be misinterpreted as Unicode characters, and that all wctype functions return false for these "characters" without needing locale-specific logic. a high range outside of Unicode such as 0x7fffff80 to 0x7fffffff was also considered, but since C11's char16_t also needs to be able to represent conversions of these bytes, the surrogate range was the natural choice.
2015-06-06make static C and C.UTF-8 locales available outside of newlocaleRich Felker-21/+21
2015-06-05fix uselocale((locale_t)0) not to modify localeTimo Teräs-3/+1
commit 68630b55c0c7219fe9df70dc28ffbf9efc8021d8 made the new locale to be assigned unconditonally resulting in crashes later on.
2015-05-27implement fail-safe static locales for newlocaleRich Felker-13/+46
this frees applications which need to make temporary use of the C locale (via uselocale) from the possibility that newlocale might fail. the C.UTF-8 locale is also provided as a static locale. presently they behave the same, but this may change in the future.
2015-05-27rename internal locale file handling locale mapsRich Felker-0/+0
since the __setlocalecat function was removed, the filename __setlocalecat.c no longer made sense.
2015-05-27overhaul locale internals to treat categories roughly uniformlyRich Felker-129/+107
previously, LC_MESSAGES was treated specially as the only category which could be set to a locale name without a definition file, in order to facilitate gettext message translations when no libc locale was available. LC_NUMERIC was completely un-settable, and LC_CTYPE stored a flag intended to be used for a possible future byte-based C locale, instead of storing a __locale_map pointer like the other categories use. this patch changes all categories to be represented by pointers to __locale_map structures, and allows locale names without definition files to be treated as valid locales with trivial definition when used in any category. outwardly visible functional changes should be minor, limited mainly to the strings read back from setlocale and the way gettext handles translations in categories other than LC_MESSAGES. various internal refactoring has also been performed, and improvements in const correctness have been made.
2015-05-27replace atomics with locks in locale-setting codeRich Felker-32/+51
this is part of a general program of removing direct use of atomics where they are not necessary to meet correctness or performance needs, but in this case it's also an optimization. only the global locale needs synchronization; allocated locales referenced with locale_t handles are immutable during their lifetimes, and using atomics to initialize them increases their cost of setup.
2015-05-21remove outdated and misleading comment in iconv.cRich Felker-6/+0
the comment claimed that EUC/GBK/Big5 are not implemented, which has been incorrect since commit 19b4a0a20efc6b9df98b6a43536ecdd628ba4643.
2015-05-21in iconv_open, accept "CHAR" and "" as aliases for "UTF-8"Rich Felker-1/+2
while not a requirement, it's common convention in other iconv implementations to accept "CHAR" as an alias for nl_langinfo(CODESET), meaning the encoding used for char[] strings in the current locale, and also "" as an alternate form. supporting this is not costly and improves compatibility.
2015-05-18fix null pointer dereference in dcngettext under specific conditionsRich Felker-1/+1
if setlocale has not been called, the current locale's messages_name may be a null pointer. the code path where it's assumed to be non-null was only reachable if bindtextdomain had already been called, which is normally not done in programs which do not call setlocale, so the omitted check went unnoticed. patch from Void Linux, with description rewritten.
2015-05-16eliminate costly tricks to avoid TLS access for current locale stateRich Felker-15/+2
the code being removed used atomics to track whether any threads might be using a locale other than the current global locale, and whether any threads might have abstract 8-bit (non-UTF-8) LC_CTYPE active, a feature which was never committed (still pending). the motivations were to support early execution prior to setup of the thread pointer, to partially support systems (ancient kernels) where thread pointer setup is not possible, and to avoid high performance cost on archs where accessing the thread pointer may be very slow. since commit 19a1fe670acb3ab9ead0fe31859ca7d4fe40dd54, the thread pointer is always available, so these hacks are no longer needed. removing them greatly simplifies the affected code.
2015-04-21fix duplocale clobbering of new locale struct with memcpy of oldRich Felker-1/+2
when the non-stub duplocale code was added as part of the locale framework in commit 0bc03091bb674ebb9fa6fe69e4aec1da3ac484f2, the old code to memcpy the old locale object to the new one was left behind. the conditional for the memcpy no longer makes sense, because the conditions are now always-true when it's reached, and the memcpy is wrong because it clobbers the new->messages_name pointer setup just above. since the messages_name and ctype_utf8 members have already been copied, all that remains is the cat[] array. these pointers are volatile, so using memcpy to copy them is formally wrong; use a for loop instead.
2015-03-03make all objects used with atomic operations volatileRich Felker-5/+5
the memory model we use internally for atomics permits plain loads of values which may be subject to concurrent modification without requiring that a special load function be used. since a compiler is free to make transformations that alter the number of loads or the way in which loads are performed, the compiler is theoretically free to break this usage. the most obvious concern is with atomic cas constructs: something of the form tmp=*p;a_cas(p,tmp,f(tmp)); could be transformed to a_cas(p,*p,f(*p)); where the latter is intended to show multiple loads of *p whose resulting values might fail to be equal; this would break the atomicity of the whole operation. but even more fundamental breakage is possible. with the changes being made now, objects that may be modified by atomics are modeled as volatile, and the atomic operations performed on them by other threads are modeled as asynchronous stores by hardware which happens to be acting on the request of another thread. such modeling of course does not itself address memory synchronization between cores/cpus, but that aspect was already handled. this all seems less than ideal, but it's the best we can do without mandating a C11 compiler and using the C11 model for atomics. in the case of pthread_once_t, the ABI type of the underlying object is not volatile-qualified. so we are assuming that accessing the object through a volatile-qualified lvalue via casts yields volatile access semantics. the language of the C standard is somewhat unclear on this matter, but this is an assumption the linux kernel also makes, and seems to be the correct interpretation of the standard.
2014-09-06fix non-static dummy function that slipped in with locale implementationRich Felker-1/+1
2014-08-13add inline isspace in ctype.h as an optimizationSzabolcs Nagy-8/+0
isspace can be a bottleneck in a simple parser, inlining it gives slightly smaller and faster code src/locale/pleval.o already had this optimization, the size change for other libc functions for i386 is src/internal/intscan.o 2134 2118 -16 src/locale/dcngettext.o 1562 1552 -10 src/network/res_msend.o 1961 1940 -21 src/network/lookup_name.o 2627 2608 -19 src/network/getnameinfo.o 1814 1811 -3 src/network/lookup_serv.o 643 624 -19 src/stdio/vfscanf.o 2675 2663 -12 src/stdlib/atoll.o 117 107 -10 src/stdlib/atoi.o 95 91 -4 src/stdlib/atol.o 95 91 -4 src/time/strptime.o 1515 1503 -12 (TOTALS) 432451 432321 -130
2014-07-31harden locale name handling and prevent slashes in LC_MESSAGESRich Felker-3/+3
the code which loads locale files was already rejecting locale names containing slashes. however, LC_MESSAGES records a locale name even if libc does not have a matching locale file, so that gettext or application code can use the recorded locale name for message translations to languages that libc does not support. this recorded name was not being checked for slashes, meaning that such code could potentially be tricked into directory traversal. in addition, since the value of a locale category is sometimes used as a pathname component by callers, the improved code rejects any value beginning with a dot. this prevents traversal to the parent directory via "..", use of the top-level locale directory via ".", and also avoids "hidden" directories as a side effect. finally, overly long locale names are now rejected (treated as an unrecognized name and thus as an alias for C.UTF-8) rather than being truncated.
2014-07-30plural rule evaluator rewrite for dcngettextSzabolcs Nagy-128/+106
using an operator precedence parser the code size became smaller and it is only slower by about %10 size of old vs new pleval.o on different archs: (with inlined isspace added to pleval.c for now) old: text data bss dec hex filename 828 0 0 828 33c pl.i386.o 1152 0 0 1152 480 pl.arm.o 1704 0 0 1704 6a8 pl.mips.o 1328 0 0 1328 530 pl.ppc.o 992 0 0 992 3e0 pl.x64.o new: text data bss dec hex filename 693 0 0 693 2b5 pl.i386.o 972 0 0 972 3cc pl.arm.o 1276 0 0 1276 4fc pl.mips.o 1087 0 0 1087 43f pl.ppc.o 846 0 0 846 34e pl.x64.o
2014-07-29tweaks to plural rules evaluatorSzabolcs Nagy-54/+44
const parsing, depth accounting and failure handling was changed a bit so the generated code is slightly smaller.
2014-07-29harden dcngettext plural processingRich Felker-2/+3
while the __mo_lookup backend can verify that the translated message ends with a null terminator, is has no way to know nplurals and thus no way to verify that sufficiently many null terminators are present in the string to satisfy all plural forms. the code in dcngettext was already attempting to avoid reading past the end of the mo file mapping, but failed to do so because the strlen call itself could over-read. using strnlen instead allows us to avoid the problem.
2014-07-29harden mo file processing for locale/translationsRich Felker-2/+6
rather than just checking that the start of the string lies within the mapping, also check that the nominal length remains within the mapping, and that the null terminator is present at the nominal length. this ensures that the caller, using the result as a C string, will not read past the end of the mapping. the nominal length is never exposed to the caller, but it's useful internally to find where the null terminator should be without having to restort to linear search via strnlen/memchr.
2014-07-28implement non-default plural rules for ngettext translationsRich Felker-8/+243
the new code in dcngettext was written by me, and the expression evaluator by Szabolcs Nagy (nsz).
2014-07-27implement gettext message translation functionsRich Felker-68/+271
this commit replaces the stub implementations with working message translation functions. translation units are factored so as to prevent pulling in the legacy, non-library-safe functions which use a global textdomain in modern code which is using the versions with an explicit domain argument. bind_textdomain_codeset is also placed in its own file since it should not be needed by most programs. this implementation is still missing some features: the LANGUAGE environment variable (for multiple fallback languages) is not honored, and non-default plural-form rules are not supported. these issues will be addressed in a later commit. one notable difference from the GNU implementation is that there is no default path for loading translation files. in principle one could be added, but since the documented correct usage is to call the bindtextdomain function, a default path is probably unnecessary.
2014-07-26add support for LC_TIME and LC_MESSAGES translationsRich Felker-7/+1
for LC_MESSAGES, translation of strerror and similar literal message functions is supported. for messages in other places (particularly the dynamic linker) that use format strings, translation is not yet supported. in order to make it possible and safe, such messages will need to be refactored to separate the textual content from the format. for LC_TIME, the day and month names and strftime-style format strings provided by nl_langinfo are supported for translation. however there may be limitations, as some of the original C-locale nl_langinfo strings are non-unique and thus perhaps non-suitable as keys. overall, the locale support activated by this commit should not be seen as complete and polished but as a basis for beginning to test locale functionality and implement locales.
2014-07-26add missing yes/no strings to nl_langinfoRich Felker-2/+2
these were removed from the standard but still offered as an extension in langinfo.h, so nl_langinfo should support them.
2014-07-26fix nl_langinfo table for LC_TIME era-related itemsRich Felker-1/+2
due to a skipped slot and missing null terminator, the last few strings were off by one or two slots from their item codes.
2014-07-26implement mo file string lookup for translationsRich Felker-0/+65
the core is based on a binary search; hash table is not used. both native and reverse-endian mo files are supported. all offsets read from the mapped mo file are checked against the mapping size to prevent the possibility of reads outside the mapping. this commit has no observable effects since there are not yet any callers to the message translation code.
2014-07-24implement locale file loading and state for remaining locale categoriesRich Felker-2/+70
there is still no code which actually uses the loaded locale files, so the main observable effect of this commit is that calls to setlocale store and give back the names of the selected locales for the remaining categories (LC_TIME, LC_COLLATE, LC_MONETARY) if a locale file by the requested name could be loaded.
2014-07-24fix locale environment variable logic for empty stringsRich Felker-3/+3
per POSIX (XBD 8.2) LC_*/LANG environment variables set to to the empty string are supposed to be treated as if they were not set at all.
2014-07-02properly pass current locale to *_l functions when used internallyRich Felker-6/+12
this change is presently non-functional since the callees do not yet use their locale argument for anything.
2014-07-02consolidate str[n]casecmp_l into str[n]casecmp source filesRich Felker-13/+0
this is mainly done for consistency with the ctype functions and to declutter the src/locale directory.
2014-07-02consolidate *_l ctype/wctype functions into their non-_l source filesRich Felker-204/+0
the main practical purposes of this commit are to remove a huge amount of clutter from the src/locale directory, to cut down on the length of the $(AR) and $(LD) command lines, and to reduce the amount of space wasted by object file headers in the static libc.a. build time may also be reduced, though this has not been measured. as an additional justification, if there ever were a need for the behavior of these functions to vary by locale, it would be necessary for the non-_l versions to call the _l versions, so that linking the former without the latter would not be possible anyway.
2014-07-02add locale frameworkRich Felker-19/+155
this commit adds non-stub implementations of setlocale, duplocale, newlocale, and uselocale, along with the data structures and minimal code needed for representing the active locale on a per-thread basis and optimizing the common case where thread-local locale settings are not in use. at this point, the data structures only contain what is necessary to represent LC_CTYPE (a single flag) and LC_MESSAGES (a name for use in finding message translation files). representation for the other categories will be added later; the expectation is that a single pointer will suffice for each. for LC_CTYPE, the strings "C" and "POSIX" are treated as special; any other string is accepted and treated as "C.UTF-8". for other categories, any string is accepted after being truncated to a maximum supported length (currently 15 bytes). for LC_MESSAGES, the name is kept regardless of whether libc itself can use such a message translation locale, since applications using catgets or gettext should be able to use message locales libc is not aware of. for other categories, names which are not successfully loaded as locales (which, at present, means all names) are treated as aliases for "C". setlocale never fails. locale settings are not yet used anywhere, so this commit should have no visible effects except for the contents of the string returned by setlocale.
2014-06-10replace all remaining internal uses of pthread_self with __pthread_selfRich Felker-1/+1
prior to version 1.1.0, the difference between pthread_self (the public function) and __pthread_self (the internal macro or inline function) was that the former would lazily initialize the thread pointer if it was not already initialized, whereas the latter would crash in this case. since lazy initialization is no longer supported, use of pthread_self no longer makes sense; it simply generates larger, slower code.
2014-05-13add cp437 and cp850 to available iconv conversionsRich Felker-177/+206
perhaps some additional legacy DOS-era codepages would also be useful to have, but these are the ones for which there has been demand. the size of the diff is due to the fact that legacychars.h is updated in such a way that new characters are inserted into the table in unicode codepoint order; thus other mappings in codepages.h have changed to reflect the new table indices of their characters.
2014-01-23fix an overflow in wcsxfrm when n==0Szabolcs Nagy-2/+4
posix allows zero length destination
2013-12-12include cleanups: remove unused headers and add feature test macrosSzabolcs Nagy-3/+1
2013-11-25remove duplicate includes from dynlink.c, strfmon.c and getaddrinfo.cSzabolcs Nagy-1/+0
2013-08-17remove spurious tmp file present since initial git check-inRich Felker-390/+0
2013-08-17add hkscs/big5-2003/eten extensions to iconv big5Rich Felker-977/+1433
with these changes, the character set implemented as "big5" in musl is a pure superset of cp950, the canonical "big5", and agrees with the normative parts of Unicode. this means it has minor differences from both hkscs and big5-2003: - the range A2CC-A2CE maps to CJK ideographs rather than numerals, contrary to changes made in big5-2003. - C6CD maps to a CJK ideograph rather than its corresponding Kangxi radical character, contrary to changes made in hkscs. - F9FE maps to U+2593 rather than U+FFED. of these differences, none but the last are visually distinct, and the last is a character used purely for text-based graphics, not to convey linguistic content. should there be future demand for strict conformance to big5-2003 or hkscs mappings, the present charset aliases can be replaced with distinct variants. reportedly there are other non-standard big5 extensions in common use in Taiwan and perhaps elsewhere, which could also be added as layers on top of the existing big5 support. there may be additional characters which should be added to the hkscs table: the whatwg standard for big5 defines what appears to be a superset of hkscs.