musl - musl - an implementation of the standard library for Linux-based systems

Age	Commit message (Collapse)	Author	Lines
2026-04-02	fix pathological slowness & incorrect mappings in iconv gb18030 decoder	Rich Felker	-9/+24
	in order to implement the "UTF" aspect of gb18030 (ability to represent arbitrary unicode characters not present in the 2-byte mapping), we have to apply the index obtained from the encoded 4-byte sequence into the set of unmapped characters. this was done by scanning repeatedly over the table of mapped characters and counting off mapped characters below a running index by which to adjust the running index by on each iteration. this iterative process eventually leaves us with the value of the Nth unmapped character replacing the index, but depending on which particular character that is, the number of iterations needed to find it can be in the tens of thousands, and each iteration traverses the whole 126x190 table in the inner loop. this can lead to run times exceeding an entire second per character on moderate-speed machines. on top of that, the transformation logic produced wrong results for BMP characters above the the surrogate range, as a result of not correctly accounting for it being excluded, and for characters outside the BMP, as a result of a misunderstanding of how gb18030 encodes them. this patch replaces the unmapped character lookup with a single linear search of a list of unmapped ranges. there are only 206 such ranges, and these are permanently assigned and unchangeable as a consequence of the character encoding having to be stable, so a simple array of 16-bit start/length values for each range consumes only 824 bytes, a very reasonable size cost here. this new table accounts for the previously-incorrect surrogate handling, and non-BMP characters are handled correctly by a single offset, without the need for any unmapped-range search. there are still a small number of mappings that are incorrect due to late changes made in the definition of gb18030, swapping PUA codepoints with proper Unicode characters. correcting these requires a postprocessing step that will be added later.
2025-02-12	iconv: harden UTF-8 output code path against input decoder bugs	Rich Felker	-0/+4
	the UTF-8 output code was written assuming an invariant that iconv's decoders only emit valid Unicode Scalar Values which wctomb can encode successfully, thereby always returning a value between 1 and 4. if this invariant is not satisfied, wctomb returns (size_t)-1, and the subsequent adjustments to the output buffer pointer and remaining output byte count overflow, moving the output position backwards, potentially past the beginning of the buffer, without storing any bytes.
2025-02-09	iconv: fix erroneous input validation in EUC-KR decoder	Rich Felker	-1/+1
	as a result of incorrect bounds checking on the lead byte being decoded, certain invalid inputs which should produce an encoding error, such as "\xc8\x41", instead produced out-of-bounds loads from the ksc table. in a worst case, the loaded value may not be a valid unicode scalar value, in which case, if the output encoding was UTF-8, wctomb would return (size_t)-1, causing an overflow in the output pointer and remaining buffer size which could clobber memory outside of the output buffer. bug report was submitted in private by Nick Wellnhofer on account of potential security implications.
2025-02-09	iconv: fix erroneous decoding of some invalid ShiftJIS sequences	Rich Felker	-0/+2
	out-of-range second bytes were not handled, leading to wrong character output rather than a reported encoding error. fix based on bug report by Nick Wellnhofer, submitted in private in case the issue turned out to have security implications.
2024-03-02	iconv: fix missing bounds checking for shift_jis decoding	Rich Felker	-0/+1
	the jis0208 table we use is only 84x94 in size, but the shift_jis encoding supports a 94x94 grid. attempts to convert sequences outside of the supported zone resulted in out-of-bounds table reads, misinterpreting adjacent rodata as part of the character table and thereby converting these sequences to unexpected characters.
2024-03-01	iconv: add aliases for GBK	Rich Felker	-1/+1
	these are taken from the IANA registry, restricted to those that match the forms already used for other supported character encodings.
2024-03-01	iconv: add euro symbol to GBK as single byte 0x80	Rich Felker	-0/+4
	this is how it's defined in the cp936 document referenced by the IANA charset registry as defining GBK, and of the mappings defined there, was the only one missing. it is not accepted for GB18030, as GB18030 is a UTF and has its own unique mapping for the euro symbol.
2024-02-29	iconv: add cp932 as an alias for shift_jis	Rich Felker	-1/+1

2018-06-01	fix output size handling for multi-unicode-char big5-hkscs characters	Rich Felker	-5/+13
	since this iconv implementation's output is stateless, it's necessary to know before writing anything to the output buffer whether the conversion of the current input character will fit. previously we used a hard-coded table of the output size needed for each supported output encoding, but failed to update the table when adding support for conversion to jis-based encodings and again when adding separate encoding identifiers for implicit-endianness utf-16/32 and ucs-2/4 variants, resulting in out-of-bound table reads and incorrect size checks. no buffer overflow was possible, but the affected characters could be converted incorrectly, and iconv could potentially produce an incorrect return value as a result. remove the hard-coded table, and instead perform the recursive iconv conversion to a temporary buffer, measuring the output size and transferring it to the actual output buffer only if the whole converted result fits.
2018-06-01	fix iconv mapping of big5-hkscs characters that map to two unicode chars	Rich Felker	-1/+1
	this case is handled with a recursive call to iconv using a specially-constructed conversion descriptor. the constant 0 was used as the offset for utf-8, since utf-8 appears first in the charmaps table, but the offset used needs to point into the charmap entry, past the name/aliases at the beginning, to the byte identifying the encoding. as a result of this error, junk was produced. instead, call find_charmap so we don't have to hard-code a nontrivial offset. with this change, the code has been tested and found to work in the case of converting the affected hkscs characters to utf-8.
2018-05-09	fix iconv conversion to UTF-32 with implicit (big) endianness	Will Dietz	-0/+2
	maintainer's notes: commit 95c6044e2ae85846330814c4ac5ebf4102dbe02c split UTF-32 and UTF-32BE but neglected to add a case for the former as a destination encoding, resulting in it wrongly being handled by the default case. the intent was that the value of the macro be chosen to encode "big endian" in the low bits, so that no code would be needed, but this was botched; instead, handle it the way UCS2 is handled.
2018-05-09	fix iconv buffer overflow converting to legacy JIS-based encodings	Will Dietz	-0/+1
	maintainer's notes: commit a223dbd27ae36fe53f9f67f86caf685b729593fc added the reverse conversions to JIS-based encodings, but omitted the check for remining buffer space in the case where the next character to be written was single-byte, allowing conversion to continue past the end of the destination buffer.
2017-12-18	fix iconv output of surrogate pairs in ucs2	Rich Felker	-1/+1
	in the unified code for handling utf-16 and ucs2 output, the check for ucs2 wrongly looked at the source charset rather than the destination charset.
2017-12-18	add support for BOM-determined-endian UCS2, UTF-16, and UTF-32 to iconv	Rich Felker	-3/+40
	previously, the charset names without endianness specified were always interpreted as big endian. unicode specifies that UTF-16 and UTF-32 have BOM-determined endianness if BOM is present, and are otherwise big endian. since commit 5b546faa67544af395d6407553762b37e9711157 added support for stateful encodings, it is now possible to implement BOM support via the conversion descriptor state. for conversions to these charsets, the output is always big endian and does not have a BOM.
2017-11-14	add reverse iconv mappings for JIS-based encodings	Rich Felker	-1/+97
	these encodings are still commonly used in messaging protocols and such. the reverse mapping is implemented as a binary search of a list of the jis 0208 characters in unicode order; the existing forward table is used to perform the comparison in the search.
2017-11-13	generalize iconv framework for 8-bit codepages	Rich Felker	-11/+16
	previously, 8-bit codepages could only remap the high 128 bytes; the low range was assumed/forced to agree with ascii. interpretation of codepage table headers has been changed so that it's possible to represent mappings for up to 256 slots (fewer if the initial portion of the map is elided because it coincides with unicode codepoints). this requires consuming a bit more of the 10-bit space of characters that can be represented in 8-bit codepages, but there's still a plenty left. the size of the legacy_chars table is actually reduced now by eliding the first 256 entries and considering them to map implicitly via the identity map. before these changes, there seem to have been minor bugs/omissions in codepage table generation, so it's likely that some actual bug fixes are silently included in this commit. round-trip testing of a few codepages was performed on the new version of the code, but no differential testing against the old version was done.
2017-11-10	add iso-2022-jp support (decoding only) to iconv	Rich Felker	-2/+45
	this implementation aims to match the baseline defined by rfc1468 (the original mime charset definition) plus the halfwidth katakana extension included in the whatwg definition of the charset. rejection of si/so controls and newlines in doublebyte state are not currently enforced. the jis x 0201 mode is currently interpreted as having the yen sign and overline character in place of backslash and tilde; ascii mode has the standard ascii characters in those slots.
2017-11-10	add iconv framework for decoding stateful encodings	Rich Felker	-3/+22
	assuming pointers obtained from malloc have some nonzero alignment, repurpose the low bit of iconv_t as an indicator that the descriptor is a stateless value representing the source and destination character encodings.
2017-11-10	simplify/optimize iconv utf-8 case	Rich Felker	-4/+3
	the special case where mbrtowc returns 0 but consumed 1 byte of input does not need to be considered, because the short-circuit for low bytes already covered that case.
2017-11-10	handle ascii range individually in each iconv case	Rich Felker	-2/+10
	short-circuiting low bytes before the switch precluded support for character encodings that don't coincide with ascii in this range. this limitation affected iso-2022 encodings, which use the esc byte to introduce a shift sequence, and things like ebcdic.
2017-11-10	move iconv_close to its own translation unit	Rich Felker	-5/+0
	this is in preparation to support stateful conversion descriptors, which are necessarily allocated and thus must be freed in iconv_close. putting it in a separate TU will avoid pulling in free if iconv_close is not referenced.
2017-11-10	refactor iconv conversion descriptor encoding/decoding	Rich Felker	-6/+20
	this change is made to avoid having assumptions about the encoding spread out across the file, and to facilitate future change to a form that can accommodate allocted, stateful descriptors when needed. this commit should not produce any functional changes; with the compiler tested the only change to code generation was minor reordering of local variables on stack.
2017-06-20	fix iconv conversions for iso88592-iso885916	Bartosz Brachaczek	-1/+1
	commit 97bd6b09dbe7478d5a90a06ecd9e5b59389d8eb9 refactored the table lookup into a function and introduced an error in index computation. the error caused garbage to be read from the table if the given charmap had a non-zero number of elided entries.
2017-05-27	fix iconv conversions to legacy 8bit encodings	Rich Felker	-9/+12
	there was missing reverse-conversion logic for the case, handled specially in the character set tables, where a byte represents a unicode codepoint with the same value. this patch adds code to handle the case, and refactors the two-level 10-bit table lookup for legacy character sets into a function to avoid repeating it yet another time as part of the fix.
2015-06-16	byte-based C locale, phase 2: stdio and iconv (multibyte callers)	Rich Felker	-0/+6
	this patch adjusts libc components which use the multibyte functions internally, and which depend on them operating in a particular encoding, to make the appropriate locale changes before calling them and restore the calling thread's locale afterwards. activating the byte-based C locale without these changes would cause regressions in stdio and iconv. in the case of iconv, the current implementation was simply using the multibyte functions as UTF-8 conversions. setting a multibyte UTF-8 locale for the duration of the iconv operation allows the code to continue working. in the case of stdio, POSIX requires that FILE streams have an encoding rule bound at the time of setting wide orientation. as long as all locales, including the C locale, used the same encoding, treating high bytes as UTF-8, there was no need to store an encoding rule as part of the stream's state. a new locale field in the FILE structure points to the locale that should be made active during fgetwc/fputwc/ungetwc on the stream. it cannot point to the locale active at the time the stream becomes oriented, because this locale could be mutable (the global locale) or could be destroyed (locale_t objects produced by newlocale) before the stream is closed. instead, a pointer to the static C or C.UTF-8 locale object added in commit commit aeeac9ca5490d7d90fe061ab72da446c01ddf746 is used. this is valid since categories other than LC_CTYPE will not affect these functions.
2015-05-21	remove outdated and misleading comment in iconv.c	Rich Felker	-6/+0
	the comment claimed that EUC/GBK/Big5 are not implemented, which has been incorrect since commit 19b4a0a20efc6b9df98b6a43536ecdd628ba4643.
2015-05-21	in iconv_open, accept "CHAR" and "" as aliases for "UTF-8"	Rich Felker	-1/+2
	while not a requirement, it's common convention in other iconv implementations to accept "CHAR" as an alias for nl_langinfo(CODESET), meaning the encoding used for char[] strings in the current locale, and also "" as an alternate form. supporting this is not costly and improves compatibility.
2013-08-17	add hkscs/big5-2003/eten extensions to iconv big5	Rich Felker	-4/+33
	with these changes, the character set implemented as "big5" in musl is a pure superset of cp950, the canonical "big5", and agrees with the normative parts of Unicode. this means it has minor differences from both hkscs and big5-2003: - the range A2CC-A2CE maps to CJK ideographs rather than numerals, contrary to changes made in big5-2003. - C6CD maps to a CJK ideograph rather than its corresponding Kangxi radical character, contrary to changes made in hkscs. - F9FE maps to U+2593 rather than U+FFED. of these differences, none but the last are visually distinct, and the last is a character used purely for text-based graphics, not to convey linguistic content. should there be future demand for strict conformance to big5-2003 or hkscs mappings, the present charset aliases can be replaced with distinct variants. reportedly there are other non-standard big5 extensions in common use in Taiwan and perhaps elsewhere, which could also be added as layers on top of the existing big5 support. there may be additional characters which should be added to the hkscs table: the whatwg standard for big5 defines what appears to be a superset of hkscs.
2013-08-07	add Big5 charset support to iconv	Rich Felker	-0/+18
	at this point, it is just the common base charset equivalent to Windows CP 950, with no further extensions. HKSCS and possibly other supersets will be added later. other aliases may need to be added too.
2013-08-05	iconv support for legacy Korean encodings	Rich Felker	-0/+38
	like for other character sets, stateful iso-2022 form is not supported yet but everything else should work. all charset aliases are treated the same, as Windows codepage 949, because reportedly the EUC-KR charset name is in widespread (mis?)usage in email and on the web for data which actually uses the extended characters outside the standard 93x94 grid. this could easily be changed if desired. the principle of this converter for handling the giant bulk of rare Hangul syllables outside of the standard KS X 1001 93x94 grid is the same as the GB18030 converter's treatment of non-explicitly-coded Unicode codepoints: sequences in the extension range are mapped to an integer index N, and the converter explicitly computes the Nth Hangul syllable not explicitly encoded in the character map. empirically, this requires at most 7 passes over the grid. this approach reduces the table size required for Korean legacy encodings from roughly 44k to 17k and should have minimal performance impact on real-world text conversions since the "slow" characters are rare. where it does have impact, the cost is merely a large constant time factor.
2013-06-26	fix iconv conversion to legacy 8bit codepages	Rich Felker	-2/+2
	this seems to have been a simple copy-and-paste error from the code for converting from legacy codepages.
2012-09-06	use restrict everywhere it's required by c99 and/or posix 2008	Rich Felker	-1/+1
	to deal with the fact that the public headers may be used with pre-c99 compilers, __restrict is used in place of restrict, and defined appropriately for any supported compiler. we also avoid the form [restrict] since older versions of gcc rejected it due to a bug in the original c99 standard, and instead use the form *restrict.
2012-06-18	fix multiple iconv bugs reading utf-16/32 and wchar_t	Rich Felker	-8/+8

2012-06-18	fix iconv dest utf-16: unavailable chars must be replaced; EILSEQ is wrong	Rich Felker	-2/+2

2012-06-18	fix erroneous utf-16 encoding with surrogates in iconv	Rich Felker	-0/+1
	apparently this was never tested before.
2012-04-21	fix major breakage in iconv, bogus rejecting of dest charsets	Rich Felker	-1/+1

2011-07-12	gb18030 support in iconv (only from, not to)	Rich Felker	-2/+51
	also support (and restrict to subsets) older chinese sets, and explicitly refuse to convert to cjk (since there's no code for it yet)
2011-07-12	legacy japanese charset support in iconv (only from, not to)	Rich Felker	-0/+47

2011-07-12	simplify iconv and support more legacy codepages	Rich Felker	-352/+54

2011-07-03	iconv was not returning -1 on most failure	Rich Felker	-0/+2
	this broke most uses of iconv in real-world programs, especially glib's iconv wrappers.
2011-04-07	fix breakage due to converting a return type to size_t in iconv...	Rich Felker	-1/+1

2011-03-25	fix all implicit conversion between signed/unsigned pointers	Rich Felker	-11/+11
	sadly the C language does not specify any such implicit conversion, so this is not a matter of just fixing warnings (as gcc treats it) but actual errors. i would like to revisit a number of these changes and possibly revise the types used to reduce the number of casts required.
2011-02-13	use a more-correct integer type, and silence 64-bit warnings as a bonus	Rich Felker	-2/+2

2011-02-12	initial check-in, version 0.5.0v0.5.0	Rich Felker	-0/+568