musl - musl - an implementation of the standard library for Linux-based systems

diff options

author	Rich Felker <dalias@aerifal.cx>	2026-03-30 16:00:50 -0400
committer	Rich Felker <dalias@aerifal.cx>	2026-04-02 23:06:33 -0400
commit	67219f0130ec7c876ac0b299046460fad31caabf (patch)
tree	b3b323bd5aba180c1032268d1882b6090f99c1f2 /src/string/wcscat.c
parent	40acb04b2c1291f7d3091c61080109da11eea48b (diff)
download	musl-67219f0130ec7c876ac0b299046460fad31caabf.tar.gz

fix pathological slowness & incorrect mappings in iconv gb18030 decoderHEAD master

in order to implement the "UTF" aspect of gb18030 (ability to represent arbitrary unicode characters not present in the 2-byte mapping), we have to apply the index obtained from the encoded 4-byte sequence into the set of unmapped characters. this was done by scanning repeatedly over the table of mapped characters and counting off mapped characters below a running index by which to adjust the running index by on each iteration. this iterative process eventually leaves us with the value of the Nth unmapped character replacing the index, but depending on which particular character that is, the number of iterations needed to find it can be in the tens of thousands, and each iteration traverses the whole 126x190 table in the inner loop. this can lead to run times exceeding an entire second per character on moderate-speed machines. on top of that, the transformation logic produced wrong results for BMP characters above the the surrogate range, as a result of not correctly accounting for it being excluded, and for characters outside the BMP, as a result of a misunderstanding of how gb18030 encodes them. this patch replaces the unmapped character lookup with a single linear search of a list of unmapped ranges. there are only 206 such ranges, and these are permanently assigned and unchangeable as a consequence of the character encoding having to be stable, so a simple array of 16-bit start/length values for each range consumes only 824 bytes, a very reasonable size cost here. this new table accounts for the previously-incorrect surrogate handling, and non-BMP characters are handled correctly by a single offset, without the need for any unmapped-range search. there are still a small number of mappings that are incorrect due to late changes made in the definition of gb18030, swapping PUA codepoints with proper Unicode characters. correcting these requires a postprocessing step that will be added later.

Diffstat (limited to 'src/string/wcscat.c')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: