diff options
| author | Rich Felker <dalias@aerifal.cx> | 2026-03-30 16:00:50 -0400 |
|---|---|---|
| committer | Rich Felker <dalias@aerifal.cx> | 2026-04-02 23:06:33 -0400 |
| commit | 67219f0130ec7c876ac0b299046460fad31caabf (patch) | |
| tree | b3b323bd5aba180c1032268d1882b6090f99c1f2 /src/string/wcscat.c | |
| parent | 40acb04b2c1291f7d3091c61080109da11eea48b (diff) | |
| download | musl-67219f0130ec7c876ac0b299046460fad31caabf.tar.gz | |
in order to implement the "UTF" aspect of gb18030 (ability to
represent arbitrary unicode characters not present in the 2-byte
mapping), we have to apply the index obtained from the encoded 4-byte
sequence into the set of unmapped characters. this was done by
scanning repeatedly over the table of mapped characters and counting
off mapped characters below a running index by which to adjust the
running index by on each iteration. this iterative process eventually
leaves us with the value of the Nth unmapped character replacing the
index, but depending on which particular character that is, the number
of iterations needed to find it can be in the tens of thousands, and
each iteration traverses the whole 126x190 table in the inner loop.
this can lead to run times exceeding an entire second per character on
moderate-speed machines.
on top of that, the transformation logic produced wrong results for
BMP characters above the the surrogate range, as a result of not
correctly accounting for it being excluded, and for characters outside
the BMP, as a result of a misunderstanding of how gb18030 encodes
them.
this patch replaces the unmapped character lookup with a single linear
search of a list of unmapped ranges. there are only 206 such ranges,
and these are permanently assigned and unchangeable as a consequence
of the character encoding having to be stable, so a simple array of
16-bit start/length values for each range consumes only 824 bytes, a
very reasonable size cost here.
this new table accounts for the previously-incorrect surrogate
handling, and non-BMP characters are handled correctly by a single
offset, without the need for any unmapped-range search.
there are still a small number of mappings that are incorrect due to
late changes made in the definition of gb18030, swapping PUA
codepoints with proper Unicode characters. correcting these requires a
postprocessing step that will be added later.
Diffstat (limited to 'src/string/wcscat.c')
0 files changed, 0 insertions, 0 deletions
