|author||Rich Felker <firstname.lastname@example.org>||2019-10-25 12:33:17 -0400|
|committer||Rich Felker <email@example.com>||2019-10-25 12:39:40 -0400|
overhaul wide character case mapping implementation
the existing implementation of case mappings was very small (typically around 1.5k), but unmaintainable, requiring manual addition of new case mappings with each new edition of Unicode. often, it turned out that newly-added case mappings were not easily representable in the existing tightly-constrained table structures, requiring new hacks to be invented and delaying support for new characters. the new implementation added here follows the pattern used for character class membership, with a two-level table allowing Unicode blocks for which no data is needed to be elided. however, rather than single-bit data, each character maps to a one of up to 6 case-mapping rules available to its block, where 6 is floor(cbrt(256)) and allow 3 characters to be represented per byte (vs 8 with bit tables). blocks that would need more than 6 rules designate one as an exception and let lookup pass into a binary search of exceptional cases for the block. the number 6 was chosen empirically; many blocks would be ok with 4 rules (uncased, lower, upper, possible exceptions), some even just with 2, but the latter are rare and fitting 4 characters per byte rather than 3 does not save significant space. moreover, somewhat surprisingly, there are sufficiently many blocks where even 4 rules don't suffice without a lot of exceptions (blocks where some case pairs are laced, others offset) that originally I was looking at supporting variable-width tables, with 1-, 2-, or 3-bit entries, thereby allowing blocks with 8 rules. as implemented in my experiments, that version was significantly larger and involved more memory accesses/cache lines. improvements in size at the expense of some performance might be possible by utilizing iswalpha data or merging the table of case mapping identity with alphabetic identity. these were explored somewhat when the code was first written, and might be worth revisiting in the future.
Diffstat (limited to 'ldso')
0 files changed, 0 insertions, 0 deletions