musl - musl - an implementation of the standard library for Linux-based systems

diff options

author	Szabolcs Nagy <nsz@port70.net>	2014-08-14 22:25:33 +0200
committer	Szabolcs Nagy <nsz@port70.net>	2014-09-13 00:20:55 +0200
commit	ec1aed0a144b3e00e16eeb142c9d13362d6048e7 (patch)
tree	737098a3ffc046d54f8ae470998718087e307dd3 /src/stdlib
parent	bd082916b110c0c49e71bc83ff68dfd88bb8313a (diff)
download	musl-ec1aed0a144b3e00e16eeb142c9d13362d6048e7.tar.gz

rewrite the regex pattern parser in regcomp

The new code is a bit simpler and the generated code is about 1KB smaller (on i386). The basic design was kept including internal interfaces, TNFA generation was not touched. The old tre parser had various issues: [^aa-z] negated overlapping ranges in a bracket expression were handled incorrectly (eg [^aa-z] was handled as [^a] instead of [^a-z]) a{,2} missing lower bound in a counted repetition should be an error, but it was accepted with broken semantics: a{,2} was treated as a{0,3}, the new parser rejects it a{999,} large min count was not rejected (a{5000,} failed with REG_ESPACE due to reaching a stack limit), the new parser enforces the RE_DUP_MAX limit \xff regcomp used to accept a pattern with illegal sequences in it (treated them as empty expression so p\xffq matched pq) the new parser rejects such patterns with REG_BADPAT or REG_ERANGE [^b-fD-H] with REG_ICASE old parser turned this into [^b-fB-F] because of the negated overlapping range issue (see above), the new parser treats it as [^b-hB-H], POSIX seems to require [^d-fD-F], but practical implementations do case-folding first and negate the character set later instead of the other way around. (Supporting the posix way efficiently would require significant changes so it was left as is, it is unclear if any application actually expects the posix behaviour, this issue is raised on the austingroup tracker: http://austingroupbugs.net/view.php?id=872 ). another case-insensitive matching issue is that unicode case folding rules can group more than two characters together while towupper and towlower can only work for a pair of upper and lower case characters, this is a limitation of POSIX so it is not fixed. invalid bracket and brace expressions may return different error codes now (REG_ERANGE instead of REG_EBRACK or REG_BADBR instead of REG_EBRACE) otherwise the new parser should be compatible with the old one. regcomp should be able to handle arbitrary pattern input if the pattern length is limited, the only exception is the use of large repetition counts (eg. (a{255}){255}) which require exp amount of memory and there is no easy workaround.

Diffstat (limited to 'src/stdlib')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: