path: root/src/regex/regcomp.c
AgeCommit message (Collapse)AuthorLines
2015-03-30regex: fix character class repetitionsSzabolcs Nagy-0/+5
Internally regcomp needs to copy some iteration nodes before translating the AST into TNFA representation. Literal nodes were not copied correctly: the class type and list of negated class types were not copied so classes were ignored (in the non-negated case an ignored char class caused the literal to match everything). This affects iterations when the upper bound is finite, larger than one or the lower bound is larger than one. So eg. the EREs [[:digit:]]{2} [^[:space:]ab]{1,4} were treated as .{2} [^ab]{1,4} The fix is done with minimal source modification to copy the necessary fields, but the AST preparation and node handling code of tre will need to be cleaned up for clarity. (cherry picked from commit c498efe117539a9d40d90b588c033316701c4b3e)
2015-03-30fix regcomp handling of backslash followed by high byteRich Felker-4/+1
the regex parser handles the (undefined) case of an unexpected byte following a backslash as a literal. however, instead of correctly decoding a character, it was treating the byte value itself as a character. this was not only semantically unjustified, but turned out to be dangerous on archs where plain char is signed: bytes in the range 252-255 alias the internal codes -4 through -1 used for special types of literal nodes in the AST. analogous to commit 39dfd58417ef642307d90306e1c7e50aaec5a35c in mainline. it's unclear whether the same crash that affected mainline is possible in the older regcomp code in 1.0.x, but conceptually the bug is the same.
2013-12-12include cleanups: remove unused headers and add feature test macrosSzabolcs Nagy-1/+0
2013-10-07fix allocation sizes in regcompSzabolcs Nagy-4/+4
sizeof had incorrect argument in a few places, the size was always large enough so the issue was not critical.
2013-01-15remove unused "params" related code from regexSzabolcs Nagy-20/+11
some structs and functions had reference to the params feature of tre that is not used by the code anymore
2012-09-06use restrict everywhere it's required by c99 and/or posix 2008Rich Felker-1/+1
to deal with the fact that the public headers may be used with pre-c99 compilers, __restrict is used in place of restrict, and defined appropriately for any supported compiler. we also avoid the form [restrict] since older versions of gcc rejected it due to a bug in the original c99 standard, and instead use the form *restrict.
2012-05-13remove some no-op end of string tests from regex parserRich Felker-4/+0
these are cruft from the original code which used an explicit string length rather than null termination. i blindly converted all the checks to null terminator checks, without noticing that in several cases, the subsequent switch statement would automatically handle the null byte correctly.
2012-05-13another BRE fix: in ^*, * is literalRich Felker-0/+2
i don't understand why this has to be conditional on being in BRE mode, but enabling this code unconditionally breaks a huge number of ERE test cases.
2012-05-07fix error checking for \ at end of regex (this was broken previously)Rich Felker-1/+1
2012-05-07fix copy and paste error in regex code causing mishandling of \) in BRERich Felker-1/+1
2012-05-07fix regex breakage in last commit (failure to handle empty regex, etc.)Rich Felker-4/+1
2012-05-07fix ugly bugs in TRE regex parserRich Felker-60/+31
1. * in BRE is not special at the beginning of the regex or a subexpression. this broke ncurses' build scripts. 2. \\( in BRE is a literal \ followed by a literal (, not a literal \ followed by a subexpression opener. 3. the ^ in \\(^ in BRE is a literal ^ only at the beginning of the entire BRE. POSIX allows treating it as an anchor at the beginning of a subexpression, but TRE's code for checking if it was at the beginning of a subexpression was wrong, and fixing it for the sake of supporting a non-portable usage was too much trouble when just removing this non-portable behavior was much easier. this patch also moved lots of the ugly logic for empty atom checking out of the default/literal case and into new cases for the relevant characters. this should make parsing faster and make the code smaller. if nothing else it's a lot more readable/logical. at some point i'd like to revisit and overhaul lots of this code...
2012-04-13remove invalid code from TRERich Felker-14/+0
TRE wants to treat + and ? after a +, ?, or * as special; ? means ungreedy and + is reserved for future use. however, this is non-conformant. although redundant, these redundant characters have well-defined (no-op) meaning for POSIX ERE, and are actually _literal_ characters (which TRE is wrongly ignoring) in POSIX BRE mode. the simplest fix is to simply remove the unneeded nonstandard functionality. as a plus, this shaves off a small amount of bloat.
2012-03-20upgrade to latest upstream TRE regex code (0.8.0)Rich Felker-775/+822
the main practical results of this change are 1. the regex code is no longer subject to LGPL; it's now 2-clause BSD 2. most (all?) popular nonstandard regex extensions are supported I hesitate to call this a "sync" since both the old and new code are heavily modified. in one sense, the old code was "more severely" modified, in that it was actively hostile to non-strictly-conforming expressions. on the other hand, the new code has eliminated the useless translation of the entire regex string to wchar_t prior to compiling, and now only converts multibyte character literals as needed. in the future i may use this modified TRE as a basis for writing the long-planned new regex engine that will avoid multibyte-to-wide character conversion entirely by compiling multibyte bracket expressions specific to UTF-8.
2011-06-16duplicate re_nsub in LSB/glibc ABI compatible locationRich Felker-1/+1
2011-02-12initial check-in, version 0.5.0v0.5.0Rich Felker-0/+3362