Expand/Shrink

regex_options

Definition: include builtins\regex.e

regex_options(integer opts=RE_PIKEVM, integer rErrHand=NULL)
Description: Set regular expression handling options. The default is RE_PIKEVM and nothing else.

The following constants are provided:

global constant RE_PIKEVM           = #001, 
                RE_RECURSIVE        = #002, 
                RE_EARLY_EXIT       = #004, 
                RE_BACKREFERENCES   = #008, 
                RE_CASEINSENSITIVE  = #010,
                RE_DOTMATCHESNL     = #020

RE_PIKEVM (the default) and RE_RECURSIVE are the two available (mutually exclusive) run-time engines.
Elsewhere in these documents, RE_PIKEVM is "the pikevm" whereas RE_RECURSIVE is "the backtrackingvm" or sometimes "the recursivevm".

RE_PIKEVM is a fast deterministic approach with (severely) limited support for backreferences but an absolute guarantee that it will not bog down for 100 million+ years (which is surprisingly easy for RE_RECURSIVE).

RE_RECURSIVE is suitable in controlled environments, can sometimes find solutions RE_PIKEVM cannot, and is usually fast enough.
Needless to say, however, exposing RE_RECURSIVE directly to the interweb risks a severe DDOS attack,
and the pikevm is almost always preferrable, except when it isn’t or gives the wrong results.
Some expressions (see test\t63regex.exw) work fine on the pikevm, but overflow on the backtrackingvm.

It is quite probable that you can (theoretically) write a regular expression for any reasonable problem that works just fine on the pikevm, to match any that works on the backtrackingvm, but I cannot prove that. An example of what I am talking about is given in RE_BACKREFERENCES shortly below.

RE_EARLY_EXIT is poorly defined. It only applies to the pikevm. The precise rules it effectively obeys cannot easily be defined, should there be tied first place for the shortest/leftmost match, and it will quite likely ride roughshod over any attempts to specify [non-]greedy matching within an ambiguous expression.
Some examples of the differences this setting causes can be found in test\t63regex.exw.
Of course if you only care whether a given expression matches, not where or what, this option could offer a significant performance improvement (then again, the pikevm is pretty fast anyway).

RE_BACKREFERENCES is disabled by default: The pikevm only supports entirely unambiguous backreferences, and it is up to you to restrict the regular expressions appropriately, should you attempt that particular configuration.
For example, given the target string "<1>test1<x1><2>test2<y2>test3<3><4>test4<4>", then
`<(\d+)>(.*?)<\1>` on the backtrackingvm will successfully match the <4>, but
`<(\d+)>([^<]*?)<\1>` is needed on the pikevm, ie replace .* with [^<]*, aka (any char)* with (not '<')*.

In the ambiguous (1st) case, when you get to the final <4> the \1 and the inner group could be:
"1" and "test1<x1><2>test2<y2>test3<3><4>test4", or
"2" and "test2<y2>test3<3><4>test4", or
"3" and "<4>test4", or
"4" and "test4".
and since the pikevm has only one slot to hold a backreference, it will inevitably get it wrong (as shown below).
In contrast, using the unambiguous (2nd) expression, there is no way for "1" to be a valid possibility by the time you get to the final "<4>", instead only that last case would ever be tried there. As an added bonus, the unambiguous expression is guaranteed to be faster, even on the backtrackingvm.

RE_CASEINSENSITIVE causes the regular expression to be compiled as upper case and individual character matches to invoke upper on the target characters before comparison. Note that this only applies to comparison between the regular expression and the target, and backreferences are unaffected, so this will not allow a target of say ABCabc to be matched using backreferences (though of course you can upper(target) beforehand instead).

RE_DOTMATCHESNL suppresses the legacy and often unnecessary behaviour that . matches everything but \n. It changes the way an expression is compiled, rather than the way it is matched.

The rErrHand parameter, if supplied, should be the routine_id of a procedure that accepts three arguments, such as procedure Error(string msg, string src, integer idx), which will be invoked instead of displaying an error on the terminal via printf(1,"%s\n%s^%s\n",{src,repeat(' ',idx-1),msg}), when a failure occurs during compilation.
pwa/p2js: Supported.
Example:
-- The code(/failure) from RE_BACKREFERENCES above in action
include builtins\regex.e

constant tgt = "<1>test1<x1><2>test2<y2>test3<3><4>test4<4>",
         r12 = {`<(\d+)>(.*?)<\1>`,     -- ambiguous
                `<(\d+)>([^<]*?)<\1>`}  -- unambiguous

procedure test(string engine, integer options)
    ?engine
    regex_options(options)
    for i=1 to length(r12) do
        sequence res = regex(r12[i],tgt)
        for j=1 to length(res) by 2 do
            res = append(res,tgt[res[j]..res[j+1]-1])
        end for
        ?res
    end for
end procedure

test("pikevm",RE_BACKREFERENCES)
test("recursivevm",RE_BACKREFERENCES+RE_RECURSIVE)

-- Output:
--  "pikevm"
--  {}
--  {33,44,34,35,36,41,"<4>test4<4>","4","test4"}
--  "recursivevm"
--  {33,44,34,35,36,41,"<4>test4<4>","4","test4"}
--  {33,44,34,35,36,41,"<4>test4<4>","4","test4"}
See Also: routine_id, regex
Expand/Shrink