regex syntax

The following elements are supported (better tutorials can easily be found, and reading test\t63regex.exw may help):

Element Meaning
^ Outside character classes this is an anchor for start of text. eg `^You` matches lines beginning with "You".
At the very start of a character class, negates it. eg `[^q]` maches any character except a `q`.
Use `\^` to mean the literal character.
$ An anchor for the end of text. eg `string$` matches lines ending with "string". Use `\$` to mean the literal character.
. Any character, except `\n`, unless the RE_DOTMATCHESNL option has been set. Use `\.` to mean the literal character.
| Alternation. eg `a|bc|def` matches "a" or "bc" or "def".
\ The backslash is the escape character. Adds meaning to letters and removes meaning from symbols. eg `\d\$` matches say "2$" Use `\\` to mean the literal character.
\b A word boundary. \B is "Not a word boundary".
Note that `\b.*\b` can match "as", " as", " as ", "as ", " as if ", etc whereas `\b\w*\b` is more "as"-only.
\d The digits `0` through `9`. \D is the negation of that set (ie not a digit).
\w The characters `a` to `z`, `A` to `Z`, `0` to `9`, and `_` (underscore). \W is the negation of that set (not a word character).
\s The white space characters space, tab, newline, and carriage return. \S is the negation of that set (not a whitespace character).
\xHH A character as a two-digit hexadecimal. A leading `0` is not optional.
[ ] Defines a character class.
If a `^` immediately follows the opening `[` then the class is negated.
Matches a single character, unless the closing `]` is followed by any of `?*+{}` (see below).
Ranges can be specified with - (a hyphen).
eg `[0-9]` is the same as `\d`, `[a-zA-Z0-9_]` is the same as `\w`, and `[ \r\n\t]` is the same as `\s`.
Use `\]` to include a literal `]`, and `\\` to include a literal `\`. `\r\n\t` are also recognised.
( ) Defines a capture group.
If `?:` immediately follows the opening `(` then the group is non-capturing, otherwise the start and end+1 index pairs of each matching group are returned.
An error occurs if there is a `?` immediately after the opening `(` but that is not then followed by a recognised character, currently only ':', specifically not the =|!|<|> of lookahead|lookbehind|atomic groups.
The entire expression is automatically enclosed in an outer matching group, sometimes called \0.

Aside: when processing say `(abc)`, ie matching the string "abc", the (’s SAVE stores the index of the `a`, but the )’s SAVE stores the index of the character after the `c`. Saving the caller the effort of subtracting 1 from the even-numbered results would have required splitting the SAVE op into SAVE_OPEN and SAVE_CLOSE, which just didn’t seem worthwhile.
\1..9 A backreference to a previous capture group. NB disabled by default, see RE_BACKREFERENCES. \0 is also [/may one day be] valid (meaning the entire auto-added outer match) in the third parameter of gmatch(), which can also use \1 to \9 even when RE_BACKREFERENCES is disabled, since by that stage they are just references to already captured groups, rather than backreferences that affect behaviour or require any backtracking.
Note that, tempting as it may seem, backreferences do not make regular expressions a sensible choice for parsing html.
? optional - 0 or 1 occurance of the preceding term. A trailing ? specifies non-greedy
* 0 or more occurances of the preceding term. A trailing ? specifies non-greedy
+ 1 or more occurances of the preceding term. A trailing ? specifies non-greedy
{n} Exactly n (<=1000) occurances of the preceding term. The arbitrary limit of 1000 just seems sensible.
A trailing ? means non-greedy (but is superfluous in this instance)
{n,} At least n (<=1000) occurances of the preceding term. A trailing ? specifies non-greedy
{n,m} At least n and at most m (n<=m<=1000) occurances of the preceding term. A trailing ? specifies non-greedy

 

Quick summary of PCRE (in)compatibility

For those of you already familiar with regular expressions, while the above table covers alot of ground, note the following:
(Hopefully these are all somewhat lesser features that you don’t much use anyway)
Does not support lookahead or lookbehind.
Does not support unicode character ranges, see below.
Does not support posix character classes (such as [:digit:]).
Does not support \Q \E \a (bel) \b (backspace) \c (ctrl) \e (esc) \f (ff) \v (vt).
Does not support single-line or multi-line modes, ie ^|$ are whole thing, not post|pre \n|\r.
Does not support atomic grouping [quite probably impossible on the pikevm, but unnecessary anyway].
Does not support possessive quantifiers (whereby a trailing + makes a group atomic, I believe).
Does not support named capture groups (only madmen cut and paste bits of regexes anyway!).
Does not support inline modifiers (or therefore free-spacing mode along with in-text comments).
Does not support subroutines (which are like backreferences but re-use the pattern rather than the match).
Does not support conditionals, or branch resets or code capsules or callouts or version checks...
Regular expressions should be enclosed in backticks to avoid escaping, eg `\s+` is the same as "\\s+".
Backticks should be used instead of the forwardslashes of other programming languages.

Quite a scary list, but then again regular expressions can get very scary without my help. Feel free to try implementing anything from that list that you miss.

Note that internally I elected to implement ?*+ as {OPT,min,max} with {min,max} being {0,1}, {0,-1}, {1,-1} respectively, alongside the more obvious {n} as {OPT,n,n}, {n,} as {OPT,n,-1}, and {n,m} as {OPT,n,m}. It was therefore trivial to make it {OPT,min,max,greedy}, if perhaps a little unwise. Any resemblence to the behaviour of greedy and non-greedy in PCRE is entirely coincidental - partly joking, of course, but I will say this: If there is one and only one unambiguous match then obviously it should agree, but greedy/non-greedy strongly implies ambiguity, and especially when pitting an entirely different algorithm and a tiny fraction of the number of source code lines against a well established giant like PCRE, all compatibility bets (in ambiguous cases) are off.

So the situation is this:
Q: Does it handle PCRE-like expressions, or at least a fairly good subset of them?   A: Yes, at least I think so.
Q: Does it give exactly the same results in all possible edge cases as PCRE?   A: No, that way lies insanity.
(Then again even Perl does not always match PCRE perfectly..)

Literal unicode is supported via utf8 only, eg "\u20AC" matches a euro sign, but `[\u20AC]` triggers an error. (Note I have been a little crafty with the use of double quotes and backticks there.) The former simply converts the character to a utf-8 byte stream, but character ranges are aways byte-wise, ultimately #00 to #FF only. Instead of [£$\u20AC] you would have to use (£|$|\u20AC). Obviously if you use a utf8-enabled editor and can key in the euro sign directly, you would get the same deal for alternatives but the character range would be treated quite wrongly: just as utf32_to_utf8({#20AC}) yields {#E2,#82,#AC} then a class would be treated as [\xE2\x82\xAC] (ie as \xE2|\x82|\xAC).

It would probably not be difficult to create a new and improved unicode entry point based on the existing code, perhaps along these lines:
    type utf8or32(object s)  -- string(utf8) or dword_sequence(utf32)
        if not string(s) then
            if not sequence(s) then return false end if
            for i=1 to length(s) do
                integer codepoint = s[i]
                if codepoint<0 or codepoint>#10FFFF 
                or (codepoint>=#D800 and codepoint<=#DFFF) then
                    return false
                end if
            end for
        end if
        return true
    end type

    function regex32(utf8or32 pat, utf8or32 src)
        if string(pat) then pat = utf8_to_utf32(pat) end if
        if string(src) then src = utf8_to_utf32(src) end if
        ...
But obviously I would rather not impose that overhead on everything, or support such. I should also admit that I have never knowingly processed any combining characters in my entire life, and therefore cannot offer any suggestions on how to handle them. The same applies to Letter/Mark/Separator/Symbol/Number/Punctuation/Other unicode sets. I cannot imagine any reason for complicating things by working in utf-16 either.