Archive.php

Regex: Match regex strings and capture delimiters, pattern and flags.

Goal: To build a regular expression that would identify and capture the sections of any string representing a regex. The resulting captures will always be one of:

  • [delimiter start][pattern][delimiter end][flags]
  • [delimiter start][pattern][delimiter end]
  • [pattern] – The fallback on finding no delimiters is the entire string is assumed to be a pattern.

Note: This is not going to be used to validate a regex, but simply break it apart for processing e.g. by the use of JavaScript RegExp constructor which requires a pattern string and a string of flags. It only supports recognition of forward slash (/) as a delimiter.

It seems like a simple ask until you realise capturing the optional delimiters and flags should only be attempted under the condition the delimiters are present at both beginning and end, of which the latter may also be followed by optional flags. Lets take a look:

^(?<delim>\/(?=.+(?<delim_end>\/)(?<flags>(?:(?<f>[igsmyu])(?!.*\k<f>.*$)){0,6}$)))?(?<pattern>.*(?=\k<delim>)|(?<!^\/).*(?!\k<delim>))(?:\k<delim>\k<flags>)?$

Code example: A regular expression to match and capture a regular expression.

I know, horror show right? 🥲 Actually it’s not that messy once you break it down and bear in mind I’m using named capture groups which add to the length.

The regex works in two halves: the first captures the first delimiter, but only if a matching delimiter is present at the end, otherwise we count the first ‘/’ as part of the pattern and assume this string doesn’t include delimiters or flags. While we’re ‘looking ahead’ for the end delimiter and optional flags, we capture them to save repetition and because we’ll need them for conditional captures in the pattern.

Here is a break down of the first half:

^(?<delim>                   // Open delim group.
    \/                       // Capture '/'.
    (?=                      // Positively look-ahead.
        .+                   // Match any char 1+ times.
        (?<delim_end>\/)     // Capture '/' in delim_end group.
        (?<flags>            // Open flags group.
            (?:
                (?<f>        // Open flag group.
                    [igsmyu] // Match any legal regex flag char.
                )
                (?!          // Negatively look-ahead.
                    .*       // Match any char 0+ times.
                    \k<f>    // Match flag by reference.
                    .*$      // Match any char 0+ times, then EOL.
                )
            ){0,6}$          // Capture 0-6 unique flags before EOL.
        )
    )
)?                           // 0-1 times.

In the second half we capture the pattern while using the possible captures from the first to decide if we need to watch out for delimiters and flags or not. It works in two parts: Firstly we see if we can match all characters up to the already captured delimiter and flags at the end. The limitation being this won’t match if we don’t have any delimiters. So we say `OR` (in the second part) as long as the line doesn’t begin with a delimiter, capture all characters as long as it also doesn’t end with a delimiter and optional flags.

Part 1 is responsible for capturing the pattern when delimiters are present and part 2 for when they are not.

Here is a break down of the second half:

// Part 1.

(?<pattern>                  // Open pattern group.
    .*                       // Match any char 0+ times.
    (?=                      // Positively look-ahead.
        \k<delim>            // Match delimiter by reference.
    )
    |                        // OR.

// Part 2.

    (?<!                     // Negatively look-behind.
        ^\/                  // Match line start and forward-slash.
    )
    .*                       // Match any char 0+ times.
    (?!                      // Negatively look-ahead.
        \k<delim>            // Match delimiter by reference.
    )
    (?:
        \k<delim>            // Match delimiter.
        \k<flags>            // Match flags.
    )?                       // 0-1 times.
)$                           // Match EOL.

I went through countless iterations to come up with this and I’m not suggesting it’s final, but I was quite happy it passed my tests and was about 25% the length of the longest iteration I was experimenting with!

I’m not a ‘regexpert’ (yep, come get me) so if you know of a better way, please share the wealth.

Play with the demo on regex101.com

[wallomatic]