Writing a Regular Expression Engine

#Writing a Regular Expression Engine

The last 4 months have seen me consumed in my latest side project, kleingrep. It’s a regex engine and command-line tool, written completely from scratch in Go. When I say ‘from scratch’, I mean that all the code was written by me; Russ Cox’s excellent articles¹ on regular expressions and Ken Thompson’s seminal paper on the topic² served as the theoretical foundations for my project.

This post accompanies the project. It mostly serves as a walkthrough of my implementation, without any code. If you’re already familiar with regular expressions, feel free to skim through. The final section describes some additional features I added to the engine (and their implementation), which provide feature parity with mainstream regex engines.

#What is a Regular Expression?

#A match pattern in text

Practically speaking, a regular expression (or regex) is a language that defines a match pattern in text. A regex consists of literal characters and special symbols, that define additional criteria. A regex that consists solely of literal characters is identical to a naive text search; the real power of regexes lies in the symbols - symbols that you’ve almost certainly used before, without realizing. If you’ve ever heard the asterisk symbol (*) referred to as the ‘wildcard’, or used $ to go to the end of a line in Vim, or listed all text files in a directory with:

ls *.txt

you’ve used regular expression syntax.

#An example

Assume you are working on a C codebase, and want to search for unsigned integer variable declarations. Specifically, you want to search for variables with the following data types:

uint8_t
uint16_t
uint32_t

Assume, for the sake of the example, that all you had was a simple text search. You could search for these keywords individually and collect the results. But that’s not very efficient, is it?

Now assume you had the ability to search for text matching a regular expression. While there are many implementations of regex, they all specify the | (‘pipe’) symbol. This symbol specifies an alternation - it says ‘match the text to the left or the text to the right’. Using this operator, we can construct an expression:

(uint8_t)|(uint16_t)|(uint32_t)

Notice that we can group words together, just like a mathematical expression. This looks nice, and will give us all the results we want at once, but we can shorten it. Notice that the only different is the size - 8-bit vs. 16-bit vs. 32-bit.

uint(8|16|32)_t

This is more concise and will give us the same results. We’ve gone from 3 keywords that find results individually, to 1 expression that finds all the results we need. This is just a sliver of the true power of regular expressions. Today, they are often used in text extraction and input validation - if you’ve ever had to concoct a password that satisfies the seemingly arbitrary requirements of a sign up page, you’re dealing with a regular expression. If you’ve had an email address rejected because you missed the @, you’re dealing with a regular expression. Validating phone numbers? Regular expressions.

#A representation of a regular language

This is the theoretical side of the coin. Strictly speaking, a regular expression is a text that defines a Regular Language. Such a language must be recognized by a Finite State Automaton³, better known as a State Machine. The theory behind regular expressions was developed by a mathematician named Stephen Kleene⁴ in the 1950s.

#The expression

A strict regular expression must consist solely of the following four components:

Base Case:

An expression that matches a single letter eg. x

Recursive Cases:

One expression followed by another (concatenation): xy
One expression or another (alternation): x|y
One expression, repeated 0 or more times (repetition): x*

The ‘star’ operator is sometimes referred to as the Kleene Star, after Stephen Kleene.

Note that the recursive cases can be combined. This allows us to construct expressions like the following:

(foo|bar)(xy)*  <--- Match 'foo' (concatenation) or 'bar' (concatenation, alternation), followed by 'xy' repeated zero or more times (concatenation, repetition)

#The machine

I mentioned earlier that a strict regular expression must be recognized by a State Machine / Automaton. A state machine consists of two things:

A bunch of ‘states’
Transitions between these states

The automaton begins at some predefined start state, and with the 0th character in the string.

At each state, the automaton can match the character in the state, with the current character in the string. If the automaton matches the character, it consumes it and moves on to the next character in the string and the next state(s) in the automaton. If it fails to match the character, it stops traversing that ‘branch’ of the automaton.

An epsilon (ε) state or zero state refers to one that contains no character - it may specify some condition, but these can also be free states - the automaton can reach them, perform any required action, then move on.

The automaton is terminated by a final epsilon state. When it reaches this state, it has found a match. This state is marked by drawing two circles around it.

The following diagrams show the construction of simple expressions. Note that a complex expression can be represented as combinations of simple expressions. Similarly, the automaton corresponding to a complex expression can be constructed using the automata of simpler ones.

#An example

Let’s consider the regex mentioned earlier: (foo|bar)(xy)*. The first component is foo:

The second is bar, constructed the same way:

The parentheses are only used for grouping, so ignoring them gives us foo|bar:

The second part of the regex is xy.

Applying the kleene star to this gives us (xy)*.

Putting everything together, we arrive at (foo|bar)(xy)*

Note that we finish off the automaton with a ‘match’ state.

#Useful repetition operators

In addition to the three basic operations, it’s useful to define a couple of others that are built on these. These operators are available on all regex engines, and their construction is straightforward.

One or more repetitions - this is denoted with the plus symbol. Note that:

x+ == xx*

That is, an expression repeating one or more times is equivalent to: the expression occurring once, then zero or more times.

Zero or one repetition - this is denoted with the question-mark symbol. Note that:

x? == (x|)

That is, we either require the expression, or nothing.

#Converting expressions to automata

While I’ve described the high-level conversion process, it might help to go into more detail on how my engine does it. At its heart, this is a parsing problem: Given some expression, we need to parse it into a format that will allow us to construct the automaton. This usually means a syntax tree, but I chose to use postfix notation instead. The cool thing about postfix notation is that you don’t need a bespoke data structure to represent it: a simple list will do.

#Sidenote: Explicit concatenation operators

So far, I’ve described concatenation as being an implicit process - when you stick two expressions together, they are concatenated. For parsing purposes (especially if we’re parsing to postfix), it helps immensely if we explicitly denote concatenation. This post uses the tilde symbol (~) to denote concatenation; my engine uses unicode characters from the PUA range instead, so that tildes can be used as regular characters.

Inserting explicit concatenation operators, our regex above is converted into:

(f~o~o)|(b~a~r)~(x~y)*

#Parsing

Parsing algorithms and generators are a dime a dozen, so the specific choice of algorithm isn’t too important. I chose the Shunting-Yard algorithm⁵, because it was easier to understand. If I were to do it again, I would probably choose Recursive Descent⁶ since its more powerful and extensible, and maps better to a formal grammar.

One neat side-effect of parsing into postfix is that parentheses are no longer needed. The example above, when converted to postfix, becomes:

fo~o~ba~r~|xy~*~

Take a moment to understand how the expression above relates to this one.

If you are using the same approach to parsing, here are a few tips:

Remember that every special character can be escaped. Any special character can be turned into a regular character by putting a backslash in front of it. Conversely, certain regular characters take on a special meaning when escaped. For example, \n and \t.
Try to limit the number of passes. When I wrote my parsing function, efficiency wasn’t a priority (although it should have been). This resulted in a large number of passes through the input string, which can quickly become slow and unwieldy.
Use the type system. This is language-dependent, but if your language provides a good type system, use it.
Provide informative errors. Error detection is a major component of parsing. When you detect a malformed expression, provide as much information as you can. This will help future debugging.
Keep it extensible. This is just good programming advice in general, but its especially relevant here, because you might want to add features to your engine (indeed, we will discuss a few below).

#Generating the automaton

Once we’ve parsed the expression, we can generate the automaton. This step should be much easier than the first, since the bulk of the work has already been done. The following construction algorithm was described in Ken Thompson’s 1968 paper (see [2] below).

For every item in the postfix representation:

If the item is a character, create a state for it, and add that state to the output stack.
If the item is an operator, take the requisite number of items out of the stack (concatenation and alternation are binary operators, the kleene star is unary), perform the operation on the items, and put the result back in.

That’s it! At the end of this process (assuming you have a syntactically correct expression), your stack should contain 1 item, the start of the machine.

As mentioned above, the other repetition operators can be broken down, so you don’t need any special code to handle them.

Here are some tips for this stage of the engine:

A state machine is not always a graph. My engine initially represented the state machine as a graph. This is fine, but a graph is too abstract for a regex automaton. Specifically, a graph has no upper bound on the number of transitions from one state to another, while a state in a regex automaton can connect to either 1 (concatenation) or 2 (alternation) states. No more. This should greatly help with the design of the state machine.
Use pointers, if you can. When attempting to match the automaton with a string (which we’ll see below), it helps if the states are represented as pointers. This makes the process of marking states as ‘visited’ much easier, and brings performance benefits. This is dependent on the language, of course.
Store the outputs of a state. When I say ‘outputs’, I am referring to the arrow that comes out of a state. Consider the regex a(b|c|d). For convenience, we can write this as a((b|c)|d). Note that, after matching a, the engine has three options: b, c or d. Each of the corresponding states has an outgoing arrow that should ultimately point to the same match state. Rather than traversing the entire machine to add the match state to b, then c, then d, we could store the outputs for the machine at the start ie. at a. This output pointer will get updated every time we add a state to the machine.
Remember that the machine starts at exactly one state, and finishes at exactly one state.

#NFA to DFA

The automaton we have generated here is a Non-deterministic Finite State Automaton or NFA. It is non-deterministic because a state isn’t uniquely identified by the characters consumed to get there. Consider the automaton for aa|ab:

From the start, consuming the character a leads us to two states. A Deterministic Finite State Automaton (or DFA) would combine both of these states, reducing the number of states without changing the behavior of the automaton. This article doesn’t cover the NFA-to-DFA conversion, but it’s a useful part of the process that can reduce space- and time-complexity.

#Matching Algorithms

Now that we’ve constructed the automaton, we can get to work matching it with a string.

#Side-note: Greediness

Before diving into matching, note that the default behavior of an expression is to be greedy. This means that, given a repetition, the automaton will match as many characters as it can, while preserving the match.

For example, consider the following expression:

x*x

In English, this means ‘match x zero or more times, then match x’. Suppose our input is the string xxxxxxxxxx. The first part of the expression could just match nothing, and allow the second part to match the first x. But because of greediness, it will match as many as it can, while preserving the match ie. the first 9 x’s. This will allow the second part to match the last x, and complete the match. If it had matched all of them, the second part would have nothing to match, preventing the expression as a whole from matching.

If this sounds confusing, rexegg uses the phrase “Greedy, but docile” which sums it up pretty well. The automaton has two priorities which, in order, are:

Find a match.
Match as many characters as possible.

#Side-note: Branch order

Take a look at the automaton above for (foo|bar)(xy)*. The first zero state has two options, f and b. While it may seem that these are weighted equally, they are not.

Most regex engines, including mine, follow the Perl convention for matching. According to this convention, the engine prefers the left branch of the alternation over the right. If the left branch finds a match, the right branch isn’t even tried, even if it could lead to a longer match. This breaks greediness, which is unfortunate.

An alternative approach is the POSIX one. This convention states that the longest match is always preferred. In other words, the order of alternation does not matter. The POSIX specification comes with some warts, covered by Russ Cox. While I had initially implemented POSIX-style matching, I decided to switch to Perl-style to avoid these pitfalls.

Now let’s look at the matching algorithms.

#Backtracking

This is the conventional method of matching some text with an automaton. When a backtracking engine is confronted with a split (eg. via an alternation), it will pick one branch and explore it fully. Upon encountering a failing match, it will ‘backtrack’ to the last split and pick the other branch. This is the depth-first approach, and corresponds to a DFS of a graph.

This algorithm is simple and easy to implement, and lends itself well to a recursive solution. However, it’s vulnerable to catastrophic backtracking, which results in exponentially slow matching speed and stack overflows. Russ Cox shows it better than I ever could.

This approach is implemented by the engines of most mainstream languages, including Javascript (atleast on Firefox), Perl (the defacto standard for regex), Java and PHP.

The takeaway: backtracking is simple and works well for most expressions, but we can do better.

#Thompson’s Matching Algorithm

In the very same 1968 paper, Thompson also described an algorithm to match expressions. His key observation was that, by trying different branches in parallel, the automaton could be matched in a single pass of the input string (as opposed to the unbounded number of passes required by backtracking). As expected, this corresponds to a breadth-first search of a graph.

On the surface, this seems fairly pragmatic - if backtracking represents a depth-first approach, a breadth-first approach must surely exist. However, the implications are immense, especially with regard to worst-case time complexity - this approach is immune to catastrophic backtracking. This approach is less common, and is used by the UNIX tools - grep, awk, sed etc., Go’s regexp (itself a derivative of Google’s RE2), Rust’s regex package, and my own engine.

The algorithm relies on looping through every character in the input string. At every loop iteration, we maintain two lists of states - one is processed during the current iteration (the current list), and the other is processed in the next iteration (the next list). The current list starts off with the first state in the automaton, and the next list starts off empty.

Every state can be one of three things - a character, a split or a match. A character has 1 output, a split has 2, and a match has 0. For every state in the current list:

If it’s a character, we check if it matches the current character in the input string. If it does, we call a function to add the next state (ie. its output) to the next list.
It it’s a match state, we can save the current position in the string - we have found a match. We still continue, though, because of greediness (there could potentially be a longer match).

The function mentioned above (the one that adds states to the next list) is recursive, and does the following:

If the next state already exists in the next list, return the list unchanged. This is the first base case.
If the next state is a character or a match, add it to the next list. This is the second base case.
If the next state is a split, call the function on its ‘top’ branch and then on the ‘bottom’ branch (this order preserves Perl-style matching). This is the recursive case.

At the end of every iteration, the next list is copied into the current list, and the next list is cleared. If the current list is ever empty, the matching function exits.

Notice that this approach isn’t completely free of recursion - however, we’ve relegated it to a smaller function, which (due to the first base case) has a finite upper-bound.

Let’s look at an example of the matching algorithm. Consider the automaton for (foo|bar)(xy)*. Assume we are matching it with the input fooxyxy.

(The following section uses C: <states> and N: <states> to denote the current and next list, respectively).

We initialize the current list and next list, as well as our string pointer (denoted sp), which will store the current index (0-indexed) in the string. C: <empty>; N: <empty>; sp=0.

sp=0 str=fooxyxy

First, we need to add the start state. Since it’s a split, we will recursively add its top branch, then its bottom branch. This gives us C: f,b; N: <empty>.

Iterating through C we find f. It matches the current character in the string, so we add its output state. C: f,b; N: o. Next up is b. This doesn’t match, so it’s discarded.

sp=1 str=fooxyxy

Only one character in C, and that’s o. It matches, so we add its output. C: o; N: o.

sp=2 str=fooxyxy

Now things get interesting. The o matches, but its output is a split. So we add the top, then the bottom. The bottom is a match state (I denote it here with m; note that, in this case, it doesn’t signify a literal m character). C: o; N: x, m.

sp=3 str=fooxyxy

First up, we have x. This matches, so we add its output, y.

The next state is a match, indicating that we could stop here. We store the current sp and keep moving on. C: x, m; N: y.

sp=4 str=fooxyxy

Only one character: y. This matches, so we add its output. Since its a split, we do the same thing that we did earlier. This gives us - C: y; N: x, m.

sp=5 str=fooxyxy

The x matches, so we add its output. Next, we see another match state. This time, we have come further than we did before, so we clear out the old stored sp and put this one in. C: x, m; N: y.

sp=6 str=fooxyxy

The y matches, so we add its output. This is exactly the same as before. C: y; N: x, m.

sp=7 str=fooxyxy_

We are at the end of the string. The x doesn’t match (because there’s nothing for it to match). We find a match state, and replace the old stored sp with this one. C: x, m; N: <empty>.

sp=8 str=fooxyxy

Nothing in our current state list, so we return. We have matched from the beginning to index 7, as indicated by the stored string pointer.

Running through more simulations like this should convince you that this approach works. Try it out on the regex a*a and the input aaaaa, if you want to be convinced that it’s faster.

Interestingly enough, this approach isn’t completely immune either - a recent paper⁷ published by Su et. al. discusses a method of generating strings that can cause DoS (denial-of-service) with certain regexes, even on non-backtracking engines. Certain engines (most notably, ripgrep which relies on Rust’s engine) stand their ground even with these pathological test cases; my engine, on the other hand, crumbles like a cookie. I suspect that aggressive optimizations have a big role to play here, which I haven’t gotten around to yet.

#Sidenote: Branch order revisited

Remember the POSIX vs. Perl discussion? When it finds a match, a POSIX-style engine will process all remaining states in the current list, before moving on to the next iteration in the loop. A Perl-style engine would discard all other states in the current list - since the top branch was added before the bottom branch, any matches that it finds must take precedence over those found by the bottom branch. Russ Cox calls this ‘thread priority’ but I think that’s a bit of a misnomer. The list is never reordered to maintain a rigid priority (like you would see in a priority queue). Instead, it’s implicitly encoded in the construction of the automaton - that’s what makes this approach so elegant.

#Additional Features

This section is optional. You already have all the knowledge you need to build a minimal engine…

But that’s not too fun, is it?

We can’t write the typical gobblydook associated with regexes, like:

^.*[\x60-\x6f](?=\*\\){,7}

This section covers additional regex features that are implemented by most mainstream engines. Implementing these should allow you to write the fun expressions.

For each feature, We’ll do a quick overview and discuss how it can be implemented.

#Metacharacters

Most engines specify metacharacters - characters that can match a whole group of characters. For example, Perl specifies \d which matches any digit character (0 to 9). The ‘dot’ metacharacter is also very common - a period (.) is used to represent any single character.

To implement these, you will need to expand your definition of a character state. Store a list of characters instead of a single one. When a character state is reached, check if the current character in the input string matches any of the items in the state’s character list.

#Hexadecimal and Octal characters

Most engines also allow you to define arbitrary hexadecimal / octal values. The syntax is \x{ABCDEF} (where ABCDEF is a six-character hex value) for unicode hexadecimal values and \xFF for ASCII hex values. The usual syntax for octal values is \0123 (where 123 is an octal value, and must be preceded by 0).

This is purely a parsing problem. The automaton constructor should only see the resulting unicode values, and should treat them as regular characters.

#Lazy quantifiers

We have discussed three quantifiers:

Zero or more (prefer more) (*)
One or more (prefer more) (+)
Zero or one (prefer one) (?)

As mentioned above, the default behavior for these quantifiers is greediness - they will prefer matching more characters.

A lazy quantifier flips this preference - lazy quantifiers will match as few characters as possible, and will only match more if they have to. The typical syntax for lazy quantifiers is as follows:

Zero or more (prefer zero) (*?)
One or more (prefer one) (+?)
Zero or one (prefer zero) (??)

To understand the purpose of lazy quantifiers, take a look at the following regex:

$.*$

The goal is to match parenthesized expressions. The opening and closing parentheses are escaped so that they match literally, and the ‘dot’ matches any character. Given the input:

(parenthesized) (text)

Should the regex capture the whole thing (parenthesized text) or the first expression (parenthesized)? A greedy quantifier will give you the former; a lazy quantifier (replace * with *?) will give you the latter.

Implementing lazy quantifiers is actually really easy - flip the order in which you add states when you encounter a split. Instead of adding the top branch before the bottom one, add the bottom one first. This will flip the implicit preference - the automaton will try the shorter alternative, and only try the longer one if it fails.

#Character ranges and classes

A character class is a cleaner way to represent multiple alternations. Character classes are denoted with square brackets.

[abcde] => a|b|c|d|e

Character ranges are extensions of character classes, which further shorten the expression.

[a-e] => a|b|c|d|e

A character class can consist of one or more character ranges, mixed with regular characters.

[xyza-e0-9] => x|y|z|a|b|c|d|e|0|1|2|3|4|5|6|7|8|9

Since these boil down to repeated alternations, the implementation should be straightforward. Most of the work will be done in the parsing stage, and the automaton construction should be left largely untouched. Alternatively, you could encode them as ordinary character states that contain all the elements in the character class. If you implemented metacharacters, your character states should already represent a list of characters, rather than a single one.

#Negated character classes

Character classes can be negated to match any character except those in the class. This is specified by adding a caret (^) to the beginning of the class. For example:

[^xyz] (Match any character except x, y or z)

These are a little harder to implement. Since the number of characters that can be matched is huge (every unicode character except x, y and z), it’s impractical to define each one of them. Instead, you may want to add an ‘exception’ field to the character state. Instead of checking that the current character matches one in the character list, you would check that it doesn’t match one in the exception list.

#POSIX character classes

POSIX specifies additional shorthands for character classes. These are mostly alternatives to the Perl metacharacters. Some examples are [[:digit:]] (equivalent to \d in Perl) and [[:upper:]] (equivalent to [A-Z]).

#Finite repetition

Most engines allow you to specify a finite lower and upper bound for repetitions. The typical syntax is {m,n} where m is the lower bound and n is the upper bound. The upper bound defaults to infinity if it isn’t present. For example:

(xyz){7,9} (Match 'xyz' between 7 and 9 times)

(xyz){9,} (Match 'xyz' at least 9 times)

(xyz){7} (Match 'xyz' exactly 7 times)

Finite repetitions can be expressed through the built-in repetition operators, as explained by Russ Cox in his first article on regular expressions⁸. a{3,5} expands to aaaa?a? and a{5,} expands to aaaaa+ (or aaaaaa*).

With this knowledge, there are two ways to implement this feature:

Expand the expression while parsing, according to the rules mentioned above. The automaton constructor will be completely untouched.
While parsing, ‘tag’ the part of the expression that’s being repeated. The automaton constructor will construct the state, then use the tags as required.

Essentially, method 1 places all of the burden on the parser, while method 2 offloads some of it to the automaton constructor. Regardless, this is mostly a parsing problem.

#Assertions

An assertion is a check that consumes no characters. It evaluates to a boolean value, which determines whether or not the automaton continues the branch.

Here are some common assertion symbols:

^ = Start of string (true if sp=0)
$ = End of string
\b = Word boundary (a word character (letter, number or underscore) next to a non-word character (anything else))
\B = Non-word boundary (two word or non-word characters next to each other)

For example, the following regex will match int, but not integer or print:

^int$

Since an assertion consumes no characters, it must be handled in the recursive case of the ‘add next state to list’ function. Think of the assertion as a gate - you may only proceed if you pass the assertion. Hence, they must be stored as a distinct type of state. We now have the following four kinds of states:

Character
Split
Match
Assertion

The ‘add next state to list’ function should be updated with the following rule:

If the next state is an assertion; check if it passes. If it does, call the function on the next state. This is a recursive case.

#Lookarounds

A lookaround is a more generic assertion. It allows you to check if the substring before or after the current index matches an expression. There are four kinds of lookarounds:

(?=<expr>) = Positive lookahead (true if substring in front matches expr)
(?!<expr>) = Negative lookahead
(?<=<expr>) = Positive lookbehind
(?<!<expr>) = Negative lookbehind

Most of the work will be done in the parsing stage, figuring out which kind of lookaround you have encountered. Since a lookaround is just an assertion, it can be represented as an assertion state containing the text of the expression. When this state is reached, parse and compile the expression into an automaton, and run it on the substring after (or before) the current index. If a match starts (or ends) at the current index, the lookaround passes. If not, it fails.

Note that lookarounds violate the worst-case time complexity of Thompson’s matching algorithm. Since lookarounds could be nested, there is no finite upper-bound. Of course, you could detect nested lookarounds and prevent them, but the time complexity will still be worse - you are running an automaton as part of another automaton.

A small improvement can be achieved by pre-compiling the lookaround regex - instead of storing the expression text in the assertion state, store the resulting automaton.

#Capturing Groups

A capturing group is used to store the text matched by a ‘group’ (enclosed in parentheses) of the regex. Take the following regex:

(pine)apple

When matched with the string pineapple, the entire string matches, but the first group only captures pine. This is a very useful tool, and is often a good substitute for a lookaround. If we were only interested in pine, notice that we could have also written the following:

pine(?=apple)

But this is both inefficient and hard to understand.

My implementation of capturing groups borrows from Russ Cox⁹ (who borrows it from Rob Pike¹⁰). The idea is to simulate a backtracking approach using threads. A thread contains a state, along with some ‘thread variables’. One of these thread variables is the content captured by each capturing group of the expression. I just added these variables as part of the state, but you could also create a new type to represent a thread. The current and next list would then contain threads, rather than states. Whenever a new state (or thread) is created from an existing one, the original’s thread variables are copied into the new one.

On the parsing side, parentheses should be left in the original expression rather than being filtered out (as they usually are with the Shunting-yard algorithm). The automaton constructor then adds them as a fifth kind of state - a Group state. Group states are zero-states that can be traversed for free, with the caveat that, every time they are encountered, the thread variables of the current thread must be updated. If the group state is an opening parentheses, the start index of the respective capturing group is set; if it’s a closing parentheses, the end index is set. Since the thread variables will be carried forward, all future threads will ‘remember’ the location of the capturing group.

At this point, you might also want to think about defining additional types to make matching easier. My engine uses a ‘Group’ type, that contains the start and end index of a group. A ‘Match’ is a list of Groups - the Group at index ‘n’ represents the contents of the nth capturing group. The Group at index 0 represents the ‘0-group’ ie. the entire match.

#Backreferences

A backreference is used to refer to the contents matched by a previous capturing group. They are indicated by a backslash followed by a number, which denotes the capturing group. The following regex:

(x|y)\1

translates to ‘Match x or y, then whatever was matched by group 1’. Groups are numbered based on the position of their opening parentheses.

Backreferences are incredibly interesting, not least because of their implications on regular expressions. Any regular expression that implements backreferences isn’t regular in the strict sense (ie. it doesn’t describe a regular language). From my understanding, this is because they are no longer context-free - a prerequisite for regular languages. A backreference in a regex depends on an earlier match, hence it relies on context.

As you might have guessed, this was definitely the hardest feature to implement.

Russ mentions that the UNIX tool grep rewrites expressions that contain backreferences. Using his example, (cat|dog)\1 would become (cat|dog)(cat|dog). This has a broader meaning that the original, since it will match catdog and dogcat (in addition to catcat and dogdog). The matching function is then rewritten, to return a list of all potential match points, rather than just the longest one. Of all the match points returned, a filter function is called, to return the one where all backreferences match the original group.

After pondering over it for weeks, I decided to go with the grep approach. I reached out to Russ Cox via email for ideas. Russ suggested an alternate approach, which worked perfectly. With his permission, I have paraphrased his solution below (all credit belongs to him).

First, we create a new type of state to represent a backreference - no rewriting of the expression is required. This backreference state contains the group number that it refers to. It also contains an additional thread variable - how far it has matched into the group that it’s referring to. Naturally, this variable (let’s call it backref) will start at 0 (since the backreference cannot come before the group that it refers to).

When this state is encountered, check if the current character matches the backref-th character of the referred group. If it does, increment backref and add the current state to the next list, so that the next character can be checked. If it doesn’t, the backreference has failed, and the state can be discarded. If all the characters in the referred group have been matched (ie. backref equals the group’s length) then the backreference has succeeded, and we can proceed as we normally would, by adding the state after the backreference state.

This is an astoundingly simple approach, that took me ~50 lines of code to implement. Yet it worked perfectly on all the tests I ran - a true testament to ‘less is more’. It even passes contrived tests like:

(x*)\1

which will only match an even number of x’s.

#Numeric Ranges

This was an interesting feature to implement because, to my knowledge, my engine is the only one that has built-in syntax for it. Suppose you wanted to find all IPv4 addresses in a piece of text. An IPv4 address has the following form:

xxx.xxx.xxx.xxx

Where each xxx is a number from 0 to 255. A simple approach would be the following expression (note that the period has to be escaped):

\d+\.\d+\.\d+\.\d+

We could even restrict this to 3-digit numbers:

\d{3}\.\d{3}\.\d{3}\.\d{3}

But this isn’t foolproof, either. The following text (which is not a valid IPv4 address) matches the expression:

500.899.985.451

To solve such problems, my engine allows the construction of numeric ranges (analogous to character ranges). A numeric range has the following syntax:

<x-y>

Where x and y are both positive numbers, and y is greater than x. This will match any number between x and y inclusive. Using this syntax, a better IPv4 address matcher can be constructed:

<0-255>\.<0-255>\.<0-255>\.<0-255>

I wrote a post about this feature, where I cover the algorithm used to implement it.

#Flags

Most engines allow embedding flags inside an expression. These flags control case-sensitivity, multiline-mode and more. Since the flags can be embedded anywhere, this is a non-trivial parsing problem. I took the easy route instead, receiving the flags as parameters to my parsing function. The actual implementation of the flags is fairly easy, and shall be left as an exercise for the reader.

#Paying my dues

If you haven’t noticed already, Russ Cox’s series of articles were a huge inspiration for me. To distill the complex jargon that comes up so often in this field into language that a college sophomore like me can understand is a huge feat - and I can’t thank him enough for doing it. His work on the RE2 engine - to my knowledge, the first mainstream engine to avoid backtracking - was a slap in the face to the backtracking implementations that had dominated up to that point. Today Go, Rust and even V8 (Google Chrome’s Javascript engine) have regex engines that avoid backtracking. As I mentioned earlier, my implementation of backreferences was completely inspired by him, as well.

Of course, if you asked him, he would likely credit Rob Pike and Ken Thompson; so I must credit them as well. Rob Pike’s work on the sam editor pioneered a lot of the techniques covered here (sidenote: sam also pioneered the use of ‘structural regular expressions’ - an improved way of searching, filtering and replacing text. If you’re interested, this forum post does an excellent job introducing the concept, although it’s focused on Perl).

I can write an entire blog post crediting Ken Thompson for his influence on modern software, but his 1968 paper on regular expressions is, in itself, worthy of a Turing award (speaking of Turing awards, if you ever want to lose your sanity, read this paper from his Turing Award lecture: ‘Reflections on Trusting Trust’¹¹). In 4 pages, most of them taken up by code, Thompson lays out an algorithm for constructing and matching regexes that’s elegant, concise and efficient. Thompson gets the biggest honor of all - I named my automaton constructor function after him.

Stephen Kleene’s work on defining the language of regular expressions underpins everything; the recognition that a regular language can be represented through a finite automaton is the basis for Thompson’s matching algorithm.

Also, a shoutout to my ECE 20875 professor at Purdue - Anuran Makur. He was the one who sparked my interest in regular expressions, and inspired me to start this project.