Searching for Numeric Ranges with Regular Expressions

#Searching for Numeric Ranges with Regular Expressions

Edit - I recently learned that the generated regex will only work on engines that use POSIX rules for matching. The POSIX rule specifies that, given an alternation (eg. a|b) the engine will find the longest possible match, regardless of the order of the alternation. This means that, under POSIX rules, a|b is completely equivalent to b|a. My engine currently uses POSIX rules, but I plan on implementing the standard PCRE rule. This rule specifies that the left branch of an alternation is tried first. If it succeeds, the right one is ignored. This approach is more well-defined and widely implemented, and leads to fewer bugs.

As far as this article is concerned, the only change you will have to make is reversing the order of alternation of the generated regex. So instead of a|b|c|d, the generated regex must be d|c|b|a, where the longest match is specified first. Since POSIX rules don’t care about the order, they will be unaffected by this change. I have updated the Go playground link at the bottom of the page to generate the reversed order instead.

The rest of the article has been left unedited for posterity.

Last fall, I wrote an NFA-based regex engine, inspired by Russ Cox’s articles. While I’m still ironing out a few QoL issues (stay tuned for the blog post about it), I wanted to discuss a pretty cool feature in my engine - one that I haven’t found in other engines. This is the ability to search for numeric ranges. My engine does this by specifying two numbers, separated by a hyphen and enclosed in angle brackets.

<16-523> # Will match any number between 16 and 523 inclusive

This post aims to be a deep dive into how this feature was implemented.

#Why are numeric ranges hard to match?

Fundamentally, regular expressions describe patterns of text. This is useful if I want to match ‘hello’ in the string ‘hello world’. It’s also useful if I want to match the number ‘5’, because the number simply gets treated as any other string.

Therein lies the problem. A normal regex engine doesn’t assign any semantic meaning to a number. The number gets treated as any other sequence of characters, and loses its meaning as a number.

#What about character ranges?

Character classes allow the user to specify multiple characters that can match at the current position. The characters to match are specified inside square brackets. For example, [abc] matches either a, b or c. Notice that this is just syntactic sugar for an alternation.

Character ranges are an extension of character classes, whereby a contiguous list of characters can be shortened. For example, [a-z] will match any letter from a to z.

[a-z] => a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z

Let’s look at what happens when we use a character range to search for a numeric range.

Assume I want to search for a number from 0 to 255 (all 8-bit values). If I construct the following regular expression:

[0-255]

Expanding this regex gives us this:

0|1|2|5|5

The 0-2 is interpreted as “match any character between 0 and 2”, and the two fives are interpreted as “match 5 or 5”. That isn’t even close to what we wanted.

But we can make use of character ranges. For example, matching a single digit number is easy:

[0-5] # Matches a number from 0 to 5

It’s also easy to match certain double-digit ranges:

[4-6][0-9] # Matches a number from 40 to 69

Our goal is to generalize this: Given a numeric range, how do we construct a regex that uses character ranges to match all numbers in the range?

#Constructing a regex to match a numeric range

Let’s assume we are searching for the same range as above (0-255).

First, let’s notice that there are three distinct sets of numbers here:

1-digit numbers (0-9)
2-digit numbers (10-99)
3-digit numbers (100-255).

We see that, in order to match any number in this range, we will have to match any number in the three ‘subranges’. The first is easy to search for:

[0-9]

So is the second one

([1-9][0-9])

Combining the two gives us this:

([0-9])|([1-9][0-9])

This gives a regex that matches any number between 0 and 9, or any number between 10 and 99.

Now let’s focus on the third (and most challenging) subrange: 100-255. Note that the following regex will not match the subrange, because it won’t match numbers like 199.

[1-2][0-5][0-5] # Will not work!

So, let’s break this down even further. We see that this subrange is made of the following subranges (sub-sub-ranges?):

100-199
200-249
250-255

These ranges translate to the following regexes:

1[0-9][0-9]
2[0-4][0-9]
25[0-5]

Combining all of the above regexes gives us the final expression:

([0-9])|([1-9][0-9])|(1[0-9][0-9])|(2[0-4][0-9])|(25[0-5])

In words: Match any number between 0 and 9, or between 10 and 99, or between 100 and 199, or between 200 and 249, or between 250 and 255.

This looks like a mess, but it gets the job done! This regex will match any number from 0 to 255, inclusive. Sticking the entire regex into a group and surrounding it with anchors gives us a more palatable solution.

^(([0-9])|([1-9][0-9])|(1[0-9][0-9])|(2[0-9][4-9])|(25[0-5]))$

The key is to break down the range into subranges that can be expressed using character ranges.

#The algorithm

These subranges can be labeled by the closest multiple of a power of 10, that is less than or equal to the smallest number in the range. In our example above, the subranges can be labeled as follows:

0-9 - 1 x 10⁰
10-99 - 1 x 10¹
100-199 - 1 x 10²
200-249 - 2 x 10²
250-255 - 25 x 10¹

Notice the pattern: We go ‘vertically’ from a power of 0, to 1, to 2. Then we go ‘horizontally’ along the same power (from 100 to 200), then come back down vertically to a power of 1. This is the key to finding these ranges.

#The implementation

Now that we’ve gone over the algorithm, the implementation is actually fairly simple. I created a Go function with the following signature:

func range2regex(start int, end int) string {
    ...
  }

Given the start and end index of a range, it constructs a regex that matches all numbers in the range and returns it as a string. The regex is constructed using the ‘subrange’ method discussed above. When the engine encounters a numeric range (which I have defined as <start-end>), it simply replaces it with the equivalent regex.

I created this snippet on the Go Playground to test this algorithm. The start and end variables determine the range of numbers matched by the generated expression.