Two More Cents

ccat and the power of Regular Expressions

This summer, I worked on a personal project to create a syntax-highlighting tool. It was designed to be a replacement for cat, whenever I wanted to view source code (hence the name: ccat == color cat). While I could have used Vim to view the code, I wanted a tool that was non-interactive, so that I wouldn’t accidentally make changes. Along the way, I came to appreciate the power of regular expressions. In this post, I’d like to describe the tool and the lessons I learned making it.

Picking a language

While this often devolves into pointless bikeshedding, picking a language was a crucial part of my project. I wanted to pick one that:

After designing a quick prototype in C (which was less-than-fun), I realized that Go was probably the best language for a tool like this. It ticked all of my boxes. It was compiled, cross-platform and produced static binaries. It’s also very well suited for command-line tools, with built-in support for things like flag-parsing, which might explain its increasing use in CLI applications.

How it works

Broadly speaking, here’s how the tool works:

  1. You provide a file as a command-line argument.
  2. If the file has an extension, a corresponding config file is loaded (of the form <extension>.conf). If it doesn’t, the file is printed as-is.
  3. The config file is applied to the input file, and each character in the file is assigned a color.
  4. Each character is printed out in its color (credit to https://github.com/fatih/color)

I wanted to focus on step 3. This was the most challenging part of the project, and I spent many hours wondering how to implement it. It was clear that each language would need to have a configuration file. I just wasn’t sure how this file should look. I searched for libraries that would perform grammatical analysis of languages and decided against them based on their complexity. At some point, I realized that I actually didn’t need such a library for this - the solution lay in regular expressions.

Syntax analysis with regex

It might be my love for UNIX and Linux, but I’ve always been fascinated with regex. The idea of a language to define patterns in text is simple yet brilliant, and it’s no wonder that they’re so prevalent today.

But I’d never thought of them being used to find patterns in code. On the surface, it makes perfect sense: code is text, after all, and it can be parsed just like any other text can. Once I realized this, I immediately set to work implementing regex matching in my program. Lucky for me, Go had a builtin regexp library which, while it didn’t have support for PCRE, was blazing-fast - exactly what I needed to compete with a tool like cat.

In some ways, I was extremely fortunate that Go’s regex library implemented a specific method:

func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int

It returns a slice (array) of all matches of a regex in the given string. This method saved me the pain of implementing something similar myself, and is really the backbone of the entire tool.

The configuration file

Once I’d figured out the regex aspect of the tool, I could focus on the config file. In fact, it turned out to be easier than I’d expected. Each line of the config file would look this:

REGEX: COLOR

Where every match of REGEX would be printed in COLOR. For example, my c.conf contains the following lines:

'//.*': GRAY        # Print single-line comments in gray
'"(.*?)"': BLUE     # Print quoted text in blue
'\<(.*?)\>': BLUE   # Print text in angled brackets (eg. after #include) in blue

These are just a few examples, but I hope they illustrate the simple beauty of such a solution. By treating my source code as plain text, I was able to search for patterns which match elements (strings, comments etc.) of programming languages.

It also means I need to know nothing about a programming language, beyond its basic syntax, to create a config file for it. If the keywords of the language follow a pattern (as they should), then creating a config file is trivial.

Quite elegant, don’t you think?

What I learned

Creating this project showed me that the solution to your problem can often be hiding under your nose. In my case, I use regular expressions with grep fairly often for analyzing log files and extracting data, but I hadn’t considered their use in analyzing source code. It also gave me a greater appreciation for regular expressions, and how a simple set of rules can be used in so many different ways.