Creating a Static Site Generator
“There comes a time in every rightly constructed blogger’s life, when he has a raging desire to build his own Static Site Generator”.
Whether or not that statement is true, I’ve seen enough anecdotal evidence in favor of it. Just look at this thread on Hacker News, for instance. Obviously, this is a subset of bloggers: Only the most tech-savvy ones go as far as to write their own tools.
In any case, a couple of weeks ago, I happened to fall victim to this desire. The end-result was a tool I call s4g. It stands for ‘stupid, simple, static site generator’, and it lives up to its name: It definitely is stupid, and it’s relatively easy to use. In this post, I’m going to give you a quick overview of this tool, my motivation behind creating it, and how I use it to write my pages. I will also show a couple of neat things that the tool does, which save me me a lot of time.
Goal / Purpose
I wrote this tool for a simple reason: I didn’t want to write raw HTML. That’s how my pages used to be written, and I found it slow, cumbersome and unwieldy. Markdown seemed like a much more elegant way of writing pages but, obviously, web browsers cannot render Markdown.
At the same time, I also wanted to automate a few of the tasks I used to perform manually. This includes adding the published date to my articles, adding an ‘index’ page that contained a list of articles, etc.
Finally, I wanted the tool to work on an arbitrarily large number of files. I wanted the ability to run it on a directory of HTML files (some of which could be nested in other directories), and replicate the folder structure somewhere else.
That gave me the following objectives:
- Convert Markdown files to HTML
- Add a header and footer to the converted HTML files.
- Copy the same folder structure, and copy over any static assets (like images).
- Figure out some way to add metadata (such as the title and published date) to the document.
So what does it do?
At the moment, s4g is a 200-something line shell script. I know that sounds atrocious, but the reason behind this decision, is that I rely on Pandoc to convert my Markdown files to HTML.
The tool itself, however, is straightforward and easy-to-use. It expects a few file and directories to be present in the working directory:
- styles.css - File - Global CSS file, applied to every web page.
- header.html - File - Header file, prepended to all blog posts.
- footer.html - File - Footer file, appended to all blog posts.
- source - Directory - Contains a folder structure with source (i.e. Markdown) files.
- fonts - Directory - Contains a list of fonts, which can be linked to in styles.css.
- files - Directory - Global files and assets eg. to link to from the home page.
- pandoc_filters - Directory - List of Lua filters for Pandoc.
As mentioned before, this tool is stupid, and so it’s going to crash if it doesn’t find any of the aforementioned directories. I plan to fix this in the near future.
Once it finds everything it needs, s4g gets to work converting your Markdown files to ready-to-display web pages.
The process goes like this:
- Create a copy of the source directory (called output).
- Delete all .md files in the output directory. Any other files, such as images or PDFs, are left untouched.
- Convert all Markdown files in source to HTML files, and copy them into the appropriate locations in output.
- Add a header and footer to each converted HTML document.
By the end of this process, I have a complete, ready website under output.
As you can see, however, the tool is highly opinionated. It does things a certain way, and crashes if you try to do things differently. I created it because it fits well with my workflow, but it’s obviously not a ‘one-size-fits-all’ solution.
Tweaks / Tips
I could have fit the functionality given above into a 100 line script. So why is s4g 200 lines? The answer’s because it does a couple of other things in addition to its main job of converting Markdown to HTML.
- Parse metadata (provided in a key : value format) from the top of my Markdown pages, and use that metadata anywhere else in the document.
- Generate a page list, that contains a list of all web pages, sorted from newest to oldest.
Let’s see how it does each of these, because I really think this is the coolest part of the tool.
Metadata
At the top of every Markdown file, I define a list of key-value pairs separated by a colon. This defines ‘variables’, that I can later use later on in my document. To reference a variable in my code, I use the key, enclosed in two dollar signs. For example:
page_title: Hello there!
date: Jan 19, 2024
## \$\$page_title\$\$
This post was published on \$\$date\$\$.
(The reason I use \$\$ instead of $$ is that Pandoc recognizes dollar signs as LaTeX expressions. To use a literal dollar sign, you need to escape it.)
Then, in my script, I replace any instances of$key$$ with the corresponding value. This is especially helpful when it comes to adding a title for the web page, since the <title> tag is defined the header file. This means that I cannot change the title directly, since the same header will be applied to all pages. Instead, I place a variable inside the tag in the header, and replace it with the page’s title in my script.
That was probably not a very good explanation, so let me show you a quick example:
$ cat header.html
<html>
<head>
<title>$$page_title$$</title>
</head>
<body>
$ cat file.md
page_title: Page 1
This is the first page!
Let’s ignore the footer for the moment. After running the script:
- The metadata is read and parsed, then stripped from the Markdown file.
- The Markdown file is converted to HTML.
- The header is prepended to the resulting HTML file.
- Every instance of $$page_title$$ in the HTML file, is replaced with Page 1, giving me this document:
$ cat file.html
<html>
<head>
<title>Page 1</title>
</head>
<body>
This is the first page!
As you can see, this saves me a ton of time, as I can define the title in the same document as my content.
Generate a page listing
The other neat feature of this tool, is the generation of a ‘page list’, which shows all of my published web pages, sorted by the date in which they were published.
I did this by parsing the web page for a published date (this can be accomplished with a simple grep invocation). I then put this information (the file name and date published) into an array, and sort the array. Finally, the array is written to a file, that is used to generate the page list.
The catch here is that most of my articles aren’t written in a single day. Therefore, I cannot hardcode a date when I begin writing, as that doesn’t reflect the date on which I publish it. My solution, here, was to use a date:auto metadata tag, that automatically resolves to the date on which the file was last modified (this can be found with the date command). Therefore, the date is automatically updated when I modify the file. For files that have a static date (i.e. those that were published before I started using s4g), I can still specify a hardcoded date.
Conclusion
Obviously, this tool is not for general public use. It is much too opinionated and extremely janky. However, I think it goes to show that creating a static site generator isn’t terribly hard, and can teach you a lot about the language you write it in: for example, I had no idea that Bash supported associative arrays until I wrote this project. More importantly, though, it can help make blogging a more fun experience, by allowing you to focus on the content of your posts, rather than the formatting.