A short guide to Pandoc Filters

#A short guide to Pandoc Filters

You may know that I use pandoc to convert Markdown pages to HTML for this website. One of Pandoc’s best features, in my opinion, is the ability to transform the syntax tree before it is converted. This is done through filters, and this post will be a quick introduction to them.

This post was motivated by an annoying problem I ran into while writing my previous post. The end of this article contains a summary of the problem, and how I solved it with filters.

#The purpose of a filter

A filter’s job is to transform the document in its intermediate representation. This ‘intermediate representation’ is always a syntax tree, regardless of the input and output document formats. This is the key to Pandoc’s elegance - the ability to represent multiple document formats through a single programmatic interface.

Filters run after the document has been processed into a syntax tree, but before it is converted into the output format. The Pandoc website puts it this way:

INPUT --reader--> AST --filter--> AST --writer--> OUTPUT

Where reader and writer are functions that convert the document into an Abstract Syntax Tree (AST) and vice-versa. Note that the filter never interacts with the original documents.

A common use of filters is to replace one fragment of a document with another. For example, replacing all bold text in a document to italic text. Every markup language represents these things differently - Markdown uses underscores or asterisks, while HTML uses the <i> and <b> tags. docx would use a specific representation, and ODT files will use another one. The filter doesn’t care - bolded elements are represented as a pandoc.Strong object, and italicized elements are represented as pandoc.Emph.

#Creating and using filters

Pandoc filters are represented as Lua functions. The choice of Lua is interesting, given that Pandoc is written in Haskell, but I assume it’s because Lua is easy to embed into other languages and has a simple syntax.

Every function looks for a specific Pandoc object type - the name of the object must match the name of the function - and returns the object that it should be replaced with. The function takes in one parameter - the specific element that is being acted on. The Pandoc page on filters contains a list of recognized objects, as well as constructor functions for these objects.

That’s all there is, really. To demonstrate how easy this is, let’s revisit the earlier example - replacing bolded elements with italicized ones. We are looking for objects of type Strong (requiring our filter function to be named Strong) and we will return an object of type .Emph.

According to the Pandoc API, Strong elements have their content stored in the content field. After retrieving the contents of this field, we can wrap it in an Emph element by calling its constructor (the constructor only requires the one parameter) and return the new element.

Putting all this together, we can write the following Lua function. (The syntax is a triviality and can be picked up separately). In this example, elem is a name I chose for the parameter - it has no special meaning; it just represents the element on which the function is called.

function Strong(elem)
    return pandoc.Emph(elem.content)
end

Once you have a filter, place it in a Lua file and pass the file to Pandoc with the --lua-filter option.

#A practical use for filters

Replacing every bolded element with an italic one isn’t very useful. But while writing my previous blog post, I encountered a problem that necessitated the use of a filter.

The problem was this: that post had many pictures that I’d taken on a hike. The pictures were relatively high-resolution, but for practical purposes (I didn’t want the images to take up the entire screen), I showed them in a lower resolution. I still wanted you to be able to view the high-res image, so I decided to make every image a hyperlink to itself - clicking on an image will load the high-res version.

This would have been tedious as hell to do manually, and I didn’t want to resort to Javascript, so I used a Pandoc filter instead.

Here is the filter code:

function wrap_image_in_hyperlink(elem)
    local to_return = elem
    if FORMAT == "html" then
        to_return = pandoc.Link(elem, elem.src, "", elem.attr)
    end
    return to_return
end

I verify that the output format is HTML before messing with the element. If it isn’t, the element is returned unchanged. If it is, the element is wrapped in a Link object, that links to the source of the image. The attributes of the image are copied over as well, although this isn’t necessary.

Not only did I automate this process, I also did it in the “compile-time” stage of converting Markdown to HTML, rather than the “run-time” stage, which would have required modifying the loaded document using Javascript. Any chance to avoid Javascript is a win in my book.