Static Site Generation

Creating a Markdown to HTML converter with Regular Expressions in Python

October 8, 2021

Optimizing Workflows ⚙️

When I first began this website my workflow for getting content from my head to the site looked something like this:

  1. Write content in Microsoft OneNote
  2. Manually write out HTML, copying and pasting content for each HTML tag
  3. Proofread as web page

If I needed to make changes to the content after I had manually converted it to an HTML page, then I had to update it in the OneNote document and the HTML page, so as to have the source content and HTML not be out-of-sync.

The thing that I didn't anticipate was that I would be making a lot of small changes during the proofreading phase. Between rephrasing sentences, reorganizing paragraphs, and just plain-old spelling mistakes, it took a considerable amount of time to get from the first draft of content being done to the new web page being live.

I made three posts like this before I got annoyed enough to automate it.

I wanted a way to go directly from content to finished web page. If there were changes to the content then that should update in the HTML as well. The first thing that I did was move from Microsoft OneNote to Obsidian. This allowed me to create content in Markdown and manage it with GitHub.

If aren't familiar, Markdown is a way to write formatted text (think text that has hyperlinks, or bold items, or lists; that kind of stuff) as plain text. So if you wanted to write the word, "Title" and have it be bold, you would write:


**Title**
  

With that, all of my content was now in a easy-to-handle format. I just had to write a Python script to convert these Markdown files to HTML that matched my website.

Crowded Waters 🌊

If you Google "static site generator" you'll get almost 76 million results. If you Google "Markdown to HTML" you'll get almost 1.7 billion results. It's safe to say that whatever I needed is probably out there already. No need to reinvent the wheel.

However, there's something to be said about making your own tools; doing something yourself. I feel that I gain a better appreciation for when I do have to use a tool if I've created something similar myself.

Plus it just seamed interesting and fun, which was good enough for me.

Python and Regular Expressions 🐍

There's a feature-full Python library for handling Markdown already. If I was going for the cleanest and most robust code possible, I would leverage that. But if I'm going to do it myself, I might as well do it all myself.

So instead I decided to use regular expressions.

Regular expressions (shortened to RegEx) is a syntax for pattern matching in text. You can search for literal characters in a text, like "name", or "23", but the real power of RegEx comes from its ability to search for things where you don't know/care exactly what you're looking for.

Let's say you're searching a document for phone numbers. Well, a phone number may be written as "123-456-7890" or just "1234567890". So a simple regular expression to capture both of those would be:


\d{3}-?\d{3}-?\d{4}
  

Where "\d" means any digit, {#} means exactly # amount of what precedes it and "-?" means an optional (?) "-" character.

If you aren't familiar with regular expressions then check out regex101 where you can try it out for yourself. I won't breakdown all of the RegEx to explain their meaning, so if you don't understand something then use that website as a reference; they do a great job explaining what each part of the expression does.

The first thing that I wanted to change was any instance of "<" or ">" that I had in my Markdown to the appropriate HTML character code. I often use code snippets, so it's common that those characters pop up in various places of the content that I write. Since HTML uses those as part of its syntax, it's best to replace them with the safe version ("&lt;" and "&gt;").

I don't need any fancy RegEx for this since I'm only searching for the literal characters "<" and ">". In fact, I can use Pythons built-in replace() method, but since I'm writing about RegEx, I'll stick to that.

The general idea for using regular expressions is that you provide a search pattern and a replacement pattern. All instances of the search pattern are replaced using the replacement pattern. In this case, the search patterns would be Markdown and the replacement patterns would be HTML.


# Change any < or > to &lt; and &gt;
 pattern = "<"
 replPattern = "&lt;"
 markdownText = re.sub(pattern, replPattern, markdownText)
 pattern = ">"
 replPattern = "&gt;"
 markdownText = re.sub(pattern, replPattern, markdownText)
  

Capturing the Good Stuff 🕸️

Another powerful aspect of RegEx is the ability to capture selections. In your regular expression you can say, "keep everything that matches this part of the pattern," so that you can save it for later use.

Using this idea, I can convert a "Heading level 1" in Markdown to a "Heading level 1" in HTML. The conversion would be going from this Markdown:


# Heading level 1
  

To this HTML:


<h1>Heading level 1</h1>
  

Again, I won't explain all of the syntax of regular expressions, that can be Google'd, so I'll just provide the code snippet for finding and replacing level 1 headings:


pattern = "^# (.+)\\n"
replPattern = "<h1>\g<1></h1>\\n"
markdownText = re.sub(pattern, replPattern, markdownText)
  

The main thing to see here is that I capture everything within the parenthesis and place it where the "\g<1>" is.

I did a similar pattern for level 2 and 3 headings so I won't show that for the sake of brevity.

✳︎✳︎ Being Bold ✳︎✳︎

One fun regular expression to write was for finding bold items. As I said earlier, you can bold text by sticking two asterisks on either side like **this**.

But there's one problem. The asterisk is used as part of the RegEx syntax. You place one after something you want to find zero or more times. To reconcile this you have to use an escape operator ("\") to tell the RegEx parser that we want the literal character "*". This means our pattern can't be:


**(.+)**
  

Where the " . " is to capture any character and the "+" is any quantity of characters and instead has to be:


\*\*(.+)\*\*
  

But there's another issue that comes up with this expression. If you have two bold items on the same line, then it will bold everything between the first "**" and last "**" so instead of having your sentence be:

Lorem ipsum dolor sit amet

You get:

Lorem ipsum dolor sit amet

This is easily fixed by saying that we want to stop our pattern matching once we hit the first instance of "**", like so:


\*\*([^\*\*]+)\*\*
  

There is another way to bold text in Markdown and that's by sticking text between two underscores (__like this__). That's because text is italicized with a single asterisk, so if you want text that's bold and italic, then you can simply write it as __*example*__. [1]

Hyperlinks ↗️

The last part worth highlighting is how hyperlinks get converted.

In Markdown, you write a hyperlink as:


[display text](hyperlink)
    

The problem, just like asterisks and bold text, is that brackets and parentheses have syntactic meaning in RegEx. Let's start off by just finding and capturing the display text that's in the square brackets.

First, we want to find any literal instance of [ and capture everything until we hit a ]. That would look like this:


\[(.+)\]
  

Unfortunately, we run into the same issue as with bold text: if there's more than one ] on a line, then this will capture everything until the second ]. To remedy this, we capture everything until the first ]:


\[([^\]]+)\]
  

The same pattern can be used for the parenthesis:


\(([^\)]+)\)
  

Combining them and replacing them with the correct tag in HTML results in this code:


pattern = "\[([^\]]+)\]\(([^\)]+)\)"
replPattern = '<a href="\g<2>">\g<1></a>'
markdownText = re.sub(pattern, replPattern, markdownText)
  

That RegEx pattern is quite the sight to behold.

Wrapping Up 🎁

I wanted this post to be less of a "how-to" guide and more of a "check this out".

The RegEx patterns that I've shown here cover maybe 10% of what my final script contains. A lot of the patterns are similar to what I've already shown, or are very specific to the formatting of my website, so it didn't pay to show everything.

Making this static site generator not only reduced my time spent writing HTML, but also taught me quite a bit about regular expressions; and you can't ask for much more from a personal project than providing utility and expanding your own knowledge.






Footnotes

[1] You can actually write it as ***example*** but that's a bit more difficult to parse.