Perspective Unspoken

My perspective on Git, Docker, Python, Django, PHP and other stuff

Django’s templating engine is simpler than you think

I’ve always wondered how Django’s template engine worked. How does it go about parsing and making sense of all the markup we write in our templates? I figured it had to involve a Parser and a Lexer because I paid attention in Compiler class back in school. I also saw a post on SO that prompted me to pursue the answer to this question I’ve long asked. So I went digging into Django’s source code on Github trying to find out which parser and lexer the core guys at Django opted to use for the implementation of template parsing. I was pleasantly surprised at the answer!

It turns out that for starters, Django built their own Lexer and Parser! And… it’s all coded in about 750 lines of code. Additionally, the implementation is quite elegant and genius! So let’s dig in and discover just how template parsing is accomplished.

Building a node list

Let’s start with a simple example… you have a string of text and a context to go along with it and you want to render that out.

When you construct a Template  object, Django stores the text as the source  then it proceeds to compile a node list. A node list, as the name implies is simply a list of nodes. These nodes can be text sequences, control blocks (e.g. for, if), comments. It’s basically the template broken down into digestable pieces that can be rendered separately. In order to render the template, each node is rendered and the resulting content is basically concatenated together. So it’s important to get that node list.

So of course… alot of things happen under the hood to get that node list together. A lexer and parser are involved in the process. Both have to be called in turn before the node list is produced.

The purpose of the Lexer is to recognize the different symbols that are permissible in the language. The Lexer is responsible for scanning a stream of characters and outputting a stream of tokens. The tokens are passed to the Parser, whose responsibility is combining the symbols to actually get real work done. So the Lexer just grabs the individual bits and pieces and the parser puts them together to make logical sense of them.

What are the symbols the Lexer is looking for?

  • Text
  • Variable start and end tags {{ and }}
  • Block start and ends tags ({% and %})
  • Comment start and end {# and #}
  • Special symbols used in filters | and :
  • Variable attribute separator . (the dot)

So using the example string with the greeting above, the Lexer  is responsible for extracting text and the parts of the variable. Let’s see how that happens.

The Lexer

The Lexer takes the template source as an input and immediately calls the tokenize  method. Just as a reminder, this is all happening when the Template is initialized with some content. We haven’t gotten to render just yet.

Regular Expressions are awesome!

This is where the genius really starts. Remember the Lexer’s purpose is to break text into a stream of tokens to be passed to the Parser. The way this is done is via the use of an intriguingly simple Regular Expression.

I really thought this would be some super complex expression but it turns out it’s really simple. All it’s saying is look for blocks or variables or comments. It uses the pipe ( | ) character to say “or”.

In order to keep things clean, and readable they use re.escape  to escape the characters. This function backslashes all non-alphanumeric characters. Otherwise, this regular expression would look unwieldy.  String formatting is used to pass in each of the escaped characters. As outlined earlier, the regular expression is really three different parts (block, variables and comments). The block regex is represented by {%.*?%} , the variable expression regex is {{.*?}}  and for comments {#.*?#} . The .*  part of each expression says match anything. The dot .  literally means match anything, and the star *  means 0 or more occurrences. Adding the ?  sign says don’t be greedy. By adding the question sign it allows us to match anything up until we see the ending sequence. Which means match {{ , then match anything you want to until you see }} .

A quick note from the Python docs on non-greedy matching.

If the regular expression <.*> is matched against ‘<H1>title</H1>’, it will match the entire string, and not just ‘<H1>’. Adding ‘?’ after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only ‘<H1>’.

So now we know the Lexer  starts by compiling a regular expression that enables Django to find the start and end tags for variables, blocks and comments. All together, that regular expression would look something like this (if we didn’t care about escaping characters):

The next thing that happens, is that it uses this regular expression to spit the template string. This is where things really start to get interesting!

If you split the the string using the regex, you get back a list of strings, with every other item being a matched symbol, in our case, it matched {{ name }} . This is not the normal behaviour for the split method. Usually when you call the split method on a compiled regex, the matches are not returned in the resulting list. However, here’s another note from the Python re  docs.

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

So, the docs says if you have “capturing parentheses” expression, those matched expressions are returned in the result. If you remember the regular expression, it starts and ends with a parentheses, meaning our list contains our matched expressions. In other words..

So our list looks like this…

So every other item in the list is a matched expression. The Lexer  now needs to parse this list of symbols to create tokens for the Parser .

Creating tokens

Given we now have a list of matched symbols nicely organized with the text in between, we have enough knowledge to create tokens for the Parser.

Let’s talk briefly about how tokens are modeled by the Django template core. Ok, so no surprise, Django declares a Token  object that is used to represent a token. Each token has a token type, contents, line number and an optional position tuple that stores the beginning and the end of the token. So the class basically looks like this…

The token_type  can be a TOKEN_TEXT , TOKEN_VAR , TOKEN_BLOCK  or TOKEN_COMMENT . The Lexer has the job of determining for each bit from the list (split by the regex), which type of Token  to return.

Before we go on, let’s look at a more complex example, so we can follow exactly what’s happening here.

Just some quick notes about that resulting list. Because we had split the string by the regex we are guaranteed that our first matched expression will always populate the even indices in the string. So let me just re-show that example from earlier:

 

So basically, every text symbol is going to be converted to a Token  of type TOKEN_TEXT . And every TOKEN_TEXT  node is followed by a matched expression which we will do some further inspection to determine whether it is a  TOKEN_VAR , TOKEN_BLOCK  or TOKEN_COMMENT .

So far so good? Well.. the approach is flawed for one reason. Can you guess why? When would we need to break this rule?

 

Now let’s take a look at the full tokenize  method on the Lexer . The tokenize  method is called right after splitting the template source string using our regex. The tokenize method basically loops over all the bits and

So the __init__  method stores the template string and also declares a verbatim  property to be False (more on that later). In the tokenize  method, we start by defining a in_tag  boolean. There’s a lineno  count and a list of tokens ( result) that’s returned.

Next, we start to loop over the bits from the template split by the regular expression. We keep flipping the truthiness of in_tag  on each iteration. The reason they’re able to do this is because we know that the template string is split by the regex, and as we saw earlier, matched symbols occupy the even spaces.

Also, the lineno  is incremented by simply counting the number of \n  characters in the bit. So create_token  ends up being called with in_tag = True  for every matched expression and False  otherwise (for unmatched text). So let’s have a look at what create_token  actually does then.

The create_token  then is the function that does all the work in determining what kind of token should be created. This has to be returned to the Parser.

The first concern the Lexer caters to is that of the verbatim  feature Django provides. Recall that anything in a verbatim  block is ignored, it allows us to leave notes for ourselves. So, the Lexer has to ensure that anything in this block is ignored. To be a little more accurate though, a text node will be created from the verbatim block. Which is fine, since we don’t want the Parser to think anything more involved is actually happening. The Parser won’t do any further processing on textual nodes.

In order to do that though, it needs to track the beginning and ending of verbatim blocks. So the create_token  function starts with a check that toggles whether or not we are in verbatim mode. Once we’re not in verbatim mode, it does the following checks.

  • Check if the token starts with VARIABLE_TAG_START  .i.e. {{  and create a token of type TOKEN_VAR  for that from the entire variable name. This is done by removing the {{  and }}  and stripping the variable name.
  • Check if the token starts with BLOCK_TAG_START  (i.e. {%). Then, check to see if it’s a verbatim  block. If it is, it sets the verbatim  variable to be end verbatim so that subsequent calls to create_token  can check if the content of the block it’s looking matches that. If it does, verbatim mode is disabled again. For now though, a new token of type TOKEN_BLOCK  is created.
  • Check for the start of a comment {# . If it finds one, it creates a token of type TOKEN_COMMENT .

If none of these rules are matched, it creates a TOKEN_TEXT  token.

So now we have a list of tokens that can be passed directly into our Parser . Now we just need to call parse  on the parser to have make some sense of all the tokens it has been passed. The BLOCK_TAG_START  tokens signify the start of conditionals, loops, template tags, etc… there are variables involved in the mix. The parser has to make sense of all of this and somehow render our content as we wanted.

And that I’m going to cover in Part 2 of this wonderful foray into Django’s template engine.

Hope you enjoyed following me on this journey to far.

Share your thoughts and stay tuned!

jaywhy13 • October 21, 2016


Previous Post

Next Post

000webhost logo