Code generation the easier way

Following the article on integrating a code generator with CMake, I was asked for more details on how the code generator itself is done.

I'm currently using the same structure for a fourth C++ code generator in a row and have found it convenient to use and adapt to new projects. Before this one I haven't been able to reuse a code-generator design, which seems to be a common problem looking at projects I work on. More specifically, low reuse can be had because these scripts usually rely on hand-written output code.

Template languages vs. printing to a file.

The easiest and most straightforward way to generate a file is to use print statements to stdout and redirect the output, or to write lines to a file directly. The advantage of this approach is that it requires no library-specific knowledge.

On the contrary, for text templates, there is a the template language to learn in addition to the support library used to render the templates. While starting to use these requires some efforts, I will argue that it brings many important benefits to non-trivial generators, most of which come from separating the presentation from the data model.

First templates' structure matches the structure of the generated files, which makes the generator easier understand and debug.
This is contrary to print-style generators that usually have lots of functions calling each other, making the flow difficult to understand.
The separation also makes changes easier, changing the form of the output only requires template changes while modifying the data output mostly needs changes in the model.
Finally, it makes it easy to extend the generator to output multiple and very different files, there just needs to be a template per file.

These benefits are the same MVC frameworks bring to web development compared to outputting the HTML page directly (a la simple PHP or CGI scripts). Actually Jinja2 is a clone of the V part of the Django MVC framework.

It is tempting and fun to make one's own template language [1] but a popular template library will be feature-rich, well documented and with a beautiful template syntax, for no effort. I chose Jinja2 as a template library because it is for the scripting language I know best (but not well), Python, is popular, maintained and has great documentation.

Structure of the generator

After parsing command line arguments and handling special arguments like --print-dependencies for build system integration, the generator usually loads a file containing the raw data of the model. The reason for using an external file instead putting the data in the script is so the data can be output by another program, and used for different generator. This is especially useful if your coworker happens to like Perl better and is not able to load Python modules.

There are many great libraries for parsing standard formats in Python: json or the excellent xml.etree.ElementTree module, as well as the python-yaml library. After the raw data is loaded, the generator will create Python object for each data element, then link them together. The linking phase is very useful when your model has cycles which cannot be represented by the tree-structured standard formats, when you don't control the data and need to sanitize it [2], or just to do sanity checks.

Most model objects have names. I have found very useful to abstract the case of name using a Name class that contains the name split by word as well as various methods to return it formatted in different cases. This way the model doesn't have a notion of case and the choice is left to the template. It is helpful to have a canonical lowercase and space-separated representation of names in data files, as it makes it easy to split and recase the name. Here's the code for a minimal Name class:

class Name:
    def __init__(self, name):
        self.chunks = name.split(' ')

    def CamelChunk(self, chunk):
        return chunk[0].upper() + chunk[1:]

    def canonical_case(self):
        return (' '.join(self.chunks)).lower()

    def concatcase(self):
        return ''.join(self.chunks)

    def camelCase(self):
        return self.chunks[0] + ''.join([self.CamelChunk(chunk) for chunk in self.chunks[1:]])

    def CamelCase(self):
        return ''.join([self.CamelChunk(chunk) for chunk in self.chunks])

    def SNAKE_CASE(self):
        return '_'.join([chunk.upper() for chunk in self.chunks])

    def snake_case(self):
        return '_'.join(self.chunks)

Now that the model is built, we can feed it to our templates as a dictionary, along with helper functions that will be available in the templates as additional builtins:

# Represent a single template render operation, params_dicts are dictionaries that will
# be merged before being fed to the template.
FileRender = namedtuple('FileRender', ['template', 'output', 'params_dicts'])

def do_renders(renders, template_dir, output_dir):

    # Create the Jinja2 environment using custom options and loader, see sections below.
    env = jinja2.Environment(loader=PreprocessingLoader(template_dir), trim_blocks=True, lstrip_blocks=True, line_comment_prefix='//*')

    for render in renders:

        # Merge the dictionnaries
        params = {}
        for param_dict in render.params_dicts:
            params.update(param_dict)

        # Render the template
        output = env.get_template(render.template).render(**params)

        # Output the file, creating directories if needed.
        output_file = output_dir + os.path.sep + render.output
        directory = os.path.dirname(output_file)
        if not os.path.exists(directory):
            os.makedirs(directory)

        with open(output_file, 'w') as outfile:
            outfile.write(output)

# Example use of do_renders
model = parse_and_link_model()

# Few builtin function are present by default in the templates, add some.
builtins = {
    'xrange': xrange,
    'ord': 'ord'
}

renders = [FileRender('Template.cpp', 'output.cpp', [model, builtins])]
do_renders(renders, 'templates/', 'output/')

Producing files with the correct spacing

It is a good idea to have the generated files be readable, if only to help with debugging and code understanding. However when using code generators, preserving correct spacing of the resulting source can be a challenge. Take for example the following model representing classes:

model = {
    'objects': [
        {
            'name': 'Car',
            'methods': [
                {'name': 'GoVroom', 'arguments': [{'type': 'int', 'name': 'noiseLevel'}]},
                {'name': 'Stop', 'arguments': [{'type': 'bool', 'name': 'fast'}, {'type': 'bool', 'name': 'hard'}]}
            ]
        },
        {
            'name': 'Nothing',
            'methods': [{'name': 'DoNothing', 'arguments': []}]
        }
    ]
}

We want to output the following:

void CarGoVroom(Car* this, int noiseLevel);
void CarStop(Car* this, bool fast, bool hard);
void NothingDoNothing(Nothing* this);

The straightforward way to output this is to have the following template and render code:

env = jinja2.Environment(loader=jinja2.FileSystemLoader(''))
print(env.get_template('MyTemplate.tmpl').render(model))

{% for object in objects %}
    {% for method in object.methods%}
        void {{object.name}}{{method.name}}({{object.name}}* this
            {% for argument in method.arguments %}
                , {{argument.type}} {{argument.name}}
            {% endfor %}
        );
    {% endfor%}
{% endfor %}

Which outputs:

void CarGoVroom(Car* this

        , int noiseLevel

);

void CarStop(Car* this

        , bool fast

        , bool hard

);



void NothingDoNothing(Nothing* this

);

It is pretty bad. The first problem is that in the example above, Jinja2 will output the indentation and line-breaks that are used in the template for readability and completely messes up the output. The builtin solution to this problem is to use the "-" version of the template control flow directives that will tell Jinja2 to remove at parse time, all the whitespace preceding ({%-) or following (-%}) the directive. Here's the example template using some "-" version of the directives, and its output:

{% for object in objects %}
    {% for method in object.methods%}
        void {{object.name}}{{method.name}}({{object.name}}* this
            {%- for argument in method.arguments -%}
                , {{argument.type}} {{argument.name}}
            {%- endfor -%}
        );
    {% endfor%}
{% endfor %}

Gives:

void CarGoVroom(Car* this, int noiseLevel);

void CarStop(Car* this, bool fast, bool hard);



void NothingDoNothing(Nothing* this);

The function definitions are correctly formatted by there are extra newlines introduced by the control flow blocks. In most cases, when a control flow is on a line on its own, we don't want that line to appear in the output. We use a combination of jinja2.Environment options that do just that:

trim_blocks that removes the first newline following a block.
lstrip_blocks that removes whitespace from the start of a line to a block.

env = jinja2.Environment(loader=jinja2.FileSystemLoader(''), trim_blocks=True, lstrip_blocks=True)

Gives:

void CarGoVroom(Car* this, int noiseLevel);
void CarStop(Car* this, bool fast, bool hard);
void NothingDoNothing(Nothing* this);

The output is now almost correctly spaced but for the extra indentation levels introduced by the indentation of the template directives. Unfortunately Jinja2 doesn't have a built-in way to correct the indentation. Instead I wrote a custom jinja2.Loader that processes the template text before it is parsed by Jinja2: it assumes that the template is indented for each directive level, and remove one indentation level per nested directive. That loader is passed to the Jinja2 environment creation and transparently fixes the indentation level of all the templates.

# A custom Jinja2 template loader that removes the extra indentation
# of the template blocks so that the output is correctly indented
class PreprocessingLoader(jinja2.BaseLoader):
    def __init__(self, path):
        self.path = path

    def get_source(self, environment, template):
        path = os.path.join(self.path, template)
        if not os.path.exists(path):
            raise jinja2.TemplateNotFound(template)
        mtime = os.path.getmtime(path)
        with open(path) as f:
            source = self.preprocess(f.read())
        return source, path, lambda: mtime == os.path.getmtime(path)

    blockstart = re.compile('{%-?\s*(if|for|block)[^}]*%}')
    blockend = re.compile('{%-?\s*end(if|for|block)[^}]*%}')

    def preprocess(self, source):
        lines = source.split('\n')

        # Compute the current indentation level of the template blocks and remove their indentation
        result = []
        indentation_level = 0

        for line in lines:
            # The capture in the regex adds one element per block start or end so we divide by two
            # there is also an extra line chunk corresponding to the line end, so we substract it.
            numends = (len(self.blockend.split(line)) - 1) / 2
            indentation_level -= numends

            result.append(self.remove_indentation(line, indentation_level))

            numstarts = (len(self.blockstart.split(line)) - 1) / 2
            indentation_level += numstarts

        return '\n'.join(result)

    def remove_indentation(self, line, n):
        for _ in range(n):
            if line.startswith(' '):
                line = line[4:]
            elif line.startswith('\t'):
                line = line[1:]
            else:
                assert(line.strip() == '')
        return line

env = jinja2.Environment(loader=PreprocessingLoader(''), trim_blocks=True, lstrip_blocks=True)

With all of the above whitespace fixes, the output is correctly spaced:

void CarGoVroom(Car* this, int noiseLevel);
void CarStop(Car* this, bool fast, bool hard);
void NothingDoNothing(Nothing* this);

Three fixes seems like a lot, but both the loader and block-stripping only have to be implemented once and incur no cognitive burden.

[1]	something like doing text substitution using Python's str.format keyword arguments and matching @@-prefixed lines for control flow. Don't do that.

[2]	like for my generator that uses Vulkan's horrible XML API definition.