Integrating a code generator with CMake

Motivation

In C / C++ projects we often need to use a code generator to create files that would be very tedious or error-prone to write. There are many use cases for this, such as:

Serialization code, for example for protobufs
Bindings for APIs, like the interfaces exposed to Javascript by the browsers (from WebIDL files)
Plumbing for a specific program, such as component-based gamelogic used in Unvanquished

This post will look at how a code generator can be integrated with the CMake build system through my experience implementing the integration of the VkCPP code generator. In VkCPP, a python script is used to parse the vk.xml file describing the Vulkan graphics API, and outputs a C++ interface for Vulkan.

But why would we integrate the generator with the build system instead of simply checking the generated files in the version control system?

To start, adding generated files to commits can make the commits much larger hence difficult to read. This is especially true for changes involving the code generator as most of the generated files could get rewritten (this will also increase the size of the repo, but it is a minor point).

Another advantage of the integration is that the code generator will re-run automatically, freeing the developer from having to think about it, and preventing bugs where the generated code and the rest of the codebase are not compatible. This could happen if someone submitted changes to the generator but without the updated generated files.

Finally, integration makes sure that the generator works reliably and easily on all the developer’s platform, contrary to a seldomly run generator that could easily bit-rot or work on only one developer’s machine.

The most basic integration

Two CMake commands allow running scripts at build time, add_custom_command and add_custom_target. Out of the two, only add_custom_commands allows creating commands that are run only when needed. For example if we wanted to generate files from a protobuf definition, the add_custom_command call could look like the following:

add_custom_command(
    COMMAND ${PYTHON_EXECUTABLE} generate_protobuf.py {ProtoName}.proto
    DEPENDS generate_protobuf.py ${ProtoName}.proto
    OUTPUTS ${ProtoName}.cpp ${ProtoName}.h
    COMMENT "Generating code for ${ProtoName}."
)

The OUTPUTS and DEPENDS arguments to add_custom_command tell CMake how to integrate this command in the build dependency graph [1]. The COMMENTS argument will be the build step name shown at build time, and COMMAND is the script invocation to run at build time, here a python script invocation.

Note that in the dependencies we added generate_protobuf.py itself. This ensures that the generated files will be updated when any changes to scripts is made.

This integration works well when the generator’s dependencies and outputs are easy to express, such as when a protobuf definition produces a header / implementation pair with the same name.

In VkCPP however, the dependencies and outputs are more complex. The dependencies include several template files and additional configuration files, while the number and the name of the outputs depend on the content of the configuration files.

Invoking a complex generator

When a generator is complex in the sense that its dependencies and outputs are difficult to express, we may not want to manually keep the add_custom_command call and the scripts synchronized as it would be easy to forget to add dependencies or outputs.

Instead we are going to query the script itself by adding the --print-dependencies and --print-outputs command line arguments that make the script return the corresponding, semicolon separated lists of files (semicolon is the list separator in CMake). In order to query these at build configuration time, we are going to use the execute_process command as follows:

execute_process(
    COMMAND ${GENERATE_COMMAND} --print-dependencies
    OUTPUT_VARIABLE DEPENDENCIES
    RESULT_VARIABLE RETURN_VALUE
)
if (NOT RETURN_VALUE EQUAL 0)
    message(FATAL_ERROR "Failed to get the dependencies")
endif()

The DEPENDENCIES variable whose name is given for the OUTPUT_VARIABLE argument will contain the standard output of the script, in this case the list of dependencies. We also check that the script didn’t crash and worked correctly by checking that the return value is 0.

Adding support for --print-dependencies is just a matter of gathering the input files and printing them (here the relevant code from the VkCPP generator):

if args.print_dependencies:
     dependencies = set(
         [template_dir + os.path.sep + 'TemplateUtils.h'] +
         [template_dir + os.path.sep + render.template for render in to_render] +
         [os.path.abspath(args.xml_file)]
     )
     sys.stdout.write(';'.join(dependencies))
     return 0

Getting the output is done the same way. It is a good idea to put the generator invocation command in a variable so the same arguments are used in all three scripts invocations. Once we have the outputs and dependencies, the add_custom_command will look like the following:

add_custom_command(
    COMMAND ${GENERATE_COMMAND}
    DEPENDS ${DEPENDENCIES} generator.py
    OUTPUT ${OUTPUTS}
    COMMENT "Generating the files."
)

Rest of the integration

With the code generator invocation done, the integration is not complete just yet and we need a couple more steps:

Generated code often needs support code, so we’ll bundle them together in one library. That way everything can be used together in other targets by simply linking the library.
To avoid cluttering the main CMakeLists.txt we’ll put the generator invocations in a subdirectory’s CMakeLists.txt. This way only the target names will be exported to the main CMakelists.txt and not the variables.
To avoid cluttering the source tree, we’ll generate the files in the build directory, properly namespaced with the subdirectory.
Finally we’ll deal with include paths.

Let’s look at how these steps combined look with (a simplified) VkCPP CMakeLists.txt:

# This is src/vkcpp/CMakeLists.txt

# Output the generated files in the build directory. Since the generated files include header
# files and that we want include of the form #include “vkcpp/Vulkan.h” we output them in
# an additional vkcpp subdirectory and will add the current binary dir as a include path.
# This handles 3) and part of 4)
set(VKCPP_INCLUDE_DIR ${CMAKE_CURRENT_BINARY_DIR})
set(VKCPP_OUTPUT_DIR ${CMAKE_CURRENT_BINARY_DIR}/vkcpp)

# Here do the execute_process and add_custom_command calls for the generator

# Bundle everything in a library, adresses 1)
add_library(vkcpp STATIC
    ${SUPPORT_CODE}
    ${OUTPUTS}
)

# Make targets linking against VkCPP use the right include directories, handles 4)
# Add the include directory to make #include "vkcpp/Generated.h" work
target_include_directories(vkcpp PUBLIC ${VKCPP_INCLUDE_DIR})
# Add the include directory to make #include "vkcpp/Manual.h" work
# Note that these headers live in src/vkcpp/include/vkcpp/
target_include_directories(vkcpp PUBLIC ${VKCPP_SOURCE_DIR}/include)

# Add a include directory that should only be used when compiling the VkCPP files
# here it is the C version of the Vulkan API, that users of VkCPP don’t need.Thi
target_include_directories(vkcpp SYSTEM PRIVATE ${VKCPP_SOURCE_DIR}/external/vulkan/include)

The main CMakeLists.txt simply uses add_subdirectory:

# This is the root CMakeLists.txt, it handles 2) by using add_subdirectory

add_subdirectory(src/vkcpp)

# Now use the vkcpp target normally
target_link_libraries(MyApp vkcpp)

Conclusion

At that point our code generator runs automatically during the build process, and is hidden from from the rest of the build system so our integration is done. What has been presented was based on the VkCPP source code, which you can see on github LINK, although the code might change.

The approach shown here has one limitation though. We gather the list of dependencies and outputs at build configuration time only, so if the generator changes and adds more dependencies, the build files won’t know it. Ideally we would like to reconfigure the build when the generator or one of its dependencies changes, but there doesn’t seem to be a way to do this in CMake.

[1]

This could be implemented by having each output depend on each dependencies, however this would be somewhat inefficient and might run the script multiple times. Instead CMake usually create a new “stamp” file which last modified time corresponds to the beginning of the custom command execution. In Makefile syntax it would look like the following:

custom_command42.stamp: generate_protobuf.py MyProto.proto
    touch custom_command42.stamp
    python generate_protobuf.py MyProto.proto

MyProto.cpp: custom_command42.stamp;
MyProto.h: custom_command42.stamp;