tree-sitter-fortran

Fortran grammar for popular parsing library

Published: Mar 4, 2025 by Peter Hill

Project info

licence MIT

Tree-sitter is a popular library for parsing programming languages for syntax highlighting and code navigation in IDEs, websites, and other tools. Although modern Fortran is still widely used in academia, many syntax highlighting packages, for example, have not kept up with the developments in the language, and fail to highlight modern Fortran correctly. A Fortran grammar was already being developed, and was about 60% complete when we started to contribute.

This project was useful because we could build other tools on top of it, as well as hopefully getting better syntax highlighting in our favourite editors and websites! It was also a classic case of needing to build the tools in order to build the tools – in order to find what needed adding or fixing in the grammar, I downloaded a large corpus of Fortran codes (around 200 at last count, around 34M lines of code!), and wrote a short python tool to parse Fortran files in a directory with the tree-sitter grammar and display syntax errors with some context. This made it very easy to quickly find missing features and edge cases in the existing grammar.

While some Fortran features were very easy to add to the grammar, there are a few things that are very tricky to parse. For example, the language doesn’t have reserved keywords, and so it’s necessary to manually allow clashes between the identifier AST node and many keywords. Line continuations are also allowed in some strange places, and we don’t try to parse all of these – in the middle of other tokens, for instance! The other particularly tricky thing to support is the preprocessor. While there is no official Fortran preprocessor, most compilers use a fairly standard C preprocessor, which does straight-forward textual replacement – which means it can do basically anything to the source. This isn’t possible to completely support in the grammar, but I was able to get a minimal, common subset in.

My work on the grammar took it from being able to parse about 60% of files in the corpus to being able to parse 92%, with the majority of the failures coming from preprocessor macros (several projects use something like a DEBUG_WRITE macro which expands to write or nothing, for instance) or are actual syntax errors. Excluding those puts it much closer to 99% of all files.

Fortran parsing

Share