Query-Based Compiler Architectures

sly010 · on June 26, 2020

I have some negative feelings about this trend (of increased integration in compilers), but I can't quite put my finger on the reason.

Before the language server idea came along all compilers were pure functions. Dependency management and caching were the responsibility of the build system. A flexible build system could handle things that the language designers haven't though of, like code generation, mixed language projects, linking between different languages etc. Things are very composable and extendable and everything can be modeled as a DAG.

With language servers and incremental compilation becoming part of some long running compiler process the responsibilities are blurred, it all leads to increased integration, less flexibility and when things break, you will not be able to tell why.

Aren't we giving up too much to get faster recommendations in IDEs?

comex · on June 26, 2020

Contrary to other replies, I think this is a reasonable concern. For example, the traditional C model allowed people to relatively easily write wrappers that do transparent caching (ccache) and parallelism (distcc), which is harder with a more integrated model.

Except:

> Aren't we giving up too much to get faster recommendations in IDEs?

The goal of incremental compilation isn't just IDE recommendations. It's faster compile times. C's system of header files and file-level dependency tracking works for C, but modern C++ severely strains its limits, considering how templates often force you to put all your code in header files. Rust doesn't even have header files and uses a single compilation unit for an entire library, which is much worse… but Rust is working on fine-grained query-based incremental compilation, which might someday let it leapfrog C++ in rebuild times.

Maybe.

bcherny · on June 26, 2020

I wonder if you could approach this like you’d approach designing any other data structure, by first nailing down what operations you are optimizing for.

We could optimize for ease of iterating on the compiler, or for ease of integration into an ecosystem of various build systems. Or for correctness, consistency, compilation speed, etc.

Or, maybe it turns out that saving developers’ time is more important than any of these, as long as we keep all of these within reasonable bounds, since it’s by a factor of 10,000 the most common operation you’re going to be performing.

drivebycomment · on June 26, 2020

I honestly fail to see anything we're giving up.

What makes you assume somehow compilers won't support the pipeline manner ? What "less flexibility" you've actually suffered from ? When was the last time you had to debug compiler problems yourself ?

sly010 · on June 26, 2020

If the compiler and the build system is so coupled, I am now stuck with that build system.

Specifically compiling frontend code with a generic build tool like bazel is a torture, because the entire javascript ecosystem consists of giant monolith everything-in-a-box toolkits like webpack.

drivebycomment · on June 26, 2020

> If the compiler and the build system is so coupled, I am now stuck with that build system.

This seems like a strawman. Please give me an example of this.

Your example has nothing to do with what most people considers "compiler" (that reads in code written in some programming language, and transforms it into some lower-level form, either machine code or some other format). Bazel is not a compiler, and it does not have any compiler built-in.

sly010 · on June 27, 2020

Webpack is a highly integrated build system that has very high integration with the language and the compilers it drives. e.g. You don't have an explicit dependency graph, your graph is implicitly defined by your source code imports together with your random magic build level configuration. You also don't have build steps, it's just a big ... thing. Once you start doing things the "webpack way", (e.g. using loaders) you lost all chance of using any other build system (e.g. bazel).

Imho this integration between the build system (webpack) and the compilers (typescript/babel) is bad and accidental. It happened because they were all written in the same language and because it was possible for the build system (webpack) to use the compiler as a library.

I am worried that if we start creating flexible query based compilers for language, it will make sense for the query language to be in the same language and for the build system to be in the same language which will lead to coupling, which will lead to lack of flexibility.

I hope this makes sense.

As an analogy, try to contrast the unix philosophy with a Spark cluster for data processing workflows. You don't need to be a "unix expert" to extend the unix toolkit, but organizations have Spark teams...

drivebycomment · on June 27, 2020

I'm afraid this is the end of our meaningful discussion and we'll have to agree to disagree since you seem to have very different definition of what compiler is than I do.

Webpack doesn't "compile" anything in the sense of traditional compiler sense. From webpack documentation:

> Note that webpack will not alter any code other than import and export statements.

So, from my perspective, your example isn't relevant to the topic at all, as webpack is not and does not contain a compiler like LLVM or GCC or Java compiler / JVM or Haskell compiler/runtime or any such system, and has no relevance whatsoever to this discussion of "increased integration in compilers" w.r.t. compile servers and IDEs.

ratmice · on June 26, 2020

The assertion "all compilers were pure functions", is a strange one, because it is almost entirely backwards.

the purity of compilers was abandoned almost immediately (when they started creating a file a.out and writing to that instead of writing binaries to stdout, and in the c-preprocessor when #include was added, and in the assembler with the .incbin directive, If compilers were pure, there would be zero need for Makefile style build systems which stat files to see if they have changed.

while Makefiles and their ilk are modeled as a dag is true, The only reason an external file/dag is actually necessary is due to impurity in the compilation process.

There have been very few compilers which have even had a relatively pure core (TeX is the only one that I can actually think of), language servers are if anything moving them to a more pure model, simply due to the fact that its sending sources through some file descriptor rather than having to construct some graph out of filenames.

Long story short, "purity" in the sense of a compiler is a function from source text -> binary text, "foo.c" is not source text, and a bunch of errors is not binary text.

At least language servers take in source text as input.

rumanator · on June 26, 2020

> the purity of compilers was abandoned almost immediately (when they started creating a file a.out and writing to that instead of writing binaries to stdout

I don't understand your point. A function doesn't cease to be a function if it sends it's output somewhere else.

> and in the c-preprocessor when #include was added,

The C preprocessor is not the compiler. It's a macro processor that expands all macros to generate the translation unit, which is the input that compilers use to generate their output.

And that is clearly a function.

> If compilers were pure, there would be zero need for Makefile style build systems which stat files to see if they have changed.

That assertion makes no sense at all. Compilers take source code as input and output binaries. That's it. The feature you're mentioning is just a convenient trick to cut down build times by avoiding to compile source files that haven't changed. That's not the responsibility of the compiler. That's a function whose input is the source files' attributes and it's output is a DAG of files that is used to run a workflow where in each step a compiler is invoked to take a specific source file as input in order to generate a binary.

It's functions all the way down, but the compiler is just a layer in the middle.

> while Makefiles and their ilk are modeled as a dag is true, The only reason an external file/dag is actually necessary is due to impurity in the compilation process.

You have it entirely backwards: build systems exist because compilers are pure functions with specific and isolated responsibilities. Compilers take source code as input and generate binaries as output. That's it. And they are just a component in the whole build system, which is comprised of multiple tools that are designed as pure functions as well.

ratmice · on June 26, 2020

> I don't understand your point. A function doesn't cease to be a function if it sends it's output somewhere else.

I think here lies the miscommunication, I'm talking about pure functions, it doesn't cease to be a function, but it does cease to be a pure one if sending its output somewhere else is done by side-effect.

haakonhr · on June 26, 2020

I guess there is pure and pure. Pure in the sense of no side-effects at all, such as for example writing to a file, and pure in the sense of not relying on state.

rumanator · on June 26, 2020

> I think here lies the miscommunication, I'm talking about pure functions, it doesn't cease to be a function, but it does cease to be a pure one if sending its output somewhere else is done by side-effect.

No, you're getting it entirely wrong. The input is the translation unit, the output is the binaries. If you understand what a compiler does and what's it's role in a build process then you'll understand it works as a pure function. There are no side-effects. Translation units in, binaries out.

I suggest you spend some time with a tiny C or C++ project trying to arrive at an executable by performing all build steps by hand instead of using a Makefile or any form of high-level build system.

sly010 · on June 26, 2020

> The only reason an external file/dag is actually necessary is due to impurity in the compilation process.

But files also make various pieces of compiler chain interoperable and allows me to define a DAG. That's exactly what make is so powerful and that's exactly what I'd hate to loose.

Modern compilers do a lot and understandably they are trying to avoid writing out some partially calculated state to disk (e.g. serializing and AST to disk between stages would be doing work twice). But moving everything into the process means your compiler becomes a walled garden.

You can see this happening in the javascript world. Very few people actually know what WebPack does. It's a giant black box with infinite number of switches and everything is "magic".

ratmice · on June 26, 2020

I totally agree with this and think it's a valid concern, It difficult to get the best of both worlds here within the confines of the unix process model.

The query style compiler isn't by necessity a single process. You could imagine an implementation based on the actor model or almost any other message passing system where the queries are driven externally. That the expedient way to do this is by stuffing everything into a giant process is regrettable.

Ace17 · on June 26, 2020

> The only reason an external file/dag is actually necessary is due to impurity in the compilation process.

No compilation process can know about the parts of your project written in another language.

ratmice · on June 26, 2020

I didn't do a good job of explaining it, the thing is that if your compiler is truly pure, your source input is a node in a graph, the compiler is an edge, and the output is another node. Given 2 languages with pure compilers the entire compilation process inherently admits a DAG, rather than having to reconstruct the DAG in order to drive the compilation process.

rumanator · on June 26, 2020

> I didn't do a good job of explaining it, the thing is that if your compiler is truly pure, your source input is a node in a graph, the compiler is an edge, and the output is another node.

You've just described the role of a compiler in a build system.

> Given 2 languages with pure compilers the entire compilation process inherently admits a DAG

That's what you're getting wrong. Building is not the same thing as compiling. A software build is a workflow comprised of one or more jobs, where compiling is one of the many types of jobs that's performed. In fact, the process of generating a library or executable from source code is a multi-step process where compiling is only one of the many steps. You get a source code, which may be processed by a macro processor to generate the source code to be compiled, then the compiler generates binaries from that source code, which are then handed to the linker, etc etc etc.

ratmice · on June 26, 2020

What build systems get wrong from a purity perspective is that the compiler has hidden inputs and outputs from the build systems perspective.

You cannot infer from the inputs and outputs the DAG shape. And so you must construct an artificial mapping to those inputs and outputs.

All I am saying is in pure compilers this is unneccesary because inputs and outputs and compilation steps readily form a graph shape.

Its hard just hard to explain because our familiar approach (to operating systems in general) is not well suited to it.

Couple of introspective questions: From a makefile can you reconstruct the input, send it over the network and successfully compile it without knowledge specific to the compilation step?

Can you modify the input in memory, and recompile it without saving?

Can you start the final step when the first step has finished?

All of these things should be realistically possible in a pure build process where each step is pure. Some of these questions relate to the input, some whether the DAG accurately reflects the real DAG underlying the build steps, etc etc etc.

Anyhow I've said all that I really want to say, that our familiar systems don't readily admit it is a separate topic all together.

rumanator · on June 26, 2020

> What build systems get wrong f

Build systems are not getting anything wrong. You're just struggling with a bunch of misconceptions about what a build system is and does, and what role a compiler plays in the whole process.

> is that the compiler has hidden inputs

There are no hidden inputs. There are only the compiler settings and project configs, and even the choice of compiler, which are all managed and set by the build system, and then there is the source code.

That's it.

pjmlp · on June 26, 2020

All UNIX compilers with their primitive toolchains, these kind of ideas go all the way back to Xerox development environments.

Lucid and IBM already had query based compiler architectures for their C++ toolchains (Energize C++ and Visual Age for C++ v4).

Orphis · on June 26, 2020

If you cache only at the file level, then you're going to miss a lot of opportunities and still repeat a lot of work.

You could cache at the function level, or each template separately. You could cache includes automatically on files that change frequently to avoid parsing headers again and again without having to manually specify precompiler headers (yes, sure, modules should help there soon).

chriswarbo · on June 26, 2020

"Traditional" compilers (e.g. GCC, GHC, javac, etc.) are essentially single-purpose black boxes: source code goes in, executable comes out.

Usually that source code must be on disk, often it must be arranged in a certain directory structure (e.g. src/module/...), and sometimes it must be in files with particular names (e.g. javac). This forces programmatic use of the compiler to be more complicated, e.g. setting up temporary directories to appease these rules.

That single-purpose is a common use-case, but certainly not the only one. Traditional compilers typically perform pre-processing, lexing, parsing, precedence resolution, name resolution, macro expansion, type inference, type checking, optimisation, code generation, linking, stripping, etc. all within the same binary (there are some exceptions, e.g. the C pre-processor can also be invoked separately).

In my experience, this is the opposite of composable and extendable! Each of these steps is very useful in its own right, yet we typically have no way to invoke them independently, e.g. to parse code into a structured form; or to infer the type of an expression; or to resolve a name; or to optimise an AST; etc.

To make this composable and extendable in the way you suggest, we would need to make these separate processes, piped together with a build tool (e.g. make, or a helper script). In practice this doesn't happen, but some projects have hooks into their code for extensibility; e.g. GCC can be run with different front- and back-ends, and the "middle-end" can be extended with new passes and plugins (finally!); GHC has some limited plugin functionality, and has a (very flaky!) Haskell API for invoking its different stages; etc.

My point is that the "traditional" world was pretty awful for composability and extendability. From the outside, we had big opaque compiler processes invoked by Make. If we're willing to drop down to the compiler's implementation language, there were some limited facilities to make them do something other than their usual source files -> binary task.

If we look at the post, we see that it's talking about "drop down to the compiler's implementation language" rather than standalone processes with stdio and Makefiles. However, the approach it's talking about is precisely one of pure functions (e.g. `fetchType`) and a flexible build system (Rock), providing composability and extendability. It even says this explicitly, e.g.

> The rules of our compiler, i.e. its "Makefile", then becomes the following function, reusing the functions from above:

Note that the post isn't specifically about LSP; it only mentions "providing editor tooling, e.g. through a language server". It doesn't even talk about long-running processes. As a counter-example, it would be pretty trivial to expose these 'tasks' as standalone commands, piping through stdio, if we really wanted to. So we're not "giving up too much"; we would be gaining composability and extendability!

As for "faster recommendations in IDEs", that's a complete straw man. The post gives the example of querying for the type of a qualified function name, and a few others (e.g. resolving names). Sure, those would be useful for IDEs, but they would also be useful for many more systems. Some examples, off the top of my head:

- Code search, e.g. searching by type (like Hayoo, but more general and less flaky); finding usages across a package repo (this relies on name resolution)

- Chasing (resolved) name occurrences, e.g. to finding downstream projects impacted by a breaking change; or to speed up delta-debugging by only checking commits which change code used by the test.

- Documentation generators can benefit from looking up types.

- Static analysers benefit from name resolution, type inference/lookup, etc.

Personally I've spent years on projects which use compilers for things other than their usual source files -> executable task, and their lack of composability and extendability is painful (my comment history is full of rants about this, mostly regarding GHC!). The approach described in this post would be amazing to see in "real" languages (i.e. those with lots of users and code, where more tooling and automation would provide a lot of benefit). I've often thought about a similar approach to this 'query-based' design, and would love to see things go even further in this direction (e.g. to a Prolog-style database of code)

hobo_mark · on June 25, 2020

Reminded me of this lecture from last year:

Responsive compilers - Nicholas Matsakis - PLISS 2019

https://youtube.com/watch?v=N6b44kMS6OM

(Of course it's based on Rust, but the same principles would be applicable elsewhere)

fluffything · on June 26, 2020

The blog post does cite salsa, which is the frame work that was created to create Lark, the language that was created to protoype Rust's implementation of query-based compilation.

https://github.com/lark-exploration/lark

https://github.com/salsa-rs/salsa

_bxg1 · on June 25, 2020

Really interesting ideas, though I think the content is hampered a bit by the use of what I think is Haskell in the examples. It hinders accessibility when you're having to learn two unrelated things in parallel, and I don't think the average audience for this piece can be expected to know Haskell well enough to follow along effectively.

vajrabum · on June 26, 2020

This is from a blog that has 2 posts. Both discuss a compiler for an experimental dependently typed language. That's a fairly specialized topic.

Ericson2314 · on June 26, 2020

They are talking about the idea, and their implementation, which is in Haskell.

yingw787 · on June 25, 2020

I think the author linked a Rust example of a query-based compiler: https://github.com/salsa-rs/salsa

mshockwave · on June 26, 2020

Good...but why does the author think modern compilers / language servers DON'T do caching? Or in another way: why does the author think caching mechanism in modern compiler / language server is insufficient? I think the author is proposing a caching mechanism that has finer granularity and design the whole system around this idea from day one. But first, at least many components in LLVM have been doing caching for a long time (e.g. OrcJIT has a pretty decent caching layer and libTooling also supports incremental parsing with AST caches). Second, what is the memory overhead of this (fine grain caching) design when it's facing some larger input program in real world (e.g. OS kernel)? Does it scale well?

I know it's refurbished from an old post so I probably shouldn't be so harsh. But it will be better to compare some old ideas against state-of art works and found some insights rather than doing pure archaeology

matklad · on June 26, 2020

I am not the author, but work in the same domain. Empirically, existing compilers are impossible to turn into good IDEs, unless the language has header files and forward declarations. Otherwise, you get a decent IDE by doing one of the following:

* writing two compilers (C#, Dart, Java(and, in some sense, every major language, supported in JetBrains tooling)

* starting with IDE compiler from the start (Kotlin & TypeScript)

The two examples where I’ve heard a batch compiler was successfully used to build a language server are C++ and OCaml (haven’t tried these language servers myself though). Curiously, they both use header files, which means that it’s the user who does fine grained caching.

I don‘t see how caching in LLVM is relevant to the task of building LSP servers.

In terms of prior art, I would suggest studying how IntelliJ works, specifically looking at the stubs:

https://www.jetbrains.org/intellij/sdk/docs/basics/indexing_...

seanmcdirmid · on June 26, 2020

There are ways to turn batch compilers into incremental IDE compilers with some tree and effect caching on the periphery of the compiler. You can even go all the way up to the eval level for a full live programming environment. You don’t need module signatures if you are willing to trace dependencies dynamically.

See https://www.microsoft.com/en-us/research/publication/program....

weliveindetail · on June 26, 2020

> OrcJIT has a pretty decent caching layer

(Sadly) this is not correct. If you want to cache compiled code between sessions, your only option so far are handwritten object caches along the lines of this example:

https://github.com/llvm/llvm-project/blob/ae47d158a096abad43...