Shameless plug for my syntax highlighting library in Rust which uses Sublime Text 3 grammars, which give much richer highlighting and semantic information than Pygments: https://github.com/trishume/syntect
Looking at the Sublime Text 3 documentation [1], I see very little functional difference to Pygments. In fact, they seem remarkably similar. Pygments models a state machine via regex matching and actions, including pushing/popping state, sub-lexers, and so on.
What is the "richer highlighting and semantic information" to which you refer?
I'm interested in your package, but as a go programmer. I spend my hollidays writing a [colorizer](https://github.com/chmike/clrz) package in go. I compared many colorizer packages and tried to make something original. I also wanted to support faster lexers than the one based on regex. I had to stop by the end of hollidays. Do you have a documentation on how your lexer works ? Does it use regex only ? I don't understand rust and don't want to learn it.
Syntect's grammars are regex based, they're based on ST3's grammars which are an extension of the fundamental model used by Textmate/Atom/VSCode. They're stronger than just regexes though, there's a fancy stack machine with all sorts of features laid on top that allows it to do full parsing of a lot of languages.
The more modern grammars are written with a regex style that makes heavier use of the stack machine and so only uses regexes that can be turned into a DFA. There's ongoing work to use a layer on top of Rust's super-fast DFA-based regex engine to accelerate these grammars https://github.com/trishume/syntect/pull/34.
The problem with using non-regex based grammars is that you have to write them yourself. Syntect is something like 4000 lines of code but the grammars it uses total around 35k lines, and that's just the included ones, not the full ecosystem of online grammars. Basically unless you only want to support a small set of languages, a non-regex-based highlighting library is fairly infeasibly for a single hobbyist.
> There's ongoing work to use a layer on top of Rust's super-fast DFA-based regex engine to accelerate these grammars
Will it really be faster? It didn't seem so from that GitHub thread.
As a data point for anyone curious, I'm using Syntect myself in a toy project. With Oniguruma (C NFA regexes), it highlights 200 lines of Rust in 40 ms on a 10 W TDP Celeron, which is all right, but a bit slower than I expected.
It'll hopefully be faster. The Rust regex engine is quite fast, I haven't done any profiling to figure out why performance was the same in my initial test. There might be something easy to fix.
It's definitely possible to get better performance out of the underlying model, since Sublime does, but they have a custom DFA-based engine that can test regexes in parallel with captures, which Rust's regex engine can't.
Sorry. You are right. I'm close to 55year old, and start to feel the limit of number of neurons available. I have to use them sparingly. That is why I enjoy so much Go. There is no judgment of Rust. It's just because of me. :)
I mean, it wasn't, but no comment is. Judging by all the upvotes people seemed to get value from it.
I think it's interesting to compare syntax highlighting approaches. If syntect was literally the same thing, but only usable in Rust, I wouldn't have commented. But syntect uses a different approach that's better for some use cases, and as demonstrated by Sourcegraph, is useable from a Go program (albeit with a cost), so is a plausible alternative to consider. The tradeoff is of course that it isn't directly in Go, and so may be slower, and also it supports fewer languages out of the box (although you could probably exceed Pygments with all online tmLanguage and sublime-syntax files).
Perfect timing, I've been looking to add syntax highlighting to my blog. Took me about an hour to integrate it this morning. Here's a working example using the excellent blackfriday package.
I would say almost zero is algorithmic, as Chroma very closely adheres to the design of Pygments.
The improvement is due to two factors, with the first being by far the biggest factor:
1. Hugo no longer has to call an external `pygmentize` tool for every highlight. This removes the overhead of the fork/exec, as well as the (not-insignificant) overhead of the Python interpreter starting up.
2. Go is generally a faster language than Python.
The caveat with 2 is that Python can spend large amounts of time in C, eg. doing regex matching.
Sourcegraph also wrote a server using syntect which provides an API for highlighting, which they use to power their new server, so you can use it from any language (at a cost): https://about.sourcegraph.com/blog/announcing-sourcegraph-2 https://github.com/sourcegraph/syntect_server