There is a rustls side project called Graviola that's building a fast crypto provider in Rust+ASM. It's taken an interesting approach: starting with an assembly library that's been formally proven correct, and then programmatically translating that into Rust with inline assembly that's easy to build with Rust tooling.
I contributed a number of performance patches to this release of zlib-rs. This was my first time doing perf work on a Rust project, so here are some things I learned:
Even in a project that uses `unsafe` for SIMD and internal buffers, Rust still provided guardrails that made it easier to iterate on optimizations. Abstraction boundaries helped here: a common idiom in the codebase is to cast a raw buffer to a Rust slice for processing, to enable more compile-time checking of lifetimes and array bounds.
The compiler pleasantly surprised me by doing optimizations I thought I’d have to do myself, such as optimizing away bounds checks for array accesses that could be proven correct at compile time. It also inlined functions aggressively, which enabled it to do common subexpression elimination across functions. Many times, I had an idea for a micro-optimization, but when I looked at the generated assembly I found the compiler had already done it.
Some of the performance improvements came from better cache locality. I had to use C-style structure declarations in one place to force fields that were commonly used together to inhabit the same cache line. For the rare cases where this is needed, it was helpful that Rust enabled it.
SIMD code is arch-specific and requires unsafe APIs. Hopefully this will get better in the future.
Memory-safety in the language was a piece of the project’s overall solution for shipping correct code. Test coverage and auditing were two other critical pieces.
Interesting! I wonder if you have used PGO in the project? Forcing fields to be located next to each other kind of feels like something that PGO could do for you.
I basically did manual PGO because I was also reducing the size of several integer fields at the same time to pack more into each cache line. I’m excited to try out the rustc+LLVM PGO for future optimizations.
The way Chrome achieves this backward-compatibility is by using the SSL Next Protocol Negotiation (NPN) extension during SSL handshaking. When the browser is establishing an SSL session, it mentions to the server that it's willing to speak SPDY (as part of the ClientHello message). If the server also speaks SPDY, it can communicate that fact back to the client. If the client sees that the server supports SPDY, it proceeds to send SPDY messages over the newly established connection once the SSL handshaking is complete. Otherwise, it sends HTTP messages. The cool thing about this approach is that it doesn't add any additional network round trips.
I've found Ragel ( http://www.complang.org/ragel/ ) to be a good compromise: it's less error-prone and easier to maintain a Ragel grammar than a handwritten lexer, but Ragel lets you use regular expressions for all the little places near the leaves of a grammar where it's easy to represent token rules as regular expressions. In contrast to most regex APIs, it does the state machine compilation at build time rather than runtime, and the generated code can be quite fast (although you have to make a speed-vs-code-size tradeoff).
In my experience, it's useful, even when writing high-level applications, to be aware of the relative cost of low-level operations. The "Numbers Everyone Should Know" slide from this deck is a reasonable starting point: http://research.google.com/people/jeff/stanford-295-talk.pdf To generalize a bit, the small numbers at the top of that chart are mostly a concern for people doing systems programming, but as you progress down the list you'll find operations costly enough to have a noticeable impact on application programs. E.g., if you build a typical web application in a high-level language, your user won't be able to tell if you add a hundred prediction-resistant conditional branch instructions or a hundred L1 cache misses per page view; but if you add a hundred network round trips per page view they'll observe a measurable slowdown. Similarly, if you're making a game and you want to display graphics at 60 frames per second, you can do quite a large amount of computation per frame, but you can't read a file from disk on every frame.