Lwan: Experimental, scalable, high performance HTTP server

mmastrac · on Oct 12, 2014

I had to Google "rebimboca da parafuseta" out of curiosity. Apparently it's a Brazilian term roughly analogous to "reticulation of the splines".

acidx · on Oct 12, 2014

Author here. This gave me a chuckle. :)

And, yes, that's pretty much it, although "rebimboca da parafuseta" is less obscure than "reticulation of the splines" (if you're Brazilian, anyway).

It is used to denote a fictitious part, which name or function are unknown, of a car engine or any other machine. It was coined in the 70s TV show, and later used in some ads aired during the same period; since then, it's an expression used for humorous effect. It's not very common these days but is unlikely you'll meet someone down here that never heard it.

mutagen · on Oct 13, 2014

Aha, not unlike the retro-encabulator:

https://www.youtube.com/watch?v=RXJKdh1KZ0w

emmelaich · on Oct 13, 2014

Which is of course a riff on the original turbo-encabulator: https://www.youtube.com/watch?v=Ac7G7xOG2Ag

techdragon · on Oct 13, 2014

Love that video, I may have shared it with more people than I did the original rick roll video lol

kaoD · on Oct 12, 2014

We've got a similar idiom in Spain: "la junta de la trócola" ("trócola's joint", not really translatable, though trócola is apparently a synonym for "polea", "pulley").

It was coined for a cigar brand TV commercial back in the 90s and also alludes to a fictional part, in this case used by a car repairman to fraud a customer into paying more in said commercial.

adrianm · on Oct 12, 2014

Out of curiosity, do you know if this translation is used in the Portuguese versions of SimCity?

acidx · on Oct 12, 2014

If I recall correctly, a literal translation is used ("reticulando os splines"). But it's been years since the last alien invasion.

ufo · on Oct 13, 2014

I dont think it would be an accurate translation in that case. the rebimboca is a fictional object while spline reticulation is a fictional action.

codingbeer · on Oct 12, 2014

For more information on some of the C magic behind this well-written piece of software, check out the authors blog[1]. It should be pretty interesting for any systems programmer.

[1]: http://tia.mat.br/blog/html/index.html

nwmcsween · on Oct 12, 2014

A few issues:

* rawmemchr - it might be faster as it doesn't have to decrement the size_t but this only mildly relevant for lwan as many instances of rawmemchr use are simply rawmemchr(ptr, '\0') which is exactly the same as ptr + strlen(ptr) + 1 and even less optimized.

* pthread_tryjoin_np - __linux__ is defined by gcc not glibc, you should check for __GLIBC__ if you want to use glibc specific functions.

* underscore prefixed functions - pedantic I know but it is reserved for the implementation.

acidx · on Oct 13, 2014

These are easy things to fix: feel free to issue a few pull requests. :)

Regarding rawmemchr(): both are pretty well optimized. Both are implemented in glibc using the same technique (reading a byte at a time until it is aligned, then moving to multibyte reads). strlen() might be faster, yes, considering that the implementation can hardcode some magic numbers. In other words: some micro benchmarks might help decide here.

Regarding __linux__ vs. __GLIBC__: Lwan works with some alternative libcs (such as uClibc), so relying on __GLIBC__ being defined for things like this doesn't seem like a good idea. In any case, since Lwan isn't portable anyway, one can just assume it is always running on Linux and get rid of these #ifdefs.

e12e · on Oct 12, 2014

Any relationship to G-WAN[g]? I can't see any mention of it in the readme, or on the web page (perhaps I didn't look hard enough) -- but there appears to be some resemblance? (Use as a C web framework work-a-like, fast webserver etc)?

[g]http://gwan.com/

acidx · on Oct 12, 2014

Apart from sharing the "wan" suffix and choice of language, there's no relationship whatsoever.

hendzen · on Oct 12, 2014

"Hand crafted HTTP request parser" - hard to see how this really faster (and less bug prone) than generating one with Ragel.

nly · on Oct 12, 2014

Much of the advantage of using a DFA generator like Ragel is lost because the HTTP header grammar is actually ambiguous in several places and can't be streamed. You can use it as a component, but it isn't entirely sufficient on its own.

The HTTP 1.1 RFC requires whitespace be stripped at the end of header values, yet also permits (although deprecates) header folding, giving rise to the following ambiguity (using _'s in place of leading spaces):

    Foo: Hello\r\n
    ______\r\n
    ________\r\n
    ________ world!\r\n
    Bar: smeg

This requires that the parser buffer all the whitespace between 'Hello' and 'world!' (and the RFC doesn't put a standard limit on header value length) just in case 'world!' never comes and the value of the Foo header has to be stripped back to just "Hello"

Here's a related observation from a commit[0] by the Joyent guys, who wrote the streaming parser used by NodeJS: "For http-parser itself to confirm[sic] exactly would involve significant changes in order to synthesize replacement SP octets. Such changes are unlikely to be worth it to support what is an obscure and deprecated feature"

Another example is parsing the Request and Status lines:

    GET <uri> HTTP/1.1

Technically <uri> can't contain spaces, but the RFC says you MAY accept them [RFC7230: 3.5. Message Parsing Robustness] ... which then gives rise to the possibility of <uri> containing the literal string " HTTP/1.1", and ultimately opens up bad user agents that send spaces to header injection.

Resolving these ambiguities require implementing your own buffering, and dropping down to Ragels 'state charts' feature to avoid your semantic actions being munged... which leaves you to design the top level state machine yourself.

[0] https://github.com/joyent/http-parser/commit/5d9c3821729b194...

oever · on Oct 12, 2014

https://www.ietf.org/rfc/rfc2616.txt : "All linear white space, including folding, has the same semantics as SP." So only one space need be counted and any subsequent space or horizontal tab can be ignored. Your example simplifies to:

Foo: Hello World\r\n Bar: smeg

nly · on Oct 12, 2014

It doesn't matter that the obs-fold can be treated as a single space because you don't know that until you've actually reached the fold. Consider:

    Foo: Hello_______world\r\n

Here the whitespace has to be preserved, as per the field-content production.

So RFC2616 allowed:

    Foo: Hello_______\r\n
    ____world\r\n

to be reduced to:

    Foo: Hello world\r\n

Incidentally RFC7230 (it's successor) says something subtly different:

"A server that receives an obs-fold in a request message that is not within a message/http container MUST either reject the message by sending a 400 (Bad Request), preferably with a representation explaining that obsolete line folding is unacceptable, or replace each received obs-fold with one or more SP octets prior to interpreting the field value or forwarding the message downstream."

So now it's been loosened to one or more SP octets... useful, except there's no mention of TAB octets, which can follow a CRLF as part of an obs-fold... so you still can't just remove the CRLF and preserve all the whitespace... preserving tabs would be illegal. So the new rules don't help streaming either. Joyents parser did this regardless (not sure if it still does).

You'll also notice it says the obs-fold can be replaced with a sequence of SP characters... according to the grammar that's the CRLF and the following whitespace, not any whitespace preceding the CRLF. You'd think that would be helpful because it means any whitespace before the CRLF can always be streamed as part of the field value, right? Except...

    Foo: Hello_______\r\n
    ____\r\n

would still simplify to:

    Foo: Hello_______[one or more SP octets]\r\n

and then what? Well, presumably you then have to trim off the trailing whitespace to produce a value of just "Hello"... so all the whitespace you buffered before you reached the CRLF (in case you reached another field-vchar like the 'w' in 'world') has to be discarded.

Retric · on Oct 12, 2014

Assuming your reading this from a buffer and most of the time you don't need to count then store an index to the o in hello followed by ignoring space. Then to back and count it only if you need to. Or better yet if whitespace is not need just copy a space but if it's needed then move those bytes to the next stage and don't bother to count them that go around.

nly · on Oct 12, 2014

... and what happens if your header happens to run off the end of your buffer amongst the spaces that follow 'Hello'? Cookies can be huge, I can construct a header like 'Cookie: x[8000 spaces]y'

Retric · on Oct 13, 2014

This get's into the specifics so you almost need to know the code your dealing with to make a relevant suggestion.

However, the easy option is to default to the slow code path on edge cases which considering it's probably a rare enough it's not important to make fast as long it's bug free. IMO, optimizations are always a balancing act between minimizing the computer's effort and minimizing the coders effort while trying to maintain long term readability. But, you can always keep track of more than one buffer so the option is there.

meowface · on Oct 12, 2014

I am fairly ignorant of Ragel, but to my understanding Ragel is better for assuring correctness rather than performance. I don't think Ragel makes any claims regarding performance.

I could see a specially written parser outperforming it, just like hand written assembly can still sometimes outperform a compiler.

I agree that it's more likely to have bugs, though.

quacker · on Oct 12, 2014

hard to see how this really faster (and less bug prone) than generating one with Ragel.

Hm, I actually think the opposite (except for the point about bugs). A hand-crafted parser can easily be faster than a generic parser because:

* You aren't limited to implementing a regular grammar

* You can guarantee no overhead is introduced by the parsing tool

* You have complete freedom to optimize at the lowest levels

Now, I've never used Ragel, so I'm speaking from my experience with other parser generators (and from the perspective of a programming language implementation). I'd be interested to know if Ragel is different with regard to any of these points.

nly · on Oct 13, 2014

> I'd be interested to know if Ragel is different with regard to any of these points.

It'll do a better job than you will at reducing DFAs to their optimal representation. Imho, with careful use of its pragmas, it produces pretty much optimal code when using -G2 (goto based code generation). Another really nice feature is being able to dump the DFA to a dot file and render it using Graphviz.

djcapelis · on Oct 12, 2014

> "Hand-crafted HTTP request parser"

Uh oh.

zongitsrinzler · on Oct 12, 2014

How does this compare to Nginx?

sauere · on Oct 12, 2014

Pretty good, i'd say about 50% faster on raw request speed. Anyway, it isn't a fair comparison given that nginx's feature set is much larger.

nwmcsween · on Oct 12, 2014

Nice, benchmarks vs the competition would be interesting as well. Also with 10K+ idle connections I would be more worried about kernelspace memory requirements (maybe recommended sysctl.conf changes).

donavanm · on Oct 12, 2014

Nah, its a couple KB per connection. the biggest consumer would be the tcp socket control structs and associated data buffers. Ball park 1.5KB for the structs and another 4-16KB for tcp buffers on a typical internet tcp connection.

nwmcsween · on Oct 12, 2014

but vars controlling how long a tcp sock is held for or if they are reused is controlled my the kernel.

donavanm · on Oct 13, 2014

I think youre talking about timewait states et al. On linux thats dominated by the MSL, which is a compile time constant of 60 seconds. You mentioned sysctls, those are primarily tcp_tw_reuse and tcp_tw_recycle (which is the worlds worst sysctl). Regardless, its a couple KB per connection. How many hundred thousand do you want to support?

abionic · on Oct 13, 2014

Can't spot an OSS license.

What license is aimed for it?

acidx · on Oct 13, 2014

GPLv2 (or later) at the moment. Might change to LGPLv2 (or later) soon, though.

CCs · on Oct 13, 2014

Any chance for MIT or BSD?

We can't use GPL or LGPL at the company, everything is statically linked.

acidx · on Oct 13, 2014

No chance for MIT or BSD, although LGPL with a static linking exception might be doable.

jagger27 · on Oct 13, 2014

Any roadmap for HTTP2 support?

acidx · on Oct 14, 2014

Not planned ATM. Still need to read more about it before giving it a go.

imaginenore · on Oct 12, 2014

Nginx can do 500K to 1 million req/s

http://lowlatencyweb.wordpress.com/2012/03/26/500000-request...

So can Google's compute engine on a $10 instance:

https://news.ycombinator.com/item?id=6804897

http://googlecloudplatform.blogspot.ca/2013/11/compute-engin...

acidx · on Oct 12, 2014

Performance isn't the only thing that you should look for in a web server. Nginx is probably the best choice for most applications, yes, as Lwan lacks lots of important features, real world testing, and community. And I say that having written Lwan: it is, for me, nothing but a toy. :)

OTOH, the beefiest machine I have access to test it is a 4 year old laptop, not a 24-core Xeon.

bhauer · on Oct 12, 2014

The tests at lowlatencyweb.wordpress.com were conducted without network connectivity—the load generator (wrk) was running on the same host as the web server. The results are 500K RPS for localhost connections with standard keep-alive and 1M RPS for localhost connections with pipelining. This is using a server with 24 HT cores and it's not clear to me what the response body was.

Google's Compute Engine test was using 200 virtual servers, but it does include network connectivity. The response body is a single byte. Their blog entry is a celebration of the performance of their load balancer more than a statement about the performance of each VM.

In March, we were able to exceed 1M requests with network connectivity and without pipelining to a single server [1]. Our project is not testing static web servers, so we don't test with plain nginx; but I expect nginx would also exceed 1M RPS in this hardware environment. This was using a server with 40 HT cores and a single-byte response body.

Similarly, a highly tuned web server such as OP's Lwan should be expected to exceed 1M RPS (network-connected) on a 40 HT core server. 1M RPS with small response payloads is fairly easy on modern hardware.

Incidentally, we see 6M+ RPS with pipelining in our Round 9 plaintext results [2].

[1] http://www.techempower.com/blog/2014/03/04/one-million-http-...

[2] http://www.techempower.com/benchmarks/#section=data-r9&hw=pe...

imaginenore · on Oct 12, 2014

If you want to go further, you need to get rid of the OS:

40 million req/sec with Lua

http://highscalability.com/blog/2014/2/13/snabb-switch-skip-...

justincormack · on Oct 12, 2014

Thats not a web server, its doing packet processing, which is a different problem. You could connect a web server to it, but that is not a benchmark of that.