Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Network protocols, sans I/O (sans-io.readthedocs.io)
123 points by kawera on Aug 7, 2016 | hide | past | favorite | 36 comments


This idea is the basis for one of the popular HTTP libraries in OCaml, cohttp [1]. The core library (see the lib/ directory in the source code) implements the HTTP protocol, but makes no attempt to perform IO. The file lib/s.mli describes the interface that may be used to perform the IO for the library, making use of OCaml's module system to work generically over multiple interfaces. What's interesting about this approach is that it almost "generates the code" for IO once you plug it in to the implementation code (i.e. all you have to do is provide a handful of very basic functions and types and it will make use of those to construct the rest of the library).

[1] - https://github.com/mirage/ocaml-cohttp


So I guess it will integrate well with uvloop http://magic.io/blog/uvloop-blazing-fast-python-networking/ ? (which is lacking an http parser :) )

May I dream that this + uvloop + asyncpg [0] and we can start to have something to build simple web services à la tornado that leverage python3.5?

[0] http://magic.io/blog/asyncpg-1m-rows-from-postgres-to-python...


https://github.com/MagicStack/httptools magicstack already did their own http parser


I used that approach for my minecraft protocol parsing library (https://github.com/dividuum/fastmc). It's indeed very useful to decouple parsing a protocol from the underlying transport. I used my library both for socket communication as well as to read and write files. I'd love to see more libraries implemented that way. So thanks for posting that link.

I imagine it might be difficult sometimes depending on the protocol. IIRC TLS might sometimes require writes while reading.


On the topic of IO. I'm actually a fan of java.io.Reader/Writer and java.io.InputStream/OutputStream. It seems to me that most java protocols and parsers are quite composable by default due to these simple little abstractions. Granted, they might not be a good fit for async IO, but I'd still argue that they've held up quite well during the past fifteen years. I'll go out on a limb and claim that standard C++ IO haven't worked out quite as well, although it might just be my skip-the-oo-and-template-everything skepticism. Rust seems closer to java, although I feel like the Jury's still out. I'm not a enthusiastic of doing everything asynchronously as in Node. Haskell does seem to get the composability right.

Python has some of javas thinking through a standard convention on the method names for reading and writing (an interface!), but it seems common with departures from those simple rules when people get creative. From that perspective, an initiative like this is quite understandable and convenient. Nevertheless, I don't seem to feel the need in the same way in certain other languages.


A little off-topic but IMO one of the more unfortunate decisions by Java was Reader/Writer and InputStream/OutputStream, instead of CharReader/CharWriter and ByteReader/ByteWriter (or similar).


It turns out TLS isn't that bad.

OpenSSL exposes a "memory BIO", which is basically exactly this. You send in writes, and then emit data when you can, and in the meantime it buffers.

It's a bit more limited than that, but generally speaking it's not too bad.


I assume you handle decompression the same way? Feed in data, and the decompressor emits data when it can?


From the description, it sounds like the necessity is the lack of a single way to do I/O in Python. If that's right, then this shows the real advantage of Go's io.Reader/Writer/&c. interfaces, which enable this sort of composability.

From my Python years, I'd think that something built around mandatory read()/write() methods could do much the same.


I think that's a tempting conclusion to draw, but it's not quite right. The standard interfaces of Go reduce the pain, but the design principle ("Don't do I/O in parser or state machines") remains solid.

The reason for this is basically that I/O and protocol logic are separate concerns, and whenever they start influencing each other too much they impose costs on each other.

The best example is actually testing. If your protocol code includes calls to Golang's `Reader`/`Writer` interface methods, it causes a few problems.

The easiest thing to see is that it causes testing problems. For example, for each call to `Reader`/`Writer` methods, in addition to testing all possible reads/writes (a protocol concern), you need to test all possible I/O failures (timeouts, closed connections, weird kernel problems) in order to actually cover the complete failure space.

However, if your code doesn't have reads/writes mixed in with protocol logic, your testing scenarios are much easier. Bytes just come in and go out. Reading/writing problems aren't an issue.

This is just basic separation of concerns stuff, but it really does help, even in languages with "blessed" I/O mechanisms.


I'm not sure I follow.

Why not test with something that implements the reader or writer interface, but shuffles bytes around in memory? That should alleviate the testing explosion.

And whatever interface you do have must be putting bytes in our getting bytes out of the protocol layers. Why not call that interface reader or writer?

I'm not seeing the distinction.


Sorry, let me be clearer.

The reason that just having an in-memory Reader or Writer doesn't solve the problem is that the failure modes don't match up. An in-memory reader/writer has basically no failure modes beyond ENOMEM. That's why in the no-I/O implementation, this is exactly what we use: write to an in-memory buffer.

Real I/O on the other hand has many failure modes. For an example, consider timeouts. If your parser does I/O, you need to test timeouts at every location that your parser does I/O. You need to confirm it handles those timeouts appropriately. And you need to decide what "appropriately" means here: do you retry? Do you abort? Do you attempt to unwind that state transition?

All of these are expansions of your state space. This means your protocol parser has to handle this combinatorial explosion of possible outcomes: at every point you have a Read/Write you need to be ready and prepared to handle all possible error conditions that can come out of that.

If your parser does no I/O, though, and only writes to buffers, this problem does not exist. That allows you to have two totally isolated sections of code: one part manipulates bytes in memory (the parser), and another bit is responsible for getting those bytes to and from the network. Each can be tested separately. If we need `n` tests for the no-I/O parser, and `m` tests for the I/O without parser, then to achieve equivalent test coverage your combined code requires `n * m` tests to achieve equivalent logical coverage of the possibility space.

Small, isolated components are good.


Oh. So it is less about I/O vs no-I/O and more about push parsing vs pull parsing.

Because testing with a reader and writer interface lets you test errors too, but now you are talking about error recovery strategies (pull has to pass through or have smarts, push can know nothing).

I agree in many cases push has the advantages being discussed. I just wouldn't have called it no-I/O since that doesn't really have the right connotation.


Python has a standard interface just like Go's `io.{Read,Writ}er`: the `send` and `recv` methods as defined by `socket.socket`. The appeal of "sans I/O" protocol implementations is primarily for asynchronous I/O, which requires a completely different style of interface. The clever thing about Go is not the standardization of a reader/writer interface, but the fact that the goroutine I/O scheduler makes these synchronous interfaces nearly as efficient as asynchronous ones, so there is no need to introduce a separate family of asynchronous interfaces.


"Python has a standard interface just like Go's `io.{Read,Writ}er`: the `send` and `recv` methods as defined by `socket.socket`."

But a File has no "send" and "recv". io.Reader and Writer apply to files, sockets, byte slices, strings, compositions of other Readers or Writers, HTTP request and response bodies, and anything else that strikes my fancy to implement the correct methods. Sockets appear to deliberately not have those methods, as there's this line in the docs: "Note that there are no methods read() or write(); use recv() and send() without flags argument instead."

There's nothing preventing Python from having the equivalent of Reader and Writer. It just doesn't, at least not with the standard library. In fact that's true of a lot of languages; there's nothing preventing that interface from existing, it just doesn't. This is a point in favor of getting the standard library right earlier rather than a point in favor of Go-the-language.


"send" and "recv" for sockets are a Berkeleyism from BSD, where networking was an add-on alongside the existing UNIX kernel. The distinction is historical, not functional. "select" works on both file-like objects and socket-like objects on Linux. Under QNX, "write" and "read" work on sockets, and "send" and "recv" are just alternate functions for "write" and "read".

Whether Python should hide that distinction is not clear, but it could.


How would a standard interface like this support async IO?


It's easy enough to shim an async version to become sync, but impossible to go the other way (modulo gevent).

So again let me underline what I said before, that this is more a "standard library" problem than a language problem. Conceivably, Python could adopt Readers/Writers into the standard library, but it's late now. Python 2.0 didn't have the requisite stuff in the library to make the idea work, so modern-day penetration would be low. Go and other recent languages benefit from experience, and by putting these things in correctly from day one, can move on to the next problems we'll discover in standard libraries. :)

Sometimes people ask why we need new languages, and I think "get a fresh start on a standard library" is at least halfway to a valid answer. Imagine if Java the language stayed exactly the same, but we could reach into an alternate universe where a brand new batteries-included standard library was written based on the language as it is now. It would still be Java, but it would be a much better Java, with better interoperability between many currently-separate worlds through standardized, sensible interfaces and extension points, taking full advantage of lambdas, probably dropping the stupid synchronization stuff even if it is nominally in the language, etc. I think the end result is still something I wouldn't love, but it would be much better. But one can only dream about that.


Are Reader/Writer async?


Callbacks would be one way.


Can someone point to a good resource for best practises in protocol design? Am currently faced with this problem and would like to avoid having to reinvent the wheel...


Maybe this? https://tools.ietf.org/html/rfc3117

There is a lot of "philosophy" about protocol design, which seems mostly sound. Although I wonder what happened to the protocol mentioned (BXXP or BEEP). I think HTTP was "good enough" as a universal application protocol?

This one also talks about HTTP being wildly successful beyond its original design parameters:

https://tools.ietf.org/html/rfc5218


Gosh, i don't know of anything off the top of my head.

The only real tip i have is to byte length encode variable length requests, because scanning strings sucks.

having a frame for messages for messages is nice, stick your fixed with stuff up front, and create fixed with length indicators. It sucks that you waste a little space with 0's, but avoiding the scan is pretty wonderful.

    GET /foo HTTP/1.1

    GET HTTP 01.10 0004/foo
Some things need to happen in order, you need to be authenticated before being granted access to resources, but that doesn't necessarily require the client knowing. the state at each step,

Traditionally you'd have something like

    HELO jfoutz hunter2
        OK
    REQEUEST 1
        OK <bytes for resource 1>
    REQUEST 2
        OK <bytes for resource 2>
but really, the server knows if you're authenticated so i could send

    HELO jfoutz hunter2
    REQUEST 1
    REQUEST 1
and then get back either

    OK
    OK <bytes for resource 1>
    OK <bytes for resource 2>
or

    NO
    NO
    NO
or, whatever. generally it's just going to be a bunch of request/response pairs. sometimes you can't make later requests without actually looking at a response. You don't know what css or images to request until the html is parsed, for example. Usually, you can request a bunch of stuff at once, and it'll work out however it works out.


The only real tip i have is to byte length encode variable length requests, because scanning strings sucks.

This is really, really good advice to use whenever possible: it means clients can determine apriori how much data they need to read, and perhaps decide whether the length is valid and/or allocate sufficient memory. One of my annoyances with the traditional "almost-text-based" protocols like HTTP, FTP, SMTP, etc. is that parsing them is not trivial and often you need to keep reading until you hit the delimiter or reach an internal limit. In contrast, "read a length, then read length bytes" makes for simple and efficient implementation.

There are certainly cases when the amount of data cannot be determined ahead of time, and in those cases I'd suggest chunked-length-prefixing; delimiters are really a method of last resort.


Do you feel your use-cases can't be met by existing protocols (or toolkits built on top of them) like HTTP, AMQP, MQTT?


This nicely documents the "why", as does the linked PyCon talk. But both don't go into much detail on the "how", making it difficult to follow suit.


Hi, I'm the linked PyCon speaker!

It'd be good to know what additional information you'd like on the "how". I think at a high level I addressed that in my talk, but if it didn't make it across or you needed more information I'd love to know what you need. Ideally I'll turn this into a blog post at some point so it'd be great to have an idea of what extra info is needed.


You definitely addressed it at a high level (I saw your talk at PyCon), but I'd like to see the low-level details as well.

This might just be a symptom of so few libraries following this pattern. I'd love to see some specific examples of handling various kinds of protocols (not just HTTP) with this approach, to see how it addresses various kinds of protocol components. For instance: variable-length data structures with length prefixes, variable-length data structures with "number of elements in the following array" in the middle, variable-length data structures with a terminator, text structures that require parsing tokens, and so on. Right now, the main documentation for those kinds of patterns seems to be "hope HTTP has a similar pattern and read the corresponding code in h2 or h11".

I'd also love to see some reusable components that make it easier to build such protocol libraries.


Yeah, so that's very reasonable.

In Python-land this is all pretty easy. For example, HTTP/2 is a protocol of the first kind ("variable-length data structures with length prefixes") at the framing layer, which is implemented in a Python packager called hyperframe. This uses a combination of the `struct` module and bytestring operations to achieve its results. A similar approach works for the second kind as well.

Basically, in Python this is almost always much easier because struct sizing and memory allocation isn't a concern like it is in a C-like language (though even there, dynamically sized structures and pointers are your friends).

But I agree, there is a lack of good discussion about "how do I actually do this?" I'd like to elaborate on that at some point for sure, because the reality is that it's remarkably simple.


> Basically, in Python this is almost always much easier because struct sizing and memory allocation isn't a concern like it is in a C-like language (though even there, dynamically sized structures and pointers are your friends).

I definitely don't want C anywhere near parsers for untrusted data, for so many reasons, this among them.

> But I agree, there is a lack of good discussion about "how do I actually do this?" I'd like to elaborate on that at some point for sure, because the reality is that it's remarkably simple.

Perhaps it would help to have some worked examples for some additional protocols?

Would you be interested in collaborating on a Python parser for some non-trivial data structures? I have a collection of such parsers as part of BITS (https://biosbits.org/) that really need reworking to decouple them from I/O, and I suspect the result would make a good article and/or conference talk.


I am the author of one such library for the problem of writing parsers (particularly for binary protocols). The declaration of the protocol structures are separate from anything involving I/O. Not trying to push it too hard but it is one approach: https://github.com/digidotcom/python-suitcase

There is also Construct which has a different syntax but is similar in many ways: http://construct.readthedocs.io/en/latest/index.html

Both suitcase/construct are definitely better suited for parsing binary protocols -- In my line of work, that limitation hasn't been a deal breaker. With suitcase, at least, I haven't done much work to optimize performance (mostly because if I cared, I wouldn't be using Python).


Both of those look great; thanks for the pointer to them!


I did a networkless approach for decoding DNS packets: https://github.com/spc476/SPCDNS No memory allocation (the user supplies the memory) and because it does not bother with the network at all, it's easy to integrate into an existing network framework (I think).


Node.js has pretty elegant read/write stream abstractions for this (https://nodejs.org/api/stream.html#stream_stream).


Has anyone got any examples of this kind of design in C#?


Could we get "Python" in the title?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: