Hi, I'm the linked PyCon speaker! It'd be good to know what additional informati...

JoshTriplett · on Aug 7, 2016

You definitely addressed it at a high level (I saw your talk at PyCon), but I'd like to see the low-level details as well.

This might just be a symptom of so few libraries following this pattern. I'd love to see some specific examples of handling various kinds of protocols (not just HTTP) with this approach, to see how it addresses various kinds of protocol components. For instance: variable-length data structures with length prefixes, variable-length data structures with "number of elements in the following array" in the middle, variable-length data structures with a terminator, text structures that require parsing tokens, and so on. Right now, the main documentation for those kinds of patterns seems to be "hope HTTP has a similar pattern and read the corresponding code in h2 or h11".

I'd also love to see some reusable components that make it easier to build such protocol libraries.

Lukasa · on Aug 7, 2016

Yeah, so that's very reasonable.

In Python-land this is all pretty easy. For example, HTTP/2 is a protocol of the first kind ("variable-length data structures with length prefixes") at the framing layer, which is implemented in a Python packager called hyperframe. This uses a combination of the `struct` module and bytestring operations to achieve its results. A similar approach works for the second kind as well.

Basically, in Python this is almost always much easier because struct sizing and memory allocation isn't a concern like it is in a C-like language (though even there, dynamically sized structures and pointers are your friends).

But I agree, there is a lack of good discussion about "how do I actually do this?" I'd like to elaborate on that at some point for sure, because the reality is that it's remarkably simple.

JoshTriplett · on Aug 7, 2016

> Basically, in Python this is almost always much easier because struct sizing and memory allocation isn't a concern like it is in a C-like language (though even there, dynamically sized structures and pointers are your friends).

I definitely don't want C anywhere near parsers for untrusted data, for so many reasons, this among them.

> But I agree, there is a lack of good discussion about "how do I actually do this?" I'd like to elaborate on that at some point for sure, because the reality is that it's remarkably simple.

Perhaps it would help to have some worked examples for some additional protocols?

Would you be interested in collaborating on a Python parser for some non-trivial data structures? I have a collection of such parsers as part of BITS (https://biosbits.org/) that really need reworking to decouple them from I/O, and I suspect the result would make a good article and/or conference talk.

posborne · on Aug 7, 2016

I am the author of one such library for the problem of writing parsers (particularly for binary protocols). The declaration of the protocol structures are separate from anything involving I/O. Not trying to push it too hard but it is one approach: https://github.com/digidotcom/python-suitcase

There is also Construct which has a different syntax but is similar in many ways: http://construct.readthedocs.io/en/latest/index.html

Both suitcase/construct are definitely better suited for parsing binary protocols -- In my line of work, that limitation hasn't been a deal breaker. With suitcase, at least, I haven't done much work to optimize performance (mostly because if I cared, I wouldn't be using Python).

JoshTriplett · on Aug 8, 2016

Both of those look great; thanks for the pointer to them!

spc476 · on Aug 7, 2016

I did a networkless approach for decoding DNS packets: https://github.com/spc476/SPCDNS No memory allocation (the user supplies the memory) and because it does not bother with the network at all, it's easy to integrate into an existing network framework (I think).