Back

Parsing network streams

I have often wondered what the "right" way to parse network streams is. Unlike parsing a file or something in-memory, parsing from the network often involves the annoying call to read(). It's annoying because you don't know how many bytes you'll get back. You could get 1 byte. You could also get 4k bytes.

My first stab at getting around this was to always read 1 byte at a time, and make my parsing code a state machine. Parsing with a state machine is "cool" in a weird sort of way, but it's also a pain to maintain and understand, especially for protocols as lenient as HTTP/1.1.

On its own, parsing byte-by-byte would be "slow" because each call to read() would involve an "expensive" network syscall. But I was doing this in Rust, which comes with a free abstraction around buffered reading, making this not such a bad idea.

I have since moved away from that approach. For HTTP/1.1 in particular, I have found that it's much easier to buffer until you see a byte that indicates the end of a section. For example, to read the request line, you continuously read from the network socket until you see a \r\n or \n, then you take that chunk and parse it.

Until today, I wasn't sure how applicable this approach is to other protocols. To my surprise, the Beej guy that wrote the guide to network programming in C also recently wrote about just this topic. It appears that this "buffering until you get a complete message" is exactly how this sort of thing is done.

This is a nice affirmation that if you spend enough time trying to re-invent a wheel, you'll eventually stumble upon the best practices around wheel making. All on your own.