Broccoli Sync Conversation

A number of days ago (I know, I'm an awful human who failed to post this for over a week), myself, Lars, Mark, and Vince discussed Dropbox's article about Broccoli Sync. It wasn't quite what we'd expected but it was an interesting discussion of compression and streamed data.

Vince observed that it was interesting in that it was a way to move storage compression cost to the client edge. This makes sense because decompression (to verify the uploaded content) is cheaper than compression; and also since CPU and bandwidth are expensive, spending the client CPU to reduce bandwidth is worthwhile.

Lars talked about how even in situations where everyone has gigabit data connectivity with no limit on total transit, bandwidth/time is a concern, so it makes sense.

We liked how they determined the right compresison level to use available bandwidth (i.e. not be CPU throttled) but also gain the most compression possible. Their diagram showing relative compression sizes for level 1 vs. 3 vs. 5 suggests that the gain for putting the effort in for 5 rather than 1. It's interesting in that diagram that 'documents' don't compress well but then again it is notable that such documents are likely DEFLATE'd zip files. Basically if the data is already compressed then there's little hope Brotli will gain much.

I raised that it was interesting that they chose Brotli, in part, due to the availability of a pure Rust implementation of Brotli. Lars mentioned that Microsoft and others talk about how huge quantities of C code has unexpected memory safety issues and so perhaps that is related. Daniel mentioned that the document talked about Dropbox having a policy of not running unconstrained C code which was interesting.

Vince noted that in their deployment challenges it seemed like a very poor general strategy to cope with crasher errors; but Daniel pointed out that it might be an over-simplified description, and Mark suggested that it might be sufficient until a fix can be pushed out. Vince agreed that it's plausible this is a tiered/sharded deployment process and thus a good way to smoke out problems.

Daniel found it interesting that their block storage sounds remarkably like every other content-addressible storage and that while they make it clear in the article that encryption, client identification etc are elided, it looks like they might be able to deduplicate between theoretically hostile clients.

We think that the compressed-data plus type plus hash (which we assume also contains length) is an interesting and nice approach to durability and integrity validation in the protocol. And the compressed blocks can then be passed to the storage backend quickly and effectively which is nice for latency.

Daniel raised that he thought it was fun that their rust-brotli library is still workable on Rust 1.12 which is really quite old.

We ended up on a number of tangential discussions, about Rust, about deployment strategies, and so on. While the article itself was a little thin, we certainly had a lot of good chatting around topics it raised.

We'll meet again in a month (on the 28th Sept) so perhaps we'll have a chunkier article next time. (Possibly this and/or related articles)