Monday, July 07, 2008

Protocol Buffers, our serialized structured data, released as Open Source

One of the core pieces of infrastructure at Google is something called Protocol Buffers. We are really pleased to be open sourcing the system, but what are these buffers?
Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the "old" format
It is probably best to take a peak at some code behind this. The first thing you need to do is define a message type, which can look like the following .proto file:
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;

enum PhoneType {
HOME = 1;
WORK = 2;

message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];

repeated PhoneNumber phone = 4;
There is detailed documentation on this language for you to learn more.

Once you have defined a message type, you run a protocol buffer compiler on the file to create data access classes for your platform of choice (Java, C++, Python in this release).

Then you can easily work with the data, for example in C++:
Person person;
person.set_name("John Doe");
fstream output("myfile", ios::out | ios::binary);
We sat down with Kenton Varda, a software engineer who worked on the open source effort, to get his take on Protocol Buffers, how we ended up with them, how they compare to other solutions, and more:


  1. "We're sorry, this video is no longer available."

    Will the video be coming back?

  2. Holy fullscreen button batman!

    Haven't seen that on the embedded youtube player before...

  3. If you find this interesting, you may also want to check out Thrift. Thrift was open-sourced by Facebook and is now in the Apache Incubator. It has a serialization format with many of the same features as Protocol Buffers and also includes RPC mechanisms. It also supports a larger number of languages.

  4. OK, so we still not getting on practical relational data structures, but we need more complexities for streaming. Take your time to rationalize SQL into classes, then we talk again.

  5. It would be cool to see a detailed comparison of Protocol Buffers vs Thrift...

  6. Protocol buffer (released under Apache License 2.0, according to the included COPYING) is a refreshing alternative to the entangled systems of Thrift and Ice etc., for people who already have a solid comm layer. It does one thing and does it right (almost). The code looks good and builds/tests fine. I'm also impressed with the amount of documentation available with the beta release. Thrift (also Apache v2), in comparison, has almost no up to date documentation (besides the high level and dated paper. It looks like it was inspired by protocol buffer (some of the facebookers are exgooglers)). People who seek full gamut of service should also take a look at Ice by ZeroC (GPLv2), which has excellent documentation (1800+ pages professionally formatted PDF in 8MB)

    IMO, the lack of support for explicit service level versioning in both Protocol buffers and Thrift will cause troubles down the road.

  7. Interesting technology. Specially when compared to XML's verboseness, this could result in efficient communications, even across multiple programming languages.

    However, any information regarding performance of Protocol Buffer against Java Serialization? Which one is faster?

  8. I don't see a scalar value type for date/time. Does Google have a convention for handling dates within Protocol Buffers (e.g., use an int64 to store the number of milliseconds since January 1, 1970, 00:00:00 GMT)?

  9. If you're looking for the video, I found it here: