ProtocolBuffers is an open source project supporting Google's ProtocolBuffer's platform-neutral and language-neutral interprocess-communication (IPC) and serialization framework. It has an Interface Definition Language (IDL) that is used to describe the wire- and file formats; this IDL is then pre-compiled into source code for the target languages (Python, Java and C++ included), which are then used in the applications.
Hadoop 0.23+ requires the protocol buffers JAR (protobufs.jar) to be on the classpath of both clients and servers; the native binaries are required to compile this and later versions of Hadoop.
In comparison with previous IDLs (such as CORBA, DCOM and SunOS RPC), ProtocolBuffers are designed to be
- Simple remote procedure calls (not Object-Oriented communication in the style of CORBA).
- Usable for efficient binary serialization of raw data.
Highly efficient in terms of bandwidth, serialization and deserialization. In a large Hadoop cluster, network bandwidth, especially to and from the NameNode, JobTracker and -in NextGenMapReduce-, the ResourceManager, is precious. An efficient wire format not only saves bandwidth to and from these master nodes, it can reduce load and congestion on the main switching fabric of a large cluster.
- Excellent support for forward versioning, in which a remote service can support older versions of a client.
- Workable support for backward versioning, in which a remote service can support newer versions of a client. This requires more careful programming in the service code.
The protocol is significantly different from the Web Services WS-* stack, that has been criticised by Steve Loughran and Edmund Smith in Rethinking the Java SOAP Stack and RPC under fire in that the WS-* language for describing data XML-Schema, is not completely mappable to the Object-Oriented model of today's languages, yet the WS-* stacks attempt to seamlessly do so, even across languages. Loughran and Smith regard such an O/X mapping to be as insolvable as a perfect O/R Mapping, and hence doomed. Instead SOAP stacks should embrace the XML nature of documents and use mechanisms such as !XPath to directly work with the XML content. No widely used SOAP stack does this, as WS-* developers appear to prefer to write implementation-first code in which the datatypes are written in their native language, the interface specification reverse-engineered from this and then everyone hopes that this specification will be convertable into usable datatypes in other languages, and stable across protocol versions.
ProtocolBuffers and Thrift both require the IDL to be specified first, and have a code generation stage that generates language-specific code from it. Version support is explicitly handled,
One criticism of both ProtocolBuffers and Thrift is that the content is not self-describing; it is expected that the reader has compile-time expectations for the specific datatypes and interfaces, though possibly different versions. Apache Avro does include in-content type declarations and runtime parsing, which is why some organizations using Hadoop consider it a significantly better format for persistent data: it becomes possible to parse files without advance knowledge of their structure.
Hadoop 0.23+ must have Google's ProtocolBuffers for compilation to work. These are native binaries which need to be downloaded, compiled and then installed locally. See BUILDING.txt.
This is a good opportunity to get the GNU C/C++ toolchain installed, which is useful for working on the native code used in the HDFS project.
To install and use ProtocolBuffers
Install the protobuf packages provided they are current enough -see the README file for the current version. If they are too old, uninstall any version you have and follow the instructions.
Local build and installation
you need a copy of GCC 4.1+ including the g++ C++ compiler, make and the rest of the GNU C++ development chain.
- Linux: you need a copy of autoconf installed, which your local package manager will do -along with automake.
Download the version of protocol buffers that the BUILDING.txt recommends from the protocol buffers project.
- unzip it/untar it
cd into the directory that has been created.
- If configure fails with "C++ preprocessor "/lib/cpp" fails sanity check" that means you don't have g++ installed. Install it.
run make to build the libraries.
on a Unix system, after building the libraries, you must install it as root. su to root, then run make install
Testing your Protocol Buffers installation
The test for this is verifying that protoc is on the command line. You should expect something like
$ protoc Missing input file.
You may see the error message
$ protoc protoc: error while loading shared libraries: libprotobuf.so.7: cannot open shared object file: No such file or directory
This is a known issue for Linux, and is caused by a stale cache of libraries. Run ldconfig and try again.