Recent Posts

Sunday, 31 July 2016

Google Protocol Buffers Tutorial


What is Serialization?
     The process of saving (or) writing state of an object to a file is called "Serialization". But strictly speaking it is the process of converting an object from java supported form to either network supported form (or) file supported form. You can find complete tutorial for Serialization here Serialization Tutorial

Problems of Java Serialization
     Default serialization mechanism provided in Java is not that efficient and has a host of well-known problems. Also the Java serialization doesn’t work very well if you want to share data with applications written in C++ or Python.

     Once an instance of serializable class in serialized , and after that you make some change in class implementing default serialization , you won't be able to deserialize that object from serialized data stream. JVM generates serial version id for every serializable class and persist it with serialize stream So as to validate at the time of deserialization if JVM still have same complied  version of class as it was at the time of serialization. As class has been changed later and that has changed the automatically generated version id , so id stored in stream does not match with id available in compiled class code and this mismatch causes InvalidClassCast exception. So you need to be highly judicious to decide If implementing serialable is right for a class As making any change to that class, later, will cause problems in getting those persisted data stream back in JVM.

     Serialization can be attributed to the resource overhead (both the CPU and the IO devices) that is involved in serializing and deserializing the data and the latency issues that are involved for transmitting the data over the network.

     Further, serialization is quite slow. Moreover, XML serialization is insecure, consumes a lot of space on the disk and it works on public members and public classes and not on the private or internal classes. Therefore, it compels the developer to allow the class to be accessed to the outside world.

     Since serialization does not offer any transaction control mechanisms, it is not suitable for use within applications needing concurrent access without making use of additional APIs

What is Google protocol buffers?
     Google protocol buffers also known as protobuf is an efficient alternative to serialize objects.  Protobuf is faster and simpler than XML and more compact than JSON. It was designed to be language/platform neutral and extensible. Currently, protobuf has support for C++, C#, Go, Java, and Python. In this tutorial we will see an introduction to Google Protocol Buffers(Protobuf) in Java.

    Google Protocol buffers are an open source encoding mechanism for structured data, developed at Google. It is useful in developing programs to communicate with each other over a wire or for storing data. All you have to do is specify a message for each data structure you want to serialize (in a Java class like format) using a .proto specification file.

     From that, the Google protocol buffer compiler (protoc) creates a class that implements automatic encoding and parsing of the protocol buffer data with an efficient binary format. The generated java class provides setters and getters for the fields that make up a protocol buffer and takes care of the details of reading and writing the google protocol buffer as a unit. Importantly, the google protocol buffer format supports the idea of extending the format over time in such a way that the code can still read data encoded with the old format. The protobuf API in Java is used to serialize and deserialize Java objects. You don’t need to worry about any encoding and decoding detail.

Advantages of Google Protocol Buffer 
1. Protocol Buffer is 3-10 times smaller than an XML.
2. Protocol Buffer is 10-100 times faster than an XML.
3. Generate data access classes that are easier to use programmatically.

Protocol Buffer Basics
1. Defining A Message Type
     Every .proto file starts with a package declaration, which helps to prevent naming conflicts between different projects. Basically, you will define how you want your data to be structured, using a message format, in a .proto file. This file is consumed by the protocol buffer compiler (protoc) which will generate a Java class with getter and setter methods so that you can serialize and deserialize Java objects to and from a variety of streams. You can define a message type in .proto as follows
message User {
   required string name = 1;
   required int32 id = 2;
   optional string email = 3;
}
     The message format is very straightforward. Each message type has one or more uniquely numbered fields Nested message types have their own set of uniquely numbered fields. Value types can be numbers, Booleans, strings, bytes, collections and enumerations (inspired in the Java enum). Also, you can nest other message types, allowing you to structure your data hierarchically in much the same way JSON allows you to.

2. Specifying Field Types
     Fields can be specified as optional, required, or repeated. Don’t let the type of the field (e.g enum, int32, float, string, etc) confuse you when implementing protocol buffers in Python. The types in the field are just hints to protoc about how to serialize a fields value and produce the message encoded format of your message. The encoded format looks a flatten and compressed representation of your object. You would write this specification the exact same way whether you are using protocol buffers in Python, Java, or C++.
     In the above example, all the fields are scalar types: two strings and one int. However, you can also specify composite types for your fields, including enumerations and other message types. The scalar types are any one of the following double, float, int32, int64, uint32, uint64, sint32, sint64, fixed32, fixed64, sfixed32, sfixed64, bool, string, bytes.

3. Assigning Tags
     Each field in the message definition has a unique numbered tag. These tags are used to identify your fields in the message binary format, and should not be changed once your message type is in use. Note that tags with values in the range 1 through 15 take one byte to encode, including the identifying number and the field's type (you can find out more about this in Protocol Buffer Encoding). Tags in the range 16 through 2047 take two bytes. So you should reserve the tags 1 through 15 for very frequently occurring message elements.

     The smallest tag number you can specify is 1, and the largest is 229 - 1, or 536,870,911. You also cannot use the numbers 19000 though 19999 as they are reserved for the Protocol Buffers implementation - the protocol buffer compiler will complain if you use one of these reserved numbers in your .proto. Similarly, you cannot use any previously reserved tags.

4. Specifying Field Rules
     You specify that message fields are one of the following
1. required
      For required fields, the initial value must be provided, otherwise the field is not initialized.

2. optional
     For optional fields, if not initialize, then a default value will be assigned to the field, of course, you can specify a default value, as defined in the above proto PhoneType field types.

3. repeated
     This field can be repeated any number of times (including zero) in a well-formed message. The order of the repeated values will be preserved.

     For historical reasons, repeated fields of scalar numeric types aren't encoded as efficiently as they could be. New code should use the special option [packed=true] to get a more efficient encoding.
For example:
repeated int32 samples = 4 [packed=true];

5. Adding More Message Types

     Multiple message types can be defined in a single .proto file. This is useful if you are defining multiple related messages

6. Adding Comments
     To add comments to your .proto files, use C/C++-style //
message User {
   required string name = 1;
   required int32 id = 2; // Id of the User
   optional string email = 3; // Email of the User
}
7. What's Generated From Your .proto?
     When you run the protocol buffer compiler on a .proto, the compiler generates the code in your chosen language you'll need to work with the message types you've described in the file, including getting and setting field values, serializing your messages to an output stream, and parsing your messages from an input stream.

     For C++, the compiler generates a .h and .cc file from each .proto, with a class for each message type described in your file.

     For Java, the compiler generates a .java file with a class for each message type, as well as a special Builder classes for creating message class instances.

     Python is a little different – the Python compiler generates a module with a static descriptor of each message type in your .proto, which is then used with a metaclass to create the necessary Python data access class at runtime.
     For Go, the compiler generates a .pb.go file with a type for each message type in your file.

8. Enumerations
    When you're defining a message type, you might want one of its fields to only have one of a pre-defined list of values.
E.g
enum PhoneType {
   MOBILE = 0;
   HOME = 1;
   WORK = 2;
}

Next Tutorial  Google Protocol Buffers Example

3 comments:

  1. Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging. If anyone wants to become a Java developer learn from Java Training in Chennai. or learn thru Java Online Training in India . Nowadays Java has tons of job opportunities on various vertical industry.

    ReplyDelete
  2. Hi
    I have .bin files I have to open and edit .bin protobuf files.can you please help me on this

    ReplyDelete