Recent Posts

Friday, 22 April 2016

Apache Zookeeper Tutorial

     The formal definition of Apache Zookeeper says that it is a distributed, open-source configuration, synchronization service along with naming registry for distributed applications. Apache Zookeeper is used to manage and coordinate large cluster of machines.
     Distributed applications are difficult to coordinate and work with as they are much more error prone (cause errors) due to huge number of machines attached to network. As many machines are involved, race condition and deadlocks are common problems when implementing distributed applications.

     Race condition occurs when a machine tries to perform two or more operations at a time and this can be taken care by serialization property of Zookeeper.

     Deadlocks are when two or more machines try to access same shared resource at the same time. More precisely they try to access each other’s resources which lead to lock of system as none of the system is releasing the resource but waiting for other system to release it. Synchronization in Zookeeper helps to solve the deadlock.

     Another major issue with distributed application can be partial failure of process, which can lead to inconsistency of data. Zookeeper handles this through atomicity, which means either whole of the process will finish or nothing will persist after failure.

      The Zookeeper framework was originally built at “Yahoo!” for accessing their applications in an easy and robust manner. Later, Apache Zookeeper became a standard for organized service used by Hadoop, HBase, and other distributed frameworks.

Distributed Application
   A Distributed Application can run on multiple systems in a network at a given time (simultaneously) by coordinating among them to complete a particular task in a fast and efficient manner. Normally, complex and time-consuming tasks, which will take hours to complete by a non-distributed application (running in a single system) can be done in minutes by a distributed application by using computing capabilities of all the system involved. 

  The time to complete the task can be further reduced by configuring the distributed application to run on more systems. A group of systems in which a distributed application is running is called a Cluster and each machine running in a cluster is called a Node.

  A distributed application has two parts, Server and Client application. Server applications are actually distributed and have a common interface so that clients can connect to any server in the cluster and get the same result. Client applications are the tools to interact with a distributed application.

    Apache Zookeeper is a service used by a cluster (group of nodes) to coordinate between themselves and maintain shared data with robust synchronization techniques. Zookeeper is itself a distributed application providing services for writing a distributed application. The common services provided by Zookeeper are as follows 

Naming service: Identifying the nodes in a cluster by name. It is similar to DNS, but for nodes.

Configuration management: Latest and up-to-date configuration information of the system for a joining node.

Cluster management: Joining / leaving of a node in a cluster and node status at real time.

Leader election: Electing a node as leader for coordination purpose.

Locking and synchronisation service: Locking the data while modifying it. This mechanism helps you in automatic fail recovery while connecting other distributed applications like Apache HBase.

Highly reliable data registry: Availability of data even when one or a few nodes are down.

Architecture of Zookeeper
The following diagram depicts the “Client-Server Architecture” of Zookeeper.
     Clients, one of the nodes in our distributed application cluster, access information from the server. For a particular time interval, every client sends a message to the server to let the sever know that the client is alive. Similarly, the server sends an acknowledgement when a client connects. If there is no response from the connected server, the client automatically redirects the message to another server.

    Server, one of the nodes in our ZooKeeper ensemble, provides all the services to clients. Gives acknowledgement to client to inform that the server is alive.

     Group of ZooKeeper servers. The minimum number of nodes that is required to form an ensemble is 3.

    Server node which performs automatic recovery if any of the connected node failed. Leaders are elected on service startup.

     Server node which follows leader instruction.

Hierarchical Namespace
     The following diagram depicts the tree structure of ZooKeeper file system used for memory representation. ZooKeeper node is referred as znode. Every znode is identified by a name and separated by a sequence of path (/).

  • In the diagram, first you have a root znode separated by “/”. Under root, you have two logical namespaces config and workers.
  • The config namespace is used for centralized configuration management and the workers namespace is used for naming.
  • Under config namespace, each znode can store upto 1MB of data. This is similar to UNIX file system except that the parent znode can store data as well. The main purpose of this structure is to store synchronized data and describe the metadata of the znode. This structure is called as ZooKeeper Data Model.
     Every znode in the ZooKeeper data model maintains a stat structure. A stat simply provides the metadata of a znode. It consists of Version number, Action control list (ACL), Timestamp, and Data length.

Version number
     Every znode has a version number, which means every time the data associated with the znode changes, its corresponding version number would also increased. The use of version number is important when multiple zookeeper clients are trying to perform operations over the same znode.

Action Control List (ACL) 
     ACL is basically an authentication mechanism for accessing the znode. It governs all the znode read and write operations.

     Timestamp represents time elapsed from znode creation and modification. It is usually represented in milliseconds. ZooKeeper identifies every change to the znodes from “Transaction ID” (zxid). Zxid is unique and maintains time for each transaction so that you can easily identify the time elapsed from one request to another request.

Data length
     Total amount of the data stored in a znode is the data length. You can store a maximum of 1MB of data.

Types of Znodes
     Znodes are categorized as persistence, sequential, and ephemeral.

Persistence znode
     Persistence znode is alive even after the client, which created that particular znode, is disconnected. By default, all znodes are persistent unless otherwise specified.

Ephemeral znode
     Ephemeral znodes are active until the client is alive. When a client gets disconnected from the ZooKeeper ensemble, then the ephemeral znodes get deleted automatically. For this reason, only ephemeral znodes are not allowed to have a children further. If an ephemeral znode is deleted, then the next suitable node will fill its position. Ephemeral znodes play an important role in Leader election.

Sequential znode
     Sequential znodes can be either persistent or ephemeral. When a new znode is created as a sequential znode, then ZooKeeper sets the path of the znode by attaching a 10 digit sequence number to the original name. For example, if a znode with path /myapp is created as a sequential znode, ZooKeeper will change the path to /myapp0000000001 and set the next sequence number as 0000000002. If two sequential znodes are created concurrently, then ZooKeeper never uses the same number for each znode. Sequential znodes play an important role in Locking and Synchronization.

     Sessions are very important for the operation of ZooKeeper. Requests in a session are executed in FIFO order. Once a client connects to a server, the session will be established and a session id is assigned to the client.
     The client sends heartbeats at a particular time interval to keep the session valid. If the ZooKeeper ensemble does not receive heartbeats from a client for more than the period (session timeout) specified at the starting of the service, it decides that the client died.
     Session timeouts are usually represented in milliseconds. When a session ends for any reason, the ephemeral znodes created during that session also get deleted.

     Watches are a simple mechanism for the client to get notifications about the changes in the ZooKeeper ensemble. Clients can set watches while reading a particular znode. Watches send a notification to the registered client for any of the znode (on which client registers) changes.
     Znode changes are modification of data associated with the znode or changes in the znode’s children. Watches are triggered only once. If a client wants a notification again, it must be done through another read operation. When a connection session is expired, the client will be disconnected from the server and the associated watches are also removed.

Zookeeper - Workflow
     Once a ZooKeeper ensemble starts, it will wait for the clients to connect. Clients will connect to one of the nodes in the ZooKeeper ensemble. It may be a leader or a follower node. Once a client is connected, the node assigns a session ID to the particular client and sends an acknowledgement to the client. If the client does not get an acknowledgment, it simply tries to connect another node in the ZooKeeper ensemble. Once connected to a node, the client will send heartbeats to the node in a regular interval to make sure that the connection is not lost.

     If a client wants to read a particular znode, it sends a read request to the node with the znode path and the node returns the requested znode by getting it from its own database. For this reason, reads are fast in ZooKeeper ensemble.

     If a client wants to store data in the ZooKeeper ensemble, it sends the znode path and the data to the server. The connected server will forward the request to the leader and then the leader will reissue the writing request to all the followers. If only a majority of the nodes respond successfully, then the write request will succeed and a successful return code will be sent to the client. Otherwise, the write request will fail. The strict majority of nodes is called as Quorum.

Nodes in a ZooKeeper Ensemble
     Let us analyze the effect of having different number of nodes in the ZooKeeper ensemble. If we have a single node, then the ZooKeeper ensemble fails when that node fails. It contributes to “Single Point of Failure” and it is not recommended in a production environment.

     If we have two nodes and one node fails, we don’t have majority as well, since one out of two is not a majority. If we have three nodes and one node fails, we have majority and so, it is the minimum requirement. It is mandatory for a ZooKeeper ensemble to have at least three nodes in a live production environment.

     If we have four nodes and two nodes fail, it fails again and it is similar to having three nodes. The extra node does not serve any purpose and so, it is better to add nodes in odd numbers, e.g., 3, 5, 7.

    We know that a write process is expensive than a read process in ZooKeeper ensemble, since all the nodes need to write the same data in its database. So, it is better to have less number of nodes (3, 5 or 7) than having a large number of nodes for a balanced environment.

     The following diagram depicts the ZooKeeper WorkFlow and the subsequent table explains its different components.
     Write process is handled by the leader node. The leader forwards the write request to all the znodes and waits for answers from the znodes. If half of the znodes reply, then the write process is complete.

     Reads are performed internally by a specific connected znode, so there is no need to interact with the cluster.

Replicated Database
     It is used to store data in zookeeper. Each znode has its own database and every znode has the same data at every time with the help of consistency.

     Leader is the Znode that is responsible for processing write requests.

     Followers receive write requests from the clients and forward them to the leader znode.

Request Processor
     Present only in leader node. It governs write requests from the follower node.

Atomic broadcasts
     Responsible for broadcasting the changes from the leader node to the follower nodes.

Zookeeper - Leader Election
Let us analyze how a leader node can be elected in a ZooKeeper ensemble. Consider there are N number of nodes in a cluster. The process of leader election is as follows -
* All the nodes create a sequential, ephemeral znode with the same path, /app/leader_election/guid_.

* ZooKeeper ensemble will append the 10-digit sequence number to the path and the znode created will be /app/leader_election/guid_0000000001, /app/leader_election/guid_0000000002, etc.

* For a given instance, the node which creates the smallest number in the znode becomes the leader and all the other nodes are followers.

* Each follower node watches the znode having the next smallest number. For example, the node which creates znode /app/leader_election/guid_0000000008 will watch the znode /app/leader_election/guid_0000000007 and the node which creates the znode /app/leader_election/guid_0000000007 will watch the znode /app/leader_election/guid_0000000006.

* If the leader goes down, then its corresponding znode /app/leader_electionN gets deleted.

* The next in line follower node will get the notification through watcher about the leader removal.

* The next in line follower node will check if there are other znodes with the smallest number. If none, then it will assume the role of the leader. Otherwise, it finds the node which created the znode with the smallest number as leader.

* Similarly, all other follower nodes elect the node which created the znode with the smallest number as leader.
     Leader election is a complex process when it is done from scratch. But ZooKeeper service makes it very simple. Let us move on to the installation of ZooKeeper for development purpose in the next chapter.

Implementing Leader Election Algorithm in Java
     To implement the leader election algorithm explained in previous section, we have created following three classes (please click on the heading to see the source code on GitHub) - - This class is responsible for interacting with ZooKeeper cluster by connecting to ZooKeeper service, creating, deleting and getting znodes and setting the watch on znodes. - This class represents a process in the leader election. This class is implemented as Runnable and responsible for implementing the leader election algorithm with the help of ZooKeeperService class. - This is a launcher class responsible for starting the thread of ProcessNode implementation. Since Apache ZooKeeper client runs daemon processes for notifying about watchevent, this class uses ExecutorService to start ProcessNode so that program doesn't exit after ProcessNode main thread is finished executing.

      That's it guys. This is all about Apache Zookeeper Tutorial. Let me know your comments and suggestions about this tutorial. Thank you. 

1 comment: