Collaborative Fault Tolerance using JGroups

Collaborative Fault Tolerance Using JGroups

By Jerry Overton, OCI Senior Software Engineer

September 2007


Unlike in other systems, failures in collaborative systems are the norm rather than the exception. Any viable collaborative system must have fault tolerance mechanisms that allow for both the detection and mitigation of individual node failures. These mechanisms can be complicated and very expensive to build.

The JGroups API gives developers easy access to reliable group communication capabilities. This article gives an overview of how to build collaborative fault tolerance by using JGroups to implement portions of the Object Group pattern. The information is based on the experience of writing a Java application that simulates collaborative system failures. Excerpts from that application are used to illustrate key concepts.

Collaborative Systems

In collaborative systems, otherwise autonomous computing nodes cooperate to achieve a common task that would not be possible with any individual node acting alone. Examples include intelligent transport systems, military command and control systems and autonomous manufacturing systems. Although the exact definition of a collaborative system can vary depending on context, this article will focus on three defining characteristics:

Figure 1 is an example of a collaborative system. It shows a network of environmental sensor stations. The nodes in the network are autonomous and spatially distributed across the region shown. Each sensor is capable of recording and reporting its local environmental conditions without the help of any of the other sensor stations.

The task-execution responsibilities are distributed across multiple sensor stations. The system is designed to report the environmental condition of a given region. Each sensor is capable of recording and reporting its local conditions, but to record and report the condition of the entire region requires that all sensor stations cooperate.

The communication links between the sensor stations are decentralized and dynamic. Sensors can enter and leave the network at anytime. Every station is wirelessly connected to every other station, so no single sensor failure can disrupt the overall network connectivity.

Figure 1: Example Collaborative System.

Figure 1 Example Collaborative System

Collaborative systems have special failure characteristics beyond the typical failures of other software systems. In systems with large numbers of nodes, node failures are common. Internal exceptions will cause nodes to become unresponsive in the network. Dynamic network connections, especially wireless links, are prone to temporal or permanent interruptions. These interruptions not only decrease the functionality of the collaborative system, but may also lead to the corruption of individual node states.

For example, sensor stations are exposed to adverse weather, are knocked over and broken easily, and can be expected to run out of power. Anything (people, cars, animals) passing between the line of sight of two sensor stations can cause a temporary loss of communication. If this happens at the right time, a controller in a region may miss a sensor update and become out of touch with the current conditions in the region.

A successful set of fault tolerance strategies for collaborative systems will allow nodes to both detect and mitigate distributed failures. Each node must be capable of detecting when other nodes become unresponsive; each node must be capable of performing in degraded mode when disconnected from the network; and the network must be capable of using node redundancy to compensate for the loss of any particular node.

Simulating a Collaborative System

This section describes the design of a Java application that simulates fault tolerance in collaborative systems using JGroups. Code excerpts from the simulator are used in subsequent sections to illustrate how to use the JGroups API to build failure detection and mitigation in collaborative systems.

Overview of JGroups

JGroups is toolkit of Java APIs for reliable group communication. Developers can use JGroups to join a group, send messages to group members, and receive messages from group members. JGroups communication infrastructure keeps track of the members in every group and updates group members when changes occur -- new members join, existing members leave or crash.

As shown in Figure 2, the JGroups architecture consists of three main parts: (1) the Channel API used to establish communication links, (2) building blocks that group channel functionality into higher-level abstractions and (3) the protocol stack that handles network communication for a given channel.

Figure 2: JGroups Architecture

Figure 2: JGroups Architecture

A process has to create a channel to join a group. The channel is the handle to the group. Clients that want to connect to multiple groups have to create multiple channels. The programming interface for channels is simple and primitive. Building blocks (a more sophisticated interface into a channel) saves the application programmer from having to write tedious, repetitive code. The protocol stack manages the details of sending and receiving messages over the IP network.

A Network Simulator

Figure 3 shows the logical structure of software developed to simulate the behavior of nodes in a collaborative system. A simulation engine (not shown here) launches GroupNodes, each with its own threads of execution. Each node gets its ability to collaborate through an association with a CommStrategy object. The CommStrategy has an association back to its GroupNode in case the GroupNode needs to be notified of events from the CommStrategy.

Figure 3: Logical Structure of Collaborative System Simulator.

Figure 3: Logical Structure of Collaborative System Simulator

In this specific example, the PushPullNode extends the functionality of the standard GroupNode. The PushPullNode gets its ability to collaborate through an association with a PushPullStrategy that extends the functionality of the CommStrategy using JGroup-specific functions defined by the JGroupListener object. The remainder of this article gives an overview of how JGroups was used to add collaborative fault tolerance to the PushPullNode.

Detecting Failures

Participants in a collaborative network need the ability to let other nodes know when they fail and to detect the failures of the other nodes. For example, suppose that, in the collaborative system shown in Figure 1, rain water leaks into the waterproofing of one of the sensors. The sensor shorts and is unable to perform normally.

In this scenario, controllers that rely on that sensor for regular updates will have to be notified of the failure. The sensor is responsible for letting controllers know that it has failed, but if the sensor is sufficiently damaged, the controllers may have to find out about the error by some other means.

The following code is an excerpt from a Java application that simulates collaborative system failures. It shows how the PushPullNode uses the JGroups API to detect distributed failures. For the sake of brevity, the error handling code is not shown. In some cases, the details of a function are replaced with general comments.

  1. public class CommStrategy {
  2. protected JChannel channel;
  4. public CommStrategy(){
  5. = new JChannel();
  6., Boolean.TRUE);
  7. }
  8. public void joinGroup(String groupName) throws Exception {
  10. }
  11. public void sendMessage (String message) throws Exception {
  12., null, message.getBytes());
  13. }
  14. }
  15. public class PushPullStrategy extends CommStrategy implements Receiver {
  17. public void joinGroup (String groupName) throws Exception {
  20. }
  21. public void receive(Message message){
  22. String msg = new String(message.getBuffer());
  23. //write code to process the message;
  24. }
  25. public void suspect(View view){
  26. //check the view for its content and handle appropriately
  27. }
  28. }

The PushPullStrategy extends the CommStrategy. When the PushPullNode creates a PushPullStrategy, the CommStrategy constructor creates a new channel and sets the channel.SUSPECT property to TRUE. The channel.SUSPECT property allows the channel to automatically notify the PushPullNode when any member of its group is suspected of failing. Nodes in a group are suspect when, after a period of time, they fail to respond to health monitoring messages. Both the suspect notification and the node monitoring are provided by the JGroups infrastructure and are transparent to the application developer.

For the network shown in Figure 1, fault tolerant communication between two sensors (A and B) can be simulated using two PushPullNodes. Both sensor PushPullNodes join the same group using the same groupName. If sensor A fails, but is still able to notify other members of that failure, it calls this.CommStrategy.sendMessage(). JGroups calls sensor B's PushPullStrategy.receive(), alerting sensor B of the failure.

If sensor A's error is so severe that it becomes completely unresponsive, JGroups calls the suspect function of sensor B. Sensor B's suspect function checks the view for details about which member is suspected and the nature of the failure. See the JGroups User Manual for more details about the information available in a view.

Mitigating Failures

The most basic strategy for mitigating failures is to keep back-up copies of data or services and use them when a failure occurs. An Object Group uses replicated state to ensure continuity in the event of a failure. Although Object Groups can provide failure times much faster than simply using backups, programming them from scratch can be a daunting task.

The JGroups API makes it very easy to implement the Object Group pattern. The following code extends the CommStrategy and PushPullStrategy to include support for Object Groups.

  1. public class CommStrategy {
  2. protected JChannel channel;
  4. public CommStrategy(){...}
  5. public void joinGroup(String groupName) throws Exception {...}
  6. public void sendMessage (String message) throws Exception {...}
  8. public void groupOperation(){
  9. //domain-specific processing here
  10., 5000);
  11. }
  12. }
  13. public class PushPullStrategy extends CommStrategy implements Receiver {
  14. private byte[] state;
  16. public void joinGroup (String groupName) throws Exception {...}
  17. public void receive(Message message){...}
  18. public void suspect(View view){...}
  20. public byte[] getState(){
  21. //retrieve state appropriately
  22. return this.state;
  23. }
  24. public void setState(byte[] state){
  25. //process and store state appropriately
  26. }
  27. }

Adding Object-Group style support to the PushPullNode using JGroups took only a very small amount of additional code. The new operations added are CommStrategy.groupOperation(), PushPullStrategy.getState() and PushPullStrategy.setState(). The CommStrategy.groupOperation() is a general representation of what would normally be domain-specific functionality. For example, in the sensor network shown in figure 1, the group operation might be getZoneTemp() instead.

The CommStrategy.groupOperation() requests the state of the current group, the PushPullStrategy.getState() retrieves the state and the PushPullStrategy.setState() operation processes and persistently stores the state.

For the network shown in Figure 1, a controller, sensor A and sensor B could be simulated using PushPullNodes. Sensors A and B join the same group representing a physical zone. The controller relies on both sensor A and sensor B to report temperature for a given region. The controller doesn't care which sensor it uses as long as at least one of them is always available. Figure 4 shows the interactions that occur during a group operation.

Figure 4: Collaborations in a Group Operation.

Figure 4: Collaborations in a Group Operation

When the controller wants a temperature reading from the zone, it joins the zone's group and calls this.commstrategy.groupOperation(). JGroups elects a leader (by default based on membership length) within the group and calls getState() on that node (let's assume that sensor A was chosen).

The getState() operation of sensor A takes a temperature reading and sets the reading as the operation's return value. JGroups then calls setState() on the controller, passing it the temperature reading from sensor A. In subsequent requests for the zone temperature, if sensor A becomes unresponsive, JGroups will failover to sensor B.

Complicated group management functions were provided to the programmer implicitly by the JGroups API, and the leadership election and failover is completely transparent to the controller's programmer.


JGroups actually works. It was a straightforward task to use APIs for the multicasting and group management functions needed for collaborative fault tolerance. JGroups has several extensibility features, not mentioned in this article, that allow you to do things like write your own communication protocol or overwrite the default settings of the original protocol.

Unfortunately, some of the examples in the API manual were outdated, and several of the API operations were deprecated. In the course of writing the simulation software, it took some trial and error to compensate for information missing from the manual.

At the time this article was written, JGroups was used in a number of commercial applications. JBoss uses JGroups to provide clustering features for its Java application servers. JGroups is also used by FactoryPMI to provide connection redundancy and load balancing for industrial control systems.