Performance tests JGroups


Author: Bela Ban
Date: Sept/Oct 2006

Configuration


The tests were conducted in the JBoss lab in Atlanta. The cluster consists of 8 identical boxes (cluster01-08), where each box is:

The tests were run only on 64-bit systems.

Switches

The following switch was used:

Network

Note that neither the network nor the test machines were reserved exclusively for the performance tests, so some of
the fluctuations are explained by the fact that other users/processes were using the machines and/or generating traffic on
the network. This was mitigated somewhat by running all tests in the AM hours (8am - 3pm) in Europe, when the servers in
Atlanta GA were mostly idle. We also use machines which have no cron jobs (like cruisecontrol etc) scheduled that could interfere
with the tests.


JVM configuration

The following JVM was used:

SUN: JDK 1.5.0_04-b05 Java HotSpot(TM) Server VM (build 1.5.0_04-b05, mixed mode)

The options used for starting the JVMs were:

SUN: -Xmx500M -Xms500M -XX:CompileThreshold=1000 -XX:+UseParallelGC -XX:+AggressiveHeap -XX:NewRatio=1 -server -Dlog4j.configuration=file:/home/bela/log4j.properties -Dresolve.dns=false -Dbind.address=${MYTESTIP_1}




Design of the tests

The goal of the test is to measure the time it takes to reliably send N messages to all nodes in a cluster. We are currently not interested in measuring the latency of remote group RPC, this may be done in a future performance test. The test works as follows.
The test is part of JGroups: org.jgroups.tests.perf.Test. The driver is JGroups independent and can also be used to measure JMS performance and pure UDP or TCP performance (all that needs to be done is to write a Transport interface implementation, with a send() and a receive()).

The configuration file (config.txt) is shown below (only the JGroups configuration is shown):

# Class implementing the org.jgroups.tests.perf.Transport interface

transport=org.jgroups.tests.perf.transports.JGroupsTransport
#transport=org.jgroups.tests.perf.transports.UdpTransport
#transport=org.jgroups.tests.perf.transports.TcpTransport

# Number of messages a sender multicasts
num_msgs=1000000

# Message size in bytes.
msg_size=1000

# Expected number of group members.
num_members=2

# Number of senders in the group. Min 1, max num_members.
num_senders=2

# dump stats every n msgs
log_interval=100000

This file must be the same for all nodes, so it is suggested to place it in a shared file system, e.g. an NFS mounted directory.
The following parameters are used:

transport
An implementation of Transport, for our tests we used the JGroupsTransport
num_msgs
Number of messages to be sent by a sender
msg_size
Number of bytes of a single message
num_members
Number of members. When members are started, they wait until num_members nodes have joined the cluster, and then the senders start sending messages
num_senders
Number of senders (must be less then or equal to num_members). This allows each receiver to compute the total number of messages to be received: num_senders * num_msgs. In the example, we'll receive 2000000 messages (2 * 1M messages)
log_interval
Output about sent and received messages will be output to stdout (and to file if -f is used (see below)) every log_interval messages

The options for the test driver are:
bela@laptop /cygdrive/c
$ java org.jgroups.tests.perf.Test -help
Test [-help] ([-sender] | [-receiver]) [-config <config file>] [-props <stack config>] [-verbose] [-jmx] [-dump_stats] [-f <filename>]


-sender / - receiver
Whether this process is a sender or a receiver (a sender is always a receiver as well)
-config <file>
Points to the configuration file, e.g. -config /home/bela/config.txt
-props <props>
The JGroups protocol stack configuration. Example: -props c:\fc-fast-minimalthreads.xml. Can be any URL or filename
-verbose
Verbose output
-jmx
Enables JMX instrumentation (requires JVM with JMX MBeanServer, e.g. JDK5). This will cause the VM not to terminate when done. To access the process via JMX, the -Dcom.sun.management.jmxremote property has to be defined and jconsole can be used. For more details see http://wiki.jboss.org/wiki/Wiki.jsp?page=JMX.
-dump_stats
Dumps some JMX statistics after the run, e.g. number of messages sent, number of times blocked etc
-f <filename>
Dumps number of messages sent, message rate, throughput, current time, free and total memory to <filename> every log_interval milliseonds. This is the main tool to generate charts on memory behavior or message rate variance
-help
This help

The files needed to run te tests are included with JGroups (source distribution) as well: JGroups/conf/config.txt is the configuration file, and JGroups/conf/fc-fast-minimalthreads.xml is the stack. However, note that all the test runs will have the JGroups stack configuration file included so that they can be reproduced.

For additional details on how to run the tests refer to http://wiki.jboss.org/wiki/Wiki.jsp?page=PerfTests.


Parameters

The parameters that are varied are:

Results

The plain results are available here for the 4 node cluster, here for the 6 node cluster and here for the 8 node cluster. The OpenOffice spreadsheet containing all results can be obtained here.
The performance tests were run on the 3 clusters, measuring message rate (messages received/sec) and throughput (MB received/sec), as described in http://wiki.jboss.org/wiki/Wiki.jsp?page=PerfTests.
The JGroups configurations used are:

Message rate and throughput

The Y axes define message rate and throughput. The message rate is computed as <number of messages sent> * <number of senders> / <time to receive all messages>. For example, a message rate of 35000 for 1K means that a receiver received 35000 1K messages on average per second. On average means that - at the end of a test run - all nodes post their results, which are collected, summed up and divided by the number of nodes. So if we have 2 nodes, and they post message rates of 10000 and 20000, the average messsage rate will be 15000.
The throughput is simply the average message rate times the message size.


4 node cluster










For the 4 node cluster, we only measured 1K and 5K messages. The message rate for UDP (using IP multicasting) is relatively stable for 1, 2 and 4 senders. In the TCP case, we can see that the message rate for 1 sender for 1K messages is slightly lower, then catches up for larger messages and more senders. In both cases, the message rate for 5K messages is lower than that for 1K, but the aggregate throughput is s about the same.
We get an average message rate of between 35000 and 40000 1K messages and 5000-8000 5K messages for UDP, and between 25000 and 45000 1K messages and 7000-10000 5K messages for TCP.



6 node cluster








For lack of availability of the entire cluster, we could only measure UDP with the 6 node cluster. The numbers shown above are somewhat contradictory to the rest of the tests: they show that the message rate and throughput decreases with increasing cluster size. The absolute numbers are also worse than for the 4-node and 8-node cluster ! A possible explanation is that since 2 machines were not available to us, these 2 machines might have generated network traffic, negatively affecting our tests.


8 node cluster










The 8 node UDP cluster exhibits similar characteristics as the 4 node cluster: for 1 sender, the results are good, then they go down slightly for 4 senders, and go up again for 8 senders. With TCP, this is different: the numbers for 1 sender are lower than for UDP, but go up for 4 and 8 senders and pass UDP at 4 senders.
With UDP, the messages rates for 1, 4 and 8 senders (sending 1K messages) don't vary much (23732, 24270 and 28994 messages/sec). The same goes for 2.5K messages. However, there is bigger variation for 5K messages.
For TCP throughput, 1 sender sends messages at around 14MB/sec, 4 senders climb to 36MB/sec and 8 senders to 50MB/sec, regardless of the message size.

Interpretation of results

General

All of the tests were conducted with a relatively conservative configuration, which would allow for the tests to be run for an extended period of time. For example, the number of credits in flow control was relatively small. Had we set them to a higher value, we would have probably gotten better results, but the tests would not have been able to run for an extended period of time without running out of memory. For instance, the number of credits should be a function of the group size (more credits for more nodes in the cluster), however, we used (more or less) the same configuration for all tests. We will introduce an enhanced version of the flow control protocol which dynamically adapts credits, e.g. based on free memory, loss rate etc, in http://jira.jboss.com/jira/browse/JGRP-2.
Another issue is the use of a gigabit ethernet switch. As shown in previous tests (http://www.jgroups.org/javagroupsnew/docs/Perftest.html), JGroups does saturate a 100Mbps switch with clusters up to 8 nodes, with roughly 11-11.5MB/sec. However, for the GB switch, where we could receive a max of ca 125MB/sec, the results show that we are far from reaching that number. However, the numbers shown are probably more than good enough for most applications using JGroups, especially if traffic-reducing mechanisms such as field-based replication for HTTP sessions are used.
Some reasons might be (needs further investigation):

Processing of messages and scalability

The test program currently doesn't do any processing at the receiver when a message is received. Instead, a message counter is increased and the size added to the total number of messages received so far. This is of course an ideal situation, and not realistic for normal use, where receivers might unmarshal the message's byte buffer into something meaningful (e.g. an HTTP session), and process it (e.g. updating an HTTP session). (The test by the way allows to take processing time at the receiver into account by setting the processing_delay value). As a matter of fact, we have found in other tests (JBossCache, JBoss HTTP session replication) that the majority of the time is spent in marshalling and (especially) unmarshalling (serialization) of application data, rather than in JGroups. Until and unless marshalling times are reduced, the performance of JGroups will not even matter much.
What does matter, however, is that (for JGroups 2.4) if we do have some processing time at the receiver, then incoming requests - even from different senders - have to wait in a queue to be processed. So if a receiver R receives messages A.M1 (M1 from A), B.M10, B.M11, B.M12 and A.M2, then B.M{10-12} and A.M2 will have to wait until A.M1 has been processed. Then B.M10 will be processed and B.M{11-12} and A.M2 will have to wait, and so on.
This will be changed in JGroups 2.5 with the threadless stack, where JGroups can be configured to process messages from different senders concurrently. For example, in the above case, JGroups could be configured to process all messages from A and B in separate threads (say T1 and T2), so T1 would process A.M{1-2} and T2 would process B.M{10-12}. This would allow for concurrent processing of A.M1 and B.M10, which increases performance, compared to the single-queue case above.
We will investigate whether the threadless stack increases performance for the performance tests, where no processing is done at the receiver.


UDP versus TCP

As can be seen with the 4 and 8 node clusters, TCP has better performance than UDP when more than one sender is sending messages. Since both UDP (fc-fast-minimalthreads.xml) and TCP (tcp_nio.xml) use flow control (FC), the problem cannot be flow control. There are multiple issues that could contribute to this, some of them being (needs to be investigated more):

Fluctuation of results (reproduceability)

As there was no reservation for the cluster labs (this was done by sending around emails), it is possible that some other applications were sometimes using the lab, generating network traffic and using CPU cycles, and thus distorting the results. Although we used the lab (based in Atlanta, GA) early in the morning (9am-2pm UST, 3am-8am EST), there's still a chance that a cron job was started, or someone from a different time zone was using the lab. We usually verified that no other application was running before starting our tests (by using last, top, who, ps for example), but that doesn't prevent some other application from starting during our test runs.

Conclusion and outlook

We showed that performance of JGroups is good for clusters ranging from 4 to 8 nodes. TCP scales better than UDP, however it also creates a mesh of connections (every sender needs to be connected to everyone else in the cluster), which may not scale to large clusters.
We intend to measure the performance of larger clusters (16 - 64 nodes) in the next round of benchmarks.