Performance tests JGroups 2.6.4

Author: Bela Ban
Date: August 2008

Configuration

The tests were conducted in the JBoss lab in Atlanta. Each box was:

Dell PE1850 2x 3Ghz EM64T processors
4GB memory
Red Hat Enterprise Linux (RHEL) 4
OS: Linux 2.6.9-42.0.10.smp

The tests were run only on 64-bit systems.

Switches

The following switch was used:

1GBps: Dell Power Connect 5224 (24 port gige L2 managed switch) with jumbo frames enabled. All NICs had MTUs of 1500 bytes. We didn't have the time to switch to jumbo frames.

Network

Note that neither the network nor the test machines were reserved exclusively for the performance tests, so some of
the fluctuations are explained by the fact that other users/processes were using the machines and/or generating traffic on the network. This was mitigated somewhat by running all tests in the AM hours (8am - 3pm) in Europe, when the servers in
Atlanta GA were mostly idle. We also used machine which have no cron jobs (like cruisecontrol etc) scheduled that could interfere with the tests.

JVM configuration

The following JVMs were used:

SUN JDK 1.6.0_03-b05

The options used for starting the JVMs were:
-server -Xmx600M -Xms400M -XX:+UseParallelGC -XX:+AggressiveHeap -XX:CompileThreshold=100 -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=31 -Dlog4j.configuration=file:/home/bela/log4j.properties -Djgroups.bind_addr=${MYTESTIP_1} -Dcom.sun.management.jmxremote -Dresolve.dns=false

Design of the tests

The goal of the test is to measure the time it takes to reliably send N messages to all nodes in a cluster. (We are currently not interested in measuring remote group RPC, this may be done in a future performance test.) The test works as follows.

All members have access to 2 configuration files: one for defining the number of messages to be sent, the number of cluster members, senders etc, and the other one defining the JGroups protocol stack.
N processes are started. They can be started on the same machine, but for better results there is one JGroups process per box
The processes join the same group, and when N equals the cluster size defined in the configuration file, all senders start sending M messages to the cluster (M is also defined in the configuration file).
Every node in the cluster receives the messages (senders are also receivers). When the first message is received, the timer is started. When the last message is received the timer is stopped. Every member knows how many messages are to be received: <number of senders> * <number of messages>. This allows every receiver to compute the message rate and throughput (we also know the size of each message).

The test is part of JGroups: org.jgroups.tests.perf.Test. The driver is JGroups independent and can also be used to measure JMS performance and pure UDP or TCP performance (all that needs to be done is to write a Transport interface implementation, with a send() and a receive()).

The configuration file (config.txt) is shown below (only the JGroups configuration is shown):
# Class implementing the org.jgroups.tests.perf.Transport interface transport=org.jgroups.tests.perf.transports.JGroupsTransport #transport=org.jgroups.tests.perf.transports.UdpTransport #transport=org.jgroups.tests.perf.transports.TcpTransport # Number of messages a sender sendes num_msgs=1000000 # Message size in bytes msg_size=1000 # Expected number of group members num_members=2 # Number of senders in the group. Min 1, max num_members num_senders=2 # dump stats every n msgs log_interval=100000

This file must be the same for all nodes, so it is suggested to place it in a shared file system, e.g. an NFS mounted directory.
The following parameters are used:

transport	An implementation of Transport
num_msgs	Number of messages to be sent by a sender
msg_size	Number of bytes of a single message
num_members	Number of members. When members are started, they wait until num_members nodes have joined the cluster, and then the senders start sending messages
num_senders	Number of senders (must be less then or equal to num_members). This allows each receiver to compute the total number of messages to be received: num_senders * num_msgs. In the example, we'll receive 2 million 1K messages
log_interval	Output about sent and received messages will be output to stdout (and to file if -f is used (see below)) every log_interval messages

The options for the test driver are:
bela@laptop /cygdrive/c $ java org.jgroups.tests.perf.Test -help Test [-help] ([-sender] | [-receiver]) [-config <config file>] [-props <stack config>] [-verbose] [-jmx] [-dump_stats] [-f <filename>]

-sender / - receiver	Whether this process is a sender or a receiver (a sender is always a receiver as well)
-config <file>	Points to the configuration file, e.g. -config /home/bela/config.txt
-props <props>	The JGroups protocol stack configuration. Example: -props c:\fc-fast-minimalthreads.xml. Can be any URL
-verbose	Verbose output
-jmx	Enables JMX instrumentation (requires JVM with JMX MBeanServer, e.g. JDK5). This will cause the VM not to terminate when done. To access the process via JMX, the -Dcom.sun.management.jmxremote property has to be defined and jconsole can be used. For more details see http://wiki.jboss.org/wiki/Wiki.jsp?page=JMX.
-dump_stats	Dumps some JMX statistics after the run, e.g. number of messages sent, number of times blocked etc
-f <filename>	Dumps number of messages sent, message rate, throughput, current time, free and total memory to <filename> every log_interval milliseonds. This is the main tool to generate charts on memory behavior or message rate variance
-help	This help

The files needed to run te tests are included with JGroups (source distribution) as well: JGroups/conf/config.txt is the test configuration file, and JGroups/conf/{udp,tcp}.xml are the JGroups protocol stack config files. Note that all the test runs will have the JGroups stack configuration file included so that they can be reproduced.

For additional details on how to run the tests refer to http://wiki.jboss.org/wiki/Wiki.jsp?page=PerfTests.

Parameters

The parameters that are varied are:

Cluster size: 4, 6, 8 and 10
Number of senders: N (where N is the cluster size); every node in the cluster is a sender
Number of messages: 1 million (constant, regardless of cluster size)
Message size: 1K, 2.5K and 5K
Switch: 1GB/sec
Protocols: IP multicast-based transport (udp.xml) and TCP-based transport (tcp.xml)
JVMs: SUN JDK6
JGroups 2.6.4.RC1 (CVS Aug 15 2008)

Results

The plain results are available here for the 4 node cluster, here for the 6 node cluster and here for the 8 node cluster. The OpenOffice spreadsheet containing all results can be obtained here.
The performance tests were run on the cluster, measuring message rate (messages received/sec) and throughput (MB received/sec), as described in http://wiki.jboss.org/wiki/Wiki.jsp?page=PerfTests.
The JGroups configurations used are:

UDP (udp.xml): uses IP multicast in the transport, with FC as flow control implementation
TCP (tcp.xml): uses multiple TCP connections in the transport

Message rate and throughput

The Y axes define message rate and throughput, and the X axis defines the message size. The message rate is computed as <number of messages sent> * <number of senders> / <time to receive all messages>. For example, a message rate of 35000 for 1K means that a receiver received 35000 1K messages on average per second. On average means that - at the end of a test run - all nodes post their results, which are collected, summed up and divided by the number of nodes. So if we have 2 nodes, and they post message rates of 10000 and 20000, the average messsage rate will be 15000.
The throughput is simply the average message rate times the message size.

4 node cluster

6 node cluster

8 node cluster

10 node cluster

Scalability per node

Scalability per cluster

Interpretation of results

General

All of the tests were conducted with a relatively conservative configuration, which would allow for the tests to be run for an extended period of time. For example, the number of credits in flow control was relatively small. Had we set them to a higher value, we would have probably gotten better results, but the tests would not have been able to run for an extended period of time without running out of memory. For instance, the number of credits should be a function of the group size (more credits for more nodes in the cluster), however, we used (more or less) the same configuration for all tests. We will introduce an enhanced version of the flow control protocol which dynamically adapts credits, e.g. based on free memory, loss rate etc, in http://jira.jboss.com/jira/browse/JGRP-2.

Compared to the previous tests (2007), the message rates and throughput for UDP is now considerably higher than before, in most cases even higher than the TCP based configuration. This is caused by performance modifications in JGroups and by setting core.net.rmem_max to 20000000 (from 131'000). The latter change allows the UDP stack to buffer more datagrams, triggering fewer retransmissions and therefore increasing performance. Note that https://jira.jboss.org/jira/browse/JGRP-806 certainly also made a big difference.

Note that for this test, for administrative reasons, we could not make use of jumbo frames and the MTU was 1500 instead of 9000 for the 2007 test. This likely affects throughput for larger messages.

The throughput of UDP has little variance for 1K messages and ranges from 56 MB/sec on 4 nodes to 51 MB/sec on 10 nodes. This is significantly better than TCP which starts at 49 MB/sec on 4 nodes and drops off to 34 MB/sec on 10 nodes. We can also see that the drop in throughput for UDP is worse for 2.5K messages (from 94 MB/sec on 4 nodes down to 65 MB/sec on 10 nodes) and for 5K messages (from 107 MB/sec on 4 nodes down to 77 MB/sec on 10 nodes). We attribute this to the fact that jumbo frames were not used: large UDP datagrams are fragmented into more IP packets (that need to get retransmitted on a single loss) with an MTU of 1500 than with 9000. TCP doesn't have this issue as it generates IP packets which are under the MTU's size.

Scalability

As can be seen in “Scalability per node”, message rates and throughput degrade slightly as we increase the cluster size. Reasons could be that the switch starts dropping packets when overloaded or flow control (FC) has too few credits (which are statically set, but should be allocated as a function of the cluster size). This will be investigated in one of the next releases.

Note that all numbers for message rates and throughput are shown per node, ie. the number of MBs a single node receives on average per second. If we aggregate those numbers and show the total traffic over the entire cluster, we can see (in “Scalability per cluster”) that we scale pretty well, although not 100% linearly. The figure shows that we can deliver ca. 500'000 1K messages over a cluster of 10 with UDP, and have an aggregate cluster wide throughput of over 800 MB/sec.

Processing of messages and scalability

The test program currently doesn't do any processing at the receiver when a message is received. Instead, a message counter is increased and the size added to the total number of messages received so far. This is of course an ideal situation, and not realistic for normal use, where receivers might unmarshal the message's byte buffer into something meaningful (e.g. an HTTP session modification), and process it (e.g. updating an HTTP session). (The test by the way allows to take processing time at the receiver into account by setting the processing_delay value).

Fluctuation of results (reproducability)

Compared to the 2.4 performance tests, we didn't run the tests multiple times. Instead, we ran each test only once, so the numbers shown are not averaged over multiple runs of a given test. We also didn't adjust the JGroups configs for the different cluster sizes. For example, the number of credits in flow control (FC and SFC) should have been increased with increasing cluster size. However, we wanted to see whether we get reasonable numbers for the same configs run on different clusters.

If you run the perf tests yourself and get even better numbers, we'd be interested to hear from you !

Conclusion and outlook

We showed that performance of JGroups is good for clusters ranging from 4 to 10 nodes. Compared to the 2007 tests, the UDP stack now has better performance than the TCP stack.