Performance tests JGroups 2.5


Author: Bela Ban
Date: June 2007

Configuration


The tests were conducted in the JBoss lab in Atlanta. Each box was:

The tests were run only on 64-bit systems.

Switches

The following switch was used:

Network

Note that neither the network nor the test machines were reserved exclusively for the performance tests, so some of
the fluctuations are explained by the fact that other users/processes were using the machines and/or generating traffic on
the network. This was mitigated somewhat by running all tests in the AM hours (8am - 3pm) in Europe, when the servers in
Atlanta GA were mostly idle. We also used machine which have no cron jobs (like cruisecontrol etc) scheduled that could interfere
with the tests.


JVM configuration

The following JVMs were used:
The options used for starting the JVMs were:
-server -Xmn300M -Xmx400M -Xms400M -XX:+UseParallelGC -XX:+AggressiveHeap -XX:CompileThreshold=100 -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=31 -Dlog4j.configuration=file:/home/bela/log4j.properties -Djgroups.bind_addr=${MYTESTIP_1} -Dcom.sun.management.jmxremote -Dresolve.dns=false


Design of the tests

The goal of the test is to measure the time it takes to reliably send N messages to all nodes in a cluster. (We are currently not interested in measuring remote group RPC, this may be done in a future performance test.) The test works as follows.
The test is part of JGroups: org.jgroups.tests.perf.Test. The driver is JGroups independent and can also be used to measure JMS performance and pure UDP or TCP performance (all that needs to be done is to write a Transport interface implementation, with a send() and a receive()).

The configuration file (config.txt) is shown below (only the JGroups configuration is shown):

# Class implementing the org.jgroups.tests.perf.Transport interface

transport=org.jgroups.tests.perf.transports.JGroupsTransport
#transport=org.jgroups.tests.perf.transports.UdpTransport
#transport=org.jgroups.tests.perf.transports.TcpTransport

# Number of messages a sender sendes
num_msgs=1000000

# Message size in bytes
msg_size=1000

# Expected number of group members
num_members=2

# Number of senders in the group. Min 1, max num_members
num_senders=2

# dump stats every n msgs
log_interval=1000


This file must be the same for all nodes, so it is suggested to place it in a shared file system, e.g. an NFS mounted directory.
The following parameters are used:

transport
An implementation of Transport
num_msgs
Number of messages to be sent by a sender
msg_size
Number of bytes of a single message
num_members
Number of members. When members are started, they wait until num_members nodes have joined the cluster, and then the senders start sending messages
num_senders
Number of senders (must be less then or equal to num_members). This allows each receiver to compute the total number of messages to be received: num_senders * num_msgs. In the example, we'll receive 2 million 1K messages
log_interval
Output about sent and received messages will be output to stdout (and to file if -f is used (see below)) every log_interval messages

The options for the test driver are:
bela@laptop /cygdrive/c
$ java org.jgroups.tests.perf.Test -help
Test [-help] ([-sender] | [-receiver]) [-config <config file>] [-props <stack config>] [-verbose] [-jmx] [-dump_stats] [-f <filename>]


-sender / - receiver
Whether this process is a sender or a receiver (a sender is always a receiver as well)
-config <file>
Points to the configuration file, e.g. -config /home/bela/config.txt
-props <props>
The JGroups protocol stack configuration. Example: -props c:\fc-fast-minimalthreads.xml. Can be any URL
-verbose
Verbose output
-jmx
Enables JMX instrumentation (requires JVM with JMX MBeanServer, e.g. JDK5). This will cause the VM not to terminate when done. To access the process via JMX, the -Dcom.sun.management.jmxremote property has to be defined and jconsole can be used. For more details see http://wiki.jboss.org/wiki/Wiki.jsp?page=JMX.
-dump_stats
Dumps some JMX statistics after the run, e.g. number of messages sent, number of times blocked etc
-f <filename>
Dumps number of messages sent, message rate, throughput, current time, free and total memory to <filename> every log_interval milliseonds. This is the main tool to generate charts on memory behavior or message rate variance
-help
This help

The files needed to run te tests are included with JGroups (source distribution) as well: JGroups/conf/config.txt is the test configuration file, and JGroups/conf/{udp,sfc,tcp}.xml are the JGroups protocol stack config files. Note that all the test runs will have the JGroups stack configuration file included so that they can be reproduced.

For additional details on how to run the tests refer to http://wiki.jboss.org/wiki/Wiki.jsp?page=PerfTests.


Parameters

The parameters that are varied are:


Results

The plain results are available here for the 4 node cluster, here for the 6 node cluster and here for the 8 node cluster. The OpenOffice spreadsheet containing all results can be obtained here.
The performance tests were run on the 3 clusters, measuring message rate (messages received/sec) and throughput (MB received/sec), as described in http://wiki.jboss.org/wiki/Wiki.jsp?page=PerfTests.
The JGroups configurations used are:

Message rate and throughput

The Y axes define message rate and throughput, and the X axis defines the message size. The message rate is computed as <number of messages sent> * <number of senders> / <time to receive all messages>. For example, a message rate of 35000 for 1K means that a receiver received 35000 1K messages on average per second. On average means that - at the end of a test run - all nodes post their results, which are collected, summed up and divided by the number of nodes. So if we have 2 nodes, and they post message rates of 10000 and 20000, the average messsage rate will be 15000.
The throughput is simply the average message rate times the message size.


4 node cluster






6 node cluster






8 node cluster







Interpretation of results

General

All of the tests were conducted with a relatively conservative configuration, which would allow for the tests to be run for an extended period of time. For example, the number of credits in flow control was relatively small. Had we set them to a higher value, we would have probably gotten better results, but the tests would not have been able to run for an extended period of time without running out of memory. For instance, the number of credits should be a function of the group size (more credits for more nodes in the cluster), however, we used (more or less) the same configuration for all tests. We will introduce an enhanced version of the flow control protocol which dynamically adapts credits, e.g. based on free memory, loss rate etc, in http://jira.jboss.com/jira/browse/JGRP-2.

The message rate and throughput is much better for the TCP-based configuration than for either SFC or FC, in all 3 cluster configurations (4, 6 and 8 nodes). Although the rate falls with increasing message size, throughput actually increases slightly with increasing message size in all configurations. Note that for 6 and 8 nodes, TCP has a throughput of around 100MB/sec, which uses 80% of the gigabit switch's capacity (ca 125MB/sec) ! Compared to a previous performance test, we increased the MTU of all NICs to 9K and enabled jumbo frames, which allows us to send bigger ethernet packets, which positively affects throughput (not necessarily message rate).

We attribute the better performance of TCP to the following issues:

The throughout of UDP/SFC ranges from 20MB/sec for 1K messages to 30MB/sec for 5K messages, regardless of cluster size. In the case of TCP, throughput increases more with large messages, however we noticed that the relative throughput decreases with increasing cluster size. We suspect that throughput for TCP is a function of the cluster size (as we have a mesh of TCP connections, from everyone to everyone else). It will be interesting to see how TCP scales in large clusters, compared to UDP/SFC.


Issues:

Processing of messages and scalability

The test program currently doesn't do any processing at the receiver when a message is received. Instead, a message counter is increased and the size added to the total number of messages received so far. This is of course an ideal situation, and not realistic for normal use, where receivers might unmarshal the message's byte buffer into something meaningful (e.g. an HTTP session), and process it (e.g. updating an HTTP session). (The test by the way allows to take processing time at the receiver into account by setting the processing_delay value).

JGroups 2.5 should process messages which are handled by receivers (that do some processing on reception of a message) much better, as we introduced the concurrent stack in 2.5. Preliminary measurements show that the performance speedup compared to 2.4 (a) for messages which take some time (a few milliseconds) to process and (b) for clusters where everyone sends messages, is close to N where N is the number of nodes sending messages. We will publish numbers for these types of scenarios later.


Fluctuation of results (reproduceability)

Compared to the 2.4 performance tests, we didn't run the tests multiple times. Instead, we ran each test only once, so the numbers shown are not averaged over multiple runs of a given test. We also didn't adjust the JGroups configs for the different cluster sizes. For example, the number of credits in flow control (FC and SFC) should have been increased with increasing cluster size. However, we wanted to see whether we get reasonable numbers for the same configs run on different clusters.

If you run the perf tests yourself and get even better numbers (sepecially for UDP/SFC), we'd be interested to hear from you !

Conclusion and outlook

We showed that performance of JGroups is good for clusters ranging from 4 to 8 nodes. TCP scales better than UDP, however it also creates a mesh of connections (every sender needs to be connected to everyone else in the cluster), which may not scale to large clusters.
We intend to measure the performance of larger clusters (16 - 80 nodes) in a later round of benchmarks.