The tests were run only on 64-bit systems.
Switches
The following switch was used:
Network
Note that neither the network nor the test machines were reserved
exclusively for the performance tests, so some of
the fluctuations are explained by the fact that other users/processes
were using the machines and/or generating traffic on
the network. This was mitigated somewhat by running all tests in the AM
hours (8am - 3pm) in Europe, when the servers in
Atlanta GA were mostly idle. We also use machines which have no cron
jobs (like cruisecontrol etc) scheduled that could interfere
with the tests.
JVM configuration
The following JVM was used:
SUN: JDK 1.5.0_04-b05 Java HotSpot(TM) Server VM (build 1.5.0_04-b05,
mixed mode)
The options used for starting the JVMs were:
SUN: -Xmx500M -Xms500M -XX:CompileThreshold=1000 -XX:+UseParallelGC
-XX:+AggressiveHeap -XX:NewRatio=1 -server
-Dlog4j.configuration=file:/home/bela/log4j.properties
-Dresolve.dns=false -Dbind.address=${MYTESTIP_1}
Design of the tests
The goal of the test is to measure the time it takes to reliably send N
messages to all nodes in a cluster. We are currently not interested in
measuring the latency of remote group RPC, this may be done in a future
performance
test. The test works as follows.
- All members have access to 2 configuration files: one for
defining the number of messages to be sent, the number of cluster
members, senders etc, and the other one defining the JGroups protocol
stack.
- N processes are started. They can be started on the same machine,
but for better results there is one JGroups process / box. In the
performance test, we started each member on a different node in the
cluster.
- The processes join the same group, and when N equals the cluster
size defined in the configuration file, all senders start sending M
messages to the cluster (M is also defined in the configuration file).
- Every node in the cluster receives the messages (senders are also
receivers). When the first message is received the clock is started.
When the last message is received the clock is stopped. Every member
knows how many messages are to be received: <number of senders>
* <number of messages>. This allows every receiver to
compute the message rate and throughput (we also know the size of each
message).
The test is part of JGroups: org.jgroups.tests.perf.Test. The driver is
JGroups independent and can also be used to measure JMS performance and
pure UDP or TCP performance (all that needs to be done is to write a
Transport interface implementation, with a send() and a receive()).
The configuration file (config.txt) is shown below (only the JGroups
configuration is shown):
# Class implementing the org.jgroups.tests.perf.Transport interface
transport=org.jgroups.tests.perf.transports.JGroupsTransport
#transport=org.jgroups.tests.perf.transports.UdpTransport
#transport=org.jgroups.tests.perf.transports.TcpTransport
# Number of messages a sender multicasts
num_msgs=1000000
# Message size in bytes.
msg_size=1000
# Expected number of group members.
num_members=2
# Number of senders in the group. Min 1, max num_members.
num_senders=2
# dump stats every n msgs
log_interval=100000
This file must be the same for all nodes, so it is suggested to place
it in a shared file system, e.g. an NFS mounted directory.
The following parameters are used:
transport
|
An implementation of Transport,
for our tests we used the JGroupsTransport
|
num_msgs
|
Number of messages to be sent by
a sender
|
msg_size
|
Number of bytes of a single
message
|
num_members
|
Number of members. When members
are started, they wait until num_members nodes have joined the cluster,
and then the senders start sending messages
|
num_senders
|
Number of senders (must be less
then or equal to num_members). This allows each receiver to compute the
total number of messages to be received: num_senders * num_msgs. In the
example, we'll receive 2000000 messages (2 * 1M messages)
|
log_interval
|
Output about sent and received
messages will be output to stdout (and to file if -f is used (see
below)) every log_interval messages
|
The options for the test driver are:
bela@laptop /cygdrive/c
$ java org.jgroups.tests.perf.Test -help
Test [-help] ([-sender] | [-receiver]) [-config <config file>]
[-props <stack config>] [-verbose] [-jmx] [-dump_stats] [-f
<filename>]
-sender / -
receiver
|
Whether this process is a sender
or a receiver (a sender is always a receiver as well)
|
-config
<file>
|
Points to the configuration
file, e.g. -config /home/bela/config.txt
|
-props
<props>
|
The JGroups protocol stack
configuration. Example: -props c:\fc-fast-minimalthreads.xml. Can be
any URL or filename
|
-verbose
|
Verbose output
|
-jmx
|
Enables JMX instrumentation
(requires JVM with JMX MBeanServer, e.g. JDK5). This will cause the VM
not to terminate when done. To access the process via JMX, the
-Dcom.sun.management.jmxremote property has to be defined and jconsole
can be used. For more details see http://wiki.jboss.org/wiki/Wiki.jsp?page=JMX.
|
-dump_stats
|
Dumps some JMX statistics after
the run, e.g. number of messages sent, number of times blocked etc
|
-f
<filename>
|
Dumps number of messages sent,
message rate, throughput, current time, free and total memory to
<filename> every log_interval milliseonds. This is the main tool
to generate charts on memory behavior or message rate variance
|
-help
|
This help
|
The files needed to run te tests are included with JGroups (source
distribution) as well: JGroups/conf/config.txt is the configuration
file, and JGroups/conf/fc-fast-minimalthreads.xml is the stack.
However, note that all the test runs will have the JGroups stack
configuration file included so that they can be reproduced.
For additional details on how to run the tests refer to
http://wiki.jboss.org/wiki/Wiki.jsp?page=PerfTests.
Parameters
The parameters that are varied are:
- Cluster size: 4, 6 (only with UDP) and 8
- Number of senders: 1, N/2 and N (where N is the cluster size)
- Number of messages: this is computed according to the input
parameters of http://wiki.jboss.org/wiki/Wiki.jsp?page=AutomatedPerfTests.
The goal is to have approximately the same number of bytes sent for
each test. So, if we want to send 1M messages and have 8 senders, each
sender will send 125K messages. If we only have 1 sender, that sender
will send 1M messages.
- Message size: 1K, 2.5K and 5K
- Switch: 1Gbps switch (see above)
- Protocols: IP multicast-based transport
(fc-fast-minimalthreads.xml) and TCP-based transport (tcp-nio.xml)
- JVMs: SUN JDK5 (see above)
- JGroups: 2.4
Results
The plain results are available
here for the
4 node
cluster,
here for the 6 node cluster and
here for the 8 node cluster. The OpenOffice
spreadsheet containing all results can be obtained
here.
The performance
tests were run on the 3 clusters, measuring message rate (messages
received/sec) and throughput (MB received/sec), as described in
http://wiki.jboss.org/wiki/Wiki.jsp?page=PerfTests.
The JGroups configurations used are:
Message rate and throughput
The Y axes define message rate and throughput. The message rate is
computed as <number of messages sent> * <number of senders>
/ <time to receive all messages>. For example, a message rate of
35000 for 1K means that a receiver received 35000 1K messages on
average per second. On average means that - at the end of a test run -
all nodes post their results, which are collected, summed up and
divided by the number of nodes. So if we have 2 nodes, and they post
message rates of 10000 and 20000, the average messsage rate will be
15000.
The throughput is simply the average message rate times the message
size.
4 node cluster
For the 4 node cluster, we only measured 1K and 5K messages. The
message rate for UDP (using IP multicasting) is relatively stable for
1, 2 and 4 senders. In the TCP case, we can see that the message rate
for 1 sender for 1K messages is slightly lower, then catches up for
larger messages and more senders. In both cases, the message rate for
5K messages is lower than that for 1K, but the aggregate throughput is
s about the same.
We get an average message rate of between 35000 and 40000 1K messages
and 5000-8000 5K messages for UDP, and between 25000 and 45000 1K
messages and 7000-10000 5K messages for TCP.
6 node cluster
For lack of availability of the entire cluster, we could only measure
UDP with the 6 node cluster. The numbers shown above are somewhat
contradictory to the rest of the tests: they show that the message rate
and throughput decreases with increasing cluster size. The absolute
numbers are also worse than for the 4-node and 8-node cluster ! A
possible explanation is that since 2 machines were not available to us,
these 2 machines might have generated network traffic, negatively
affecting our tests.
8 node cluster
The 8 node UDP cluster exhibits similar characteristics as the 4 node
cluster: for 1 sender, the results are good, then they go down slightly
for 4 senders, and go up again for 8 senders. With TCP, this is
different: the numbers for 1 sender are lower than for UDP, but go up
for 4 and 8 senders and pass UDP at 4 senders.
With UDP, the messages rates for 1, 4 and 8 senders (sending 1K
messages) don't vary much (23732, 24270 and 28994 messages/sec). The
same goes for 2.5K messages. However, there is bigger variation for 5K
messages.
For TCP throughput, 1 sender sends messages at around 14MB/sec, 4
senders climb to 36MB/sec and 8 senders to 50MB/sec, regardless of the
message size.
Interpretation of results
General
All of the tests were conducted with a relatively conservative
configuration, which would allow for the tests to be run for an
extended period of time. For example, the number of credits in flow
control was relatively small. Had we set them to a higher value, we
would have probably gotten better results, but the tests would not have
been able to run for an extended period of time without running out of
memory. For instance, the number of credits should be a function of the
group size (more credits for more nodes in the cluster), however, we
used (more or less) the same configuration for all tests. We will
introduce an enhanced version of the flow control protocol which
dynamically adapts credits, e.g. based on free memory, loss rate etc,
in
http://jira.jboss.com/jira/browse/JGRP-2.
Another issue is the use of a gigabit ethernet switch. As shown in
previous tests (
http://www.jgroups.org/javagroupsnew/docs/Perftest.html),
JGroups does saturate a 100Mbps switch with clusters up to 8 nodes,
with roughly 11-11.5MB/sec. However, for the GB switch, where we could
receive a max of ca 125MB/sec, the results show that we are far from
reaching that number. However, the numbers shown are probably more than
good enough for most applications using JGroups, especially if
traffic-reducing mechanisms such as
field-based
replication for HTTP sessions are used.
Some reasons might be (needs further investigation):
- We don't currently use jumbo frames, but
rather the default MTU of 1500 for ethernet. Jumbo frames allow for
large IP packets to be sent on the ethernet. If we have a UDP datagram
of 20K to send, and we have an MTU of 1500 bytes, the IP layer needs to
fragment the datagram into 14 IP packets, whereas if we had an MTU of
9000 bytes, it would have to send only 3 IP packets. So if only one IP
packet is lost, because we have to retransmit the entire datagram, the
IP layer would have to re-send 14 or 3 packets respetively. See also this discussion
for ramifications regarding JGroups.
- With jumbo frames we might be able to essentially transmit more
data at almost the same message rate, so throughput might increase
throughput dramatically. Needs to be investigated though.
- Static flow control in JGroups: FC doesn't dynamically adjust the
number of credits based on the number of nodes in a cluster, but this
is defined statically in the config file. We suspect these settings are
okay for smaller clusters, but would need to be adjusted for larger
clusters. We plan to write a new flow control protocol in JGroups 2.6, which
is based on credits (like the existing one), but also takes into
account latency, message loss and number of retransmissions and free
memory.
- No OS/NIC optimizations: in RHEL (and other Linux distributions),
buffer sizes (IP, network card) for incoming packets, and other
parameters can be configured. For example, larger buffer sizes lead to
less chance of losing packets because of full buffers.
- No switch optimization: we used the GB switch unchanged and
haven't yet investigated switch tuning. For example, if the switch
implements traffic prioritization (e.g. 802.1p), then IP
multicasts have low priority. Same goes for IGMP snooping:
when enabled, it slows the switch/router down because each single
multicast packet has to be examined for whether it is an IGMP packet,
or just a regular multicast packet. If disabled, the multicasts would
be copied to all ports, which doesn't matter in our test because all
ports have 1 multicast member connected.
- Logging: we log performance data to disk every N messages
(default=100000). During the tests we logged to an NFS mounted file
system, which incurs one RPC. It would have been better to log to a
local disk (e.g. /tmp), but we didn't want to change the tests half-way
through
Processing of messages and scalability
The
test
program currently doesn't do any processing at the receiver when a
message is received. Instead, a message counter is increased and the
size added to the total number of messages received so far. This is of
course an ideal situation, and not realistic for normal use, where
receivers might unmarshal the message's byte buffer into something
meaningful (e.g. an HTTP session), and process it (e.g. updating an
HTTP session). (The test by the way allows to take processing time at
the receiver into account by setting the processing_delay value). As a
matter of fact, we have found in other tests (JBossCache, JBoss HTTP
session replication) that the majority of the time is spent in
marshalling and (especially) unmarshalling (serialization) of
application data, rather than in JGroups. Until and unless marshalling
times are reduced, the performance of JGroups will not even matter much.
What does matter, however, is that (for JGroups 2.4) if we do have some
processing time at the receiver, then incoming requests - even from
different senders - have to wait in a queue to be processed. So if a
receiver R receives messages A.M1 (M1 from A), B.M10, B.M11, B.M12 and
A.M2, then B.M{10-12} and A.M2 will have to wait until A.M1 has been
processed. Then B.M10 will be processed and B.M{11-12} and A.M2 will
have to wait, and so on.
This will be changed in
JGroups 2.5 with the
threadless stack, where JGroups can be configured to process
messages from different senders
concurrently.
For example, in the above case, JGroups could be configured to process
all messages from A and B in separate threads (say T1 and T2), so T1
would process A.M{1-2} and T2 would process B.M{10-12}. This would
allow for concurrent processing of A.M1 and B.M10, which increases
performance, compared to the single-queue case above.
We will investigate whether the threadless stack increases performance
for the performance tests, where no processing is done at the receiver.
UDP versus TCP
As can be seen with the 4 and 8 node clusters, TCP has better
performance than UDP when more than one sender is sending messages.
Since both UDP (
fc-fast-minimalthreads.xml)
and TCP (
tcp_nio.xml) use flow
control (FC), the problem cannot be flow control. There are multiple
issues that could contribute to this, some of them being (needs to be
investigated more):
- Switch related
- Switches are generally unfavorable of IP multicast traffic (the
only other traffic which is treated worse is broadcast). We need to
investigate whether the switch implements a traffic shaping like
priority mechanism whereby packets of lesser priority are discarded on
congestion. Note, however, that since the performance test should be
the only application running in the cluster during the time slot
allotted to the tests, this may not be an issue.
- Full/half duplex. We need to make sure that all ports run in
full-duplex mode (verified). We also need to look into whether flow
control is enabled on the switch. This sometimes negatively impacts
flow control implemented on the protocol level, e.g. TCP, or the
application level (JGroups).
- Todo: run the performance tests and compare the statistics on
the switch before and after, e.g. number of packets dropped
- Todo: look into disabling IGMP snooping (which has some
overhead checking all IP multicast packets whether they're IGMP)
- More efficient retransmission in TCP. This is an assumption and
needs to be verified. We need to monitor the loss/retransmission rate
for our tests and measure retransmission time.
Fluctuation of results (reproduceability)
As there was no reservation for the cluster labs (this was done by
sending around emails), it is possible that some other applications
were sometimes using the lab, generating network traffic and using CPU
cycles, and thus distorting the results. Although we used the lab
(based in Atlanta, GA) early in the morning (9am-2pm UST, 3am-8am EST),
there's still a chance that a cron job was started, or someone from a
different time zone was using the lab. We usually verified that no
other application was running before starting our tests (by using last,
top, who, ps for example), but that doesn't prevent some other
application from starting
during our
test runs.
Conclusion and outlook
We showed that performance of JGroups is good for clusters ranging from
4 to 8 nodes. TCP scales better than UDP, however it also creates a
mesh of connections (every sender needs to be connected to everyone
else in the cluster), which may not scale to large clusters.
We intend to measure the performance of larger clusters (16 - 64 nodes)
in the next round of benchmarks.