Currently, I work for a mid-sized high-performance computing (HPC) shop.
For many of the scientific codes we run, communication performance matters -
both in terms of inter-machine (a.k.a., inter-node) bandwidth and latency.
Like most HPC shops, we have some experience with Infiniband, but in recent
years we’ve been using 10 Gbps Ethernet (10gigE) for a cluster interconnect.
Given ethernet’s prevalence, and general dominance in datacenter networking,
10gigE seems on the surface to be a general win, and a decent choice for
a cluster interconnect (particularly for a user base that historically
prefers gigabit ethernet for cost reasons).
I’ve designed three 10gigE clusters, two of which are on the current
(November, 2011) Top 500 list. I do
not recommend this. 10gigE has its place, but currently economics favor
Infiniband for high-performance computing. If your code uses MPI, and you
need more cores than you can fit in one compute node (and your code isn’t
embarassingly parallel - I’ve seen some that could operate nicely over
10 Mbps ethernet), you should be looking at Infiniband.
Rather than delving into why I’ve been building 10gigE clusters, this page
discusses modern technology that can help you get the most performance from
a high-speed ethernet fabric. Be warned, the content from here on out gets
technical quickly. I’ve likely spent more time than is healthy examining
this space, and doing so requires a fair amount of expertise in TCP, IP,
ethernet, Infiniband (as well as general RDMA theory, and its multiple
incarnations), operating systems, MPI libraries, and several vendors’ product
To quote the
xterm source code: “There be dragons here.”
Defining “slow”, and Why Plain TCP/IP is Bad
TCP/IP is great, for most things - but the API pretty much requires kernel
intervention. Your app calls
some library fires off a syscall, and the kernel starts formatting data to
go over the wire. Under Linux, a null syscall has an overhead of around 1000
instructions (if you’ll pardon the blind assertion), so that means you can
do around 2.5 million syscalls per second on a 2.5 GHz CPU (using some vague
hand-waving to avoid calulating effects of load-store queuing and superscalar
processors). On paper, that means a hard max of around 30 Gbps of throughput -
more, with frame sizes over 1500 bytes.
Unfortunately, that’s not reality. First off, a processor will need to do
some data formatting and copying beyond the time to enter the syscall. Second,
data arriving will also trigger syscalls. Some of this can be ameliorated
(e.g., jumbo frames, interrupt coalescing, etc.) but at a cost of tying up a
processor to handle the kernel’s side of the communication. If your application
requires frequent data exchange (like most HPC simulations), the added latency
and processor overhead can greatly degrade performance - even without fully
utilizing the available bandwidth.
TOE (TCP Offload Engine) NICs may help, to a limited degree. A TOE will
reduce the CPU’s workload, but won’t significantly reduce overall message
latency - unless the TOE vendor comes with a wrapper library to replace the
sockets API (Solarflare does this, for example).
If you need to do RDMA over Ethernet, this is the easiest way to do it. It’s
not quite Infiniband, but many of the various IB-related commands in OFED
will work. Many RDMA apps will work with this, and as iWARP is encapsulated
by TCP/IP it can transit a router. Latency will be higher than RoCE (at least
with both Chelsio and Intel/NetEffect implementations), but still well under
10 μs. iWARP is reasonably stable with recent versions of the
OpenFabrics stack - in-kernel drivers
may not be as stable (including those baked into Redhat Enterprise 5 and 6).
RoCE is RDMA over Converged Ethernet - but Infiniband over Ethernet would be
a more apt description. Strip the GUIDs out of the IB header, replace them
with Ethernet MAC addresses, and send it over the wire. As of this writing,
only Mellanox (www.mellanox.com) makes
RoCE-capable equipment (their CX2 and CX3 line of products).
Infiniband is a lossless physical-layer protocol, so RoCE requires lossless
Ethernet. Also, since it’s Ethernet, RoCE cannot transit a router. It’s
strictly a layer-2 protocol, and it needs a complicated layer-2.
Lossless Ethernet: a Quick Review
Ethernet becomes lossless by re-using 802.1D PAUSE frames for explicit flow
control. This is timing-sensitive; a receiver must send a PAUSE soon enough
such that it is received and processed before the receive buffer can fill.
Obviously, there are issues stretching this over some distance. Switches
must be internally lossless, and must be able to send PAUSE frames as well
as receive them. Such switches are usually marketed with acronyms like “DCB”
(DataCenter Bridging) or “CEE” (Converged Enhanced Ethernet).
Obviously, this coarse-grained approach will pause all traffic over the link -
including any IP or FCoE traffic. As this can have a negative impact on
non-RoCE performance, Cisco has proposed Priority Flow Control (PFC, now
covered under IEEE 802.1Qbb). This
is a PAUSE frame with a special payload, indicating which Ethernet QoS class
should be paused. This is accompanied by other protocols, to negotiate
QoS values on either end of a link (i.e., between NIC and switch).
Finally, all types of traffic on the link will have different Ethernet frame
types (as described by
IPv4, IPv6, FCoE, and RoCE all have different ID values.
While RoCE is supported by
OFED, as of OFED 1.5.3 it isn’t
completely stable. You’ll want to use Mellanox’s OFED - version 1.5.3 or
higher. Stock OFED will work fine for small tests, but large applications
will have a tendency to crash.
PFC is a pain. The tools to auto-negotiate may not exist for RoCE - the
only documentation I’ve found was limited to FCoE. Avoid it if at all possible.
Somehow, you’ll need to classify RoCE traffic as lossless. Here’s some
suggestions, in my order of preference:
Discriminate RoCE traffic by Ethertype - RoCE packets would be
treated losslessly, and non-RoCE traffic could be dropped (during congestion).
Classify ALL traffic as lossless (and deal with the performance impact, if
any, on non-RoCE traffic).
Assign a QoS class for lossless traffic. Unfortunately, Mellanox adapters will
only emit a QoS when they emit a VLAN tag, so you’ll need to do the following:
- Set a default IB Service Level to match your QoS using
options rdma_cm def_prec2sl=4 in
/etc/modprobe.d (Obviously, I’m using the value
- Configure your Ethernet switch to treat that traffic as lossless
- Create a tagged VLAN device on your RoCE NIC on all connected systems
- Assign those VLAN devices a private IP address
- Stick that IP address in
/etc/mv2.conf, so MVAPICH2 will know what IP address to try for RoCE connections
- Configure all other RDMA-aware applications to use a non-default GID (since VLAN interfaces will appear as additional GID indexes on the Infiniband HCA side of the RoCE adapter)
So you have Cisco Nexus switches…
If you can, stop reading and go buy some Infiniband adapters. You’ll save a
considerable amount of staff time.
Fine. Keep reading. But don’t say I didn’t warn you.
The Nexus 5000-series and the Nexus 7000-series switches are completely
different products. The interface to building lossless queues is different,
the command syntax is different, and different values can be used for lossless
traffic classes on each series of switches. If you have environments with
both, you’ll be picking different QoS values.
The Nexus 7000 platform only supports lossless queuing on the newest “F”
boards - the fabric boards that have no routing abilities. You’ll want to
buy those, if you plan on having stable RoCE.
Finally, be wary of ANY firmware updates. We’ve had a functional RoCE
configuration on a Nexus 7000 switch, using firmware 5.1(3), using the
third method above. That broke, however, when we upgraded to 5.1(5).
Something changed in the default queuing config, and since you can only build
on the default lossless queue config (rather than nuke it and define your
own), you are subject to changes in the default. In our case, RoCE performance
dropped to 30 Mbps (down from 9.91 Gbps). All wasn’t lost, though - after
the upgrade, all traffic was lossless (except what we’d previously tagged
via QoS, of course). We just stopped using QoS, and now have reliable
Ethernet. Absolutely bizarre.
Making this all work for practical apps
Making this work depends on how RoCE traffic was classified. If RoCE
Ethertypes are lossless, or if all traffic is lossless (options #1 or #2,
above) any RDMA application should just work - the RoCE adapter presents as an
If you picked option #3, you’ll need to jump through some extra hoops. First,
def_prec2sl module parameter and
as described above. At this point, MVAPICH2 applications should work. For
OpenMPI, you’ll need to use OpenMPI 1.4.4 or 1.5.4 or newer. They need
additional command-line options to set the IB service level and the IP address
-mca btl_openib_ib_service_level <number> and
-mca btl_openib_ipaddr_include <ipaddr>, respectively.
These can be baked into a config file (like
in your OpenMPI’s
share directory). Note that
btl_openib_ipaddr_include can take CIDR notation for a subnet to
match, so you can use the same config file for all nodes in a cluster.
In theory, it may be possible to use RoCE for non-MPI applications - including
kernel-level things like Lustre. I’d only attempt this if options #1 or #2
are in use, though - setting extra VLANs, non-default GIDs, and custom IB
service levels (mapped to Ethernet QoSes) is likely to be hard to integrate
in anything other than OpenMPI and MVAPICH2.
There isn’t a lot of documentation (practically zero, outside of Mellanox)
on RoCE. Any useful links I can find will be added here.