Multicast Performance on ESX

Multicast data has always been somewhat of a mystery to network engineers unless they have a very specific reason for using it. Since the financial industry is a heavy user of multicast, I have been fortunate to get my hands very dirty in it throughout my career.

One item that has always vexed our group is how we can consolidate our multicast workloads, and extend the efficiency gains of virtualization to this segment of our environment. These boxes represent a significant cost, and they often go under utilized in terms of CPU/Memory. But because of the nature of the data, it’s difficult to try anything that can degrade performance.

In ESX 5.0, Vmware introduced a new technology that is supposed to help alleviate the performance bottlenecks

  • splitRxMode

I’ll summarize the feature here as described in the Technical Whitepaper:

splitRxMode

In previous versions of ESX, all network receive processing for a queue was performed inside a single context within the VMKernel. splitRxMode allows you to direct each vNIC (configured individually) to use a separate context for receive packet processing.

They make a special note to indicate that even though it improves receive processing for multicast, it does incur a CPU penalty due to the extra overhead per packet, so don’t enable it on every machine.

Performance

In their testing, VMWare labs reported that they observed 10-25% packet loss on a 16Kpps multicast stream once the number of subscriber VM’s went past 24. After they enabled splitRxMode, the packetloss was < 0.01% all the way up to 32 VM’s on the host.

My Take

Even though VMWare seems confident that the recent IO improvements with splitRxMode will increase multicast performance, there are some key considerations here:

  1. 0.01% is still a lot of packet loss — at 16Kpps, that’s still over 1pps
  2. The scenario they tested is for a one-to-many situation (one stream to multiple receivers). What if the packet rate is higher or the number of streams is higher, but the receiver count is low?

Obviously this requires a lot more testing on our part before we’d ever even consider rolling anything to production. If you have any experience in this regard, please feel free to comment and offer any insights/suggestions you might have.

NOTE: This entry is my first foray into technical blogging. I’ve learned a lot from the blogs I’ve read over the years, and I’ve also found that these types of blogs are the absolute best resource for solving real problems. I hope I can contribute something meaningful and perhaps repay some of what I’ve been given.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s