What the heck is an IOP (and why do I care)? Disk math, and does it matter?

I’ll start by answering the title question first.  IOP is an acronym standing for Input Output Operation.  It does seem like it should be IOO, but that’s just not the way it worked out.

A related bit of trivia, we generally talk either about total IOPs for a given task, or we talk about a rate – IOPs per second typically, noted as IOPS.

With that the Wikipedia portion of today’s discussion is complete.   Let’s move on to why we care about IOPs.

Most frequently the topic comes up in terms of either measuring a disk system’s performance, or attempting to size a disk system for a specific workload or loads.  We want to know not how much throughput a given system needs, but how many discrete reads and writes it’s going to generate in a given unit of time.

The reason we want to know is that a given storage system has a discrete number of IOPS it can deliver.  You can read my article on Disk Physics to get a better understanding of why.

In the old days this was mostly a math problem.   We knew that a 7.2K drive would deliver 60-80 IOPS, a 10K drive would deliver 100-120, and a 15K drive would give us 120-150 IOPS.   We also knew that we had to deal with RAID penalties associated with write operations to storage arrays.  Typical values were 1 IO penalty for RAID1 and 10, and 4 for RAID5 and 50.

The idea here was fairly simple.  If I needed a disk subsystem that would give me 1500 IOPS read, then I needed 10 15K drives to do that (1500/150 = 10).   If I needed 1500 IOPS write in a RAID10 comfit, then I needed 20 15K drives ((1500 + (1500 * 1))/150 = 20).   The same 1500 IOPS write in a RAID5 config took more spindles because of the RAID penalties but it was also easily calculated as 50 drives ((1500+(1500*4))/150 = 50).

That last by the way is how come database vendors have always asked that their logs be placed on RAID1 or RAID10 storage.  When writing to RAID5 storage it’s necessary to read the entire RAID stripe, recalculate, and re-write it.  Thus the 4 penalties.

The math got a bit more complicated when we had a mix of reads and writes.   What we have to do there is to calculate the read and write portions separately and then add the result together.  Suppose we had a workload of 3000 IOPS, where 50% was read and 50% was write.  Thus we’d have 1500 IOPS read and 1500 IOPS write.   On a RAID10 system we’d need 10 drives to satisfy the reads, and 20 drives to satisfy the writes.   A total of 30 drives then is needed to satisfy the whole 3000 IOPS workload.

Those were the old days when we could pretty easily look at a disk subsystem and calculate how much performance it should deliver.  Modern disks however have changed the rules some.

How did they change the rules?   Well, basically they have a way of making IOPs disappear.

Consider for a moment NetApp’s WAFL configuration.   WAFL works by caching write operations to an NVRAM on the controller, and telling the application that the IO is complete.   No physical IO operation has actually taken place.  Now, thus far this sounds like a write back cache, but here’s the difference.  WAFL doesn’t just perform a “lazy write” of the cached data, it actually waits until it has a series of writes which need to be written to the physical disks, and then it looks for a place on disk where it can write all of those blocks down at once in sequence.  Thereby taking perhaps 4 or 10 (or more) physical IOPs and combining them into one.   WAFL actually takes this a step further by looking for places on disk where it doesn’t have to read the stripe before writing it in an attempt to also avoid paying the RAID write penalties.  This last is the reason WAFL performance degrades as the disk array becomes very full; it becomes harder to find unused space.

Another example of vanishing IOPs is Nimble’s CASL filesystem that expands on what WAFL does by doing two additional things.  First, it compresses all the data as it comes into the array, which further reduces the number of IOPs necessary to write the data.  Second CASL is based around the idea of having very large FLASH memory based caches so that physical IOPs to spinning disk can be avoided for reads.   The net of this being that write IOPs are reduced and read IOPs are nearly eliminated completely.    In testing done by Dan Brinkman while he was at Lewan, a Nimble array with 12 7.2K disks was clocked at over 18,000 IOPS.  We know that the physical disks were capable of no more than 960 IOPS (80 * 12 = 960).  This is a testament to how effective CASL is at reducing physical IOPs.

A third example of IO reduction is what Atlantis Computing does in their Ilio and USX products when dealing with persistent data (in-memory volumes is a topic for another day).   Atlantis takes the idea of caching and compression further still by adding inline data deduplication, wherein data is evaluated before being written to determine if an identical block has already been written.   If it’s an identical block then no physical write is actually performed for the block, and the Filesystem pointer for that block is merely updated to reflect an additional reference.    Atlantis caches the data (reads and writes) in RAM or on FLASH as well to further reduce physical IO operations.

The extreme case of this is the all-flash storage array (or subsystem), which is available from many vendors these days (Compellent, NetApp, Cisco, Atlantis, VMware vSAN, all offer all flash options and there are many more options as well).   All flash arrays eliminate physical disk IO by eliminating the physical disks. They’ve made the FLASH cache tier so large that there is no longer any need to store the data on a spinning drive.  There is still an upper bound for these arrays but it’s tied to controllers and bandwidth rather than the physics of the storage medium.

So what’s the net of all this?

The first part is that storage has gotten smarter and more efficient by making better use of CPU’s and memory.  Letting them deliver higher performance and better data density with fewer spinning drives.

The second part of the answer is that the old-school disk math around how many IOPS you need and how many spindles (spinning disks) will be required is largely obsolete.  Unless you’re building an old-school storage array or using internal disks in your server the storage is probably doing something to reduce and/or eliminate physical disk IOPs on your behalf.  Making the idea that you can judge the performance of the storage by the number and type of drives is uses pretty much false.   A case of not being able to judge the book by its cover.

You’ll need to discuss your workload with your storage vendor and determine how the array is going to handle your data and then rely on the vendor to size their solution properly for your need.

Why Hyperconverged Architectures Win

Much has been made recently by the likes of Nutanix, Simplivity, Atlantis, and even VMware (vSAN, EVO|RAIL) about the benefits of hyper-coverged Architecture.

I thought I’d take a few moments and weigh in on why I think that these architectures will eventually win in the virtualized datacenter.

First, I encourage you to read my earlier blogs on the evolution of storage technology and from that I’ll make two statements.   1.) physical storage has not changed and 2.) what differentiates one storage array vendor from another is not the hardware but the software their arrays run.

Bold statements I know, but bear with me for the moment and let’s agree that spinning disk is no longer evolving, and that all storage array vendors are basically using the same parts – x86 processors, Seagate, Fugitsu, Western Digital hard disks, and Intel, Micron, Sandisk or Samsung flash.   What makes them unique is the way they put the parts together and the software that makes it all work.

This is most easily seen in the many storage companies who’s physical product is really a Supermicro chassis (x86 server) with a mix of components inside.   We’ve seen this with Whiptale (Cisco), Lefthand (HP), Compellent (Dell), Nutanix, and many others.  The power of this is evidenced where the first 3 were purchased by major server vendors and then transitioned to their own hardware.   Why was this possible?   Because the product was really about software running on the servers and not about the hardware (servers, disks) itself.

Now, let’s consider the economics of storage in the datacenter.  The cheapest disk and thus the cheapest storage in the datacenter are those that go inside the servers.   It’s often a factor of 5-10x less expensive to put a given disk into a server than it is to put it into a dedicated storage array.   This is because of the additional qualification and in some cases custom firmware that goes onto the drives that are certified for the arrays; and the subsequent reduction in volume associated with a given drive only being deployed into a single vendor’s gear.  The net being that the drives for the arrays carry a premium price.

So, storage is about software, and hard disks in servers are cheaper.   It makes sense then to bring these items together.    We see this in products like VMware vSAN, and Atlantis USX.  These products let you choose your own hardware and then add software to create storage.

The problem with a roll-your-own storage solution is that it’s necessary to do validation on the configuration and components you use.   Will it scale?  Do you have the right controllers? Drivers? Firmware? What about the ratios of CPU, Memory, Disk, and Flash?  And of course there is the support question if it doesn’t all work together.  If you want the flexibility to custom configure then the option is there.   But it can be simpler if you want it to be.

So here enters the hyper-converged appliance.   The idea is that vendors combine commodity hardware in validated configurations with software to produce an integrated solution with a single point of contact for support.  If a brick provides 5TB of capacity and you need 15TB, buy 3 bricks.   Need more later?  Add another brick.   It’s like Legos for your datacenter, just snap together the bricks.

This approach removes the need to independently size RAM, Disk, and CPU; it also removes the independent knowledge domains for storage and compute.  It leverages the economy of scale of server components and provides the “easy button” for your server architecture, simplifying the install, configuration, and management of your infrastructure.

Software also has the ability to evolve very rapidly.   Updates to existing deployments do not require new hardware.

Today the economics for hyper-converged appliances have fallen short of delivering on the price point.   While they use the inexpensive hardware the software has always been priced at a premium.

The potential is there, but the software has been sold in low volumes with the vendors emphasizing OpEx savings. As competition in this space heats up we will see the price points come down.   As volumes and competition increase the software companies will be willing to sell for less.

This will drive down the cost, eventually making legacy architectures cost prohibitive due to the use of proprietary (and thus low volume) components.  Traditional storage vendors who are based on commodity components will be more competitive, but being “just storage” will make their solutions more complicated to deploy, scale, and maintain.   The more proprietary the hardware the lower the volume and higher cost.

For these reasons – cost, complexity, and the ability of software to evolve – we will see hyper-converged, building block architectures eventually take over the datacenter.  The change is upon us.

Are you ready to join the next wave?   Reach out to your Lewan account executive and ask about next generation datacenters today.   We’re ready to help.

Disk Physics

Today’s topic – Disk Performance.  A warning to the squeamish ..Math ahead.

Throughput refers to the amount of data  read or written per unit of time.  Generally measured in units like Megabytes per second (MB/s), or Gigabytes per Hours (GB/h).  Often when dealing with networks we see Kilobits per second (Kb/s) or Megabits per second (Mb/s).   Note that the abbreviations of some of those units look similar, pay attention to the capitalization because the differences are a factor of at least 8x.

It’s easy to talk about a hard drive, or a backup job, or even a network interface providing 250MB/s throughput and understand that if I have 500GB of data that it’s going to take a little over a half hour to transfer the data. (500GB * 1024MB/GB / 250MB/s / 3600s/h = 0.56h)

Throughput is by far the most talked about and in general most understood measure of performance.   By the same token when people ask about network performance they often go to speedtest.com and tell me that “my network is fast, because I get 15Mb/s download.”  I agree that’s a pretty decent throughput, but that’s not the only measure of  performance that’s important.

A second measure is Response Time (or latency).   This is a measure of how long it takes a request to complete.   In the network world we think about this being how long it takes a message to arrive at it’s destination after being sent.   In the disk world we think about how long from when we request an IO operation happen until the system completes it.  Disk latency (and network latency) are often measured in direct units of time – milliseconds (ms) or microseconds (us), and occasionally in seconds (s).   Hopefully you never see IT technology latency measured in hours or days unless you’re using an RFC1149 circuit.

The combination of a request response time and throughput, combined with the size of the request (amount of data moved at a time) yields a metric which amounts to how many requests can be completed per unit of time.  We see this most often in the disk world as I/O operations per second or IOPS.   We talk about IOPS a way of thinking about how “fast” a given disk system is, but it’s arguably more of a measure of workload capability than either latency or throughput; however both latency and throughput contribute to the maximum IO operations per second a given disk can mange.

For example – if we have a hard disk with an maximum physical throughput of 125MB/sec, which is capable of processing requests at a rate of 80 requests per second, what is the throughput of the drive if my workload consists of 4KB reads and writes?  Well in theory at 125MB/sec throughput the drive could process 125MB/s * 1024KB/MB / 4KB/IO = 32,000 IO/s.   Hold on, the drive is only capable of 80 IOPS so the maximum physical throughput won’t be achieved.  80IO/s * 4KB/IO = 320KB/s.   If we wanted to maximize this drive’s throughput we need to increase the size (or payload) of the IO requests.  Ideally we’d perform reads an writes in blocks equal to the maximum throughput divided by the maximum IO rate (125MB/s / 80IO/s = 1.562MB).

This last trick by the way is what many vendors use to improve the performance of relatively “slow” hard disks; referred to as IO coalescing they take many small IO operations and buffer them until they can perform one large physical IO.

What governs the drive’s maximum IOPS is actually a function of multiple factors.

Physical media throughput is one of them – which is governed by the physical block size (often 512bytes), the number of blocks per track on the platter, the number of platters, and the rotational velocity of the drive (typically measured in revolutions per minute – RPM).   The idea here being that the drive can only transfer data to or from the platters at the rate at which the data is moving under the heads.   If we have a drive spinning at 7200RPM, with say 100 blocks per track, and 512bytes/block and a single platter/head we have a drive with a maximum physical media transfer rate of 512B/block * 100blocks/track * 7200 tracks/minute / 60seconds/minute / 1024bytes/KB / 1024KB/MB = 5.85MB/s.  Under no circumstances can the physical throughput of the drive exceed this value, because the data simply isn’t passing under the head any faster.

To improve this value you can add more heads (with two platters and 4 heads this drive could potentially move 23.4MB/s). You can increase the number of blocks per track (with 200 blocks per track and one head the drive would have a throughput of 11.7MB/s). Or you can increase the velocity at which the drive spins (at 36,000 RPM this drive would move 29.25MB/sec).  As you can see though this maximum throughput is governed by the physical characteristics of the drive.

A second factor impacting the IOPS is the question of how long it takes to position the head to read or write a particular block from the disk.  IO operations start at a given block and then proceed to read or write subsequent sequential blocks until the size of the IO request has been fulfilled.  So on our sample drive above a 4KB request is going to read or write 8 adjacent (sequential) blocks.  We know what the physical transfer rate for the drive is, but how long does it take to physically move the mechanism so that the 8 blocks we care about will pass under the head?   Two things have to happen, first we have to position the head over the right track, and then we have to wait for the right block to pass under the head.  This is the combination of “seek time” and “rotational latency”.   Our 7200RPM drive completes one revolution every (7200RPM / 60 seconds/minute = 120 revolutions/second or every 120th of a second or 0.00833 seconds or 8.33 milliseconds).  On average then every IO operation will take 4.16ms to start performing IO after the head are aligned.  Again we can reduce the rotational latency by spinning the drive faster.  Seek time (how long it takes to align the heads) varies by drive, but if it takes 6ms then the average physical access time for the drive would be 10.15ms.   Drives which are physically smaller will have to move the heads shorter distances and will have lower seek times, and therefore lower access times.  Larger drives, or drives with heavier head assemblies (more heads) will have higher seek times. For a given drive you can typically look up the manufacture’s specs to see what the average seek time is.  So, let’s say that it takes 10ms to typically position a head and read a block, then our drive could potentially position the head 100 times per second.  That means the maximum IOPS for this drive is 100 per second.

So, IOPS is governed by physics, and throughput (from media) is governed by physics.   What else is a factor?   There are additional latencies introduced by controllers, interfaces, and other electronics.  Generally these are fairly small, measured in micro-seconds (us) relative to the latencies we’ve talked about generally become negligible.   The other side is physical interface throughput   ATA-133 for instance had a maximum throughput on the channel of 133MB/s; where Ultra-160 SCSI was limited to 160MB/s.  The maximum throughput of a given drive will be limited to the throughput of the channel to which it’s attached.  The old ATA and SCSI interfaces noted earlier also attached multiple devices to a channel which limited the sum of all devices to the bandwidth of the channel.  Newer SAS and SATA architectures generally dedicated a channel per device, however the use of expanders serves to connect multiple devices to the same channel.  The net of this being that if you have 10 devices at 25MB/sec throughput each connected to a channel with a maximum throughput of 125MB/sec then the maximum throughput you’ll see is 125MB/sec.

So that covers what governs the speed of a hard disk.   Some may ask “what makes SSD’s so fast?”  The short answer is that it’s because they aren’t spinning magnetic devices, and therefore don’t have the same physical limits.   The long answer is a topic for another blog.

A brief history of storage

I’ve told several groups I’ve spoken to recently that “disk storage hasn’t gotten faster in in 15 years.” Often that statement is met with some disbelief. I thought I’d take a few paragraphs and explain my reasoning.

First – Lets cover some timeline about the evolution of spinning disk storage.

  • 7200 RPM HDD introduced by Seagate in 1992
  • 10,000 RPM HDD introduced by Seagate in 1996
  • 15,000 RPM HDD introduced by Seagate in 2000
  • Serial ATA introduced in 2002
  • Serial Attached SCSI introduced 2004
  • 15,000 RPM SAS HDD ships in 2005

So, my argument starts with the idea that this is 2015, and the “fastest” hard disk I can buy today is still only 15,000 RPM, and those have been shipping since 2000.  Yes, capacities have gotten larger, data densities greater, but they have not increased in rotational speed, and hence have not significantly increased in terms of IOPS.

To be fair, the performance of a drive is a function of several variables, rotational latency (the time for a platter complete one revolution) is just one measure.  Head seek time is another measure.  As is the number of bits which pass under the head(s) in a straight line per second.

Greater data densities will increase the amount of data on a given cylinder for a drive, and thus increase the amount of data that can be read or written per revolution – So you could argue that throughput may have increased as a function of greater density.   But only if you don’t have to re-position the head, and only if you are reading most of a full cylinder.  I also submit that the greater densities have lead to drives having fewer platters and thus fewer heads.  This leads to my conclusion that the reduction in drive size mostly offsets any significant increased throughput due to the greater densities.

Today we’re seeing a tendency towards 2.5″ and sometimes even 1.8″ drives.   These form factors have a potential to increase IO potential by decreasing seek times for the heads.   Basically the smaller drive has a shorter head stroke distance and thus potentially will take less time to move the head between tracks.   The theory is sound, but unfortunately the seek latency is typically much lower than the rotational latency; so the head gets there faster, but is still waiting for the proper sector to arrive as the disk spins.

Interestingly some manufacturers used to take advantage of a variable number of sectors per track and recognized that the outer tracks held more sectors.  For this reason they would use the outer 1/3 of the platter for “fast track” operations looking to minimize the head seek time and maximize the sequential throughput.   Again a sound theory, but the move from 3.5″ to 2.5″ drives eliminates this faster 1/3 of the platter.  Again, negating any gains we may have made.

Another interesting trend in disk storage is a movement to phase out 15,000RPM drives.  These disks are much more power hungry, and thus produce more heat than their slower (10,000RPM and 7,200RPM) counterparts.   Heat eventually equates to failure.  Likewise the rest of the tolerances in the faster drives are much tighter.   This results in faster drives having shorter service lives and being more expensive.   For those reasons (and the availability of flash memory) many storage vendors are looking to discontinue shipping of 15,000RPM disks. A 10K drive has only 66% of the IOP potential of a 15K drive.

So I submit that any gains we’ve had in the last 15 years in spinning disk performance have largely be offset by the changes in form factor.   Spinning disk hasn’t gotten faster in 15 years.  The moves towards 2.5″ and 10K drives could arguably suggest that disks are actually getting slower.

So IO demands for performance are getting greater.   VDI, Big Data Analytics, Consolidation and other trends demand more data and faster response times.  How do we address this?  Many would say the answer is flash memory, often in the form of Solid State Disk (SSD).

SSD storage is not exactly new

  • 1991 SanDisk sold a 20MB SSD for $1000
  • 1995 M-Systems introduced Flash based SSD
  • 1999 BiTMICRO announced a 18GB SSD
  • 2007 Fusion IO PCIe @ 320GB and 100,000 IOPS
  • 2008 EMC offers SSD in Symmetrix DMX
  • 2008 SUN Storage 7000 offers SSD storage
  • 2009 OCZ demonstrates a 1TB flash SSD
  • 2010 Seagate offers Hybrid SSD/7.2K HDD
  • 2014 IBM announces X6 with Flash on DIMM

But Flash memory isn’t without it’s flaws.

We know that a given flash device has a finite lifespan measured in write-cycles.  This means that every time you write to a flash device you’re wearing it out.  Much like turning on a light bulb, each time you change the state of a bit you’ve consumed a cycle. Do it enough and you’ll eventually consume them all.

Worse is that the smaller the storage cells used for flash are (and thus the greater the memory density) the shorter the lifespan.   This means that the highest capacity flash drives will sustain the fewest number of writes per cell.   Of course they have more cells so there is an argument that the drive may actually sustain a larger total number of writes before all the cells are burned out.

But… Flash gives us fantastic performance.   And in terms of dollars per IOP flash has a much lower cost than spinning disk.

DRAM memory (volatile) hasn’t gone anywhere either – in fact it keeps increasing in it’s own densities and reduced cost per GB.  DRAM doesn’t have the wear limit issue of Flash, nor the latencies associated with Disk. However it suffers from it’s inability to store data without power. If DRAM doesn’t have it’s charges refreshed periodically (every few milliseconds) it will loose whatever it’s storing.

Spinning disk capacities keep growing, and getting cheaper.  In December of 2014 Engadget announced that Seagate was now shipping 8TB hard disks for $260.

So the ultimate answer (for today) is that we need to use Flash or DRAM for performance and spinning disk (which doesn’t wear out from being written to, or forget everything when the lights go out) for capacity and data integrity.  Thus the best overall value comes from solutions which combine technologies to their best use.  The best options don’t ask you to create pools of storage of each type, but allow you to create unified storage pools which automatically store data optimally based on how it’s being used.

This is the future of storage.

vGPU, vSGA, vDGA, software – Why do I care?

I want to take a moment and talk for a second about an oft mentioned but little understood new feature of vSphere 6.  Specifically NVIDIA’s vGPU technology.

First, we need to know that vGPU is a feature of vSphere Enterprise Plus edition; which means it’s also included in vSphere for Desktops.  But if this sounds like something you need and you’re running Standard or Enterprise, now might be a good time think about upgrading, and taking advantage of the trade up promotions.

Many folks think “I don’t need 3D for my environment.  We only run office.”  If that’s you, then please take a good close look at what your physical desktop’s GPU is doing while you run office 2013; especially PowerPoint.  Nearly every device sold since ~1992 has had some form of hardware based graphics acceleration.  No so your VM’s.   Software expects this.  Your users will demand it.

With that, let’s talk about what we’ve had for a while as relates to 3D with vSphere.  Understand you can take advantage of these features regardless of what form of Desktop or Application virtualization you choose to deploy because it’s a feature of the virtual machine.

No 3D Support – I mention this because it is an option.  You can configure a VM where 3D support is disabled.  Here an application that needs 3D either has to provide it’s own software rendering, or it will simply error out.  If you know your App doesn’t use any 3D rendering at all this is an option to ensure that no CPU time or memory is taken up trying to provide the support.  No vSphere drivers are required.

Software 3D – Ok, here we recognize that DirectX and OpenGL are part of the stack and that some applications are going to use them.  VMware builds support into their VGA driver (part of VMware Tools) that can render a subset of the API’s (DX9, OpenGL 1.2) in software, using the VM’s CPU.  This works for a set of apps that need a little 3D to run, and we aren’t concerned about the system CPU doing the work.  No hardware ties here as long as you can live with the limited API support and performance. No vSphere drivers are required.  No particular limits on how many VM’s can do this beyond running out of CPU.

vSGA – or Virtual Shared Graphics – In this mode the software implementation above gets a boost by putting a supported hardware graphics accelerator in to the vSphere host.  The API support is the same because it’s still the VMware VGA driver in the VM, but it hands off the rendering to an Xorg instance in the service console which in turn does the rendering on the physical card.   This mode does require a supported ESXi .vib driver, provided by the card manufacturer.   That means you can’t just use any card, but have to buy one specifically for your server and which has a driver.  NVIDIA and AMD provide these for their server centric GPU cards.  Upper bound of VM’s is determined by the amount of video memory you assign to the VM’s and the amount of memory on your card.

vDGA – or Virtual Dedicated Graphics – In this mode we do a PCI pass-through for a GPU card to a given virtual machine.  This means that the driver for the GPU resides inside the virtual machine and VMware’s VGA driver is not used.  This is a double (or tripple) edge sword.   Having the native driver in the VM ensures that the VM has the full power and compatibility of the card, including latest API’s supported by the driver (DX11, OpenGL 4, etc.).  But having the card assigned to single VM means no other VM’s can use it.  It also means that the VM can’t move off it’s host (no vMotion, no HA, no DRS). This binding between the PCI device and the VM also prevents you from using View Composer or XenDesktop’s MCS, though Citrix’s Provisioning Services (PVS) can be made to work.  So this gives great performance and unmatched compatibility but at a pretty significant cost.  It also means that we do not want a driver installed for ESXi since we’re only going to pass-through the device.  That means you can use pretty much any GPU you want.  You’re limit on how many VM’s per host is tied to how many cards you can squeeze into the box.

All of the above are available in vSphere 5.5, with most of it actually working under vSphere 5.1.   I’ve said if you care about your user experience you wanted to have vSGA as a minimum requirement and consider vDGA for anyone who’s running apps that clearly “need” 3D support.   Though vDGA’s downside has had a way of pushing it out of of high volume deployments.

Ok so what’s new?   The answer is NVIDIA vGPU.  The first thing to be aware of is that this is an NVIDIA technology, not VMware.  That means you won’t see vGPU supporting AMD (or anyone else’s) cards any time soon.  Those folks will need to come up with their own version.   NVIDIA also only supports this with their GRID cards (not GeForce or Quadro). So you’ve got to have the right card, in the right server.   Sorry, that’s how it is.  It’s only fair to mention that vGPU first came out for XenServer about two years ago, and came out for vSphere with vSphere 6.0.  So while it’s new to vSphere, it’s not exactly new to the market.

So what makes this different?   vGPU is a combination of an ESXi driver .vib and some additional services that make up the GRID manager.  This allows dynamic partitioning of GPU memory and works in concert with a GRID enabled driver in the virtual machine.   The end result is that the VM runs a native NVIDIA driver with full API support (DX11, OpenGL 4) and has direct access (no Xorg) to the GPU hardware, but is only allowed to use a defined portion of the GPU’s memory.  Shared  access to the GPU’s compute resources is governed by the GRID manager.   Net-Net is that you can get performance nearly identical to vDGA without the PCI pass-through and it’s accompanying downsides.  vMotion remains a ‘no’ but VMware HA and DRS do work.  Composer does work, and MCS works.  And, if you set your VM to use only 1/16th of the GPU’s memory then you have the potential to share the GPU amongst 16 virtual machines.  Set it to 1/2 or 1/4 and get more performance (more video RAM) but at a lower VM density.

So why does this matter?   It means we get performance and comparability for graphics applications (and PowerPoint!) and an awesome (as in better than physical) user experience while gaining back much of what drove us to the virtual environment in the first place.  No more choosing between a great experience and management methods, HA, and DR. Now we can have it all.

If you’re using graphics, you want vGPU!  And if you’re running Windows apps, you’re probably using graphics!

Documenting a Citrix Environment – The easy way

Do you ever find yourself thinking – “I wish I had better documentation of my Citrix environment” or “I which my documentation was more up to date?”

Well, it turns out the internet – or more specifically Carl Webster @CarlWebster – has a solution for you.

Carl has written a large number of scripts for documenting these environments (and many of the surrounding technologies like Active Directory, DHCP, VMware vSphere, NetScaler, XenServer etc.  Best of all he gives these scripts away for free.

Take a look at http://carlwebster.com/where-to-get-copies-of-the-documentation-scripts/ and I’ll bet you’ll be amazed at how fast you can have some fantastic documentation.

Media Agent Networking

I get a lot of questions about the best way to configure networking for backup media agents or media servers in order to get the best throughput.    I thought a discussion of how the networking (and link aggregation) works would help shed some light.

Client to Media Agent:
In general we consider the media agents to be the ‘sink’ for data flows during backup from clients.  This data flow originates (typically) from many clients destined for a single media agent.   Environments with multiple media agents can be thought of as multiple single-agent configs.

The nature of this is that we have many flows from many sources destined for a single sink.  It is important then if we want to utilize multiple network interfaces on the sink (media agent) that the switch to which it is attached be able to distribute the data across the multiple interfaces.  By definition then we must be in a switch-assisted network link aggregation senario.    Meaning that the switch must be configured to utilize either LACP or similar protocols.   The server must also be configured to utilize the same methods of teaming.

Why can’t we use adaptive load balancing (ALB) or other non-switch assisted methods?  This issue is that the decision of which member of a link-aggregation-group a packet is transmitted over is made by the device transmitting the packet.  In the scenario above the bulk of the data is being transmitted from the switch to the media agent, therefore the switch must be configured to support spreading the traffic across multiple physical ports.  ALB and other non-switch –assisted aggregation methods will not allow the switch to do this and will therefore result in the switch using only one member of the  aggregation group to send data.  Net result begin that the total throughput is restricted to that of a single link.

So, if you want to bond multiple 1GbE interfaces to support traffic from your clients to the media agent the use of LACP or similar switch assisted link aggregation is critical.

Media Agent to IP Storage:
Now from the media agent to storage we consider that most traffic will originate to the media agent and be destined for the storage.  Really not much in the way of many-to-one or one-to-many relationships here it’s all one-to-one.  First question is always “will LACP or ALB help?”  the answer is probably no.  Why is that?

First understand that the media agent is typically connected to a switch, and the storage is typically attached to the same or another switch.  Therefore we have two hops we need to address MA to switch and switch to storage.

ALB does a very nice job of spreading transmitted packets from the MA to the switch across multiple physical ports.  Unfortunately all of these packets are destined for the same IP and MAC address (the storage).  So while they packets are received by the switch on multiple physical ports they are all going to go to the same destination and thus leave the switch on the same port.   If the MA is attached via 1GbE and the storage via 10GbE this may be fine.  If it’s 1GbE down to the storage then the bandwidth will be limited to that.

But didn’t I just say in the client section that LACP (switch assisted aggregation) would address this?  Yes and no.  LACP can spread traffic across multiple links even if it has the same destination, but only  if it comes from multiple sources.  The reason is that LACP uses either an IP or MAC based hash algorithm to decided which member of a aggregation group a packet should be transmitted on.  That means that all packets originating from MAC address X and going to MAC address Y will always go down the same group member.  Same is true for source IP X and destination IP Y.   This means that while LACP may help balance traffic from multiple hosts going to the same storage, it can’t solve the problem of a single host going to a single storage target.

By the way, this is a big part of the reason we don’t see many iSCSI storage vendors using a single IP for their arrays.  By giving the arrays multiple IP’s it becomes possible to spread the network traffic across multiple physical switch ports and network ports on the array.  Combine that with using multiple IP’s on the media agent host and multi-path IO (MPIO) software and now the host can talk to the array across all combinations of source and destination IPs (and thus physical ports) and fully utilize all the available bandwidth.

MPIO works great for iSCSI block storage.  What about CIFS (or NFS) based storage?   Unfortunately MPIO sits down low in the storage stack, and isn’t part of the network filing (requester) stack used by CIFS and NFS.  Which means that MPIO can’t help.    Worse with the NFS and CIFS protocols the target storage is always defined by an IP address or DNS name.  So having multiple IP’s on the array in and of itself doesn’t help either.

So what can we do for CIFS (or NFS)?  Well, if you create multiple share points (shares) on the storage, and bind each to a separate IP address you can create a situation where each share has isolated bandwidth.  And by accessing the shares in parallel you can aggregate that bandwidth (between the switch and the storage).  To aggregate between the host and switch you must force traffic to originate from specific IP’s or use LACP to spread the traffic across multiple host interfaces.  You could simulate MPIO type behavior by using routing tables to map a host IP to an array IP one-to-one.    It can be done but there is no ‘easy’ button.

So as we wrap this up what do I recommend for media agent networking?   And IP storage?
On the front end – aggregate interfaces with LACP.
On the back end – use iSCSI and MPIO rather than CIFS/NFS.  Or use 10GbE if you want/need CIFS/NFS

Sizing a Tape Library

So it seems like an easy question – how do I decide how large a tape library I need? It’s one I get a lot so I thought I’d devote a few minutes to the topic.

Lets assume I have 30TB of data in my datacenter, and I’m going to do a full backup once per week.  We’re also going to assume that my daily incrementals are 5% of the size of my full (1.5TB) and that I write all of those to tape as well.

Now, let’s assume (like many of my customers, and my own shop years back) that I want to ship tapes offsite twice a week, say on Tuesday and Friday.  And that full backups are all staged to run over the weekend.  I do not want to keep any tapes in the library which contain data when I ship.

Based on this we can compute the amount of data I need to ship on Tuesday and Friday; and how fast I need to write data to tape in order to be ready to ship it.

On Tuesday I want to ship 1 full backup (30TB) + 3 incrementals (1.5TB x 3); a total of 34.5TB of data.   On Friday I’m shipping just 3 incrementals (4.5TB).

For Tuesday’s shipment we have a window which probably starts Saturday morning (say 6am) and runs till Tuesday morning (say 6am) to complete the tape writing.  This window is 72 hours in length, and will thus require a rate of just under 500GB/hour to complete the tape out.

For Friday’s shipment we have a window which could start at say 6pm Tuesday and needs to complete by 6am Friday.  This window is 60 hours, and represents a minimum throughput of 0.075GB/hour.

Based on this we’re probably not too concerned about the Friday shipment, and we’ll concentrate on making things happen for Tuesday.

First – how many drives do I need?

Well – do I need backward compatibility with older tape?  LTO can read back two generations and write back one.  So an LTO6 drive can read LTO4 and write LTO5.  If I need to read my old LTO2 tapes, then I need to restrict myself to deploying LTO4 drives.

LTO6 can write data at a rate of 560B/hour before compression, LTO5 writes at 490GB/hour, and LTO4 writes at 420GB/hour.  If you’re working with something older than that you’ll have do the math yourself.

Amount of Data / Backup Window Size / Tape Drive Throughput = Number of Drives (round up)

So for the 500GB/hour target I’d need 2 LTO4 drives, 2 LTO5 drives, or 1 LTO6 drive.  I might want to consider adding a drive or two as ‘spares’ in case one breaks or is in use for (gasp) data restore operations.

Next up – how many slots do I need?

LTO6 holds 2.5TB on a tape, LTO5 holds 1.5TB, and LTO4 holds 0.8TB.  Again if you’re older than that you’ll need to look up the math. Manufactures will also quote compressed capacities which are roughly 2x native.  I find that while data will compress some, 1.5x is probably more realistic.  I’m going to use that assumption in my calculations below.

Size of Data / Compression Factor / Tape Capacity = Number of tapes. (round up)

34.5TB is going to occupy 10 LTO6 tapes, 16 LTO5 tapes, or 29 LTO4 tapes.

At this point we know enough to size the library.

For LTO6 I need a library with 10 slots, and 1 drive.

For LTO5 I need a library with 16 slots and 2 drives.

For LTO4 I need a library with 29 slots and 2 drives.

Personally I like to add at least 10-15% to my slots and (as noted earlier) and extra drive or two. This provides headroom for growth and some inefficient tape use.

Based on this it seems that a library with 12-14 slots and 2 LTO6 drives would work well.  Maybe one with 24 slots and 3 LTO5 drives.

A last factor to consider (not really sizing) is how tapes will be loaded and unloaded from the library. If I’m going to pull 30 tapes at a time, I really don’t want a library with only a single I/E slot. Ideally a slot count equal to the number of tapes expected to be unloaded at a time, but at a minimum one which helps to minimize the number of “return trips” to the library for each unload/load session.

That covers the scenario I described at the beginning.  I know some will ask about keeping tapes in the library.  That’s also something you can calculate based on the total amount of data (Number of Fulls * Size of Full + Number of Incrementals * Size of Incremental) you want to keep and dividing by the size of the tape.  That gives you a minimum slot count.   For this scenario I’d add about 25% to the number of slots for growth and “partial tapes”.     The good news is that if you’re keeping your tapes in the library, then you don’t have to worry about IE ports.

With that I know pretty well how big my library must be.  Now I can go shopping and find a device I like.

VMworld Wrapup

It’s been an exciting week in San Francisco hearing about the latest and greatest from VMware and partners.  I’m going to try to capture some of the highlights from the show and my thoughts about what we’re going to be seeing coming down the the pipe.

Software Defined Data Center (SDDC) – SDDC was the overriding topic and concept in San Francisco this week.  The idea that add things in your datacenter – Storage, Networking, Compute should be abstracted from the “things” that they are and be managed as logical entities to enable flexibility, consolidation, and agility.  This is a huge concept but one VMware sees as the next leap for enterprise and service provider datacenters.  With vSphere and vCloud Director VMware has a good start around the compute side of SDDC and at this show they introduced their vision for software defined networking (SDN) and software defined storage (SDS).

Software Defined Networking (SDN) – SDN isn’t really a new concept, there has been discussion of things like VXLAN for a while now.  VMware introduced their vision for SDN at the show in the form of the NSX platform, along with a laundry list of networking partners who are supporting the platform.  Citrix even announced support of NSX with their NetScaler Controller for NSX.

Software Defined Storage (SDS) – SDS had a handful of interesting announcements this week, but not (yet) any shipping product.   Where VMware is going with this is the idea that storage is configured into the environment and self-describes it’s capabilities (performance, replication, snapshots, etc..).  When a virtual machine is created administrators will tag it with information which describe it’s requirements and based on an engine vSphere will select appropriate storage options for placement of the VM.  As the storage is reconfigured vShphere will detect changes; If a VM’s needs are changed vSphere will detect that as well and in both cases the environment will react accordingly to make sure that the needs of each VM are met.

SDS will see it’s first real products in the form of vSphere vFlash (an SSD based read cache) and VMware Virtual SAN (vSAN).  With each of these products you’ll see the per-VM polices being applied.  vFlash uses SSD drives installed in the individual hosts to provide high-performance read caching of VM data, and will be available as an enterprise plus edition feature in vSphere 5.5. vSAN is a local storage based hybrid distributed storage technology leveraging SSD and SAS/SATA storage in ESXi hosts to provide high performance and low cost storage for virtual machines.  vSAN is expected to be available somewhere in the first half of 2014.  Additional time was given to the concept of Virtual Volumes (vVols) wherein storage arrays will integrate directly with vSphere without the intermediate layer of LUNs and filesystems.  Virtual disks will be provisioned directly to array storage based on the requirements of the VM.  Like NSX, vVols were introduced with a lengthy list of partners who are activly working with VMware to define and bring this technology to market.

The final dimension of the SDDC will be management, delivered in the form of the vCenter suite of products (vCOPS, vCD, vCAC, vCOM).   With these tools to monitor the virtualized environment and ensure that resources are used efficiently enterprises will be able to ensure that they maximize the value of their infrastructure investments.

Today much of this discussion is largely vision based, and not yet product.  Actual product announcements for shipping code were actually pretty incremental but the changes coming in the future will be dramatic.  It’s going to be important to consider how investments made today will support the SDDC of the future.

Stay tuned, lots of big thins coming from VMware!