Media Agent Networking

I get a lot of questions about the best way to configure networking for backup media agents or media servers in order to get the best throughput.    I thought a discussion of how the networking (and link aggregation) works would help shed some light.

Client to Media Agent:
In general we consider the media agents to be the ‘sink’ for data flows during backup from clients.  This data flow originates (typically) from many clients destined for a single media agent.   Environments with multiple media agents can be thought of as multiple single-agent configs.

The nature of this is that we have many flows from many sources destined for a single sink.  It is important then if we want to utilize multiple network interfaces on the sink (media agent) that the switch to which it is attached be able to distribute the data across the multiple interfaces.  By definition then we must be in a switch-assisted network link aggregation senario.    Meaning that the switch must be configured to utilize either LACP or similar protocols.   The server must also be configured to utilize the same methods of teaming.

Why can’t we use adaptive load balancing (ALB) or other non-switch assisted methods?  This issue is that the decision of which member of a link-aggregation-group a packet is transmitted over is made by the device transmitting the packet.  In the scenario above the bulk of the data is being transmitted from the switch to the media agent, therefore the switch must be configured to support spreading the traffic across multiple physical ports.  ALB and other non-switch –assisted aggregation methods will not allow the switch to do this and will therefore result in the switch using only one member of the  aggregation group to send data.  Net result begin that the total throughput is restricted to that of a single link.

So, if you want to bond multiple 1GbE interfaces to support traffic from your clients to the media agent the use of LACP or similar switch assisted link aggregation is critical.

Media Agent to IP Storage:
Now from the media agent to storage we consider that most traffic will originate to the media agent and be destined for the storage.  Really not much in the way of many-to-one or one-to-many relationships here it’s all one-to-one.  First question is always “will LACP or ALB help?”  the answer is probably no.  Why is that?

First understand that the media agent is typically connected to a switch, and the storage is typically attached to the same or another switch.  Therefore we have two hops we need to address MA to switch and switch to storage.

ALB does a very nice job of spreading transmitted packets from the MA to the switch across multiple physical ports.  Unfortunately all of these packets are destined for the same IP and MAC address (the storage).  So while they packets are received by the switch on multiple physical ports they are all going to go to the same destination and thus leave the switch on the same port.   If the MA is attached via 1GbE and the storage via 10GbE this may be fine.  If it’s 1GbE down to the storage then the bandwidth will be limited to that.

But didn’t I just say in the client section that LACP (switch assisted aggregation) would address this?  Yes and no.  LACP can spread traffic across multiple links even if it has the same destination, but only  if it comes from multiple sources.  The reason is that LACP uses either an IP or MAC based hash algorithm to decided which member of a aggregation group a packet should be transmitted on.  That means that all packets originating from MAC address X and going to MAC address Y will always go down the same group member.  Same is true for source IP X and destination IP Y.   This means that while LACP may help balance traffic from multiple hosts going to the same storage, it can’t solve the problem of a single host going to a single storage target.

By the way, this is a big part of the reason we don’t see many iSCSI storage vendors using a single IP for their arrays.  By giving the arrays multiple IP’s it becomes possible to spread the network traffic across multiple physical switch ports and network ports on the array.  Combine that with using multiple IP’s on the media agent host and multi-path IO (MPIO) software and now the host can talk to the array across all combinations of source and destination IPs (and thus physical ports) and fully utilize all the available bandwidth.

MPIO works great for iSCSI block storage.  What about CIFS (or NFS) based storage?   Unfortunately MPIO sits down low in the storage stack, and isn’t part of the network filing (requester) stack used by CIFS and NFS.  Which means that MPIO can’t help.    Worse with the NFS and CIFS protocols the target storage is always defined by an IP address or DNS name.  So having multiple IP’s on the array in and of itself doesn’t help either.

So what can we do for CIFS (or NFS)?  Well, if you create multiple share points (shares) on the storage, and bind each to a separate IP address you can create a situation where each share has isolated bandwidth.  And by accessing the shares in parallel you can aggregate that bandwidth (between the switch and the storage).  To aggregate between the host and switch you must force traffic to originate from specific IP’s or use LACP to spread the traffic across multiple host interfaces.  You could simulate MPIO type behavior by using routing tables to map a host IP to an array IP one-to-one.    It can be done but there is no ‘easy’ button.

So as we wrap this up what do I recommend for media agent networking?   And IP storage?
On the front end – aggregate interfaces with LACP.
On the back end – use iSCSI and MPIO rather than CIFS/NFS.  Or use 10GbE if you want/need CIFS/NFS

We’re Platinum! Lewan Earns Highest CommVault Partner Certification

Lewan Commvault Platinum PartnerLewan is excited to announce the addition of a new and very notable certification to our arsenal of expertise—CommVault Platinum Partner Certification. Platinum status is CommVaults’s highest and most prestigious partner designation.

Lewan is the largest and only CommVault Platinum Partner in the Rocky Mountain region, and just one of two companies to earn this prestigious certification in the west. Our team is CommVault Certified for Administration and Engineering enabling our experts to guide your project from start to finish.

CommVault is recognized by the street and leading analysts as the best of the best in the Data management industry.

Simpana Virtualize Me (CommVault P2V Conversion w/1-Touch)

 

I recently wrote an article on using 1-Touch to restore a physical machine into a VM.  Essentially performing a P2V conversion as part of a system recovery. It’s since been called to my attention that there can be an easier way.

Simpana Service Pack 4 introduced a new feature combining functions of 1-Touch with a P2V tool. The end result is that it can be very easy to restore a failed physical machine into a VMware VM. A word of caution here, the functionality requires that 1-Touch is installed from the SP4 install CD set. If your 1-Touch install is from an earlier set you must uninstall and re-install using the SP4 media (no you can’t just patch to SP4).

You must also be using boot .ISO files generated from the above 1-Touch installation, and copy them onto a datastore on your ESXi Servers.  You must also have backed up the system you want to restore in it’s entirety so that there is a complete operating system to restore.

With those caveats stated, let’s talk about using the new “Virtulize Me” feature.

Start Virtualize Me Wizard

media_1323058613173.png

Select the server to create a VM from, right click and select Virtualize Me.

Specify the Destination

media_1323061435107.png

Select your vCenter server, once selected from your installed Virtual Server Agent instances you can browse for the datastore and boot .ISO file.

media_1323662650196.png

Enter the information about where to create the new VM. You may also want to investigate the setting under ‘Advanced’ to resize the VM to be something different than the original was.

If you have Virtual Server agents installed you should be able to browse for all the required information.

Select “Immediate” to run this job now, then click Ok.

media_1323662722829.png

The job controller will show that the job has been submitted.

media_1323746204210.png

On your vCenter server you can see that the VM has been created, and is powered on.

media_1323746214251.png

Opening the console of the VM you can observe the progress of the restore job.

media_1323746221700.png

You can also watch the progress of the restore from the CommCell Console.

media_1323746228688.png

After the restore process completes, the VM will reboot. It’s going to do some setup (courtesy of the sysprep mini-setup).

media_1323746235204.png

When faced with the login error, go ahead an login as the local administrator. Don’t panic though when it flashes a sysprep dialog and logs you out … this is expected.

media_1323746244402.png

The mini-setup will continue on the next boot.

media_1323746252541.png

Login again when prompted.

media_1323746260878.png

After the login you should find yourself on a functioning, restored system. Note the AD Domain membership.

At this point you probably want to do some additional cleanup, install VMware tools, etc – but the restore/P2V process is complete.

 

CommVault 1-Touch Bare Metal Restore to VMware VM

I was working with a customer recently who wanted to configure 1-Touch to be used for bare metal recovery for some older servers into their VMware environment. After working through the process I thought it would be best if I took a few minutes and documented the process that we’d used as well as a couple tips that will make it easier for folks in the future.

In this example we will be restoring a backup from a physical machine into a VM using the 1-Touch boot CD in offline mode. The original machine was Windows Server 2008 R2, thus we will be using the 64-bit 1-Touch CD.

This section assumes that 1-Touch has already been installed and the 1-Touch CD’s generated and available to the VMware infrastructure as .ISO files.

Building the Recovery VM

wpid1572-media_1323041407452.png

Generally you build this like you would build any other VM, but there are a couple things we want to pay attention to. First is the network adapter.

While in general the VMXNET3 adapter is preferred for VMware VMs, in this instance we want to specify the E1000 adapter because drivers for this are embedded in the Windows distribution, and thus in the 1-Touch recovery disk. You can change the adapter on the VM later if you’re so inclined, but using the E1000 for the restore will make things easier.

wpid1573-media_1323041557045.png

The second item is the SCSI controller. Again, while the SAS adapter is generally preferable, our restore is going to be easier if we select the LSI Logic Parallel adapter (for reasons we’ll discuss in a few minutes). Note this can be made to work with the SAS adapter, but it’ll require more editing of files, and like the network adapter you can change it later after the OS is installed so this makes it easier.

wpid1574-media_1323041688421.png

Pay attention to the creation of Disks. You need the same number and approximate size (larger, not smaller) as the original machine. It is possible to do a dissimilar volume configuration, but you’ll spend more time fiddling, and it will still insist that all volumes be at least as big as the original.

wpid1575-media_1323041980699.png

Power on the VM and mount the appropriate 1-Touch iso. In this case we’re using the 1Touch_X64.iso file

1-Touch Recovery

wpid1576-media_1323042223956.png

Booting the VM from the 1-Touch CD will start the interactive process. When prompted pick your favorite language and then click ok.

wpid1577-media_1323042296398.png

Click next at the welcome screen. (note if you are asked about provding a configuration file, specify ‘cancel’ to go into the interactive mode).

wpid1578-media_1323042344582.png

Verify that your disks have all been detected, then select yes and then click next.

wpid1579-media_1323042618944.png

Fill out the CommServe and client information, then click the “Get Clients” button to get the list of clients from the CommServe.

wpid1580-media_1323042817807.png

Select the client you want to recover, then click next.

wpid1581-media_1323042862700.png

Review the Summary then click next.

wpid1582-media_1323043025329.png

Select the backup set, and point in time for the recovery; then provide a CommCell username and password for the restore. Then click next.

wpid1583-media_1323043161561.png

You’ll see the ‘please wait while processing’ message … at this point you may want to watch a CommCell Console session.

wpid1584-media_1323043228923.png

You’ll see that a restore job has been created. You’ll need to watch this job for the restore to complete.

wpid1585-media_1323051629430.png

Now that the 1-Touch details have been restored we need to deal with disk mapping. In this case we will leave it with ‘similar disk mapping’ and disable the mini-setup (uncheck the box checked by default). Ok the warning about devices not matching (we’ll fix that later). Then click next.

wpid1586-media_1323051769461.png

Don’t exclude anything (unless you really know you want to). Click Next.

wpid1587-media_1323051813461.png

This is a review of the original disks. Click next.

wpid1588-media_1323051892760.png

Review or reconfigure the network binding correctly for this restore. Then click Ok.

wpid1589-media_1323052041995.png

Right click on the disk name and initialize if necessary. Then click Done.

wpid1590-media_1323052122700.png

Map the volumes to the disk by right clicking on the source, and mapping to the destination disk.

wpid1591-media_1323052173386.png
wpid1592-media_1323052220228.png

Repeat above for each volume. Once all volumes are mapped to the destination, click Ok.

The system will format the disks and start the restore.

wpid1593-media_1323052294387.png

You can also observe the restore progress again at the CommCell Console.

wpid1594-media_1323052525774.png

In general this is a good point to go get a cup of coffee, or lunch, or…whatever. This is the point where the entire backup set for the machine is going to be restored, so if the machine has any size to it this could take a while.

wpid1595-media_1323052803844.png

Time to map the drivers to the target system. CommVault does not automatically discover/remap these for you, so you need to tell the 1-Touch CD which drivers for LAN and Mass Storage Device (MSD) need to be used.

We know we need a Intel Pro1000 LAN, and a LSI Parallel SCSI (non-SAS) driver. Click on the browse buttons to the right, and look for the proper drivers under c:windowssystem32driverstorefilerepository

(note that clicking the “more” button will give a lot more detail to aid in finding the correct driver)

wpid1596-media_1323053013646.png

For reference, this is the directory containing the proper driver for the Pro1000 adapter we specified for the VM.

wpid1597-media_1323053063937.png

Select the .inf file (extension is suppressed). then click open.

wpid1598-media_1323053219097.png

Now we need to do the same for the Mass Storage Device. Looking at the detail behind the ‘more’ button will help us confirm that we need the LSI_SCSI device, and the PNP device ID’s that are expected. Make note of these ID’s, we’ll need them again in a minute. (might be worth copying them to the clipboard in the VM now).

Click the browse button and go find the LSI_SCSI driver.

wpid1599-media_1323053374320.png

This is the directory containing the LSI_SCSI driver. Browse into the directory.

wpid1601-media_1323053498084.png

If you try to just use the driver as-is, you’ll get the following error because the device IDs in the file don’t quite match close enough for 1-Touch’s satisfaction. To address this we need to edit the .inf file a little bit.

wpid1600-media_1323053435449.png

Right click the .inf file and select open with.

wpid1602-media_1323053581419.png

Accept the default of Notepad and click OK.

wpid1603-media_1323053638382.png

Scroll down to the section of the file which lists the device IDs. Unfortunately the IDs being requested by 1-Touch are longer than those in the file, so to make this happy we’ll add the extend ID’s that 1-Touch is looking for.

Below each section, paste in the ID’s you copied from the details window, and edit to match the line above.

wpid1604-media_1323053866559.png

The modified file is shown. Save the file, then select and ok.

As an aside, this is the reason we picked the LSI_SCSI for our restore rather than the LSI_SAS controller. The SAS driver has the same issue, but there are many (many) more IDs to be updated when using that driver. It’s easier to edit the simpler file, and then go back later and add a SAS based secondary disk to the VM, let windows auto-install the SAS driver. Once that’s done you can then change the adapter to SAS for the primary disk if you really want to be using the virtual SAS controller.

wpid1605-media_1323053921421.png

If the file was modified correctly, you can now click ok to continue.

wpid1606-media_1323053957526.png

The registry merge section here has to do with updating the drivers on the system. These changes are what we needed to map the drivers for.

wpid1607-media_1323054008668.png

Click Ok to the “restore completed successfully” dialog. The system will then reboot. This would be a really good time to eject the 1-Touch CD.

wpid1608-media_1323054078158.png

On reboot you may see this message. Don’t panic. Remember that at the time the backup you’re restoring was made, the server was powered on and running. This is ok, just start windows normally.

wpid1609-media_1323054219575.png

Be patient and let the machine boot up. This might take a bit, particularly if the original system had a lot of hardware management agents which will probably be none to happy in their new home. When the machine is ready go ahead and login. It might be best to use a local credential (rather than Domain).

Also don’t be surprised if you login and are immediately logged off – drivers are being discovered and installed at this point and the machine may want to reboot a time or two.

wpid1610-media_1323054487511.png

Before trying to fix the broken devices, this is a really good time to install VMware tools. After that you should be able to remove any broken devices from the restored system.

So, Install tools then clean-up any dead devices. Then uninstall any old hardware management stuff that doesn’t belong in a VM (some may need to be disabled if it won’t uninstall). This cleanup will vary from system to system.

That said, once the cleanup is done, you have recovered your physical system into a VM by way of the 1-Touch feature.

VMware Consolidated Backup and User Access Control

Working with a customer recently I got to spend some quality time troubleshooting VMware Consolidated Backup framework.  Generally VCB is a very straight forward install and it pretty much “just works” – which made my recent experience very atypical (in my experience anyway).

Here’s the setup – we have a group of ESX servers, and a Windows Server 2008 Standard 64-bit system, all attached to the same Fibrechannel SAN, with everything zoned properly.  VCB is installed on the 2K8 system.  The 2K8 OS sees all of of the VMFS luns which are presented to it.  We are using Win2K8’s native MPIO stack.

Running VCB Mounter in SAN mode returns an error that there is no path to the device where the VM is stored.  Running it in NBD mode works great…except that it passes all of the traffic over the network which is not desirable.

Again, diskpart, and the disk Management MMC see all of the LUNs with no issues.

VCB’s vcbSanDbg.exe utility however see no storage.  None at all.

We tried various options – newer and older versions of the VCB framework (btw, only the latest 1.5 U1 version of VCB is supported on Win2K8).  We tried various ways of presenting the storage.  We even tried presenting up some iSCSI storage thinking maybe it was an issue with the systems’ HBA’s.

Ok, if you’ve read the subject of this post then you already know the answer.  In case you didn’t here it is – the system has User Access Control (UAC) enabled.   The user we’re running the framework as is a local administrator on the proxy, but that’s not enough to allow it to properly enumerate the disk devices.   In order for the VCB framework to work you either have to run it in a command window with the “run as administrator” option,  or turn off UAC on the server.  The former can be a little tricky to accomplish if you’re wanting to run the framework from inside a backup application, while the latter seems to be the most common approach.

That’s it.  Turn off UAC and reboot the computer.  Now VCB works great.

Why is my backup running slow?

Backup systems, while a necessary part of any well managed IT system, are often a large source of headaches for IT staff. One of the biggest issues with any back system is poor performance. It is often assumed that performance is related to the efficiency of the backup software or the performance capabilities of backup hardware. There are, however, many places within the entire backup infrastructure that could create a bottleneck.
Weekly and nightly backups tend to place a much higher load on systems than normal daily activities. For example a standard file server may access around 5% of its files during the course of a day but a full backup reads every file on the system. Backups put strain on all components of a system from the storage through the internal buses to the network. A weakness in any component along the path can cause performance problems. Starting with the backup client itself, let’s look at some of the issues which could impact backup performance.

  • File size and file system tuning
  • Small Files

A file system with many small files is generally slower to back up than one with the same amount of data in fewer large files. Generally systems with home directories and other shares which house user files will take longer to back up than database servers and systems with fewer large files. The primary reason for this is due to the overhead involved in opening and closing a file
In order to read a file the operating system must first acquire the proper locks then access the directory information to ascertain where the data is located on the physical disk. After the data is read, additional processing is required to release those locks and close the file. If the amount of time required to read on block of data is x, then it is a minimum of 2-3x to perform the open operations and x to perform the close. The best case scenario, therefore, would require 4x to open, read and close a 1 block file. A 100 block file would require 103x. A file system with a 4 100 block files will require around 412x to back up. The same amount of data stored in 400 1 block files would require 1600x or about 4 times as much time.

So, what is the solution? Multiple strategies exist which can help alleviate the situation.
The use of synthetic full backups only copies the changed files from the client to the backup server (as with an incremental backup) and a new full is generated on the backup server from the previous full backup and the subsequent incrementals. A synthetic full strategy at a minimum requires multiple tape drives and disk based backup is recommended. Adequate server I/O performance is a must as well since the creation of the synthetic full requires a large number of read and write operations.
Another strategy can be to use storage level snapshots to present the data to the backup server. The snapshot will relieve the load from the client but will not speed up the overall backup as the open/close overhead still exists. It just has been moved to a different system. Snapshots can also be problematic if the snapshot is not properly synchronized with the original server. Backup data can be corrupted if open files are included in the snapshot.
Some backup tools allow for block level backups of file systems. This removes the performance hit due to small files but requires a full file system recovery to another server in order to extract a single file.
Continuous Data Protection (CDP) is a method of writing the changes within a file system to another location either in real time or at regular, short intervals. CDP overcomes the small file issue by only copying the changed blocks but requires reasonable bandwidth and may put an additional load on the server.
Moving older, seldom accessed files to a different server via file system archiving tools will speed up the backup process while also reducing required investment in expensive infrastructure for unused data.

  • Fragmentation

A system with a lot of fragmentation can take longer to back up as well. If large files are broken up into small pieces a read of that file will require multiple seek operations as opposed to a sequential operation if the file has no fragmentation.
File systems with a large amount of fragmentation should regularly utilize some sort of de-fragmentation process which can impact both system and backup performance.

  • Client throughput

In some cases a client system may be perfectly suited for the application but not have adequate internal bandwidth for good backup performance. A backup operation requires a large amount of disk read operations which are passed along a system’s internal bus to the network interface card (NIC). Any slow device along the path from the storage itself, through the host bus adapter, the system’s backplane and the NIC can cause a bottleneck.
Short of replacing the client hardware the solution to this issue is to minimize the effect on the remainder of the backup infrastructure. Strategies such as backup to disk before copying to tape (D2D2T) or multiplexing limit the adverse effects of a slow backup on tape performance and life. In some cases a CDP strategy might be considered as well.

  • Network throughput

Network bandwidth and latency can also affect the performance of a backup system. A very common issue arises when either a client or media server has connected to the network but the automatic configuration has set the connection to a lower speed or incorrect duplex. Using 1Gb/sec hardware has no advantage when the port is incorrectly set to 10Mb/half duplex.
Remote sites can also cause problems as those sites often utilize much slower speeds than local connections. Synthetic full backups can alleviate the problem but if there is a high daily change rate may not be ideal. CDP is often a good fit, as long as the change rate does not exceed the available bandwidth. In many cases a remote media server with deduplicated disk replicated to the main site is the most efficient method for remote sites.

  • Media server throughput

Like each client system the media server can have internal bandwidth issues. When designing a backup solution be certain that systems used for backup servers have adequate performance characteristics to meet requirements. Often a site will choose an out of production server to become the backup system. While such systems usually meet the performance needs of a backup server, in many cases obsolete servers are not up to the task.
In some cases a single media server cannot provide adequate throughput to complete the backups within required windows. In these cases multiple media servers are recommended. Most enterprise class backup software allows for sharing of tape and disk media and can automatically load balance between media servers. In such cases multiple media servers allow for both performance and availability advantages.

  • Storage network

When designing the Storage Area Network (SAN) be certain that the link bandwidth matches the requirements of attached devices. A single LTO-4 tap drive writes data at 120MB/sec. In network bandwidth terms this is equivalent to 1.2Gb/sec. If this tape drive is connected to an older 1Gb SAN, the network will not be able to write at tape speeds. In many cases multiple drives are connected to a single Fibre Channel link. This is not an issue if the link allows for at least the bandwidth of the total of the connected devices. The rule of thumb for modern LTO devices and 4Gb Fibre Channel is to put no more than 4 LTO-3 and no more than 2 LTO-4 drives on a single link.
For disk based backup media, be certain that the underlying network infrastructure (LAN for network attached or iSCSI disk and SAN for Fibre Channel) can support the required bandwidth. If a network attached disk system can handle 400MB/sec writes but is connected to a single 1Gb/sec LAN it will only be able to write up to the network speed, 100MB./sec. In such a case, 4 separate 1Gb connections will be required to meet the disk system’s capabilities.

  • Storage devices

The final stage of any backup is the write of data to the backup device. While these devices are usually not the source of performance problems there may be some areas of concern. When analyzing a backup system for performance, be sure to take into account the capabilities of the target devices. A backup system with 1Gb throughput throughout the system with a single LTO-1 target will never exceed the 15MB/sec (150Mb/sec) bandwidth of that device.

  • Disk

For disk systems the biggest performance issues is the write capability of each individual disk and the number of disks (spindles) within the system. A single SATA disk can write between 75 and 100 MB/sec. An array with 10 SATA drives can, therefore, expect to be able to write between 750MB/sec and 1GB/sec. RAID processing overhead and inline deduplication processing will limit the speed so except the real performance to be somewhat lower, as much as 50% less than the raw disk performance depending on the specific system involved. When deciding on a disk subsystem, be sure to evaluate the manufacturer’s performance specifications.

  • Tape

With modern high speed tape subsystems the biggest problem is not exceeding the device’s capability but not meeting the write speed. A tape device performs best when the tape is passing the heads at full speed. If data is not streamed to the tape device at a sufficient rate to continuously write, the tape will have to stop while the drive’s buffer is filled with enough data to perform the next write. In order to get up to speed, the tape must rewind a small amount and then restart. Such activity is referred to as “shoe shining” and drastically reduces the life of both the tape and the drive.
Techniques such as multiplexing (intermingling backup data from multiple clients) can alleviate the problem but be certain that the last, slow client is not still trickling data to the tape after all other backup jobs have completed. In most cases D2D2T is the best solution, provided that the disk can be read fast enough to meet the tape’s requirements.

  • Conclusion

In most backup systems there are multiple components which cause performance issues. Be certain to investigate each stage of the backup process and analyze all potential causes of poor performance.