Media Agent Networking

I get a lot of questions about the best way to configure networking for backup media agents or media servers in order to get the best throughput.    I thought a discussion of how the networking (and link aggregation) works would help shed some light.

Client to Media Agent:
In general we consider the media agents to be the ‘sink’ for data flows during backup from clients.  This data flow originates (typically) from many clients destined for a single media agent.   Environments with multiple media agents can be thought of as multiple single-agent configs.

The nature of this is that we have many flows from many sources destined for a single sink.  It is important then if we want to utilize multiple network interfaces on the sink (media agent) that the switch to which it is attached be able to distribute the data across the multiple interfaces.  By definition then we must be in a switch-assisted network link aggregation senario.    Meaning that the switch must be configured to utilize either LACP or similar protocols.   The server must also be configured to utilize the same methods of teaming.

Why can’t we use adaptive load balancing (ALB) or other non-switch assisted methods?  This issue is that the decision of which member of a link-aggregation-group a packet is transmitted over is made by the device transmitting the packet.  In the scenario above the bulk of the data is being transmitted from the switch to the media agent, therefore the switch must be configured to support spreading the traffic across multiple physical ports.  ALB and other non-switch –assisted aggregation methods will not allow the switch to do this and will therefore result in the switch using only one member of the  aggregation group to send data.  Net result begin that the total throughput is restricted to that of a single link.

So, if you want to bond multiple 1GbE interfaces to support traffic from your clients to the media agent the use of LACP or similar switch assisted link aggregation is critical.

Media Agent to IP Storage:
Now from the media agent to storage we consider that most traffic will originate to the media agent and be destined for the storage.  Really not much in the way of many-to-one or one-to-many relationships here it’s all one-to-one.  First question is always “will LACP or ALB help?”  the answer is probably no.  Why is that?

First understand that the media agent is typically connected to a switch, and the storage is typically attached to the same or another switch.  Therefore we have two hops we need to address MA to switch and switch to storage.

ALB does a very nice job of spreading transmitted packets from the MA to the switch across multiple physical ports.  Unfortunately all of these packets are destined for the same IP and MAC address (the storage).  So while they packets are received by the switch on multiple physical ports they are all going to go to the same destination and thus leave the switch on the same port.   If the MA is attached via 1GbE and the storage via 10GbE this may be fine.  If it’s 1GbE down to the storage then the bandwidth will be limited to that.

But didn’t I just say in the client section that LACP (switch assisted aggregation) would address this?  Yes and no.  LACP can spread traffic across multiple links even if it has the same destination, but only  if it comes from multiple sources.  The reason is that LACP uses either an IP or MAC based hash algorithm to decided which member of a aggregation group a packet should be transmitted on.  That means that all packets originating from MAC address X and going to MAC address Y will always go down the same group member.  Same is true for source IP X and destination IP Y.   This means that while LACP may help balance traffic from multiple hosts going to the same storage, it can’t solve the problem of a single host going to a single storage target.

By the way, this is a big part of the reason we don’t see many iSCSI storage vendors using a single IP for their arrays.  By giving the arrays multiple IP’s it becomes possible to spread the network traffic across multiple physical switch ports and network ports on the array.  Combine that with using multiple IP’s on the media agent host and multi-path IO (MPIO) software and now the host can talk to the array across all combinations of source and destination IPs (and thus physical ports) and fully utilize all the available bandwidth.

MPIO works great for iSCSI block storage.  What about CIFS (or NFS) based storage?   Unfortunately MPIO sits down low in the storage stack, and isn’t part of the network filing (requester) stack used by CIFS and NFS.  Which means that MPIO can’t help.    Worse with the NFS and CIFS protocols the target storage is always defined by an IP address or DNS name.  So having multiple IP’s on the array in and of itself doesn’t help either.

So what can we do for CIFS (or NFS)?  Well, if you create multiple share points (shares) on the storage, and bind each to a separate IP address you can create a situation where each share has isolated bandwidth.  And by accessing the shares in parallel you can aggregate that bandwidth (between the switch and the storage).  To aggregate between the host and switch you must force traffic to originate from specific IP’s or use LACP to spread the traffic across multiple host interfaces.  You could simulate MPIO type behavior by using routing tables to map a host IP to an array IP one-to-one.    It can be done but there is no ‘easy’ button.

So as we wrap this up what do I recommend for media agent networking?   And IP storage?
On the front end – aggregate interfaces with LACP.
On the back end – use iSCSI and MPIO rather than CIFS/NFS.  Or use 10GbE if you want/need CIFS/NFS

Asigra Linux Restore with ‘sudo’

Conduct an Asigra restore to a UNIX or Linux server using sudo credentials

Verify that user is listed in /etc/sudoers file on restore target system

media_1373309867237.png

The sudo utility allows users root level access to some (or all) subsystems without requiring users to know the root password. Please look at documentation for the sudo utility for more information.

From Asigra restore dialog, choose files to be restored

media_1373309875752.png

Select Alternate location and click on ‘>>’

media_1373310026490.png

Enter server name or IP address for restore target and check both “Ask for credentials” and “‘sudo’ as alternate user’

media_1373310083088.png

Enter username and password for user configured in /etc/sudoers file

media_1373309230681.png

Enter “root” and same password as in previous step

media_1373309513645.png

Do NOT enter the ‘root’ password. The sudo utility uses the regular user’s password.

Select restore location and truncate path, if required

media_1373309543033.png
media_1373309558317.png

Accept defaults

media_1373309569710.png

Restore in progress…

media_1373310480356.png

Verify restore completed

media_1373310743301.png

Using Asigra DS-Client Logs

How to understand backup operations using the DS-Client logs

For Lewan Managed Data Protection customers wanting additional information beyond what is available in the daily or weekly reports, the Asigra software provides the ability to look at the DS-Client activity logs. This post assumes that the user has installed or been given access to the DS-User interface and is able to connect to their DS-Client server.
A previous blog post (http://blog.lewan.com/2012/03/29/asigra-ds-user-installation-and-log-file-viewing/) addressed the installation of the DS-User along with some basics on the activity logs. This post will provide additional detail regarding the data provided by the activity logs.

Open the DS-User interface and connect to the appropriate DS-Client

media_1370017433779.png

From the menus select “Logs” and open the “Activity Log”

media_1370017447703.png

Set the parameters for logs desired

media_1370017475203.png

By default the system will display all logs for the current and previous days. For this exercise only backup activity will be required. The date and time range as well as specific nodes (backup clients) or backup sets can also be selected.
Once all options have been set, click the “Find” button to locate the specified logs.

Backup windows

media_1370017539988.png

For each set backed up, the start time, end time and total duration of the backup job can be observed. Each column can be sorted to assist in viewing.

On line Data Changed

media_1370018486463.png

The column labeled “Online” indicates to total size of changed files for the backup. That is the total amount of space used by all files which had any chage since the last backup session. For example a server with a 30 GB database which has daily updates and 4 new 1 MB documents would show 32,216,449,024 (30 GB + 4 MB). This is the amoutn of data copied from the backup client to the DS-Client.

Data Transmitted to the cloud

media_1370018516767.png

The column labeled “Transmitted…” shows the actual amount of data changed and copied to the cloud based device. This is the amount of data contained in changed blocks from all of the changed files, after compression and encryption. If, in the example above, the database file only had 1 MB of changes the Transmitted column would contain a number similar to 5,242,880 (roughly 5 MB).

Determining error and warning causes

media_1370019239770.png

In some cases a backup set will show a status of “Completed with Errors” or “Completed with Warnings”. In most cases the errors and warnings are inconsequential but should usually be looked at.
Select the line containing the backup set in question and click on the “Event Log” button

Backup Session Event Log

media_1370018398623.png

Each event in the backup session is listed in the log. Errors are flagged with a red ‘X’ and warnings with a yellow ‘!’. Selecting the event will show the detail. In the example shown above a file is stored for which the backup user does not have permission to read the file. Other common errors are due to a file being used by another process and a file which has been moved or deleted between the initial scan of the file system and the attempt to access it for backup.
In some cases there will be a large number of errors “The network name cannot be found.” These usually indicate that there is a problem with the network connection between the DS-Client and the backup target but could be caused by a reboot of the backup target or other connectivity issues.

For our Managed Data Protection customers, the Lewan Operations team checks backup sets for errors on a daily basis and will correct any critical issues.

Additional analysis

media_1370020250336.png

The activity log can also be saved to a file (text or Excel spreadsheet) for additional analysis. Right-click anywhere in the activity log and select “Save As”. Used the resulting dialog to configure the location and file type.

SAVE! SAVE! SAVE!

Have you ever been in the middle of an important document when your computer crashes or freezes and realize you didn’t save it? You are not alone! I think we’ve all been there a time or two…or three! But there may be hope for us yet! The Auto Recovery (AutoSave) feature is here to help!

Depending upon the Microsoft Office version that you have, AutoRecovery (AutoSave) is a feature that will recover a document if your computer loses power or if a program error occurs while working in a document.

This feature will create a file similar to “AutoRecover Save of <file name>.doc.” When the program is restarted, the application will search your system for any of these files. Some documents may not be recoverable especially, if you’ve never saved the document before.  And even if you have saved the document, you may lose recent changes. Once the program you were working in is restarted, it will automatically attempt to open the autorecovery files. If it does so successfully, then you will be able to save the document.

This is a truly valuable feature that Microsoft Office offers. But in order for it work, it must be enabled. To view instructions on how to enable this feature, go to: http://office.microsoft.com/en-us/word-help/automatically-save-and-recover-office-files-HP010140729.aspx.

Lewan Achieves Veeam Gold/Platinum Partner Status

veeam backup, lewan gold parnterOur Enterprise Solutions team has been hard at work in the lab, training to become solutions experts for Veeam Data Protection and Backup tools. Their hard work, dedication and opportunity to train with Veeam’s technical team has earned Lewan the recognition of Gold Partner status in the Veeam ProPartner Program. Congratulations!

View all of Lewan’s Vendor Partners: http://www.lewan.com/vendorpartners

Asigra File Recovery

Self service recovery of one or more files from Asigra file system backup.

In the DS-User Interface, Open restore dialog for backup set

media_1330636107098.png

Select recovery objects

media_1330636203533.png

For restore of most recent backup of a directory, browse to specific folder and select that folder.
For other options select Show Files or Advanced

Selection of an individual file

media_1330636222955.png

Advanced selection options

media_1330636256503.png

Default is the latest generation of data (most recent backup)
See Asigra DS-Client documentation for detailed description of options

Selection of a backup session for selective data option

media_1330636265951.png

Data selected from specific backup session

media_1330636271712.png

Note that not all data will be included in any backup session. The “From” date may need to be moved in order to pick up files which had not changed at the time of a particular session.

Restore files to original location

media_1330636318699.png

Use this option with caution as more recent versions of files may be overwritten

Alternate restore options

media_1330636383356.png

The server, destination share and path may all be changed.
Note that by default the entire directory structure is restored below the destination. The full path is noted in the bottom pane. The path can be truncated (from the top directory shown in red) by incrementing the counter on the right.

Overwrite warning pop-up

media_1330636389349.png

Keep default performance options

media_1330636393563.png

Other restore options

media_1330636404531.png

Normally these are left at the default.

Restore progress window

media_1330638166441.png

Restored files

media_1330636526139.png

Note full path of restore

To document the restore open the Activity Log

media_1333641354850.png

Select the parameters to locate the specific restore job

media_1333641382015.png

Select the restore job and open the detailed log

media_1333641393788.png
media_1333641398887.png

Right-Click on the log entries and select “Save As…”

media_1333641424187.png

Select a directory location and appropriate name for the log file

media_1333641453260.png

Asigra DS-User Installation and Log file viewing

 

How to install Asigra DS-User and view log files to determine failed objects.

Run setup from Asigra DS-Client

wpid1752-media_1329860906667.png

Request the specific path from the Lewan ROC.

Allow the install if UAC requests pops up

wpid1753-media_1329860921890.png

Install the proper Java DS-User for workstation platform

wpid1754-media_1329860945693.png

These instructions assume the Java version of DS-User. The Windows native version can also be installed by selecting the proper DS-Client option and only install the DS-User GUI

Follow the various screens to install. Use all default values.

wpid1755-media_1329860973925.png
wpid1756-media_1329860987714.png

Windows 7 workstations may show an Operating System incompatibility. This can be ignored.

wpid1757-media_1329861000969.png
wpid1758-media_1329861011808.png
wpid1759-media_1329861022010.png
wpid1760-media_1329861032841.png

Do not change the port number!

wpid1761-media_1329861046636.png
wpid1762-media_1329861061679.png
wpid1763-media_1329861072211.png

Once installation completes, exit the installer

wpid1764-media_1329861087629.png

Open DS-User from the Windows “Start” menu

wpid1765-media_1329861109484.png
wpid1766-media_1329861120787.png

Select setup-> Initialization

wpid1771-media_1329861559819.png

Add DS-Client’s IP address

wpid1767-media_1329861257386.png
wpid1768-media_1329861276472.png

Enter IP Address of DS-Client

Refresh DS-Client list and select DS-Client

wpid1769-media_1329861346309.png

On far left pane, click refresh, DS-Client should appear. Select DS-Client

Log in with Windows Domain credentials

wpid1770-media_1329861363569.png

If workstation not logged in as Domain user, enter proper credentials

Upgrade DS-User if necessary

wpid1772-media_1329861580771.png

When Asigra components are updated, new versions are automatically pushed out. The default DS-User installation will likely be out of date. Select the upgrade button to update the software. Once software is updated DS-User will restart. Complete the previous step to select the DS-CLient and log in.

View Event log

wpid1773-media_1329861764496.png

The event log will show all events relating to activities, including reasons for backup jobs completed “With Errors”

Filter for date range and to only show Errors

wpid1774-media_1329861887321.png

Typical error shown

wpid1775-media_1329861930963.png

The most common reason a backup job will fail is due to an open file. If a file fails regularly there may be a need to use a different backup strategy or to exclude that file from backup. Lewan Managed Data Protection staff regularly monitors events and will take steps to reduce errors on backup jobs.

 

VMware Backups using NetBackup 7

Configuring NetBackup 7 for VMware backup (using vStorage API)

Configure VMware backup host in Netbackup

wpid745-media_1268169003053.png

right-click on master server, select “Properties”

wpid746-media_1268169032231.png

Add VMware Backup Host

wpid747-media_1268169068352.png
wpid748-media_1268169100577.png
wpid749-media_1268169133115.png

Configure Credentials on vCenter

wpid750-media_1268169159126.png
wpid751-media_1268169176234.png
wpid752-media_1268169196136.png
wpid753-media_1268169227149.png
wpid754-media_1268169251033.png

Create the backup policy for Virtual Machine Backup

wpid755-media_1268169558017.png
wpid756-media_1268169599531.png
wpid757-media_1268169638301.png
wpid768-media_1269641712723.png
The parameters shown are not the default but reflect a configuration that seems to be optimal for test environment. Your mileage may vary.
These specific parameters have been changed from the default
Client Name Selection determines how Virtual Machines are identified to Netbackup.  VM Display name option matches the VM name as identified in vCenter
Transfer type determines how VM data is transfered to Netbackup host.  The san option uses Fibre Channel or iSCSI SAN (Note:, LUNs containing VMWare Data Stored must be presented to Netbackup host).  The nbd option resorts to a network copy, should the san option fail.
Existing snapshot handling, when set to Remove NBU, will remove stray NetBackup snapshots from VMs if encountered but ignore all other snapshots.
wpid758-media_1268169703636.png
wpid759-media_1268169735435.png
wpid760-media_1268169765662.png
wpid761-media_1268169782724.png

Configure remaining backup policy options based on backup windows etc.

wpid762-media_1268169805026.png
wpid763-media_1268169823906.png
wpid764-media_1268169840241.png
wpid765-media_1268169856070.png

If options need to be changed (‘cuz mine didn’t work in your environment 😉 ) , change on the policy’s attributes window

wpid766-media_1269640695368.png
wpid767-media_1269640972034.png

Why is my backup running slow?

Backup systems, while a necessary part of any well managed IT system, are often a large source of headaches for IT staff. One of the biggest issues with any back system is poor performance. It is often assumed that performance is related to the efficiency of the backup software or the performance capabilities of backup hardware. There are, however, many places within the entire backup infrastructure that could create a bottleneck.
Weekly and nightly backups tend to place a much higher load on systems than normal daily activities. For example a standard file server may access around 5% of its files during the course of a day but a full backup reads every file on the system. Backups put strain on all components of a system from the storage through the internal buses to the network. A weakness in any component along the path can cause performance problems. Starting with the backup client itself, let’s look at some of the issues which could impact backup performance.

  • File size and file system tuning
  • Small Files

A file system with many small files is generally slower to back up than one with the same amount of data in fewer large files. Generally systems with home directories and other shares which house user files will take longer to back up than database servers and systems with fewer large files. The primary reason for this is due to the overhead involved in opening and closing a file
In order to read a file the operating system must first acquire the proper locks then access the directory information to ascertain where the data is located on the physical disk. After the data is read, additional processing is required to release those locks and close the file. If the amount of time required to read on block of data is x, then it is a minimum of 2-3x to perform the open operations and x to perform the close. The best case scenario, therefore, would require 4x to open, read and close a 1 block file. A 100 block file would require 103x. A file system with a 4 100 block files will require around 412x to back up. The same amount of data stored in 400 1 block files would require 1600x or about 4 times as much time.

So, what is the solution? Multiple strategies exist which can help alleviate the situation.
The use of synthetic full backups only copies the changed files from the client to the backup server (as with an incremental backup) and a new full is generated on the backup server from the previous full backup and the subsequent incrementals. A synthetic full strategy at a minimum requires multiple tape drives and disk based backup is recommended. Adequate server I/O performance is a must as well since the creation of the synthetic full requires a large number of read and write operations.
Another strategy can be to use storage level snapshots to present the data to the backup server. The snapshot will relieve the load from the client but will not speed up the overall backup as the open/close overhead still exists. It just has been moved to a different system. Snapshots can also be problematic if the snapshot is not properly synchronized with the original server. Backup data can be corrupted if open files are included in the snapshot.
Some backup tools allow for block level backups of file systems. This removes the performance hit due to small files but requires a full file system recovery to another server in order to extract a single file.
Continuous Data Protection (CDP) is a method of writing the changes within a file system to another location either in real time or at regular, short intervals. CDP overcomes the small file issue by only copying the changed blocks but requires reasonable bandwidth and may put an additional load on the server.
Moving older, seldom accessed files to a different server via file system archiving tools will speed up the backup process while also reducing required investment in expensive infrastructure for unused data.

  • Fragmentation

A system with a lot of fragmentation can take longer to back up as well. If large files are broken up into small pieces a read of that file will require multiple seek operations as opposed to a sequential operation if the file has no fragmentation.
File systems with a large amount of fragmentation should regularly utilize some sort of de-fragmentation process which can impact both system and backup performance.

  • Client throughput

In some cases a client system may be perfectly suited for the application but not have adequate internal bandwidth for good backup performance. A backup operation requires a large amount of disk read operations which are passed along a system’s internal bus to the network interface card (NIC). Any slow device along the path from the storage itself, through the host bus adapter, the system’s backplane and the NIC can cause a bottleneck.
Short of replacing the client hardware the solution to this issue is to minimize the effect on the remainder of the backup infrastructure. Strategies such as backup to disk before copying to tape (D2D2T) or multiplexing limit the adverse effects of a slow backup on tape performance and life. In some cases a CDP strategy might be considered as well.

  • Network throughput

Network bandwidth and latency can also affect the performance of a backup system. A very common issue arises when either a client or media server has connected to the network but the automatic configuration has set the connection to a lower speed or incorrect duplex. Using 1Gb/sec hardware has no advantage when the port is incorrectly set to 10Mb/half duplex.
Remote sites can also cause problems as those sites often utilize much slower speeds than local connections. Synthetic full backups can alleviate the problem but if there is a high daily change rate may not be ideal. CDP is often a good fit, as long as the change rate does not exceed the available bandwidth. In many cases a remote media server with deduplicated disk replicated to the main site is the most efficient method for remote sites.

  • Media server throughput

Like each client system the media server can have internal bandwidth issues. When designing a backup solution be certain that systems used for backup servers have adequate performance characteristics to meet requirements. Often a site will choose an out of production server to become the backup system. While such systems usually meet the performance needs of a backup server, in many cases obsolete servers are not up to the task.
In some cases a single media server cannot provide adequate throughput to complete the backups within required windows. In these cases multiple media servers are recommended. Most enterprise class backup software allows for sharing of tape and disk media and can automatically load balance between media servers. In such cases multiple media servers allow for both performance and availability advantages.

  • Storage network

When designing the Storage Area Network (SAN) be certain that the link bandwidth matches the requirements of attached devices. A single LTO-4 tap drive writes data at 120MB/sec. In network bandwidth terms this is equivalent to 1.2Gb/sec. If this tape drive is connected to an older 1Gb SAN, the network will not be able to write at tape speeds. In many cases multiple drives are connected to a single Fibre Channel link. This is not an issue if the link allows for at least the bandwidth of the total of the connected devices. The rule of thumb for modern LTO devices and 4Gb Fibre Channel is to put no more than 4 LTO-3 and no more than 2 LTO-4 drives on a single link.
For disk based backup media, be certain that the underlying network infrastructure (LAN for network attached or iSCSI disk and SAN for Fibre Channel) can support the required bandwidth. If a network attached disk system can handle 400MB/sec writes but is connected to a single 1Gb/sec LAN it will only be able to write up to the network speed, 100MB./sec. In such a case, 4 separate 1Gb connections will be required to meet the disk system’s capabilities.

  • Storage devices

The final stage of any backup is the write of data to the backup device. While these devices are usually not the source of performance problems there may be some areas of concern. When analyzing a backup system for performance, be sure to take into account the capabilities of the target devices. A backup system with 1Gb throughput throughout the system with a single LTO-1 target will never exceed the 15MB/sec (150Mb/sec) bandwidth of that device.

  • Disk

For disk systems the biggest performance issues is the write capability of each individual disk and the number of disks (spindles) within the system. A single SATA disk can write between 75 and 100 MB/sec. An array with 10 SATA drives can, therefore, expect to be able to write between 750MB/sec and 1GB/sec. RAID processing overhead and inline deduplication processing will limit the speed so except the real performance to be somewhat lower, as much as 50% less than the raw disk performance depending on the specific system involved. When deciding on a disk subsystem, be sure to evaluate the manufacturer’s performance specifications.

  • Tape

With modern high speed tape subsystems the biggest problem is not exceeding the device’s capability but not meeting the write speed. A tape device performs best when the tape is passing the heads at full speed. If data is not streamed to the tape device at a sufficient rate to continuously write, the tape will have to stop while the drive’s buffer is filled with enough data to perform the next write. In order to get up to speed, the tape must rewind a small amount and then restart. Such activity is referred to as “shoe shining” and drastically reduces the life of both the tape and the drive.
Techniques such as multiplexing (intermingling backup data from multiple clients) can alleviate the problem but be certain that the last, slow client is not still trickling data to the tape after all other backup jobs have completed. In most cases D2D2T is the best solution, provided that the disk can be read fast enough to meet the tape’s requirements.

  • Conclusion

In most backup systems there are multiple components which cause performance issues. Be certain to investigate each stage of the backup process and analyze all potential causes of poor performance.