Backup systems, while a necessary part of any well managed IT system, are often a large source of headaches for IT staff. One of the biggest issues with any back system is poor performance. It is often assumed that performance is related to the efficiency of the backup software or the performance capabilities of backup hardware. There are, however, many places within the entire backup infrastructure that could create a bottleneck.
Weekly and nightly backups tend to place a much higher load on systems than normal daily activities. For example a standard file server may access around 5% of its files during the course of a day but a full backup reads every file on the system. Backups put strain on all components of a system from the storage through the internal buses to the network. A weakness in any component along the path can cause performance problems. Starting with the backup client itself, let’s look at some of the issues which could impact backup performance.
- File size and file system tuning
A file system with many small files is generally slower to back up than one with the same amount of data in fewer large files. Generally systems with home directories and other shares which house user files will take longer to back up than database servers and systems with fewer large files. The primary reason for this is due to the overhead involved in opening and closing a file
In order to read a file the operating system must first acquire the proper locks then access the directory information to ascertain where the data is located on the physical disk. After the data is read, additional processing is required to release those locks and close the file. If the amount of time required to read on block of data is x, then it is a minimum of 2-3x to perform the open operations and x to perform the close. The best case scenario, therefore, would require 4x to open, read and close a 1 block file. A 100 block file would require 103x. A file system with a 4 100 block files will require around 412x to back up. The same amount of data stored in 400 1 block files would require 1600x or about 4 times as much time.
So, what is the solution? Multiple strategies exist which can help alleviate the situation.
The use of synthetic full backups only copies the changed files from the client to the backup server (as with an incremental backup) and a new full is generated on the backup server from the previous full backup and the subsequent incrementals. A synthetic full strategy at a minimum requires multiple tape drives and disk based backup is recommended. Adequate server I/O performance is a must as well since the creation of the synthetic full requires a large number of read and write operations.
Another strategy can be to use storage level snapshots to present the data to the backup server. The snapshot will relieve the load from the client but will not speed up the overall backup as the open/close overhead still exists. It just has been moved to a different system. Snapshots can also be problematic if the snapshot is not properly synchronized with the original server. Backup data can be corrupted if open files are included in the snapshot.
Some backup tools allow for block level backups of file systems. This removes the performance hit due to small files but requires a full file system recovery to another server in order to extract a single file.
Continuous Data Protection (CDP) is a method of writing the changes within a file system to another location either in real time or at regular, short intervals. CDP overcomes the small file issue by only copying the changed blocks but requires reasonable bandwidth and may put an additional load on the server.
Moving older, seldom accessed files to a different server via file system archiving tools will speed up the backup process while also reducing required investment in expensive infrastructure for unused data.
A system with a lot of fragmentation can take longer to back up as well. If large files are broken up into small pieces a read of that file will require multiple seek operations as opposed to a sequential operation if the file has no fragmentation.
File systems with a large amount of fragmentation should regularly utilize some sort of de-fragmentation process which can impact both system and backup performance.
In some cases a client system may be perfectly suited for the application but not have adequate internal bandwidth for good backup performance. A backup operation requires a large amount of disk read operations which are passed along a system’s internal bus to the network interface card (NIC). Any slow device along the path from the storage itself, through the host bus adapter, the system’s backplane and the NIC can cause a bottleneck.
Short of replacing the client hardware the solution to this issue is to minimize the effect on the remainder of the backup infrastructure. Strategies such as backup to disk before copying to tape (D2D2T) or multiplexing limit the adverse effects of a slow backup on tape performance and life. In some cases a CDP strategy might be considered as well.
Network bandwidth and latency can also affect the performance of a backup system. A very common issue arises when either a client or media server has connected to the network but the automatic configuration has set the connection to a lower speed or incorrect duplex. Using 1Gb/sec hardware has no advantage when the port is incorrectly set to 10Mb/half duplex.
Remote sites can also cause problems as those sites often utilize much slower speeds than local connections. Synthetic full backups can alleviate the problem but if there is a high daily change rate may not be ideal. CDP is often a good fit, as long as the change rate does not exceed the available bandwidth. In many cases a remote media server with deduplicated disk replicated to the main site is the most efficient method for remote sites.
Like each client system the media server can have internal bandwidth issues. When designing a backup solution be certain that systems used for backup servers have adequate performance characteristics to meet requirements. Often a site will choose an out of production server to become the backup system. While such systems usually meet the performance needs of a backup server, in many cases obsolete servers are not up to the task.
In some cases a single media server cannot provide adequate throughput to complete the backups within required windows. In these cases multiple media servers are recommended. Most enterprise class backup software allows for sharing of tape and disk media and can automatically load balance between media servers. In such cases multiple media servers allow for both performance and availability advantages.
When designing the Storage Area Network (SAN) be certain that the link bandwidth matches the requirements of attached devices. A single LTO-4 tap drive writes data at 120MB/sec. In network bandwidth terms this is equivalent to 1.2Gb/sec. If this tape drive is connected to an older 1Gb SAN, the network will not be able to write at tape speeds. In many cases multiple drives are connected to a single Fibre Channel link. This is not an issue if the link allows for at least the bandwidth of the total of the connected devices. The rule of thumb for modern LTO devices and 4Gb Fibre Channel is to put no more than 4 LTO-3 and no more than 2 LTO-4 drives on a single link.
For disk based backup media, be certain that the underlying network infrastructure (LAN for network attached or iSCSI disk and SAN for Fibre Channel) can support the required bandwidth. If a network attached disk system can handle 400MB/sec writes but is connected to a single 1Gb/sec LAN it will only be able to write up to the network speed, 100MB./sec. In such a case, 4 separate 1Gb connections will be required to meet the disk system’s capabilities.
The final stage of any backup is the write of data to the backup device. While these devices are usually not the source of performance problems there may be some areas of concern. When analyzing a backup system for performance, be sure to take into account the capabilities of the target devices. A backup system with 1Gb throughput throughout the system with a single LTO-1 target will never exceed the 15MB/sec (150Mb/sec) bandwidth of that device.
For disk systems the biggest performance issues is the write capability of each individual disk and the number of disks (spindles) within the system. A single SATA disk can write between 75 and 100 MB/sec. An array with 10 SATA drives can, therefore, expect to be able to write between 750MB/sec and 1GB/sec. RAID processing overhead and inline deduplication processing will limit the speed so except the real performance to be somewhat lower, as much as 50% less than the raw disk performance depending on the specific system involved. When deciding on a disk subsystem, be sure to evaluate the manufacturer’s performance specifications.
With modern high speed tape subsystems the biggest problem is not exceeding the device’s capability but not meeting the write speed. A tape device performs best when the tape is passing the heads at full speed. If data is not streamed to the tape device at a sufficient rate to continuously write, the tape will have to stop while the drive’s buffer is filled with enough data to perform the next write. In order to get up to speed, the tape must rewind a small amount and then restart. Such activity is referred to as “shoe shining” and drastically reduces the life of both the tape and the drive.
Techniques such as multiplexing (intermingling backup data from multiple clients) can alleviate the problem but be certain that the last, slow client is not still trickling data to the tape after all other backup jobs have completed. In most cases D2D2T is the best solution, provided that the disk can be read fast enough to meet the tape’s requirements.
In most backup systems there are multiple components which cause performance issues. Be certain to investigate each stage of the backup process and analyze all potential causes of poor performance.