Technology
Traditional Deduplication
The number of Terabytes managed per storage administrator is growing. Most of this growth is in unstructured data – email, photos, videos, web pages, and other digital content that is not in databases. The growth of unstructured data has consistently outpaced the increase in disk sizes and the decrease in individual disk prices. As a consequence, both end users and the storage industry have been looking for new ways to get out ahead of storage growth, contain costs, and simplify management.
How Backup Stacks Up
Block deduplication is a new technology that has been routinely used to address data growth. However, its methodology tends to yield the best results for backups and structured data. Block level dedupe works when there are multiple duplicate versions of the same file because it looks at the file’s actual code – the 0s and 1s. When a document is backed up over and over again, the 0s and 1s stay the same because the file is simply duplicated. The similarities in the two files can be identified with the block dedupe because the sequence of their 0s and 1s are exactly the same.
Online data is different. Online data has few exact duplicates, rather there are files with a lot of similarities in each file. Furthermore, the majority of files contributing to the storage growth are already compressed by their applications; images and video (JPEG, MPEG, TIFF, GIF, PNG), compound documents (zip, email, HTML, Web Pages, PDFs) and Microsoft Office (Powerpoint, Word, Excel, Sharepoint etc. Block deduplication isn’t effective on already compressed files because when a file is compressed its 0s and 1s change from the original format.