May 24, 2018
For many e-discovery professionals, deduplication is one part of the data processing workflow that is veiled in mystery despite being an integral part of the process. There are many ways to perform deduplication and each processing platform has its own method. Regardless of the processing platform, deduplication always involves the creation of a hash value for each document in a data set, the comparison of hash values against all others in the designated set, the “removal” of duplicative documents, and the population of duplicate item fields for tracking purposes. We wanted to de-veil some of the mystery surrounding the finer points of deduplication.
Creating hash values. The hash value is a unique identifier for a document, which is generated upon data ingestion regardless of whether deduplication is enabled, and is utilized during deduplication to identify identical documents. There are many hashing algorithms in existence, but there are three primary types that are used in e-discovery – MD5, SHA1, and SHA256.
Since the hash values are generated by the processing software, it is possible that the same document processed in two different processing platforms could have different hash values. This is because each platform chooses which fields, metadata, and content is included in the hash computation. This is one consideration that is taken into account when comparing or migrating data across platforms.
Comparing hash values. Before comparison of document hash values, we first must specify the scope of the deduplication. There are three primary deduplication scopes – Global, Custodial, and No Deduplication. Global deduplication compares the new document against all documents previously processed in a case or project. Custodial deduplication compares the new document to previously processed documents that share the same custodian. No deduplication, as the name suggests, does not compare new documents against previously processed documents.
The process of comparing hash values may occur at different stages in the processing workflow depending on the platform in which the data is processed. While global deduplication improves efficiency during review by further reducing the reviewable item count, it also impacts other areas of processing and review such as searching.
Document type. Since each processing platform determines its own method of hashing documents, there is typically some variance between fields and content included in the data used to generate the hash. This variance carries over into the document type as well. Email files are hashed using a combination of metadata, header information, and content to generate the hash while loose files are typically hashed based on some combination of content and metadata. While understanding the minute differences between the two is not necessary, being aware that a difference exists may prove useful during inquiries and data investigations.
Filtering and searching. Filtering and searching may also be impacted by deduplication if documents normally returned in the filter or search are removed during deduplication. This tends to occur more often in global deduplication as documents for a new custodian may be removed as a duplicate of a previously processed document from a different custodian. If the search or filter is targeted at the new custodian then the result set may be incomplete due to the deduplication. This showcases the importance of the duplicate and all custodian fields.
Additionally, since some fields are not taken into account when creating the document’s hash value, it is possible that some documents identified as duplicates are not an exact match. This could result when fields such as BCC are omitted from the hash.
While we are all familiar with the benefits to utilizing deduplication in e-discovery processing, hopefully you now have a better understanding of the deduplication process and some of the finer points that may arise when employing it. This post only scratches the surface of deduplication and there are many resources available for those interested in taking a deeper look into the complex process.
Brandy Dorris is a Senior Encompass Content Analytics Platform (EnCAP) Analyst who specializes in data processing, deduplication and culling. She is based in Encompass's Nashville office and has worked in electronic discovery for 5 years.
These materials have been prepared for informational purposes only and are not legal advice. This information is not intended to create, and receipt of it does not constitute, an attorney-client relationship. Internet subscribers and online readers should not act upon this information without seeking professional counsel.