Reviewing the thousands of electronic documents requested or produced during discovery in litigation or in an investigation takes time and, as we all know, time is money. Leveraging technology goes a long way to helping cut down on time and costs, but in the end, regardless if done manually or with the help of technology, the first step in reducing document review costs is culling data sets to include only potentially relevant documents. When we tackle large document reviews for our clients, here are three techniques we use when culling massive data sets to limit our review to only those documents that are most likely to be relevant to the project.
Leveraging Metadata Fields
Using metadata fields is a goldmine to cull redundant and irrelevant data. Some good metadata fields to utilize include: file type, file size, email domain, date, and custodian.
File Name or Email Subject
If your metadata contains file names or email subject fields, it’s easy to use these fields to filter irrelevant data. For example, many companies send out weekly newsletters that might be irrelevant to your case and the newsletters usually contain the same email subject. While going through the documents in your database, noting any generic and irrelevant titles can help cut down on document review volume.
Other irrelevant data that may be identified via email subject fields: Out of office notices, travel itineraries, meeting requests, conference agendas, HR and administrative documents, and “do not reply” emails.
You can also leverage file names to easily identify irrelevant documents. Scroll through the file names and pick out key words that you can use to search in the file name field to pull up additional irreverent documents to cull. Make sure you sample the results to make sure all data pulled is actually irrelevant.
Additionally, using file names can be helpful in finding duplicates and near duplicates. Often times the file name of documents are the same or near duplicates are marked by version numbers.
Extracted Text Size
Extracted text size is not always a metadata field that is populated. However scripts may be used to populate this metadata field. For instance, in Relativity, this script is pre-made in the applications library. Once this script is run it populates the size of the extracted text in kilobytes. This is helpful because culling only by file size may be deceiving as files with zero content can still carry a larger file size; by just using the extracted text to determine the size of the document it provides you with another data point to filter out empty or trivial data.
To use this culling method, try filtering your extracted text size by 0 or other small size and check out the resulting documents. I suggest using extracted text size in conjunction with filtering for documents that are small in file size as well, and highly recommend reviewing a statistical sample before determining the values you filtered by are appropriate to cull entire sets of search results.
Clustering is a feature that groups similar documents by content and concepts. Many document review platforms have clustering capabilities. Running clustering on your documents enables you to find common themes without even looking at a single document. The database will highlight the key concepts for each cluster. Once you view the concepts you will be able to identify clusters of documents that are not relevant to the case matter and can be culled.
An additional way to find irrelevant documents with clustering is examining unclustered documents. Often, these documents were not pulled into cluster groups because they do not have enough searchable or relevant data. Groups of unclustered documents are often irrelevant and can be culled.
Clustering may also be leveraged during the document review phase. Setting up a dashboard with cluster visualization to look at responsive documents allows you to quickly see if there are certain clusters that are consistently marked as not responsive. Clicking on the cluster you want to analyze will breakdown the responsiveness for that cluster. If a cluster contains only irrelevant documents you can hone into that cluster to analyze if the remaining untagged documents within that cluster to determine whether they too are irrelevant.
Near Duplicate Analysis
Nowadays, using deduplication is standard operating procedure to cull datasets before document review, but using near duplicate analysis can also be very helpful.
Near duplicate analysis organizes documents first by size, largest to smallest, and then assigns the largest document as a principal document in which all other documents are compared and ranked by similarity percentage. Documents similar to the principal document are grouped together.
The textual near duplicate similarity percentage will give you an accurate measure similarity of documents in a group, the higher the percentage the more similar the document. As you begin culling data and identifying documents that are not relevant you can use the near duplicates feature to pull in all documents with a high textual near duplicate similarity percentage and mass code them. Note it is important to analyze that the documents are similar enough to the principle document to apply the same coding.