In “A Beginner’s Guide to Key Predictive Coding Terms: Part 1,” we shared with you the first set of terms that are commonly used in predictive coding not only to help you navigate the field, but to provide you with a firm foundation for understanding what service providers are talking about. In this post, we look at six additional terms commonly heard when predictive coding is up for discussion.
One of the two key metrics for measuring predictive-coding accuracy, recall is the percentage of documents that belong to a category (responsive, privileged, hot, etc.) that were accurately identified. It is a measure of completeness: of all the documents that should have been found, what percentage were found? Academic studies have measured attorney recall at below 50%. To the author’s knowledge, the only specific recall level ever approved by judicial order is 75%.
The second key metric for measuring predictive-coding accuracy, precision refers to the percentage of documents identified as belonging to a category that do, in fact, belong to the category. It is a measure of over-inclusiveness: of all the documents identified as responsive to a tag, what percentage actually were responsive?
The rate at which predictive-coding software incorrectly categorized documents. It can be misleading, making recall and precision the preferred metrics. Imagine a document production from a population of 1 million documents, of which 1% or 10,000 are relevant, and predictive-coding software identifies all documents as nonresponsive. The error rate would be an impressive-sounding 1% despite the software missing all 10,000 relevant documents.
The proportion of documents in a population that belong to a given category. For example, in the context of production, suppose that 1 million documents have been collected, of which 100,000 are actually relevant and must be produced. Prevalence would be 10% (100,000 / 1,000,000). The lower the prevalence, the larger the training set generally must be.
The control set is a random sample that attempts to represent the universe of documents the predictive-coding software will analyze. The control set is used to measure accuracy. Some software providers, such as Dagger Analytics, allow parties to use part or all of the training set as the control set, reducing the number of documents requiring review. The random set is typically a few hundred to a few thousand documents, depending on prevalence and the desired margins of error.
A type of Technology Assisted Review that uses linguistic analysis to group documents relating to the same concept (for example, documents dealing with marketing). Groups can then be sampled and bulk-categorized (e.g., as responsive or non-responsive), analyzed for inconsistent coding, or assigned to reviewers with the appropriate expertise.
Written by: Bruce Ellis Fein
Bruce Ellis Fein pioneered the field of legal predictive coding from its inception, starting in 2007 as co-founder of Backstop LLP (subsequently acquired by Consilio) and more recently with Dagger Analytics. Before entering the data analytics field, he was a litigation associate at Kellogg Hansen Todd Figel & Frederick and Sullivan & Cromwell.
This article was published initially in the Chronicle of E-Discovery.