The use of predictive coding is growing, has been embraced by courts, is even more defensible under the new Federal Rules of Civil Procedure and is now indispensable to large-scale document review and production because of the associated savings in time and expense. Below is a beginner’s guide to some key terms used in the predictive coding field.
predictive coding is the use of technology to categorize documents for relevance, privilege, “hotness” and the like according to their text and metadata. predictive coding involves software that “learns” from the attorney coding of documents. The software analyzes this training data to discern indicia of relevance, privilege, etc. and classifies every document in the data set accordingly. That classification may be binary (yes or no) or may take the form of a probability that the document belongs to each category. TAR (a.k.a. CAR) is a broad term encompassing the use of any technology to assist in document review. predictive coding is a type of TAR. Other types include keyword searching, mass Boolean searching, deduping, near-deduping, email threading and concept clustering. The various types of TAR may be combined.
The lodestar for all eDiscovery, Proportionality is the principle that the burden of production must correspond to the importance of the evidence and the stakes of the case. Until December 1, Proportionality was embodied in Fed. R. Civ. P. 26(b)(2)(C)(iii) (permitting a court to limit discovery where “the burden or expense of the proposed discovery outweighs its likely benefit . . .”), but now it resides in Rule 26(b)(1) (limiting the scope of discovery to matter that is relevant “and proportional to the needs of the case, considering” various factors). predictive coding generally weighs in favor of Proportionality by making it less burdensome to cull responsive from nonresponsive evidence.
Promulgated by the influential Sedona Conference in July 2008 and endorsed in numerous judicial opinions, the Cooperation Proclamation calls for “cooperative, collaborative, transparent” eDiscovery. Courts have invoked the Cooperation Proclamation to encourage parties to accept and negotiate the use of predictive coding.
A party wishing to use predictive coding may seek the other side’s agreement through an agreed protocol, endorsed by the court, that describes how it will conduct predictive coding. Key topics to address include the following:
- how to identify and gather the universe of documents for analysis (including whether and when to use keyword filtering);
- how exemplars of relevant and irrelevant documents will be identified and fed to the software;
- how to set accuracy benchmarks (such as recall and precision) and how to verify those benchmarks;
- whether to permit the clawback of inadvertently produced privileged documents without waiver (in conjunction with Fed. R. Evid. 502) and
- what role, if any, the requesting party will have in reviewing the training data.
“training data,” aka “training documents,” “tagging,” “tagging data,” “seed set”
Documents reviewed and categorized (“tagged”), usually by attorneys, according to whether they belong to various categories such as relevance, privilege, legal issues, overall “hotness” and the like. Software uses the training data to build a model that predictively codes the documents.
 See, e.g., Da Silva Moore v. Publicis Groupe, No. 11-cv-01279, 2012 WL 607412, at *11 (S.D.N.Y. Feb. 24, 2012).
See also, “A Beginner’s Guide to Key Predictive Coding Terms: Part 2”
Written by: Bruce Ellis Fein
Bruce Ellis Fein pioneered the field of legal predictive coding from its inception, starting in 2007 as co-founder of Backstop LLP (subsequently acquired by Consilio) and more recently with Dagger Analytics. Before entering the data analytics field, he was a litigation associate at Kellogg Hansen Todd Figel & Frederick and Sullivan & Cromwell.
This article was published initially in the Chronicle of E-Discovery.