The removal of near-duplicate records from a dataset in order to avoid repetitious review. Near-deduplication can encompass removal of all documents relating to the same subject matter, such as a recurrent and irrelevant sales report. It can also encompass text analysis to remove documents with only immaterial differences, such as different or multiple Bates numbers.