US-12619666-B2 - Integrated document scoring and prioritization systems and methods for enhanced document review in e-discovery
Abstract
Integrated document scoring and prioritization systems and methods for enhanced document review in e-discovery are disclosed herein. An example includes combining a first vector of scores from one algorithm and a second vector of scores from another algorithm, both corresponding to a set of documents. Ground truth labels indicating document responsiveness are also acquired. A system calculates blending weights through a supervised learning algorithm, considering the scores and ground truth labels. These weights are then employed to combine the vector scores, yielding final ranking scores for each document. The system culminates in the creation of a sorted document list, where documents with higher final ranking scores are prioritized, signifying their increased relevance within the dataset.
Inventors
- Jan Stadermann
Assignees
- OPEN TEXT INC.
Dates
- Publication Date
- 20260505
- Application Date
- 20240517
Claims (20)
- 1 . A method for enhancing document ranking in an electronic discovery system, comprising: feeding an annotated seed set of documents into a first predictive algorithm and a second predictive algorithm of a predictive engine; receiving a first vector of scores from the first predictive algorithm of the predictive engine, the first vector of scores having been generated by the first predictive algorithm based on a first machine learning model that the first predictive algorithm developed during training, the first machine learning model configured to adapt and refine its scoring process over time to determine document importance, each score corresponding to a document in a set of documents, each score being generated based on historical human or machine decisions and leveraging past document reviews to identify patterns and relationships for determining the document importance; receiving a second vector of scores from the second predictive algorithm of the predictive engine, the first predictive algorithm and the second predictive algorithm having operated in the predictive engine, the second vector of scores having been generated by the second predictive algorithm based on a second machine learning model that the second predictive algorithm developed during training, the second machine learning model configured to adapt and refine its scoring process over time to determine the document importance, each score corresponding to a document in the same set of documents, each score being generated based on historical human or machine decisions and leveraging past document reviews to identify patterns and relationships for the determining document importance; (i) receiving ground truth labels indicating if each document of the set of documents is responsive or non-responsive; (ii) calculating weights for blending the first vector of scores and the second vector of scores, where the weights are determined from a training process; (iii) combining the first vector of scores and the second vector of scores using the calculated weights to generate a final ranking score for each document of the set of documents; (iv) computing a ranked list of documents based on final ranking scores, wherein higher final ranking scores indicate documents of higher relevance; (v) picking top N documents from the ranked list to form a batch for further review; (vi) comparing the batch against the ground truth labels to identify responsive documents in the batch; (vii) annotating documents in the batch according to the ground truth labels; (viii) refining the first machine learning model and the second machine learning model by feeding the annotated documents back into the first predictive algorithm and the second predictive algorithm; and repeating steps (i)-(viii) until M responsive documents are identified, where M represents a predefined threshold.
- 2 . The method of claim 1 , wherein the first predictive algorithm is a Support Vector Machine (SVM).
- 3 . The method of claim 1 , wherein the second predictive algorithm is a Document Relation Engine (DRE).
- 4 . The method of claim 1 , wherein the training process includes utilizing a supervised learning algorithm to estimate the weights using the first vector of scores, the second vector of scores, and the ground truth labels.
- 5 . The method of claim 4 , wherein the supervised learning algorithm used to estimate the weights is a multi-layer perceptron neural network.
- 6 . The method of claim 1 , further comprising normalizing the scores in the first and second vectors to a predetermined range of values before calculating the weights.
- 7 . The method of claim 1 , wherein the ground truth labels indicate a value of one for the responsive documents and a value of zero for non-responsive documents.
- 8 . The method of claim 1 , wherein the final ranking score for each document is calculated by minimizing a squared error between the final ranking score and the ground truth labels.
- 9 . The method of claim 1 , wherein the final ranking score for each document is calculated by summing the scores for the responsive documents and non-responsive documents separately and then calculating a ratio of those sums.
- 10 . The method of claim 1 , wherein the final ranking score for each document is calculated based on positions of the documents in the first and second vectors of scores, with documents assigned higher positions for higher relevance.
- 11 . The method of claim 10 , further comprising using a softmax function to normalize a sum of the positions when calculating the weights.
- 12 . A method comprising: feeding an annotated seed set of documents into a first predictive algorithm and a second predictive algorithm of a predictive engine; generating scores for each document of a set of documents using the first predictive algorithm based on a first machine learning model that the first predictive algorithm developed during training, the first machine learning model configured to adapt and refine its scoring process over time to determine document importance, each score being generated based on historical human or machine decisions and leveraging past document reviews to identify patterns and relationships for determining the document importance; generating scores for each document of the set of documents using the second predictive algorithm based on a second machine learning model that the second predictive algorithm developed during training, the second machine learning model configured to adapt and refine its scoring process over time to determine the document importance, each score being generated based on historical human or machine decisions and leveraging past document reviews to identify patterns and relationships for determining the document importance; (i) calculating weighted averages of the generated scores using the first predictive algorithm and the generated scores using the second predictive algorithm, based on weights that are determined from a training process; (ii) computing a ranked list of documents based on the calculated weighted averages; (iii) picking top N documents from the ranked list to form a batch for further review; (iv) comparing the batch against ground truth labels to identify responsive documents in the batch; (v) annotating documents in the batch according to the ground truth labels; (vi) refining the first machine learning model and the second machine learning model by feeding the annotated documents back into the first predictive algorithm and the second predictive algorithm; and repeating steps (i-vi) until M responsive documents are identified, where M represents a predefined threshold.
- 13 . A system, comprising: a processor and memory for storing instructions, the processor executing the instructions to: feed an annotated seed set of documents into a first predictive algorithm and a second predictive algorithm of a predictive engine; apply the first predictive algorithm to generate scores for each document of a set of documents, the first predictive algorithm based on a first machine learning model that the first predictive algorithm developed during training, the first machine learning model configured to adapt and refine its scoring process over time to determine document importance, each score being generated based on historical human or machine decisions and leveraging past document reviews to identify patterns and relationships for determining the document importance; apply the second predictive algorithm to generate scores for each document of the set of documents, the second predictive algorithm based on a second machine learning model that the second predictive algorithm developed during training, the second machine learning model configured to adapt and refine its scoring process over time to determine the document importance, each score being generated based on historical human or machine decisions and leveraging past document reviews to identify patterns and relationships for determining document the importance; (i) calculate weighted averages of the generated scores of the first predictive algorithm and the generated scores of the second predictive algorithm, based on weights that are determined from a training process; (ii) compute a ranked list of documents based on the calculated weighted averages; (iii) pick top N documents from the ranked list to form a batch for further review; (iv) compare the batch against ground truth labels to identify responsive documents in the batch; (v) annotate documents in the batch according to the ground truth labels; (vi) refine the first machine learning model and the second machine learning model by feeding the annotated documents back into the first predictive algorithm and the second predictive algorithm; and repeat steps (i-vi) until M responsive documents are identified, where M represents a predefined threshold.
- 14 . The system of claim 13 , wherein the first predictive algorithm is a Support Vector Machine (SVM) and the second predictive algorithm is a Document Relation Engine (DRE).
- 15 . The system of claim 13 , wherein the calculating the weighted averages includes applying a supervised learning algorithm to estimate the weights using the generated scores and the ground truth labels, wherein the supervised learning algorithm is a multi-layer perceptron neural network.
- 16 . The system of claim 13 , further comprising normalizing a first vector of scores generated by the first predictive algorithm and a second vector of scores generated by the second predictive algorithm to a predetermined range of values before calculating the weighted averages.
- 17 . The system of claim 13 , wherein the ground truth labels indicate a value of one for responsive documents and a value of zero for non-responsive documents.
- 18 . The system of claim 13 , wherein the calculating the weighted averages is performed by minimizing a squared error between the weighted averages and the ground truth labels.
- 19 . The system of claim 13 , wherein the weighted averages are calculated by summing the scores for responsive documents and non-responsive documents separately and then calculating a ratio of those sums, and wherein the weighted averages are calculated based on positions of the set of documents in scores generated by the first predictive algorithm and scores generated by the second predictive algorithm, with documents assigned higher positions for higher relevance.
- 20 . The system of claim 19 , further comprising using a softmax function to normalize a sum of the positions when calculating the weighted averages.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS N/A. FIELD The present disclosure pertains to document review systems and methods, and more specifically, but not by way of limitation, to integrated document scoring and prioritization systems and methods for enhanced document review in e-discovery. SUMMARY A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions. One general aspect includes a method for enhancing document ranking in an electronic discovery system. The method also includes receiving a first vector of scores from a first algorithm, each score corresponding to a document in a set of documents. The method also includes receiving a second vector of scores from a second algorithm, each score corresponding to the same set of documents. The method also includes receiving ground truth labels indicating whether each document is responsive or non-responsive. It will be understood that ongoing human review can comprise all or part of the ground truth. Thus, the weight calculation can be limited to the documents that have been reviewed. Once weights have been calculated, new scores can be assigned to all or a portion of the documents. The method also includes calculating weights for blending the first vector of scores and the second vector of scores, where the weights are determined based on a training process. The method also includes combining the first vector of scores and the second vector of scores using the calculated weights to generate a final ranking score for each document. The method also includes creating a sorted list of documents based on the final ranking scores, where higher final ranking scores indicate documents of higher relevance. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. Implementations may include one or more of the following features. The method where the first algorithm is a support vector machine (SVM). The second algorithm is a document relation engine (DRE). The training process includes utilizing a supervised learning algorithm to estimate the weights using the scores and ground truth labels. The supervised learning algorithm used to estimate the weights is a multi-layer perceptron neural network. The method may include normalizing the scores in the first and second vectors to a predetermined range of values before calculating the weights. The ground truth labels indicate a value of one for responsive documents and a value of zero for non-responsive documents. The final ranking score for each document is calculated by minimizing a squared error between the final ranking score and the ground truth score. The final ranking score for each document is calculated by summing the scores for responsive documents and non-responsive documents separately and then calculating a ratio of those sums. The final ranking score for each document is calculated based on the positions of the documents in the first and second vectors, with documents assigned higher positions for higher relevance. The method may include using a softmax function to normalize a sum of positions when calculating the weights. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium. One general aspect includes a method for enhancing document ranking in an electronic discovery system. The method also includes receiving a plurality of documents for review. The method also includes generating scores for each document using a first algorithm. The method also includes generating scores for each document using a second algorithm. The method also includes assigning ground truth values (once generated from human review) indicating document responsiveness or non-responsiveness to each document. The method also includes calculating weighted averages of the scores generated based on weights. The method also includes sorting the plurality of documents based on the calculated weighted averages. The method also includes selecting documents for review from the sorted plurality of documents based on sorted positions. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. One general aspect includes a system. The system also includes a processor and memory for storing instructions, the processor executing the instructions to