US-20260127906-A1 - SYSTEMS AND METHODS FOR IDENTIFYING DUPLICATE DOCUMENTS AND DETECTING MISREPRESENTATION

US20260127906A1US 20260127906 A1US20260127906 A1US 20260127906A1US-20260127906-A1

Abstract

Systems, methods, and non-transitory computer readable media configured for identifying duplicate and misrepresented documents are provided. At least one processor may retrieve, from a first source, a first document, and may retrieve, from a second source, a second document. The processor may process each document. The processor may determine a cosine similarity between a first set of numbers and second set of numbers, and whether the cosine similarity exceeds a first threshold. The processor may determine a number of words in common between the two documents, and whether the number of words in common exceeds a second threshold. The processor may determine a number of sentences in common between the two documents, and whether that number exceeds a third threshold. Responsive to a determination that the first threshold, second threshold, or third threshold are exceeded, the processor may set a flag indicating that the second document is a duplicate.

Inventors

Joshua Raymond Stewart
John Glenn Wilkinson, III

Assignees

THE PNC FINANCIAL SERVICES GROUP, INC.

Dates

Publication Date: 20260507
Application Date: 20251218

Claims (20)

1 - 13 . (canceled)
14 . A system comprising: a memory storing instructions; and at least one processor configured to execute the stored instructions to: retrieve, from a first source, a first performance evaluation; retrieve, from a second source, a second performance evaluation; process the first performance evaluation and the second performance evaluation, wherein processing includes cleaning, tokenizing, and vectorizing the first performance evaluation and the second performance evaluation; determine a cosine similarity between a first set of numbers and a second set of numbers, the first set of numbers corresponding to one or more sentences in the first performance evaluation and the second set of numbers corresponding to one or more sentences in the second performance evaluation, wherein each number in the first set of numbers corresponds to a word in the first performance evaluation and each number in the second set of numbers corresponds to a word in the second performance evaluation; determine whether the cosine similarity exceeds a first threshold; determine a number of words in the second performance evaluation; determine whether the number of words in the second performance evaluation is below a second threshold; determine a performance review rating for the second performance evaluation; determine whether the performance review rating is below a third threshold; and responsive to a determination that the cosine similarity exceeds the first threshold, the number of words in the second performance review is below a second threshold, or the performance review rating is below the third threshold: set a flag to indicate that the second performance evaluation requires further review.
15 . The system of claim 14 , wherein the at least one processor is further configured to: iterate the processing, determining, and flag setting for each of a plurality of second performance evaluations retrieved from the second source, until the second source no longer contains any second performance evaluations to process.
16 - 24 . (canceled)
25 . The system of claim 14 , wherein the first threshold is at least 0.5.
26 . The system of claim 14 , wherein the second threshold is 50.
27 . The system of claim 14 , wherein determining a performance review rating includes: extracting text data associated with the second performance evaluation; extracting numerical data associated with the second performance evaluation; analyzing the extracted text data and numerical data using at least one of: an ANN algorithm; a KNN algorithm; optical character recognition; or natural language processing; and assigning the second performance evaluation a performance review rating based on the extracted and analyzed data.
28 . The system of claim 27 , wherein the performance review rating is a scaled score ranging from 1 to 5.
29 . The system of claim 14 , wherein the third threshold is 2.
30 . The system of claim 14 , wherein the at least one processor is configured to adjust each of the first threshold, the second threshold, and the third threshold in response to an increase or decrease in the number of performance evaluations retrieved from the first source.
31 . The system of claim 14 , wherein the first source contains performance evaluations submitted during at least one previous review period.
32 . The system of claim 14 , wherein the second source contains performance evaluations submitted during a current review period.
33 . The system of claim 14 , wherein the at least one processor is further configured to: send the set flag for display on a graphical user interface of a user device.
35 . The system of claim 14 , wherein cleaning further includes: removing malicious scripts; removing metadata; or removing malware from each of the first performance evaluation and the second performance evaluation.
36 . The system of claim 14 , wherein tokenizing the first performance evaluation and the second performance evaluation further includes substituting a sensitive data element with a non-sensitive data element using at least one of: word tokenization, character tokenization, or subword tokenization.
37 . The system of claim 43 , wherein the sensitive data element includes personal identifying information.
38 . The system of claim 14 , wherein the at least one processor is configured to vectorize each of the first and second performance evaluations using at least one of: a bag-of-words model; a term frequency-inverse document frequency model; a paragraph vector model; or one-hot encoding.
39 . A method comprising: retrieving, from a first source, a first performance evaluation; retrieving, from a second source, a second performance evaluation; processing the first performance evaluation and the second performance evaluation, wherein processing includes cleaning, tokenizing, and vectorizing the first performance evaluation and the second performance evaluation; determining a cosine similarity between a first set of numbers and a second set of numbers, the first set of numbers corresponding to one or more sentences in the first performance evaluation and the second set of numbers corresponding to one or more sentences in the second performance evaluation, wherein each number in the first set of numbers corresponds to a word in the first performance evaluation and each number in the second set of numbers corresponds to a word in the second performance evaluation; determining whether the cosine similarity exceeds a first threshold; determining a number of words in the second performance evaluation; determining whether the number of words in the second performance evaluation is below a second threshold; determining a performance review rating for the second performance evaluation; determining whether the performance review rating is below a third threshold; and responsive to a determination that the cosine similarity exceeds the first threshold, the number of words in the second performance review is below a second threshold, or the performance review rating is below the third threshold: setting a flag indicating that the second performance evaluation requires further review.
40 . The method of claim 39 , further comprising the steps of: iterating the processing, determining, and flag setting for each of a plurality of second performance evaluations retrieved from the second source, until the second source no longer contains any second performance evaluations to process.
41 . A non-transitory computer readable medium having stored instructions, which when executed, cause at least one processor to perform operations comprising: retrieving, from a first source, a first performance evaluation; retrieving, from a second source, a second performance evaluation; processing the first performance evaluation and the second performance evaluation, wherein processing includes cleaning, tokenizing, and vectorizing the first performance evaluation and the second performance evaluation; determining a cosine similarity between a first set of numbers and a second set of numbers, the first set of numbers corresponding to one or more sentences in the first performance evaluation and the second set of numbers corresponding to one or more sentences in the second performance evaluation, wherein each number in the first set of numbers corresponds to a word in the first performance evaluation and each number in the second set of numbers corresponds to a word in the second performance evaluation; determining whether the cosine similarity exceeds a first threshold; determining a number of words in the second performance evaluation; determining whether the number of words in the second performance evaluation is below a second threshold; determining a performance review rating for the second performance evaluation; determining whether the performance review rating is below a third threshold; and responsive to a determination that the cosine similarity exceeds the first threshold, the number of words in the second performance review is below a second threshold, or the performance review rating is below the third threshold: set a flag indicating that the second performance evaluation requires further review.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is based on and claims benefit of priority of U.S. Provisional Patent Application No. 63/645,480, filed May 10, 2024, the contents of which are incorporated herein in their entirety. TECHNICAL FIELD The present disclosure relates generally to systems and methods for detecting potential misrepresentation. More specifically, but without limitation, this disclosure relates to systems and methods for detecting suspect resumes submitted by potential job candidates. BACKGROUND Businesses have long focused on how to identify, deter, and remediate suspicious activity for their customers. As technologies change, bad actors seek new ways of obtaining confidential information from organizations, beyond that of customer data. One potential entry point is the employment process. While safeguards such as background checks and 1-9 forms currently exist, there are not currently satisfactory ways to identify misleading applicants early on in the recruiting process. What is needed is a way to identify applicant activity that requires further investigation. Accordingly, some embodiments of this disclosure are directed to extracting data from recently submitted resumes and to compare them to all historical resumes. Consistent with this disclosure, tools may verify whether the new resume is duplicative of another resume in the historical resume data repository not associated with the same application. Disclosed embodiments may also apply to potentially fake employers, schools, and IP addresses. Relatedly, unscrupulous managers may provide subpar feedback to employees, which may be evidenced by duplicative or cursory performance evaluations. What is needed is a method of evaluating performance evaluation feedback to determine quality, which may allow for streamlined further review of particular performance evaluations. SUMMARY One aspect of the present disclosure is directed to a system that may include a memory storing instructions and at least one processor configured to execute the instructions to perform operations. Another aspect may be related to a method. Yet another aspect is directed to a non-transitory computer readable medium. In each aspect, processor operations may include retrieving, from a first source, a first document; retrieving, from a second source, a second document; processing the first document and the second document, wherein processing comprises cleaning, tokenizing, and vectorizing each of the first document and the second documents; determining a cosine similarity between a first set of numbers and a second set of numbers, the first set of numbers corresponding to one or more sentences in the first document and the second set of numbers corresponding to one or more sentences in the second document, wherein each number in the first set of numbers corresponds to a word in the first document and each number in the second set of numbers corresponds to a word in the second document; determining whether the cosine similarity exceeds a first threshold; determining, based on the cosine similarity, a number of words in common between the first document and the second documents; determining whether the number of words in common exceeds a second threshold; determining, based on the cosine similarity, a number of sentences in common between the first document and the second document; determining whether the number of sentences in common exceeds a third threshold; and responsive to a determination that the cosine similarity exceeds the first threshold, the number of words in common exceeds the second threshold, or the number of sentences in common exceeds the third threshold: setting a flag that indicates that the second document is a duplicate. Another aspect of the present disclosure is directed to a system. The system may include a memory storing instructions and at least one processor configured to execute the instructions to perform operations. The operations may include retrieving, from a first source, a first performance evaluation; retrieving, from a second source, a second performance evaluation; processing the first performance evaluation and the second performance evaluation, wherein processing comprises cleaning, tokenizing, and vectorizing each of the first performance evaluation and the second performance evaluation; determining a cosine similarity between a first set of numbers and a second set of numbers, the first set of numbers corresponding to one or more sentences in the first performance evaluation and the second set of numbers corresponding to one or more sentences in the second performance evaluation, wherein each number in the first set of numbers corresponds to a word in the first performance evaluation and each number in the second set of numbers corresponds to a word in the second performance evaluation; determining whether the cosine similarity exceeds a first threshold; determining a number of words in the second performance evalu