US-12625911-B1 - Enhanced string match matrix generation

US12625911B1US 12625911 B1US12625911 B1US 12625911B1US-12625911-B1

Abstract

A method for remediating duplication errors associated with first and second documents includes obtaining a first document and a second document, generating a corresponding first document string set and a second document string set, generating a second document string subset that is a proper subset of the second document string set, dividing the second document string subset into second document substring, transforming the second document substrings into a synthetic substring set, converting the first document string set into first document embeddings in vector space, converting the synthetic substrings into synthetic embeddings, generate a closest embedding set of one or more first document embeddings, generating, using a respective one of the closest embedding sets and a linking string generator, a linking string defining an association between the respective substring and one or more respective first document portions, and generating a string match matrix data.

Inventors

Christopher Davis
Christopher Ziolkowski

Assignees

U.S. BANK NATIONAL ASSOCIATION

Dates

Publication Date: 20260512
Application Date: 20251125

Claims (20)

1 . A method for remediating a duplication error, the method comprising, with a set of one or more processors: obtaining a first document string set of first document strings associated with a first document, wherein the first document comprises first document text content; obtaining a second document string set of second document strings associated with a second document, wherein the second document comprises: second document first text content; second document second text content; and second document image content; obtaining a second document string subset of the second document strings, wherein the second document string subset is based on the second document second text content; dividing the second document string subset into second document substrings; generating a synthetic substring set of synthetic substrings based on one or more of items of content selected from a group consisting of: the second document first text content, the second document second text content, and the second document image content; embedding the first document string set into vector space as first document embeddings using an embedding function; embedding the synthetic substring set into the vector space as synthetic embeddings using the embedding function; generating closest embedding sets, wherein the generating includes, for respective synthetic embeddings of the synthetic embeddings: comparing in the vector space using an embedding similarity algorithm the respective synthetic embeddings with respective first document embeddings of the first document embeddings to generate embedding similarity values; and forming a closest embedding set that includes each of the first document embeddings that have an embedding similarity value to the respective synthetic embedding that satisfies a predetermined similarity value threshold or has a highest embedding similarity value; for each respective second document substring of the second document substrings: identifying an associated closest embedding set of the closest embedding sets that is associated with the respective second document substring; identifying a matching synthetic second document substring associated with the associated closest embedding set; and identifying a plurality of matching first document strings of the first document strings that are associated with the associated closest embedding set; generating, a string match data structure comprising: first portions including the second document substrings; and second portions comprising one or more first document strings identified based on the associated closest embedding set; detecting a duplication error based on the string match data structure; and transmitting a request regarding the duplication error, wherein the request includes the string match data structure, and requests initiating a remediation action selected from a group consisting of: deletion of the first document, hiding the first document, locking the first document, de-indexing the first document from search results, deprioritizing the first document in search results, redirecting links to the first document to the second document, granting privileges associated with the second document, allocating privileges associated with the second document.
2 . The method of claim 1 , wherein the string match data structure further comprises third portions comprising a linking string associated with a respective second document substring and a respective first document string.
3 . The method of claim 1 , further comprising: initiating the remediation action.
4 . The method of claim 1 , further comprising: receiving a response to a confirmation request regarding the duplication error, wherein initiating the remediation action is responsive to receiving the response.
5 . The method of claim 1 , further comprising: after obtaining one of but not both the first document and the second document: generating, using a generative artificial intelligence system, synthetic strings based on the obtained document; and identifying, by the one or more processors, the other of the first document or the second document based on a similarity to the synthetic strings.
6 . The method of claim 1 , further comprising: generating, using a string generation engine, a synthetic string; for each respective document of a set of documents that includes the second document: generating a similarity score based on a similarity between the synthetic string and the respective document; and identifying the second document as having a highest similarity score or as being above a threshold similarity score.
7 . The method of claim 1 , wherein the first document has a first style; wherein the second document has a second style different from the first style; wherein obtaining the first document comprises: generating, using a string generation engine, a synthetic string based on the second document substrings and mimicking the first style rather than the second style; and for each respective document of a document set that includes the first document: generating a similarity score describing a similarity between the synthetic string and the respective document; and identifying the first document based on the first document having a highest similarity score or the similarity score being above a threshold similarity score.
8 . The method of claim 1 , wherein the first document includes first document text content and first document image content; and wherein obtaining the first document string set includes generating at least some of the first document strings based on the first document image content.
9 . The method of claim 1 , wherein at least one of the second portions includes image content based on the second document image content.
10 . The method of claim 1 , wherein the second document comprises one or more privileged portions relevant to the first document.
11 . The method of claim 1 , wherein the second document substrings comprise structured substrings corresponding to unstructured strings of the first document.
12 . The method of claim 1 , wherein obtaining the first document includes: receiving a document generation contextual construct; and generating, using a string generation engine, the first document based at least in part on the document generation contextual construct.
13 . The method of claim 1 , further comprising: generating a document generation contextual construct using a predetermined list of document objectives, wherein the predetermined list of document objectives is determined based on a frequency of one or more of the predetermined list of document objectives in a plurality of heterogeneously privileged document repositories.
14 . The method of claim 1 , wherein obtaining the second document comprises: identifying a target component description associated with a target component; retrieving one or more documents including the second document; embedding into vector space, using an embedding function, the one or more documents and the target component description; detecting satisfaction of a similarity condition between a location of the second document in vector space and the target component description in vector space; and responsive to detecting the satisfaction, determining to obtain the second document.
15 . The method of claim 14 , wherein generating the target component description includes: receiving a document generation contextual construct comprising one or more strings describing the target component and one or more configurations; and generating, based at least in part on the document generation contextual construct, a first set of structured strings describing the target component.
16 . The method of claim 1 , wherein obtaining the first document comprises: receiving the second document comprising the second document string subset of second document strings defining one or more privileged features; generating, using a string generation engine, one or more third sets of synthetic strings based on the second document; retrieving one or more additional documents, including the first document, each additional document comprising a respective additional document string set, with the respective additional document string set for the first document being the first document string set; converting the one or more third sets of synthetic strings into one or more third embeddings; converting each of the respective additional document string sets into one or more fourth embeddings; in vector space, comparing the one or more third embeddings and each of the one or more fourth embeddings; and detecting, based on the comparison, satisfaction of a similarity condition between at least one of the one or more third embeddings and at least one of the fourth embeddings associated with the first document.
17 . A computer program product comprising at least one non-transitory computer readable medium comprising computer executable instructions that, when executed by one or more processors, are configured to: obtain a first document string set of first document strings associated with a first document, wherein the first document comprises first document text content; obtain a second document string set of second document strings associated with a second document, wherein the second document comprises: second document first text content; second document second text content; and second document image content; obtain a second document string subset of the second document strings, wherein the second document string subset is based on the second document second text content; divide the second document string subset into second document substrings; generate a synthetic substring set of synthetic substrings based on one or more of items of content selected from a group consisting of: the second document first text content, the second document second text content, and the second document image content; embed the first document string set into vector space as first document embeddings using an embedding function; embed the synthetic substring set into the vector space as synthetic embeddings using the embedding function; generate closest embedding sets, wherein the generating includes, for respective synthetic embeddings of the synthetic embeddings: compare in the vector space using an embedding similarity algorithm the respective synthetic embeddings with respective first document embeddings of the first document embeddings to generate embedding similarity values; and form a closest embedding set that includes each of the first document embeddings that have an embedding similarity value to the respective synthetic embedding that satisfies a predetermined similarity value threshold or has a highest embedding similarity value; for each respective second document substring of the second document substrings: identify an associated closest embedding set of the closest embedding sets that is associated with the respective second document substring; identify a matching synthetic second document substring associated with the associated closest embedding set; and identify a plurality of matching first document strings of the first document strings that are associated with the associated closest embedding set; generate, a string match data structure comprising: first portions including the second document substrings; and second portions comprising one or more first document strings identified based on the associated closest embedding set; detect a duplication error based on the string match data structure; and transmit a request regarding the duplication error, wherein the request includes the string match data structure, and requests initiating a remediation action selected from a group consisting of: deletion of the first document, hiding the first document, locking the first document, de-indexing the first document from search results, deprioritizing the first document in search results, redirecting links to the first document to the second document, granting privileges associated with the second document, allocating privileges associated with the second document.
18 . The computer program product of claim 17 , the computer executable instructions, when executed by the one or more processors, being further configured to: obtain a third document comprising a third document string; generate, using a generative artificial intelligence system, third document synthetic strings based on the third document string; and identify one of the first document or the second document based on a similarity of the identified document to one or more of the third document synthetic strings.
19 . A system comprising a set of one or more processors and a set of at least one non-transitory computer readable medium storing computer executable instructions that, when executed by the set of one or more processors, are configured to cause the system to: obtain a first document string set of first document strings associated with a first document, wherein the first document comprises first document text content; obtain a second document string set of second document strings associated with a second document, wherein the second document comprises: second document first text content; second document second text content; and second document image content; obtain a second document string subset of the second document strings, wherein the second document string subset is based on the second document second text content; divide the second document string subset into second document substrings; generate a synthetic substring set of synthetic substrings based on one or more of items of content selected from a group consisting of: the second document first text content, the second document second text content, and the second document image content; embed the first document string set into vector space as first document embeddings using an embedding function; embed the synthetic substring set into the vector space as synthetic embeddings using the embedding function; generate closest embedding sets, wherein the generating includes, for respective synthetic embeddings of the synthetic embeddings: compare in the vector space using an embedding similarity algorithm the respective synthetic embeddings with respective first document embeddings of the first document embeddings to generate embedding similarity values; and form a closest embedding set that includes each of the first document embeddings that have an embedding similarity value to the respective synthetic embedding that satisfies a predetermined similarity value threshold or has a highest embedding similarity value; for each respective second document substring of the second document substrings: identify an associated closest embedding set of the closest embedding sets that is associated with the respective second document substring; identify a matching synthetic second document substring associated with the associated closest embedding set; and identify a plurality of matching first document strings of the first document strings that are associated with the associated closest embedding set; generate, a string match data structure comprising: first portions including the second document substrings; and second portions comprising one or more first document strings identified based on the associated closest embedding set; detect a duplication error based on the string match data structure; and transmit a request regarding the duplication error, wherein the request includes the string match data structure, and requests initiating a remediation action selected from a group consisting of: deletion of the first document, hiding the first document, locking the first document, de-indexing the first document from search results, deprioritizing the first document in search results, redirecting links to the first document to the second document, granting privileges associated with the second document, allocating privileges associated with the second document.
20 . The system of claim 19 , the computer executable instructions, when executed by the set of one or more processors, being further configured to: obtain a third document comprising a third document string; generate, using a generative artificial intelligence system, third document synthetic strings based on the third document string; and identify one or both of the first document or the second document based on a similarity to at least one of the third document synthetic strings.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. application Ser. No. 19/314,179, filed Aug. 29, 2025, which is hereby incorporated by reference herein in its entirety. BACKGROUND Various embodiments of the present disclosure address technical challenges related to efficient and accurate comparison of structured and unstructured text data and visual across heterogeneous document repositories. Traditional data analysis techniques often struggle with comparing and matching content across different formats, structures, and access privileges. This task is hindered by several technical challenges, including the difficulty of semantically comparing structured and unstructured text, identifying similarities across large-scale information repositories with varying access controls, and efficiently detecting and remediating duplicate or near-duplicate content. Additionally, traditional approaches to managing information repositories have struggled to efficiently handle the rapid proliferation of AI-generated content, leading to increased storage requirements, slower access times, and potential propagation of errors or inconsistencies across documents. These limitations have made it challenging to maintain data integrity, optimize storage utilization, and ensure proper access controls across large-scale, heterogeneous document collections. Applicant has identified several problems associated with managing such systems and with remedying vulnerabilities of such systems and processes. Through applied ingenuity, the inventors have developed solutions to the aforementioned problems and more, many of which are described with respect to embodiments herein. SUMMARY Embodiments of the present disclosure are directed to various systems, computer readable media, and computer-implemented methods for enhanced string match matrix generation, AI training and validation, and database deduplication. In some embodiments disclosed herein, a method for remediating duplication errors associated with first and second documents includes, with a set of one or more processors, obtaining a first document, wherein the first document comprises first document text content. The method further includes obtaining a second document, wherein the second document comprises second document first text content, second document second text content having delimiters in the form of one or more symbols or tags, and second document image content. The method further includes generating, using a string generation engine, a first document string set of first document strings associated with the first document. The method further includes generating, using the string generation engine, a second document string set of second document strings associated with the second document. The method further includes generating, based on the second document second text content but not on the second document first text content, a second document string subset of the second document strings. The method further includes dividing the second document string subset into second document substrings based at least on the delimiters. The method further includes generating a synthetic substring set of synthetic substrings, wherein the generating includes, for each respective second document substring of the second document substrings: synthesizing, using a multimodal generative artificial intelligence model, first related content from the second document first text content based on relevance to the respective second document substring, synthesizing, using the multimodal generative artificial intelligence model, second related content from the second document second text content based on relevance to the respective second document substring, synthesizing, using the multimodal generative artificial intelligence model, third related content from the second document image content based on a relevance to the respective second document substring, and generating, for inclusion in the synthetic substring set, a respective synthetic substring associated with the respective second document substring by providing the first related content, the second related content, and the third related content as input to the multimodal generative artificial intelligence model. The method further includes embedding the first document string set into vector space as first document embeddings using an embedding function. The method further includes embedding the synthetic substring set into the vector space as synthetic embeddings using the embedding function. The method further includes generating closest embedding sets, wherein the generating includes, for each respective synthetic embedding of the synthetic embeddings: comparing in the vector space using an embedding similarity algorithm the respective synthetic embedding with each respective first document embedding of the first document embeddings to generate embedding similarity values, each respective embedding similarity e