US-20260127230-A1 - MULTI-STAGE DOCUMENT STORAGE FOR FACILITATING CODE-BASED RETRIEVAL
Abstract
Various embodiments of the present disclosure provide automated data storage and code-based retrieval techniques for improving search engine performance. The techniques apply a multi-stage machine learned autonomous coding pipeline to generate a search template for an input document by extracting a text segment from the input document, executing a coding query to receive a query response, determining that the query response is a null query response, and responsive to the null query response, generating a segment embedding based on the text segment, generating a text-code pair for the text segment based on an embedding similarity score between the segment embedding and a code embedding, storing the text-code pair within the text-code datastore, and modifying the search template by storing a code of the text-code pair in association with the text segment.
Inventors
- Zahra Mahmoodzadeh Poornaki
- Fazlolah MOHAGHEGH
- Jagadish Venkataraman
- Hamid Reza HASSANZADEH
Assignees
- OPTUM, INC.
Dates
- Publication Date
- 20260507
- Application Date
- 20241104
Claims (20)
- 1 . A computer-implemented method comprising: receiving, by one or more processors, an input document for storage within a search repository associated with a search engine; generating and storing, by the one or more processors, a search template for the input document by extracting a text segment from the input document, wherein the search template is associated with a document identifier and comprises a set of structured response entries to improve a retrieval speed of the search engine, and wherein the search template is indexed within the search repository based on the document identifier; providing, by the one or more processors and to a text-code datastore of the search repository, a coding query determined based on the text segment to receive a query response; determining, by the one or more processors, that the query response is a null query response; responsive to the null query response, (i) generating, by the one or more processors and using a machine learned encoder model, a segment embedding based on the text segment, (ii) generating, by the one or more processors, a text-code pair for the text segment based on an embedding similarity score between the segment embedding and a code embedding, (iii) storing, by the one or more processors, the text-code pair within the text-code datastore, and (iv) modifying, by the one or more processors, the search template by storing a code of the text-code pair in association with the text segment; receiving, by the one or more processors, a user query for the search repository that comprises an input code and the document identifier, wherein the input code is associated with contextual data; and responsive to receiving the user query: identifying, by the one or more processors, an authorization criteria based on a comparison between the input code and the search template, and providing, by the one or more processors, a user query response based on the authorization criteria.
- 2 . The computer-implemented method of claim 1 , wherein: (i) the text segment is one of a set of text segments stored within the search template, (ii) the set of text segments is respectively associated with a set of previously identified codes, and (iii) the computer-implemented method further comprises: detecting a duplicate code based on a comparison between the code and the set of previously identified codes, and generating a data redundancy flag based on the duplicate code.
- 3 . (canceled)
- 4 . (canceled)
- 5 . (canceled)
- 6 . The computer-implemented method of claim 1 , further comprising: applying the authorization criteria to the contextual data to determine an authorization decision, wherein the user query response comprises the authorization decision.
- 7 . The computer-implemented method of claim 1 , wherein the code is one of a set of codes defined within a coding domain and the computer-implemented method further comprises: receiving a code update message that identifies a code modification to a code definition of the code; and responsive to the code update message, removing the text-code pair from the text-code datastore and the search template; generating a modified code embedding for the code based on the code modification; regenerating the text-code pair for the text segment based on an embedding similarity score between the segment embedding and the modified code embedding; restoring the text-code pair within the text-code datastore; and modifying the search template by restoring the code of the text-code pair in association with the text segment.
- 8 . The computer-implemented method of claim 1 , wherein the text-code datastore comprises a hash table that comprises a set historical text-code pairs respectively indexed by a set of hashed identifiers.
- 9 . The computer-implemented method of claim 8 , wherein providing the coding query comprises: generating, using a hashing model, a hashed query identifier by hashing the text segment; and executing the coding query with the hashed query identifier.
- 10 . A system comprising one or more processors and at least one memory storing processor-executable instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising: receiving an input document for storage within a search repository associated with a search engine; generating and storing a search template for the input document by extracting a text segment and authorization criteria corresponding to the text segment from the input document, wherein the search template is associated with a document identifier and comprises a set of structured response entries to improve a retrieval speed of the search engine, and wherein the search template is indexed within the search repository based on the document identifier; providing, to a text-code datastore of the search repository, a coding query determined based on the text segment to receive a query response; determining that the query response is a null query response; and responsive to the null query response, (i) generating, using a machine learned encoder model, a segment embedding based on the text segment, (ii) generating a text-code pair for the text segment based on an embedding similarity score between the segment embedding and a code embedding, (iii) storing the text-code pair within the text-code datastore, and (iv) modifying the search template by storing a code of the text-code pair in association with the text segment; receiving a user query for the search repository that comprises an input code and the document identifier, wherein the input code is associated with contextual data; and responsive to receiving the user query. identifying an authorization criteria based on a comparison between the input code and the search template, and providing a user query response based on the authorization criteria.
- 11 . The system of claim 10 , wherein: (i) the text segment is one of a set of text segments stored within the search template, (ii) the set of text segments is respectively associated with a set of previously identified codes, and (iii) the operations further comprise: detecting a duplicate code based on a comparison between the code and the set of previously identified codes, and generating a data redundancy flag based on the duplicate code.
- 12 . (canceled)
- 13 . (canceled)
- 14 . (canceled)
- 15 . The system of claim 10 , wherein the operations further comprise: applying the authorization criteria to the contextual data to determine an authorization decision, wherein the user query response comprises the authorization decision.
- 16 . The system of claim 10 , wherein the code is one of a set of codes defined within a coding domain and the operations further comprise: receiving a code update message that identifies a code modification to a code definition of the code; and responsive to the code update message, removing the text-code pair from the text-code datastore and the search template; generating a modified code embedding for the code based on the code modification; regenerating the text-code pair for the text segment based on an embedding similarity score between the segment embedding and the modified code embedding; restoring the text-code pair within the text-code datastore; and modifying the search template by restoring the code of the text-code pair in association with the text segment.
- 17 . The system of claim 10 , wherein the text-code datastore comprises a hash table that comprises a set historical text-code pairs respectively indexed by a set of hashed identifiers.
- 18 . One or more non-transitory computer-readable storage media storing instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: receiving an input document for storage within a search repository associated with a search engine; generating and storing a search template for the input document by extracting a text segment from the input document, wherein the search template is associated with a document identifier and comprises a set of structured response entries to improve a retrieval speed of the search engine, and wherein the search template is indeed within the search repository based on the document identifier; providing, to a text-code datastore of the search repository, a coding query determined based on the text segment to receive a query response; determining that the query response is a null query response; and responsive to the null query response, (i) generating, using a machine learned encoder model, a segment embedding based on the text segment, (ii) generating a text-code pair for the text segment based on an embedding similarity score between the segment embedding and a code embedding, (iii) storing the text-code pair within the text-code datastore, and (iv) modifying the search template by storing a code of the text-code pair in association with the text segment; receiving a user query for the search repository that comprises an input code and the document identifier, wherein the input code is associated with contextual data; and responsive to receiving the user query; identifying an authorization criteria based on a comparison between the input code and the search template, and providing a user query response based on the authorization criteria.
- 19 . The one or more non-transitory computer-readable storage media of claim 18 , wherein the text-code datastore comprises a hash table that comprises a set historical text-code pairs respectively indexed by a set of hashed identifiers.
- 20 . The one or more non-transitory computer-readable storage media of claim 19 , wherein executing the coding query comprises: generating, using a hashing model, a hashed query identifier by hashing the text segment; and executing the coding query with the hashed query identifier.
Description
BACKGROUND Traditional search engines may process queries by identifying textual matches between the query and candidate search results. While this allows for searching within natural language documents, the search results are inconsistent and lead to inefficiencies and potential errors in downstream decision-making processes. This prevents the use of existing search engines for the identification and application of complex authorization criteria defined by natural language documents, which prevents the automation of the interpretation of the natural language documents and the application of the authorization criteria defined therein to user queries. Due to these challenges, human expertise is traditionally leveraged to extract relevant information from natural language documents and apply the extracted information to specific use cases, which is time-consuming, prone to subjective interpretation leading to inconsistent outcomes, and impractical at scale. These technical challenges are compounded when queries are related to standardized coding systems that require consistent resolutions for different textual expressions of the same code. Traditional automated search engines have attempted to address these challenges through basic keyword matching or rule-based approaches. However, these techniques fail to comprehensively address the nuances of natural language, variations in code terminology, and the evolving nature of coding systems. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 depicts a block diagram of an example architecture in accordance with some embodiments of the present disclosure. FIG. 2 depicts a block diagram of an example predictive data analysis computing entity in accordance with some embodiments of the present disclosure. FIG. 3 depicts a block diagram of an example client computing entity in accordance with some embodiments of the present disclosure. FIG. 4 depicts a dataflow diagram of an example multi-stage autonomous coding pipeline and compatible code-based retrieval techniques in accordance with some embodiments of the present disclosure. FIG. 5 depicts a flowchart diagram of an example process for implementing a first, storage, stage of the multi-stage autonomous coding pipeline in accordance with some embodiments of the present disclosure. FIG. 6 depicts a flowchart diagram of an example process for implementing a code-based retrieval technique in accordance with some embodiments of the present disclosure. FIG. 7 depicts a flowchart diagram of an example process for implementing a second, modification, stage of the multi-stage autonomous coding pipeline in accordance with some embodiments of the present disclosure. DETAILED DESCRIPTION Various embodiments of the present disclosure solve technical challenges with traditional query systems by leveraging a multi-stage autonomous coding pipeline to enable improved code-based retrieval systems. The multi-stage autonomous coding pipeline comprises an storage stage, a maintenance stage, and/or a query resolution stage that collectively enable reliable code-based retrieval systems through a series of data transformation, storage, and/or monitoring operations. By doing so, the techniques of the present disclosure enable code-based retrieval systems capable of improved responses at the cost of less computing resources and/or time. To overcome performance deficiencies with traditional search engines, the multi-stage autonomous coding pipeline augments adaptive datastores (e.g., lookup tables for dynamically changing and increasing data) with embedding techniques to efficiently transform text to a searchable code template that is tailored to code-based retrieval. By doing so, the techniques of the present disclosure provide a search repository that may be autonomously augmented as new text-based documents are created and/or as code(s) and/or their definitions change over time. This allows for code-based retrieval techniques that may improve response and/or prediction consistency, while addressing the nuances of and/or diversity in natural language, variations in code terminology, and/or the evolving nature of coding systems. This, in turn, enables code-based retrieval with improved accuracy, precision, false negative rate, recall, and/or F1 score and/or reduced processing times and/or resource expenditures and/or consumption compared to traditional search engines. In a first stage, a storage stage, the multi-stage autonomous coding pipeline implements a series of operations to efficiently transform a text-based document into a search template designed for code-based retrieval systems. To do so, some embodiments of the present disclosure provide a branched processing technique that leverages a unique combination of natural language processing, machine learning, and local retrieval techniques to generate improved search templates and code-based mappings with enhanced efficiency and accuracy. The branched processing technique selectively applies (e.g.,