Search

US-20260127392-A1 - MULTI-STAGE MACHINE LEARNING AUTOMATED CODING PIPELINE

US20260127392A1US 20260127392 A1US20260127392 A1US 20260127392A1US-20260127392-A1

Abstract

Various embodiments of the present disclosure provide machine learning architectures for improving predictive functionality of a computer. The techniques apply a multi-stage machine learning automated coding pipeline to a coding domain to generate a code prediction for a text segment. During a first stage, the techniques may include inputting a text segment from a file to a machine learning encoder to generate a text segment vector and extracting a subset of searching codes from a vector data store based on a comparison between the text segment vector and a plurality of code vectors within the vector data store. During a second stage, the techniques may include generating a generative model prompt based on the subset of searching codes and inputting the generative model prompt to a generative model to generate a code prediction for the text segment.

Inventors

  • Zahra Mahmoodzadeh Poornaki
  • Jennifer M. LATHROP
  • Fazlolah MOHAGHEGH
  • Jagadish Venkataraman
  • Hamid Reza HASSANZADEH
  • Shannon BUTKUS
  • Michele R. O'CONNELL
  • Teresa I. ANTHONY
  • Jag KANCHIRAKKOL
  • Jennifer A. GARVAS

Assignees

  • OPTUM, INC.

Dates

Publication Date
20260507
Application Date
20241101

Claims (20)

  1. 1 . A computer-implemented method comprising: generating, by one or more processors and a machine learning encoder, a text segment vector using a text segment from a file; determining, by the one or more processors, a subset of searching codes from a set of codes based on determining a distance between the text segment vector and a first code vector associated with a first searching code of the subset of searching codes; generating, by the one or more processors, a generative model prompt based on the subset of searching codes; and generating, by the one or more processors and a generative model, a code prediction for the text segment based on the generative model prompt.
  2. 2 . The computer-implemented method of claim 1 , wherein: (i) a set of code vectors is previously generated for the set of codes using the machine learning encoder, (ii) the first code vector is generated by the machine learning encoder based on the first searching code, and (iii) the first searching code is a first code of the set of codes.
  3. 3 . The computer-implemented method of claim 1 , further comprising storing the code prediction in a lookup table in association with the text segment as a first code-text pair.
  4. 4 . The computer-implemented method of claim 3 , further comprising: receiving a code update message identifying (a) one or more code modifications to one or more codes of the set of codes or (b) one or more additional codes to the set of codes; and in response to the code update message, at least one of: removing one or more code-text pairs from the lookup table that respectively correspond to the one or more codes indicated by the one or more code modifications; generating, by the machine learning encoder, one or more additional code vectors using the one or more additional codes; or storing the one or more additional codes in a vector data store in association with the one or more additional code vectors.
  5. 5 . The computer-implemented method of claim 3 , further comprising: receiving a code request that identifies at least one of the file or the text segment from among at least one of structural text or at least another text segment in the file; requesting an associated code from the lookup table based on the text segment and receiving the code request; receiving a null response responsive to requesting the associated code; and in response to the null response, generating the text segment vector.
  6. 6 . The computer-implemented method of claim 5 , further comprising: determining that the file does not contain natural language text; determining a conversion component associated with a file type associated with the file; and detecting the text segment from the file based on processing the file using the conversion component.
  7. 7 . The computer-implemented method of claim 1 , wherein generating the generative model prompt based on the subset of searching codes comprises: receiving a prompt template, and modifying the prompt template to indicate the subset of searching codes.
  8. 8 . The computer-implemented method of claim 7 , wherein the prompt template comprises a few-shot prompt and the generative model comprises a question answering (Q/A) large language model (LLM).
  9. 9 . The computer-implemented method of claim 1 , wherein the number of codes within the subset of searching codes is determined based on a tunable code threshold.
  10. 10 . A system comprising: one or more processors; and at least one memory storing processor-executable instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising: generating, using a machine learning encoder, a text segment vector using a text segment; determining a subset of searching codes from a set of codes based on determining a distance between the text segment vector and a first code vector associated with a first searching code of the subset of searching codes; generating a generative model prompt based on the subset of searching codes; and generating, using a generative model, a code prediction for the text segment based on the generative model prompt.
  11. 11 . The system of claim 10 , wherein the text segment is extracted from a natural language document.
  12. 12 . The system of claim 10 , wherein: (i) a set of code vectors is previously generated for the set of codes using the machine learning encoder, (ii) the first code vector is generated by the machine learning encoder based on the first searching code, and (iii) the first searching code is a first code of the set of codes.
  13. 13 . The system of claim 10 , wherein the operations further comprise storing the code prediction in a lookup table in association with the text segment as a first code-text pair.
  14. 14 . The system of claim 13 , wherein the operations further comprise: receiving a code update message identifying (a) one or more code modifications to one or more codes of the set of codes or (b) one or more additional codes to the set of codes; and in response to the code update message, at least one of: removing one or more code-text pairs from the lookup table that respectively correspond to the one or more codes indicated by the one or more code modifications; generating, by the machine learning encoder, one or more additional code vectors using the one or more additional codes; or storing the one or more additional codes in a vector data store in association with the one or more additional code vectors.
  15. 15 . The system of claim 13 , wherein the operations further comprise: receiving a code request that identifies at least one of a file or the text segment from among at least one of structural text or at least another text segment in the file; requesting an associated code from the lookup table based on the text segment and receiving the code request; receiving a null response responsive to requesting the associated code; and in response to the null response, generating the text segment vector.
  16. 16 . The system of claim 15 , wherein the operations further comprise: determining that the file does not contain natural language text; determining a conversion component associated with a file type associated with the file; and detecting the text segment from the file based on processing the file using the conversion component.
  17. 17 . The system of claim 10 , wherein to generate the generative model prompt based on the subset of searching codes the operations further comprise: receiving a prompt template, and modifying the prompt template to indicate the subset of searching codes.
  18. 18 . The system of claim 17 , wherein the prompt template comprises a few shot prompt and the generative model comprises a Q/A LLM.
  19. 19 . One or more non-transitory computer-readable storage media storing instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: generating, using a machine learning encoder, a text segment vector using a text segment; determining a subset of searching codes from a set of codes based on determining a distance between the text segment vector and a first code vector associated with a first searching code of the subset of searching codes; generating a generative model prompt based on the subset of searching codes; and generating, using a generative model, a code prediction for the text segment based on the generative model prompt.
  20. 20 . The one or more non-transitory computer-readable storage media of claim 19 , wherein the text segment is extracted from a natural language document.

Description

BACKGROUND Various embodiments of the present disclosure address technical challenges related to machine learning technology, including the application of machine learning in automated coding processes. In various domains, standardized codes (e.g., a sequence of characters, numerals) may be used to designate an actionable insight with respect to an entity. Such codes are traditionally used to improve computer understanding by translating natural language text to a recognizable code. The efficacy of codes within a particular domain is limited by a computer's capability of translating natural language text to a recognizable code. This task is hindered by several technical challenges presented by codes that (i) are adapted over time, (ii) defined by multiple different parties, or (iii) correlate to a plurality of variations of natural language text. Traditionally, these technical challenges are addressed by using tables that statically translate natural language text to an associated code. These tables include lists of codes for each natural language term or phrase encountered by the developer of the table. While helpful for codes with minimal variations of natural language text, the processing resources and time expense of maintaining static lists is prohibitive for codes that are associated with highly individualized variations of natural language text. This leads to diverging coding tables, rather than a universal table, that are individually maintained by participates of a coding domain. Each of the diverging coding tables are limited to a portion of a universal code space and may include mappings that diverge from one another. This leads to inconsistent parsing of natural language text that negatively impacts the accuracy and reliability of downstream processes. Moreover, when codes are modified, diverging coding tables may require individualized modifications that require a prohibitive amount of computing resources and still further separate the similarities between the tables. Various embodiments of the present disclosure make important contributions to traditional coding technologies by addressing these technical challenges, among others. BRIEF SUMMARY Various embodiments of the present disclosure provide machine learning model architectures and pipelines that improve the functionality of a computer through machine learning processes that address the technical challenges discussed herein. To do so, some embodiments of the present disclosure present a multi-stage machine learning automated coding pipeline that streamlines and optimizes the translation of natural language text segments to codes defined within a coding domain. The multi-staged technique leverages embedding technologies to generate semantic representations that may be used to implement a first relevancy filter to extract a short-listed set of codes for consideration by subsequent stages of the multi-staged technique. By doing so, the multi-staged technique reduces a prediction scope of an autonomous coding process to a size that is manageable by model architectures that traditionally underperform in autonomous coding processes. This allows the use of new model architectures, such a question-answering (Q/A) large language model (LLM), within an autonomous coding pipeline, among other technical advantages as described herein. In some embodiments, the techniques (e.g., hardware, software, machine-learned model(s), computer-implemented method(s), system(s), and/or one or more non-transitory computer-readable media) may comprise inputting, by one or more processors, a text segment from a file, such as a natural language document, to a machine learning encoder to generate a text segment vector; extracting, by the one or more processors, a subset of searching codes from a vector data store based on a comparison between the text segment vector and a plurality of code vectors within the vector data store; generating, by the one or more processors, a generative model prompt based on the subset of searching codes; and inputting, by the one or more processors, the generative model prompt to a generative model to generate a code prediction for the text segment. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 depicts an example overview of an architecture in accordance with some embodiments of the present disclosure. FIG. 2 depicts an example predictive data analysis computing entity in accordance with some embodiments of the present disclosure. FIG. 3 depicts an example client computing entity in accordance with some embodiments of the present disclosure. FIG. 4 depicts a dataflow diagram of a multi-layered autonomous coding framework in accordance with some embodiments of the present disclosure. FIGS. 5A-B depicts operational examples of a code request in accordance with some embodiments of the present disclosure. FIG. 6 depicts a flowchart diagram of an example process for implementing a multi-layered autonomous coding framework in accordance with some embo