US-20260127192-A1 - DOMAIN-SPECIFIC RETRIEVAL LANGUAGE MODELS

US20260127192A1US 20260127192 A1US20260127192 A1US 20260127192A1US-20260127192-A1

Abstract

Various examples, systems, and methods are disclosed relating to domain-specific document retrieval that incorporates custom vocabulary integration and embedding model updates. A computing system can extract multiple segments from a collection of documents and generate queries that correspond to at least one segment. The computing system can identify terms that satisfy a uniqueness criterion and input the terms into a tokenizer to create a vocabulary dataset. The vocabulary dataset, the document segments, and the queries can be used to update an embedding model to support retrieval and semantic alignment within private documents.

Inventors

Jiaheng Huang

Assignees

NVIDIA CORPORATION

Dates

Publication Date: 20260507
Application Date: 20241115
Priority Date: 20241101

Claims (20)

1 . One or more processors comprising processing circuitry to: input, to a tokenizer, one or more terms that satisfy a uniqueness criterion to cause the tokenizer to tokenize the one or more terms into a vocabulary dataset, the one or more terms corresponding with a domain and extracted from a plurality of documents; extract, from the plurality of documents, a plurality of portions of the plurality of documents comprising the one or more terms corresponding with the domain; generate, based at least on the plurality of portions, a plurality of queries corresponding to the plurality of portions; and update an embedding model based at least on the plurality of queries, the plurality of portions, and the vocabulary dataset.
2 . The one or more processors of claim 1 , wherein the plurality of portions of the plurality of documents are extracted by segmenting the plurality of documents based on at least one marker to segment content of the plurality of documents into the plurality of portions.
3 . The one or more processors of claim 1 , wherein the plurality of queries corresponding to the plurality of portions are generated by a large language model (LLM) trained to generate the plurality of queries based on content and context of the extracted plurality of portions.
4 . The one or more processors of claim 3 , wherein the content comprises textual information in the plurality of portions, and wherein the context comprises an association of the plurality of queries with the plurality of portions.
5 . The one or more processors of claim 3 , wherein the generation of the plurality of queries comprises prompting the LLM with a plurality of instructions based on the content and context of the plurality of portions and corresponding with at least one parameter.
6 . The one or more processors of claim 1 , wherein the one or more terms of the plurality of documents are extracted by an LLM trained to identify a plurality of segments of data based on the uniqueness criterion.
7 . The one or more processors of claim 6 , wherein the extraction of the one or more terms comprises prompting the LLM with a plurality of instructions to identify a plurality of terms in the plurality of documents and corresponding with the uniqueness criterion.
8 . The one or more processors of claim 1 , wherein the uniqueness criterion comprises a plurality of frequencies of the one or more terms being below a threshold frequency in a vocabulary of the tokenizer.
9 . The one or more processors of claim 8 , wherein the threshold frequency corresponds to a frequency of occurrence or a frequency of co-occurrence, and wherein the threshold frequency is set based on a plurality of occurrences of a plurality of domain-specific terms within the plurality of documents.
10 . The one or more processors of claim 1 , wherein the embedding model comprises a transformer model trained to convert a plurality of textual inputs into a plurality of continuous vector representations based on processing a plurality of tokens through a plurality of multi-layer attention mechanisms to encode a plurality of semantic relationships between the one or more terms.
11 . The one or more processors of claim 1 , wherein the one or more processors are comprised in at least one of: a system implementing one or more large language models (LLMs); a system implementing one or more small language models (SLMs); a system implementing one or more vision language models (VLMs); a system for generating synthetic data; a system for generating synthetic data using AI; a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing deep learning operations; a system for performing remote operations; a system for performing real-time streaming; a system for generating or presenting one or more of augmented reality content, virtual reality content, or mixed reality content; a system implemented using an edge device; a system implemented using a robot; a system for performing conversational AI operations; a system implementing one or more multi-model language models; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
12 . A system comprising: one or more processors to execute operations comprising: extract, from a plurality of documents, a plurality of portions of the plurality of documents comprising one or more terms corresponding with a domain; generate, based at least on the plurality of portions, a plurality of queries corresponding to the plurality of portions; input, to a tokenizer, the one or more terms that satisfy a uniqueness criterion to cause the tokenizer to tokenize the one or more terms into a vocabulary dataset, the one or more terms extracted from the plurality of documents; and update an embedding model based at least on the plurality of queries, the plurality of portions, and the vocabulary dataset.
13 . The system of claim 12 , wherein the plurality of portions of the plurality of documents are extracted by segmenting the plurality of documents based on at least one marker to segment content of the plurality of documents into the plurality of portions.
14 . The system of claim 12 , wherein the plurality of queries corresponding to the plurality of portions are generated by an large language model (LLM) trained to generate the plurality of queries based on content and context of the extracted plurality of portions.
15 . The system of claim 14 , wherein the content comprises textual information in the plurality of portions, and wherein the context comprises an association of the plurality of queries with the plurality of portions.
16 . The system of claim 14 , wherein the generation of the plurality of queries comprises prompting the LLM with a plurality of instructions based on the content and context of the plurality of portions and corresponding with at least one parameter.
17 . The system of claim 12 , wherein the one or more terms of the plurality of documents are extracted by an LLM trained to identify a plurality of segments of data based on the uniqueness criterion.
18 . The system of claim 17 , wherein the extraction of the one or more terms comprises prompting the LLM with a plurality of instructions to identify a plurality of terms in the plurality of documents and corresponding with the uniqueness criterion.
19 . The system of claim 12 , wherein the uniqueness criterion comprises a plurality of frequencies of the one or more terms being below a threshold frequency in a vocabulary of the tokenizer.
20 . A method, comprising: inputting, using one or more processors, one or more terms that satisfy a uniqueness criterion to cause the one or more processors to tokenize the one or more terms into a vocabulary dataset, the one or more terms corresponding with a domain and extracted from a plurality of documents; extracting, using the one or more processors from the plurality of documents, a plurality of portions of the plurality of documents comprising one or more terms corresponding with a domain; generating, using the one or more processors based at least on the plurality of portions, a plurality of queries corresponding to the plurality of portions; and updating, using the one or more processors, an embedding model based at least on the plurality of queries, the plurality of portions, and the vocabulary dataset.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS The present application claims priority to International Application No. PCT/CN2024/129231, filed Nov. 1, 2024, the disclosure of which is incorporated herein by reference in its entirety. BACKGROUND Improving the accuracy and performance of document retrieval in text-based information retrieval systems presents challenges. Some traditional methods rely on generic text retrieval models without support for specialized vocabularies or internal document terminology, leading to inefficiencies and limited retrieval performance in private environments. This approach can result in inadequate retrieval accuracy. Current systems are not configured and/or trained to identify associations between private terms and relevant document content, resulting in inconsistent query responses when processing domain-specific language. Additionally, conventional methods rely on static, pre-trained tokenizers with vocabulary limited to publicly available terms, resulting in inefficiencies and degraded retrieval performance due to lack of recognition of private terminology. This approach can lead to redundant processing and failure to process document-specific terms effectively. Current methods are inadequate for handling terminology updates over time, which increases the complexity of maintaining retrieval relevance across evolving document collections. Challenges in implementing neural networks for embedding-based retrieval models create inefficiencies, affecting the accuracy and computational efficiency of text retrieval in domain-specific environments. SUMMARY Implementations of the present disclosure relate to systems and methods for improving text retrieval in document collections using embedding models trained with domain-specific vocabularies. Systems and methods are disclosed that can use machine learning models, such as large language models (LLMs), combined with automated term extraction and query generation to improve retrieval accuracy across private documents. For example, systems and methods in accordance with the present disclosure can extract domain-specific terms from documents and utilize the terms to generate representative queries. This technical solution can output vector embeddings that capture semantic relationships between private terms, aligning retrieval outputs with the internal document vocabulary. Additionally, the systems and methods can update retrieval criteria based at least on parameters such as term frequency, relevance, or other probabilistic measures, improving retrieval alignment with private vocabularies. By updating embeddings to include newly identified terms based on these parameters, the systems and methods improve retrieval performance without extensive manual intervention. This update allows the retrieval model to maintain accuracy and efficiency across updating private document collections. Some implementations relate to one or more processors including processing circuitry. The processing circuitry input, to a tokenizer, one or more terms that satisfy a uniqueness criterion to cause the tokenizer to tokenize the one or more terms into a vocabulary dataset. In some implementations, the one or more terms corresponding with a domain and are extracted from a plurality of documents. The processing circuitry extract, from the plurality of documents, a plurality of portions of the plurality of documents including the one or more terms corresponding with the domain. The processing circuitry generate, based at least on the plurality of portions, a plurality of queries corresponding to the plurality of portions. The processing circuitry update an embedding model based at least on the plurality of queries, plurality of portions, and the vocabulary dataset. In some implementations, the plurality of portions of the plurality of documents are extracted by segmenting the plurality of documents based on at least one marker to segment content of the plurality of documents into the plurality of portions. In some implementations, the plurality of queries corresponding to the plurality of portions are generated by an large language model (LLM) trained to generate the plurality of queries based on content and context of the extracted plurality of portions. In some implementations, the content includes textual information in the plurality of portions. In some implementations, the context includes an association of the plurality of queries with the plurality of portions. In some implementations, the generation of the plurality of queries includes prompting the LLM with a plurality of instructions based on the content and context of the plurality of portions and corresponding with at least one parameter. In some implementations, the one or more terms of the plurality of documents are extracted by an LLM trained to identify a plurality of segments of data based on the uniqueness criterion. In some implementations, the extraction of the one or more terms includes promptin