CN-121996741-A - Domain-specific search language model

CN121996741ACN 121996741 ACN121996741 ACN 121996741ACN-121996741-A

Abstract

The present disclosure relates to domain-specific search language models. Various examples, systems, and methods are disclosed relating to domain-specific document retrieval that combines custom vocabulary integration and embedded model updating. The computing system may extract a plurality of segments from the set of documents and generate a query corresponding to at least one of the segments. The computing system may identify terms that meet the uniqueness criteria and input the terms into a tokenizer to create a vocabulary dataset. The vocabulary data sets, document snippets, and queries can be used to update the embedding model to support retrieval and semantic alignment within the private document.

Inventors

HUANG JIAHENG

Assignees

辉达公司

Dates

Publication Date: 20260508
Application Date: 20251030
Priority Date: 20241101

Claims (20)

1. One or more of the processors of the present invention, the one or more processors include processing circuitry, the processing circuit is configured to: inputting one or more terms meeting the uniqueness criteria into a tokenizer, such that the tokenizer tokenizes the one or more terms into a lexical dataset, the one or more terms corresponding to a domain and extracted from a plurality of documents; Extracting portions of the plurality of documents from the plurality of documents that contain the one or more terms corresponding to the domain; Generating a plurality of queries corresponding to the plurality of portions based at least on the plurality of portions, and An embedding model is updated based at least on the plurality of queries, the plurality of portions, and the vocabulary dataset.
2. The one or more processors of claim 1, wherein the plurality of portions of the plurality of documents are extracted by segmenting the plurality of documents based on at least one marker to segment content of the plurality of documents into the plurality of portions.
3. The one or more processors of claim 1, wherein the plurality of queries corresponding to the plurality of portions are generated by a large language model LLM trained to generate the plurality of queries based on the extracted content and context of the plurality of portions.
4. The one or more processors of claim 3, wherein the content comprises text information in the plurality of portions, and wherein the context comprises an association of the plurality of queries with the plurality of portions.
5. The one or more processors of claim 3, wherein the generation of the plurality of queries includes prompting the large language model LLM for a plurality of instructions based on the content and context of the plurality of portions, and the plurality of instructions correspond to at least one parameter.
6. The one or more processors of claim 1, wherein the one or more terms of the plurality of documents are extracted by a large language model LLM trained to identify a plurality of data segments based on the uniqueness criteria.
7. The one or more processors of claim 6, wherein the extracting the one or more terms includes prompting a large language model LLM for a plurality of instructions to identify a plurality of terms in the plurality of documents that correspond to the uniqueness criteria.
8. The one or more processors of claim 1, wherein the uniqueness criteria includes a plurality of frequencies of the one or more terms below a threshold frequency in the vocabulary of the tokenizer.
9. The one or more processors of claim 8, wherein the threshold frequency corresponds to a frequency of occurrence or co-occurrence frequency, and wherein the threshold frequency is set based on multiple occurrences of a plurality of domain-specific terms in the plurality of documents.
10. The one or more processors of claim 1, wherein the embedded model comprises a transformer model trained to convert a plurality of text inputs into a plurality of successive vector representations based on processing a plurality of tokens to encode a plurality of semantic relationships between the one or more terms by a plurality of multi-layer attention mechanisms.
11. The one or more processors of claim 1, wherein the one or more processors are included in at least one of: a system implementing one or more large language model LLMs; a system implementing one or more small language model SLMs; A system implementing one or more visual language models VLM; a system for generating synthetic data; A system for generating synthetic data using AI; a control system for an autonomous or semi-autonomous machine; A perception system for an autonomous or semi-autonomous machine; A system for performing a simulation operation; a system for performing digital twinning operations; a system for performing optical transmission simulation; a system for performing 3D asset collaboration content creation; a system for performing a deep learning operation; A system for performing remote operations; A system for performing real-time streaming; A system for generating or presenting one or more of augmented reality content, virtual reality content, or mixed reality content; a system implemented using edge devices; A system implemented using a robot; a system for performing a conversational AI operation; A system implementing one or more multimodal language models; A system comprising one or more virtual machine VMs; A system implemented at least in part in a data center, or A system implemented at least in part using cloud computing resources.
12. A system, the system comprising: One or more processors configured to perform operations comprising: extracting, from a plurality of documents, a plurality of portions of the plurality of documents containing one or more terms corresponding to a domain; Generating a plurality of queries corresponding to the plurality of portions based at least on the plurality of portions; inputting the one or more terms meeting the uniqueness criteria to a tokenizer to cause the tokenizer to tokenize the one or more terms into a lexical dataset, the one or more terms extracted from the plurality of documents, and An embedding model is updated based at least on the plurality of queries, the plurality of portions, and the vocabulary dataset.
13. The system of claim 12, wherein the plurality of portions of the plurality of documents are extracted by segmenting the plurality of documents based on at least one marker to segment content of the plurality of documents into the plurality of portions.
14. The system of claim 12, wherein the plurality of queries corresponding to the plurality of portions are generated by a large language model LLM trained to generate the plurality of queries based on the extracted content and context of the plurality of portions.
15. The system of claim 14, wherein the content comprises text information in the plurality of portions, and wherein the context comprises an association of the plurality of queries with the plurality of portions.
16. The system of claim 14, wherein generating the plurality of queries comprises prompting the large language model LLM for a plurality of instructions based on the content and context of the plurality of portions, and the plurality of instructions correspond to at least one parameter.
17. The system of claim 12, wherein the one or more terms of the plurality of documents are extracted by a large language model LLM trained to identify a plurality of data segments based on the uniqueness criteria.
18. The system of claim 17, wherein extracting the one or more terms comprises prompting the large language model LLM for a plurality of instructions to identify a plurality of terms in the plurality of documents that correspond to the uniqueness criteria.
19. The system of claim 12, wherein the uniqueness criteria includes that a plurality of frequencies of the one or more terms are below a threshold frequency in the vocabulary of the tokenizer.
20. A method, the method comprising: Inputting, using one or more processors, one or more terms meeting a uniqueness criterion, such that the one or more processors tag the one or more terms into a lexical dataset, the one or more terms corresponding to a domain and extracted from a plurality of documents; Extracting, using the one or more processors, portions of the plurality of documents from the plurality of documents that contain one or more terms corresponding to the domain; Generating, using the one or more processors, a plurality of queries corresponding to the plurality of portions based at least on the plurality of portions, and Updating, using the one or more processors, an embedding model based at least on the plurality of queries, the plurality of portions, and the vocabulary dataset.

Description

Domain-specific search language model Cross Reference to Related Applications The present application claims priority from international application PCT/CN2024/129231 filed on 1/11/2024, the disclosure of which is incorporated herein by reference in its entirety. Background Improving the accuracy and performance of document retrieval in text-based information retrieval systems is challenging. Some conventional approaches rely on a generic text retrieval model that does not support specialized vocabulary or internal document proper nouns, resulting in inefficiency and limited retrieval performance in a private environment. This approach may result in insufficient retrieval accuracy. Current systems are not configured and/or trained to identify associations between private terms and related document content, resulting in inconsistent query responses when dealing with domain-specific languages. Furthermore, conventional approaches rely on static, pre-trained tokenizers, whose vocabulary is limited to publicly available terms, resulting in inefficiency and reduced retrieval performance due to the inability to recognize private terminology. This approach may result in redundant processing and may not be able to efficiently process document specific terms. The current approach is inadequate to handle the update of terms over time, which increases the complexity of maintaining search relevance in a evolving set of documents. The challenges of implementing neural networks in an embedded-based search model create inefficiency that affects the accuracy and computational efficiency of text search in a domain-specific environment. Disclosure of Invention Embodiments of the present disclosure relate to systems and methods for improving text retrieval in a document collection using an embedded model trained with domain-specific vocabulary. The disclosed systems and methods may use machine learning models (e.g., large Language Models (LLMs)) in conjunction with automatic term extraction and query generation to improve retrieval accuracy across private documents. For example, systems and methods according to the present disclosure may extract domain-specific terms from documents and generate representative queries using these terms. The technical scheme can output vectors for capturing semantic relation vector embedding between private terms, so that retrieval output is aligned with an internal document vocabulary. In addition, the system and method may update the retrieval criteria based at least on parameters such as word frequency, relevance, or other probability metrics, thereby improving retrieval alignment with private vocabulary. By updating the embeddings to contain the newly identified terms based on these parameters, the system and method improves retrieval performance without extensive manual intervention. Such updating enables the retrieval model to maintain accuracy and efficiency in updating the private document collection. Some implementations relate to one or more processors including processing circuitry. The processing circuitry inputs one or more terms that meet the uniqueness criteria (uniqueness criterion, proprietary criteria) to the tokenizer, such that the tokenizer tokenizes the one or more terms into the lexical dataset. In some implementations, one or more terms corresponding to a domain are extracted from a plurality of documents. The processing circuitry extracts portions of the plurality of documents from the plurality of documents, including one or more terms corresponding to the domain. The processing circuitry generates a plurality of queries corresponding to the plurality of portions based at least on the plurality of portions. The processing circuitry updates the embedding model based at least on the plurality of queries, the plurality of portions, and the vocabulary data set. In some implementations, the plurality of portions of the plurality of documents are extracted by segmenting the plurality of documents based on the at least one marker to segment the content of the plurality of documents into the plurality of portions. In some implementations, the plurality of queries corresponding to the plurality of portions are generated by a Large Language Model (LLM) trained to generate the plurality of queries based on the extracted content and context of the plurality of portions. In some implementations, the content includes text information in a plurality of portions. In some implementations, the context includes associations of a plurality of queries with a plurality of portions. In some implementations, generating the plurality of queries includes prompting a Large Language Model (LLM) for a plurality of instructions corresponding to at least one parameter based on content and context of the plurality of portions. In some implementations, one or more terms in the plurality of documents are extracted by a Large Language Model (LLM) that is trained to identify the plurality of data segment