US-20260127152-A1 - GENERATING PROMPT EXAMPLES FOR EXTRACTING ENTITIES FROM DOCUMENTS
Abstract
An illustrative embodiment provides a computer-implemented method. The method comprises using a processor set to receive a number of documents from a plurality of data sources. The processor set annotates entities in the first subset of documents from the number of documents. The processor set splits the number of documents into a number of chunks. The processor sets indexes the number of chunks according to an index structure to generate a number of indexed chunks. The processor set generates a number of prompt examples for each document in the second subset of documents based on the number of indexed chunks and annotated entities from the first subset of documents. The processor set generates a prompt for extracting entities using a large language model based on the number of prompt examples.
Inventors
- Shreyansh Sharma
- Vishal Minhas
Assignees
- S&P GLOBAL INC.
Dates
- Publication Date
- 20260507
- Application Date
- 20241104
Claims (20)
- 1 . A computer implemented method, comprising: receiving, by a processor set, a number of documents from a plurality of data sources, wherein the number of documents comprises a first subset of documents and a second subset of documents; annotating, by the processor set, entities in the first subset of documents from the number of documents; splitting, by the processor set, the number of documents into a number of chunks, wherein each chunk from the number of chunks comprises a portion of textual information from a document in the number of documents; indexing, by the processor set, the number of chunks according to an index structure to generate a number of indexed chunks, wherein chunks with annotated text from the first subset of documents are indexed according to the index structure; generating, by the processor set, a number of prompt examples for each document in the second subset of documents based on the number of indexed chunks and annotated entities from the first subset of documents, wherein each prompt example from the number of prompt examples corresponds to an entity in a second document from the second subset of documents, and wherein the number of indexed chunks are enriched by replacing entities in the number of indexed chunks using existing dictionary based mapping of entities to create variations; and generating, by the processor set, a prompt for extracting entities in the second document from the second subset of documents using a large language model based on the number of prompt examples.
- 2 . The computer implemented method of claim 1 , wherein the generating, by the processor set, the number of prompt examples for each document in the second subset of documents based on the number of indexed chunks and annotated entities from the first subset of documents comprises: fetching, by the processor set, a number of first indexed chunks in the number of indexed chunks for a first document from the second subset of documents; identifying, by the processor set, a third subset of documents in the first subset of documents for the number of indexed chunks based on a first vector searching technique; identifying, by the processor set, a number of annotated chunks for the third subset of documents from the number of indexed chunks for entities in the number of first indexed chunks; identifying, by the processor set, a number of second indexed chunks from the number of first indexed chunks based on a second vector searching technique between the number of first indexed chunks and the number of annotated chunks; and creating, by the processor set, a number of first prompt examples for the first document from the second subset of documents based on the number of second indexed chunks.
- 3 . The computer implemented method of claim 1 , further comprising: extracting, by the processor set, entities in a second document from the second subset of documents by feeding the prompt into a deep learning model.
- 4 . The computer implemented method of claim 3 , further comprising: outputting, by the processor set, the entities in the second document from the second subset of documents and positions of the entities in the second document from the second subset of documents.
- 5 . The computer implemented method of claim 1 , wherein the prompt further comprises general instructions for the large language model, and definitions and entity schema for the number of prompt examples for each document in the second subset of documents.
- 6 . The computer implemented method of claim 1 , further comprising: finetuning, by the processor set, the large language model using the number of indexed chunks.
- 7 . The computer implemented method of claim 1 , wherein the number of indexed chunks is stored in a vector database.
- 8 . A computer system, comprising: a processor set; a set of one or more computer-readable storage media; and program instructions stored on the set of one or more storage media to cause the processor set to perform operations comprising: receiving a number of documents from a plurality of data sources, wherein the number of documents comprises a first subset of documents and a second subset of documents; annotating entities in the first subset of documents from the number of documents; splitting the number of documents into a number of chunks, wherein each chunk from the number of chunks comprises a portion of textual information from a document in the number of documents; indexing the number of chunks according to an index structure to generate a number of indexed chunks, wherein chunks with annotated text from the first subset of documents are indexed according to the index structure; generating a number of prompt examples for each document in the second subset of documents based on the number of indexed chunks and annotated entities from the first subset of documents, wherein each prompt example from the number of prompt examples corresponds to an entity in a second document from the second subset of documents, and wherein the number of indexed chunks are enriched by replacing entities in the number of indexed chunks using existing dictionary based mapping of entities to create variations; and generating a prompt for extracting entities in the second document from the second subset of documents using a large language model based on the number of prompt examples.
- 9 . The computer system of claim 8 , wherein the generating the number of prompt examples for each document in the second subset of documents based on the number of indexed chunks and annotated entities from the first subset of documents comprises: fetching a number of first indexed chunks in the number of indexed chunks for a first document from the second subset of documents; identifying a third subset of documents in the first subset of documents for the number of indexed chunks based on a first vector searching technique; identifying a number of annotated chunks for the third subset of documents from the number of indexed chunks for entities in the number of first indexed chunks; identifying a number of second indexed chunks from the number of first indexed chunks based on a second vector searching technique between the number of first indexed chunks and the number of annotated chunks; and creating a number of first prompt examples for the first document from the second subset of documents based on the number of second indexed chunks.
- 10 . The computer system of claim 8 , wherein the operations further comprise: extracting entities in a second document from the second subset of documents by feeding the prompt into a deep learning.
- 11 . The computer system of claim 10 , wherein the operations further comprise: outputting the entities in the second document from the second subset of documents and positions of the entities in the second document from the second subset of documents.
- 12 . The computer system of claim 8 , wherein the operations further comprise: finetuning the large language model using the number of indexed chunks.
- 13 . The computer system of claim 8 , wherein the number of indexed chunks is stored in a vector database.
- 14 . A computer program product comprising: a set of one or more computer-readable storage media; program instructions stored in the set of one or more storage media to perform operations comprising: receiving, by a processor set, a number of documents from a plurality of data sources, wherein the number of documents comprises a first subset of documents and a second subset of documents; annotating, by the processor set, entities in the first subset of documents from the number of documents; splitting, by the processor set, the number of documents into a number of chunks, wherein each chunk from the number of chunks comprises a portion of textual information from a document in the number of documents; indexing, by the processor set, the number of chunks according to an index structure to generate a number of indexed chunks, wherein chunks with annotated text from the first subset of documents are indexed according to the index structure; generating, by the processor set, a number of prompt examples for each document in the second subset of documents based on the number of indexed chunks and annotated entities from the first subset of documents, wherein each prompt example from the number of prompt examples corresponds to an entity in a second document from the second subset of documents, and wherein the number of indexed chunks are enriched by replacing entities in the number of indexed chunks using existing dictionary based mapping of entities to create variations; and generating, by the processor set, a prompt for extracting entities in the second document from the second subset of documents using a large language model based on the number of prompt examples.
- 15 . The computer program product of claim 14 , wherein the generating, by the processor set, the number of prompt examples for each document in the second subset of documents based on the number of indexed chunks and annotated entities from the first subset of documents comprises: fetching, by the processor set, a number of first indexed chunks in the number of indexed chunks for a first document from the second subset of documents; identifying, by the processor set, a third subset of documents in the first subset of documents for the number of indexed chunks based on a first vector searching technique; identifying, by the processor set, a number of annotated chunks for the third subset of documents from the number of indexed chunks for entities in the number of first indexed chunks; identifying, by the processor set, a number of second indexed chunks from the number of first indexed chunks based on a second vector searching technique between the number of first indexed chunks and the number of annotated chunks; and creating, by the processor set, a number of first prompt examples for the first document from the second subset of documents based on the number of second indexed chunks.
- 16 . The computer program product of claim 14 , wherein the operations further comprise: extracting, by the processor set, entities in a second document from the second subset of documents by feeding the prompt into a deep learning model.
- 17 . The computer program product of claim 16 , wherein the operations further comprise: outputting, by the processor set, the entities in the second document from the second subset of documents and positions of the entities in the second document from the second subset of documents.
- 18 . The computer program product of claim 14 , wherein the prompt further comprises general instructions for the large language model, and definitions and entity schema for the number of prompt examples for each document in the second subset of documents.
- 19 . The computer program product of claim 14 , wherein the operations further comprise: finetuning, by the processor set, the large language model using the number of indexed chunks.
- 20 . The computer system of claim 8 , wherein the prompt further comprises general instructions for the large language model, and definitions and entity schema for the number of prompt examples for each document in the second subset of documents.
Description
BACKGROUND INFORMATION 1. Field The present disclosure relates generally to generating prompt examples, and more specifically to generating prompt examples for extracting entities from documents. 2. Background Prompts refer to the input text or questions provided to a deep learning model such as a large language model to generate a response. A prompt serves as a starting point for the model's processing and can be as simple as a word, a phrase, or as complex as detailed instructions or questions. In this case, prompts are critical in guiding how deep learning models such as large language models generate content because the prompts set the context and define the scope of the response. In other words, large language models rely heavily on the prompts to interpret user intent and users can ensure that the large language models produce results that are useful by clearly framing questions or instructions in the prompts. SUMMARY An illustrative embodiment provides a computer-implemented method. The method comprises using a processor set to receive a number of documents from a plurality of data sources. The number of documents comprises a first subset of documents and a second subset of documents. The processor set annotates entities in the first subset of documents from the number of documents. The processor set splits the number of documents into a number of chunks. Each chunk from the number of chunks comprises a portion of textual information from a document in the number of documents. The processor sets indexes the number of chunks according to an index structure to generate a number of indexed chunks. Chunks with annotated text from the first subset of documents are indexed according to the index structure. The processor set generates a number of prompt examples for each document in the second subset of documents based on the number of indexed chunks and annotated entities from the first subset of documents. The processor set generates a prompt for extracting entities using a large language model based on the number of prompt examples. Another illustrative embodiment provides a computer system. The system comprises a processor set, a set of one or more computer-readable storage media, and program instructions stored on the set of one or more storage media to cause the processor set to perform operations comprising receiving a number of documents from a plurality of data sources; annotating entities in the first subset of documents from the number of documents. The number of documents comprises a first subset of documents and a second subset of documents. The processor set annotates entities in the first subset of documents from the number of documents. The processor set splits the number of documents into a number of chunks. Each chunk from the number of chunks comprises a portion of textual information from a document in the number of documents. The processor sets indexes the number of chunks according to an index structure to generate a number of indexed chunks. Chunks with annotated text from the first subset of documents are indexed according to the index structure. The processor set generates a number of prompt examples for each document in the second subset of documents based on the number of indexed chunks and annotated entities from the first subset of documents. The processor set generates a prompt for extracting entities using a large language model based on the number of prompt examples. Another illustrative embodiment provides a computer program product. The computer program product comprises a set of one or more computer-readable storage media, and program instructions stored in the set of one or more storage media to perform operations comprising using a processor set to receive a number of documents from a plurality of data sources; annotating entities in the first subset of documents from the number of documents. The number of documents comprises a first subset of documents and a second subset of documents. The processor set annotates entities in the first subset of documents from the number of documents. The processor set splits the number of documents into a number of chunks. Each chunk from the number of chunks comprises a portion of textual information from a document in the number of documents. The processor sets indexes the number of chunks according to an index structure to generate a number of indexed chunks. Chunks with annotated text from the first subset of documents are indexed according to the index structure. The processor set generates a number of prompt examples for each document in the second subset of documents based on the number of indexed chunks and annotated entities from the first subset of documents. The processor set generates a prompt for extracting entities using a large language model based on the number of prompt examples. The features and functions can be achieved independently in various embodiments of the present disclosure or may be combined in yet other embo