US-20260127175-A1 - MACHINE LEARNING-BASED PROCESSING OF QUERIES USING STRUCTURED DATA CONVERTED TO NATURAL LANGUAGE DOCUMENTS

US20260127175A1US 20260127175 A1US20260127175 A1US 20260127175A1US-20260127175-A1

Abstract

An apparatus comprises at least one processing device configured to obtain, at an application interface, a query directed to structured data, to convert the structured data into natural language documents, to associate the natural language documents with context and differentiating entities in a hierarchical database defining a hierarchy of the differentiating and context entities, and to determine a tenant boundary of the query specifying at least one of the context entities associated with at least one of the differentiating entities in the hierarchical database. The at least one processing device is further configured to generate a prompt identifying the query and a subset of the natural language documents selected based at least in part on the determined tenant boundary, to process the prompt utilizing at least one machine learning model to generate an answer for the query, and to provide, via the application interface, the generated answer to the query.

Inventors

Thirumaleshwara Adyanadka Shama
Shibi Panikkar

Assignees

DELL PRODUCTS L.P.

Dates

Publication Date: 20260507
Application Date: 20241101

Claims (20)

1 . An apparatus comprising: at least one processing device comprising a processor coupled to a memory; the at least one processing device being configured: to obtain, at an application interface, a query, the query being directed to structured data in one or more structured data sources; to convert at least a portion of the structured data extracted from the one or more structured data sources into one or more natural language documents; to associate the one or more natural language documents with one or more context entities and one or more differentiating entities in a hierarchical database defining a hierarchy of the one or more differentiating entities and the one or more context entities; to store the one or more natural language documents in a vectorized format in a vector database, the one or more natural language documents being associated with vector indexes in the hierarchical database, the vector indexes being associated with respective ones of the one or more context entities in the hierarchical database; to determine a tenant boundary of the query, the tenant boundary specifying at least one of the one or more context entities associated with at least one of the one or more differentiating entities in the hierarchical database; to generate a prompt for processing utilizing at least one machine learning model, the prompt identifying the query and a subset of the one or more natural language documents selected based at least in part on the determined tenant boundary, wherein generating the prompt comprises retrieving the vectorized formats of the identified subset of the one or more natural language documents from the vector database and utilizing the vectorized formats of the identified subset of the one or more natural language documents in generating the prompt; and to process the prompt utilizing the at least one machine learning model to generate an answer for the query; and to provide, via the application interface, the generated answer to the query.
2 . The apparatus of claim 1 wherein the one or more structured data sources comprise at least one of an online transaction processing database and an online analytical processing database.
3 . The apparatus of claim 1 wherein converting the portion of the structured data extracted from the one or more structured data sources into the one or more natural language documents utilizes a natural language linker configuration template, the natural language linker configuration template specifying an ordering for creating natural language from content in two or more columns of the structured data.
4 . The apparatus of claim 3 wherein the natural language linker configuration template further specifies, for at least one of the two or more columns of the structured data, at least one of a suffix and a prefix to be appended to the content of the at least one of the two or more columns of the structured data.
5 . The apparatus of claim 3 wherein the natural language linker configuration template further specifies whether respective ones of the two or more columns of the structured data represent at least one of the one or more differentiating entities.
6 . The apparatus of claim 3 wherein converting the portion of the structured data from the one or more structured data sources into the one or more natural language documents further comprises tuning an output of the natural language linker configuration template utilizing a large language model.
7 . The apparatus of claim 3 wherein converting the portion of the structured data from the one or more structured data sources into the one or more natural language documents further comprises testing an output of the natural language linker configuration template utilizing one or more test queries.
8 . (canceled)
9 . (canceled)
10 . The apparatus of claim 1 wherein determining the tenant boundary of the query comprises determining said at least one of the one or more differentiating entities in the hierarchical database based at least in part on identifying a source that submitted the query to the application interface.
11 . The apparatus of claim 1 wherein determining the tenant boundary of the query comprises determining said at least one of the one or more differentiating entities in the hierarchical database based at least in part on identifying a user that is logged in to the application interface.
12 . The apparatus of claim 1 wherein determining the tenant boundary of the query comprises determining said at least one of the one or more context entities in the hierarchical database based at least in part on applying natural language processing to text of the query.
13 . The apparatus of claim 1 wherein the at least one machine learning model comprises a generative artificial intelligence model.
14 . The apparatus of claim 1 wherein the at least one machine learning model comprises a large language model.
15 . A computer program product comprising a non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code when executed by at least one processing device causes the at least one processing device: to obtain, at an application interface, a query, the query being directed to structured data in one or more structured data sources; to convert at least a portion of the structured data extracted from the one or more structured data sources into one or more natural language documents; to associate the one or more natural language documents with one or more context entities and one or more differentiating entities in a hierarchical database defining a hierarchy of the one or more differentiating entities and the one or more context entities; to store the one or more natural language documents in a vectorized format in a vector database, the one or more natural language documents being associated with vector indexes in the hierarchical database, the vector indexes being associated with respective ones of the one or more context entities in the hierarchical database; to determine a tenant boundary of the query, the tenant boundary specifying at least one of the one or more context entities associated with at least one of the one or more differentiating entities in the hierarchical database; to generate a prompt for processing utilizing at least one machine learning model, the prompt identifying the query and a subset of the one or more natural language documents selected based at least in part on the determined tenant boundary, wherein generating the prompt comprises retrieving the vectorized formats of the identified subset of the one or more natural language documents from the vector database and utilizing the vectorized formats of the identified subset of the one or more natural language documents in generating the prompt; and to process the prompt utilizing the at least one machine learning model to generate an answer for the query; and to provide, via the application interface, the generated answer to the query.
16 . The computer program product of claim 15 wherein converting the portion of the structured data extracted from the one or more structured data sources into the one or more natural language documents utilizes a natural language linker configuration template, the natural language linker configuration template specifying an ordering for creating natural language from content in two or more columns of the structured data.
17 . The computer program product of claim 15 wherein determining the tenant boundary of the query comprises determining said at least one of the one or more differentiating entities in the hierarchical database based at least in part on identifying a user that is logged in to the application interface.
18 . A method comprising: obtaining, at an application interface, a query, the query being directed to structured data in one or more structured data sources; converting at least a portion of the structured data extracted from the one or more structured data sources into one or more natural language documents; associating the one or more natural language documents with one or more context entities and one or more differentiating entities in a hierarchical database defining a hierarchy of the one or more differentiating entities and the one or more context entities; storing the one or more natural language documents in a vectorized format in a vector database, the one or more natural language documents being associated with vector indexes in the hierarchical database, the vector indexes being associated with respective ones of the one or more context entities in the hierarchical database; determining a tenant boundary of the query, the tenant boundary specifying at least one of the one or more context entities associated with at least one of the one or more differentiating entities in the hierarchical database; generating a prompt for processing utilizing at least one machine learning model, the prompt identifying the query and a subset of the one or more natural language documents selected based at least in part on the determined tenant boundary, wherein generating the prompt comprises retrieving the vectorized formats of the identified subset of the one or more natural language documents from the vector database and utilizing the vectorized formats of the identified subset of the one or more natural language documents in generating the prompt; and processing the prompt utilizing the at least one machine learning model to generate an answer for the query; and providing, via the application interface, the generated answer to the query; wherein the method is performed by at least one processing device comprising a processor coupled to a memory.
19 . The method of claim 18 wherein converting the portion of the structured data extracted from the one or more structured data sources into the one or more natural language documents utilizes a natural language linker configuration template, the natural language linker configuration template specifying an ordering for creating natural language from content in two or more columns of the structured data.
20 . The method of claim 18 wherein determining the tenant boundary of the query comprises determining said at least one of the one or more differentiating entities in the hierarchical database based at least in part on identifying a user that is logged in to the application interface.

Description

BACKGROUND As the value and use of information continues to increase, individuals and businesses seek additional ways to process and store information. Information processing systems may be used to process, compile, store and communicate various types of information, including through the use of artificial intelligence (AI) and machine learning (ML). Large language models (LLMs) are a type of AI system that uses ML algorithms to process vast amounts of natural language text data. LLMs may be used to perform various natural language processing (NLP) tasks, including text classification, text summarization, text generation, named entity recognition, text sentiment analysis, and question answering. SUMMARY Illustrative embodiments of the present disclosure provide techniques for machine learning-based processing of queries using structured data converted to natural language documents. In one embodiment, an apparatus comprises at least one processing device comprising a processor coupled to a memory. The at least one processing device is configured to obtain, at an application interface, a query, the query being directed to structured data in one or more structured data sources. The at least one processing device is also configured to convert at least a portion of the structured data extracted from the one or more structured data sources into one or more natural language documents, to associate the one or more natural language documents with one or more context entities and one or more differentiating entities in a hierarchical database defining a hierarchy of the one or more differentiating entities and the one or more context entities, and to determine a tenant boundary of the query, the tenant boundary specifying at least one of the one or more context entities associated with at least one of the one or more differentiating entities in the hierarchical database. The at least one processing device is further configured to generate a prompt for processing utilizing at least one machine learning model, the prompt identifying the query and a subset of the one or more natural language documents selected based at least in part on the determined tenant boundary, to process the prompt utilizing the at least one machine learning model to generate an answer for the query, and to provide, via the application interface, the generated answer to the query. These and other illustrative embodiments include, without limitation, methods, apparatus, networks, systems and processor-readable storage media. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram of an information processing system configured for machine learning-based processing of queries using structured data converted to natural language documents in an illustrative embodiment. FIG. 2 is a flow diagram of an exemplary process for machine learning-based processing of queries using structured data converted to natural language documents in an illustrative embodiment. FIG. 3 shows an online transaction processing database with a single order for a customer in an illustrative embodiment. FIG. 4 shows a system for large language model-based processing of natural language queries in an illustrative embodiment. FIG. 5 shows a system flow for online prompt-driven analytical processing of structured data using a large language model in an illustrative embodiment. FIG. 6 shows a system for online prompt-driven analytical processing of natural language queries using structured data converted into natural language documents with an extract, transform to document, load process in an illustrative embodiment. FIG. 7 shows a natural language linkers configurator for data from an online transaction processing database in an illustrative embodiment. FIG. 8 shows an online prompt-driven analytical processing data modeling tool implementing a hierarchical database and a vector database in an illustrative embodiment. FIG. 9 shows a system flow for online prompt-driven analytical processing of natural language queries using structured data converted into natural language documents with an extract, transform to document, load process in an illustrative embodiment. FIG. 10 shows a natural language document produced using an extract, transform to document, load process with and without large language model-based tuning in an illustrative embodiment. FIG. 11 shows an online transaction processing database with multiple orders for a customer in an illustrative embodiment. FIG. 12 shows a natural language document produced using an extract, transform to document, load process with large language model-based tuning in an illustrative embodiment. FIG. 13 shows a query and answer produced using a large language model which utilizes an online prompt-driven analytical processing data modeling tool in an illustrative embodiment. FIG. 14 shows an online transaction processing database with orders for multiple customers in an illustrative embodiment. FIG. 15 shows an example of online prompt