US-12625875-B2 - Semantic search interface for data repositories

US12625875B2US 12625875 B2US12625875 B2US 12625875B2US-12625875-B2

Abstract

A method provides visual analysis of datasets. A system receives a natural language search query that is directed to data repositories including data sources and data visualizations. The system parses search tokens to determine if the natural language search query contains analytic intents. The system also determines if the search tokens match fields in one or more data sources, using a semantic search. When (i) the search tokens match fields in the one or more data sources and (ii) the natural language search query contains analytic intents, the system generates and displays visualization responses. When (i) the search tokens do not match fields in the data sources or (ii) the natural language search query does not contain the analytic intents, the system displays pre-authored content from the data visualizations.

Inventors

Vidya Raghavan Setlur
Arjun Srinivasan
Andriy Kanyuka

Assignees

SALESFORCE, INC.

Dates

Publication Date: 20260512
Application Date: 20240130

Claims (19)

1 . A method of visual analysis of datasets, comprising: at a computing system having one or more processors and memory storing one or more programs configured for execution by the one or more processors: receiving a natural language search query that is directed to a plurality of data repositories comprising a plurality of data sources and one or more data visualizations; parsing search tokens corresponding to the natural language search query to determine if the natural language search query contains one or more analytic intents; determining if the search tokens match data fields in one or more data sources of the plurality of data sources, using a semantic search, wherein the semantic search comprises: indexing each of the plurality of data repositories and their metadata to obtain indices; and performing a federated search to determine if the search tokens match fields in the one or more data sources of the plurality of data sources, based on the indices; in accordance with a determination that (i) the search tokens match fields in the one or more data sources and (ii) the natural language search query contains one or more analytic intents, generating and displaying one or more visualization responses; and in accordance with a determination that (i) the search tokens do not match fields in the plurality of data sources or (ii) the natural language search query does not contain the one or more analytic intents, displaying pre-authored content from the one or more data visualizations.
2 . The method of claim 1 , further comprising: obtaining the search tokens using a federated query search that distributes a query to multiple search repositories and combines results into a single, consolidated search result.
3 . The method of claim 1 , wherein the one or more analytic intents is selected from the group consisting of: grouping, aggregation, correlation, filter and limits, temporal, and geospatial.
4 . The method of claim 1 , wherein parsing the search tokens further comprises identifying data fields and data values along with the one or more analytic intents based on the plurality of data sources and their metadata.
5 . The method of claim 4 , wherein identifying data fields and data values comprises comparing N-grams corresponding to the search tokens to available data fields for syntactic similarities and semantic similarities.
6 . The method of claim 5 , wherein the syntactic similarities are identified using Levenshtein distance and the semantic similarities are identified using Wu-Palmer similarity scores.
7 . The method of claim 1 , wherein the indexing comprises: for each data repository and visualization context with associated metadata, representing each file as a respective document vector; and storing N-gram string tokens from the document vectors to support partial and exact matches.
8 . The method of claim 1 , wherein performing the federated search comprises: obtaining a query vector corresponding to the search tokens; encoding the query vector into query string tokens using an encoder that was used to generate the indices; and selecting a predetermined number of candidate document vectors from document vectors for each data repository and visualization context with associated metadata, based on an amount of overlap between the query string tokens and document string tokens for the document vectors.
9 . The method of claim 8 , further comprising: ranking the predetermined number of candidate document vectors using a scoring function that scores documents based on the search tokens appearing in each document, regardless of their proximity within the document.
10 . The method of claim 1 , further comprising: generating and displaying the one or more visualization responses based on data fields, data values, and the one or more analytical intents in the natural language search query.
11 . The method of claim 1 , further comprising: in accordance with a determination that (i) the semantic search returns a matching data source for the natural language query and (ii) the search tokens do not resolve to valid data fields and data values within the data source, displaying suggested queries for the data source.
12 . The method of claim 11 , further comprising: generating the suggested queries using a template-based approach based on a combination of data fields from the data source and data interestingness metrics.
13 . The method of claim 1 , further comprising: generating and displaying the one or more visualization responses using three encoding channels (x, y, and color) and four mark types (bar, line, point, and geo-shape), thereby supporting dynamic generation of bar charts, line charts, scatterplots, and maps that cover a range of analytic intents.
14 . The method of claim 1 , further comprising: determining mark types of the one or more visualization responses based on mappings between visual encodings and data types of data fields.
15 . The method of claim 1 , further comprising: generating and displaying a dynamic text summary describing the one or more visualization responses using one or more statistical computations and a large language model; providing, to the large language model, a prompt containing a statistical description that is extracted from the one or more visualization responses using a predefined set of heuristics; and in response to providing the prompt, receiving the dynamic text summary from the large language model.
16 . The method of claim 15 , wherein the prompt corresponds to (i) minimum/maximum and average values for a bar chart, and (ii) the Pearson's correlation coefficient for scatterplots.
17 . The method of claim 1 , further comprising before receiving the natural language search query: receiving user selection of a data source; presenting a graphical user interface for analysis of data in the selected data source; and providing three search options including: (i) a question-and-answer search for interpreting analytical intent withing the selected data source; (ii) an exploratory search for document-based information retrieval of indexed visualization content for the selected data source; and (iii) a design search that uses visualization metadata for the selected data source.
18 . A computer system for visual analysis of datasets, comprising: one or more processors; and memory; wherein the memory stores one or more programs configured for execution by the one or more processors, and the one or more programs comprising instructions for: receiving a natural language search query that is directed to a plurality of data repositories comprising a plurality of data sources and one or more data visualizations; parsing search tokens corresponding to the natural language search query to determine if the natural language search query contains one or more analytic intents; determining if the search tokens match data fields in one or more data sources of the plurality of data sources, using a semantic search, wherein the semantic search comprises: indexing each of the plurality of data repositories and their metadata to obtain indices; and performing a federated search to determine if the search tokens match fields in the one or more data sources of the plurality of data sources, based on the indices; in accordance with a determination that (i) the search tokens match fields in the one or more data sources and (ii) the natural language search query contains one or more analytic intents, generating and displaying one or more visualization responses; and in accordance with a determination that (i) the search tokens does not match fields in the plurality of data sources or (ii) the natural language search query does not contain the one or more analytic intents, displaying pre-authored content from the one or more data visualizations.
19 . A non-transitory computer readable storage medium storing one or more programs configured for execution by a computer system having one or more processors, and memory, the one or more programs comprising instructions for: receiving a natural language search query that is directed to a plurality of data repositories comprising a plurality of data sources and one or more data visualizations; parsing search tokens corresponding to the natural language search query to determine if the natural language search query contains one or more analytic intents; determining if the search tokens match data fields in one or more data sources of the plurality of data sources, using a semantic search, wherein the semantic search comprises: indexing each of the plurality of data repositories and their metadata to obtain indices; and performing a federated search to determine if the search tokens match fields in the one or more data sources of the plurality of data sources, based on the indices; in accordance with a determination that (i) the search tokens match fields in the one or more data sources and (ii) the natural language search query contains one or more analytic intents, generating and displaying one or more visualization responses; and in accordance with a determination that (i) the search tokens does not match fields in the plurality of data sources or (ii) the natural language search query does not contain the one or more analytic intents, displaying pre-authored content from the one or more data visualizations.

Description

RELATED APPLICATIONS This application claims priority to U.S. Provisional Application Ser. No. 63/457,367, filed Apr. 5, 2023, entitled “Semantic Search Interface for Data Repositories,” which is incorporated by reference herein in its entirety. This application claims priority to U.S. Provisional Application Ser. No. 63/461,237, filed Apr. 21, 2023, entitled “Semantic Search Interface for Data Repositories,” which is incorporated by reference herein in its entirety. TECHNICAL FIELD The disclosed implementations relate generally to data visualizations, and more specifically to systems, methods, and user interfaces for semantic search of data repositories. BACKGROUND User expectations of search interfaces are evolving. Search engines are increasingly expected to answer questions along with providing contextually relevant content that help address a searcher's goal. Current keyword-based search methods are mostly designed for content retrieval. Their main underlying drawback is limited support for structured query types that generally expect focused and specific responses. Natural language (NL) question and answering (Q&A) interfaces, on the other hand, support more fact-finding inquiry but do not support content or document discovery and retrieval. With an increase in the number of data repositories on the web, including structured data in the form of relational databases, files, and knowledge graphs, there is a plethora of information that supports the blend of generating responses to fact-finding questions with document retrieval. Along similar lines, data repositories and visualization tools host hundreds or thousands of visualizations representing a wide range of datasets, making them rich platforms for knowledge sharing and consumption. Search plays a pivotal role in these repositories, providing people the ability to winnow in on content they are interested in (e.g., charts on a specific topic, charts showing data trends and bespoke visualizations such as Sankey diagrams, or charts authored by a particular person). Current search systems tend to rely on document-retrieval techniques to provide relevant search results for a given query. However, the challenge with data repositories lies in the sparseness of searchable text within them; data sources and charts often have limited text information in the form of titles, captions, and textual data values, for example. There is a need to explore alternative ways to index and search for content based on this limited availability of textual information. Another challenge is that current search features for data repositories offer limited expressivity in specifying search queries, restricting users to predominantly perform keyword search for content based on the visualizations' titles and authors. In contrast, other contemporary search interfaces, such as general web search, image and video search, and social networking sites enable users to find and discover content through a rich combination of textual content (e.g., keywords or topics covered in a website), visual features within the content (e.g., looking for images with a specific background color), dates (e.g., viewing videos from the recent week), geographic locations (e.g., limiting search to zip codes or cities), and even different types of media (e.g., searching for similar images through reverse image search). Designing expressive search interfaces for data repositories requires gaining a deeper empirical understanding of people's search requirements, given the current limitations of these systems. For instance, what goals do people have in mind when using search in the context of data repositories? How do people formulate their search queries? Is text alone a sufficient modality for search? If not, what are complementary/alternative modalities? What supporting metadata do people want to query for or use to filter the search results? SUMMARY Accordingly, there is a need for systems, methods and interfaces for semantic search of data repositories. Some implementations bridge the gap between two contrasting search paradigms-keyword-based search methods and natural language (NL) question and answering (Q&A) interfaces-based on a hybrid methodology called semantic search. Semantic search applies user intent and the meaning (e.g., semantics) of words and phrases to determine the right content that might not be present immediately in the text (the keywords themselves) but is closely tied to what the searcher wants. The information retrieval technique goes beyond simple keyword matching by using information, such as entity recognition, word disambiguation, and relationship extraction to interpret the searcher's intent in the queries. For example, keyword search can find documents with the query, “French press”, while queries such as “How do I quickly make strong coffee?” or “manual coffee brewing methods” are better served by semantic search to produce targeted responses. Some implementations provide hybrid