US-20260127228-A1 - Progressing Search Instances in Weak Search Signal Instances

US20260127228A1US 20260127228 A1US20260127228 A1US 20260127228A1US-20260127228-A1

Abstract

Systems and methods for progressing search instances in weak signal instances can include obtaining a visual query, determining a plurality of initial search results, determining the visual query has weak search signals based on the plurality of initial search results or search intent ambiguity, generating a model-generated output to prompt the user, and providing the model-generated output for display with a subset of search results. The model-generated output can include a model-generated prompt that provides an interface for a user to provide additional inputs for additional search query clarity.

Inventors

BELINDA LUNA ZENG
Zhihao Li
David Ping Chou
Mingcen Gao
Sowmya Sree Bhagavatula
Harshit Kharbanda
Dounia Berrada
Sundeep Vaddadi
Jieming Yu
Kaan YÜCER
Michael Oh
Christopher James Kelley
Louis Wang

Assignees

GOOGLE LLC

Dates

Publication Date: 20260507
Application Date: 20251006

Claims (20)

1 . A computing system for search interface prompting, the system comprising: one or more processors; and one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: obtaining a multimodal query, wherein the multimodal query comprises a text input and an image input; processing the multimodal query to determine a plurality of initial search results, wherein the plurality of initial search results are determined to be responsive to the multimodal query; determining the multimodal query comprises weak search signals based on the image input and the plurality of initial search results; in response to determining the multimodal query comprises weak search signals, processing the image input to generate a prompt, wherein the prompt is associated with a query clarification request; providing the prompt for display with the plurality of initial search results in a search results interface; receiving a user input via the search results interface; and processing the multimodal query and the user input to determine a plurality of second search results.
2 . The system of claim 1 , wherein determining the multimodal query comprises weak search signals based on the image input and the plurality of initial search results comprises: processing the multimodal query and at least a subset of the plurality of initial search results with a vision language model to generate a responsiveness score associated with how responsive the plurality of initial search results are to the multimodal query; and determining the responsiveness score is below a threshold score.
3 . The system of claim 1 , wherein determining the multimodal query comprises weak search signals based on the image input and the plurality of initial search results comprises: processing the image input and the plurality of initial search results to determine an object identification for an object depicted in the image input is ambiguous based on candidate object identifications associated with the plurality of initial search results.
4 . The system of claim 1 , wherein determining the multimodal query comprises weak search signals based on the image input and the plurality of initial search results comprises: processing the image input to determine the image input comprises an image quality below a quality threshold; and determining one or more similarity measures between the image input and at least a subset of the plurality of initial search results are below a similarity threshold.
5 . The system of claim 1 , wherein the operations further comprise: providing the plurality of second search results for display in a search results interface, wherein the search results interface comprises a query input box, a search results panel, and a knowledge panel comprising information obtained from a curated knowledge database.
6 . The system of claim 1 , wherein the operations further comprise: processing the multimodal query, data associated with the user input, and the plurality of second search results with a generative model to generate a model-generated response, wherein the model-generated response is generated to be responsive to the text input; and providing the model-generated response for display with the plurality of second search results.
7 . The system of claim 1 , wherein processing the image input to generate the prompt comprises: processing the image input with an image classification model to generate a plurality of predicted classification labels and a plurality of confidence scores associated with the plurality of predicted classification labels; determining each of the plurality of confidence scores are below a threshold confidence score; and generating a prompt based on a subset of the plurality of predicted classification labels determined to have a highest probability based on the plurality of confidence scores.
8 . The system of claim 1 , wherein processing the image input to generate the prompt comprises: processing the image input with an image classification model to generate a plurality of predicted classification labels and a plurality of confidence scores associated with the plurality of predicted classification labels; determining a first score and a second score of the plurality of confidence scores are similar; obtaining first object details associated with a first object classification of the plurality of predicted classification labels, wherein the first object classification is associated with the first score; obtaining second object details associated with a second object classification of the plurality of predicted classification labels, wherein the second object classification is associated with the second score; determining a differentiating feature between the first object details and the second object details; and generating the prompt based on the differentiating feature.
9 . The system of claim 1 , wherein processing the image input to generate the prompt comprises: processing the multimodal query and the plurality of initial search results with a generative model to generate a model-generated overview, wherein the model-generated overview is descriptive of a model understanding of the multimodal query and the plurality of initial search results; and generating the prompt based on the model-generated overview.
10 . The system of claim 1 , wherein the operations further comprise: processing the multimodal query and the plurality of initial search results with a generative model to generate a model-generated overview, wherein the model-generated overview is descriptive of a model understanding of the multimodal query and the plurality of initial search results; and wherein providing the prompt for display with the plurality of initial search results in the search results interface comprises: providing the prompt and the model-generated overview for display with the plurality of initial search results in a search results interface.
11 . A computer-implemented method, the method comprising: obtaining, by a computing system comprising one or more processors, a multimodal query, wherein the multimodal query comprises a text input and an image input; processing, by the computing system, the multimodal query to determine a plurality of initial search results, wherein the plurality of initial search results are determined to be responsive to the multimodal query; determining, by the computing system, the multimodal query comprises weak search signals based on the image input and the plurality of initial search results; in response to determining the multimodal query comprises weak search signals, processing, by the computing system, the image input with an object detection model to generate one or more object detections; processing, by the computing system, the image input and the one or more object detections to generate a prompt, wherein the prompt is associated with a query clarification request; providing, by the computing system, the prompt for display with the plurality of initial search results in a search results interface; receiving, by the computing system, a user input via the search results interface; and processing, by the computing system, the multimodal query and the user input to determine a plurality of second search results.
12 . The method of claim 11 , wherein determining the multimodal query comprises weak search signals comprises at least one of: processing the image input to determine the image input comprises an image quality below a quality threshold; or determining a responsiveness score for the plurality of initial search results is below a response threshold.
13 . The method of claim 11 , wherein the prompt comprises a plurality of selectable images, wherein the plurality of selectable images are obtained based on the one or more object detections.
14 . The method of claim 13 , wherein processing the image input and the one or more object detections to generate the prompt comprises: processing at least one of the image input or the one or more object detections to determine a plurality of image search results; and generating the plurality of selectable images based on the plurality of image search results.
15 . The method of claim 13 , wherein providing, by the computing system, the prompt for display with the plurality of initial search results in the search results interface comprises: providing the plurality of selectable images for display in a carousel interface.
16 . The method of claim 13 , wherein the user input is descriptive of a selection of a particular image of the plurality of selectable images.
17 . One or more non-transitory computer-readable media that collectively store instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform operations, the operations comprising: obtaining a visual query, wherein the visual query comprises an image input; processing the visual query to determine a plurality of initial search results, wherein the plurality of initial search results are determined to be responsive to the visual query; determining a search intent of the visual query is ambiguous based on at least one of the image input and the plurality of initial search results; in response to determining the search intent of the visual query is ambiguous, processing the image input with an image classification model to generate an image classification; generating a plurality of prompts based on the image classification, wherein the plurality of prompts comprise a plurality of suggested data processing actions; providing the plurality of prompts for display with the plurality of initial search results in a search results interface; receiving a user input via the search results interface; and processing the multimodal query and the user input to determine a plurality of second search results.
18 . The one or more non-transitory computer-readable media of claim 17 , wherein the image classification comprises a text-focused image classification; wherein generating the plurality of prompts based on the image classification comprises: processing the image input with an optical character recognition model to generate text data descriptive of text within the image input; determining a plurality of text data processing actions based on the image classification being descriptive of the image input being text-focused; and generating the plurality of prompts based on the plurality of text data processing actions.
19 . The one or more non-transitory computer-readable media of claim 18 , wherein the user input comprises a selection of a particular prompt associated with a particular text data processing action of the plurality of text data processing actions; and wherein processing the multimodal query and the user input to determine a plurality of second search results comprises: processing the text data with a search engine to determine a plurality of web search results; and processing the particular prompt and the text data with a generative language model to generate a model-generated response, wherein the plurality of second search results comprises the plurality of web search results and the model-generated response.
20 . The one or more non-transitory computer-readable media of claim 17 , wherein the image classification comprises a text-focused image classification; wherein generating the plurality of prompts based on the image classification comprises: processing the image input with an optical character recognition model to generate text data descriptive of text within the image input; processing the text data and the plurality of initial search results with a generative language model to generate the plurality of prompts.

Description

PRIORITY CLAIM The present application is based on and claims priority to U.S. Provisional Application No. 63/714,919 having a filing date of Nov. 1, 2024. Application claims priority to and the benefit of each of such application and incorporates all such application herein by reference in its entirety. FIELD The present disclosure relates generally to generating prompts and/or hedged answers with a generative model for particular search instances. More particularly, the present disclosure relates to generating prompts and/or hedged answers with a generative model when the search instance is determined to have weak search signals. BACKGROUND Understanding the world at large can be difficult. Whether an individual is trying to understand what the object in front of them is, trying to determine where else the object can be found, and/or trying to determine where an image on the internet was captured from, text searching alone can be difficult. In particular, users may struggle to determine which words to use. Additionally, the words may not be descriptive enough and/or abundant enough to generate desired results. Moreover, processing low quality images can provide difficulties for search systems. Search intent ambiguity can further such difficulties. SUMMARY Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments. One example aspect of the present disclosure is directed to a computing system for search interface prompting. The system can include one or more processors and one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations. The operations can include obtaining a multimodal query. The multimodal query can include a text input and an image input. The operations can include processing the multimodal query to determine a plurality of initial search results. The plurality of initial search results can be determined to be responsive to the multimodal query. The operations can include determining the multimodal query includes weak search signals based on the image input and the plurality of initial search results. The operations can include processing the image input to generate a prompt in response to determining the multimodal query includes weak search signals. The prompt can be associated with a query clarification request. The operations can include providing the prompt for display with the plurality of initial search results in a search results interface. The operations can include receiving a user input via the search results interface and processing the multimodal query and the user input to determine a plurality of second search results. In some implementations, determining the multimodal query includes weak search signals based on the image input and the plurality of initial search results can include processing the multimodal query and at least a subset of the plurality of initial search results with a vision language model to generate a responsiveness score associated with how responsive the plurality of initial search results are to the multimodal query and determining the responsiveness score is below a threshold score. Determining the multimodal query includes weak search signals based on the image input and the plurality of initial search results can include processing the image input and the plurality of initial search results to determine an object identification for an object depicted in the image input is ambiguous based on candidate object identifications associated with the plurality of initial search results. In some implementations, determining the multimodal query includes weak search signals based on the image input and the plurality of initial search results can include processing the image input to determine the image input includes an image quality below a quality threshold and determining one or more similarity measures between the image input and at least a subset of the plurality of initial search results are below a similarity threshold. The operations can include providing the plurality of second search results for display in a search results interface. The search results interface can include a query input box, a search results panel, and a knowledge panel including information obtained from a curated knowledge database. The operations can include processing the multimodal query, data associated with the user input, and the plurality of second search results with a generative model to generate a model-generated response. The model-generated response can be generated to be responsive to the text input. The operations can include providing the model-generated response for display with the plurality of second search results. In some implementations, processing the image input to generate the pro