US-12619656-B2 - Automated tool for determining and providing information about dwellings satisfying search criteria specified using multiple data modes
Abstract
Techniques are described for performing automated operations related to determining and providing information about dwellings for searches with search criteria combining data of multiple modes, such as at least free-form natural language text and one or more images. In some situations, the described techniques include training machine learning (“ML”) model(s) to encode semantic information about dwellings from multiple data modes into corresponding vector-based embeddings, using the trained ML model(s) to generate vector embeddings for dwellings in one or more geographical areas to represent dwelling data of multiple data modes, using the trained ML model(s) to generate vector embeddings for a search query with multiple search criteria including data of multiple modes, and determining one or more matching target dwellings for the query by matching generated vector embeddings of candidate dwellings to the generated vector embedding(s) for the query, with information about matching target dwelling(s) then further used.
Inventors
- Hubert Vijay Arokiasamy
- Dushyant R. Maheshwary
- Piali Syam
- Alexey Serba
- Sandesh Satish
- Ashwani Kapoor
- Supriya Anand
- Jyoti Prakash Maheswari
- Rajendra Waman Shioramwar
Assignees
- MFTB Holdco, Inc.
Dates
- Publication Date
- 20260505
- Application Date
- 20240605
Claims (20)
- 1 . A system comprising: one or more hardware processors of one or more computing devices; and one or more memories with stored instructions that, when executed by at least one of the one or more hardware processors, cause at least one computing device of the one or more computing devices to perform automated operations including at least: generating, for a plurality of dwellings that each has an associated textual description and multiple associated images of portions of that dwelling, and using one or more machine learning models that are trained to capture at least one of semantic relationships between words or semantic content of images and to convert high-dimensional data into low-dimensional vectors that preserve data content, respective vector-based embeddings that represent the plurality of dwellings, wherein the vector-based embeddings include, for each of the plurality of dwellings, a dwelling text vector-based embedding that represents semantic content from the associated textual description of that dwelling, and one or more dwelling image vector-based embeddings that each encodes a semantic representation of contents of one or more of the multiple associated images for that dwelling; generating, by a client device that is one of the one or more computing devices, and in response to a request received from a user of the client device for information about target dwellings that satisfy multiple specified search criteria, a user query that includes the multiple search criteria and that further includes automatically added information about a determined geographical location of the client device, the multiple search criteria including a textual dwelling characterization specified using a sequence of freeform terms, and further including one or more indicated images depicting one or more representative dwelling portions; receiving the user query from the client device; generating, using the one or more trained machine learning models, multiple additional vector-based embeddings for the user query that represent further semantic content of at least some of the user query, the multiple additional vector-based embeddings including a query text vector-based embedding that represents semantic content from at least some of the textual dwelling characterization, and including one or more query image vector-based embeddings that each represents semantic content from at least one of the one or more indicated images; determining at least one target dwelling of the plurality of dwellings that is located in one or more geographical areas associated with the determined geographical location and that satisfies the multiple search criteria, including determining for each of the at least one target dwellings that the respective dwelling text vector-based embedding for that target dwelling matches the query text vector-based embedding for the user query, and further determining that, for each of the one or more query image vector-based embeddings, each of the at least one target dwellings has at least one respective dwelling image vector-based embedding that matches that query image vector-based embedding for the user query; and providing, in response to the user query, information about the determined at least one target dwelling.
- 2 . The system of claim 1 wherein the determining that a dwelling text vector-based embedding for a target dwelling matches the query text vector-based embedding includes determining that that dwelling text vector-based embedding differs from that query text vector-based embedding by at most a first defined threshold amount, and wherein the determining that a dwelling image vector-based embedding for a target dwelling matches a query image vector-based embedding includes determining that that dwelling text vector-based embedding differs from that query image vector-based embedding by at most a second defined threshold amount.
- 3 . The system of claim 2 wherein the first and second defined threshold amounts are a same amount, and wherein the determining that a dwelling text vector-based embedding differs from a query text vector-based embedding by at most a first defined threshold amount and the determining that a dwelling text vector-based embedding differs from a query image vector-based embedding by at most a second defined threshold amount includes measuring a distance between a pair of vector-based embeddings and determining that the measured distance is below a defined distance-based threshold.
- 4 . The system of claim 1 wherein the associated textual description for each of the plurality of dwellings includes a textual narrative describing that dwelling using freeform text and includes a plurality of keyword-value pairs to describe attributes of that dwelling, and wherein the stored instructions include software instructions that, when executed by the one or more hardware processors, cause the one or more computing devices to perform the generating, for each of the plurality of dwellings, of the respective dwelling text vector-based embedding for that dwelling to encode a semantic representation of contents of at least the textual narrative and the plurality of keyword-value pairs included in the textual description for that dwelling.
- 5 . The system of claim 1 wherein the providing of the information about the determined at least one target dwelling includes presenting the information about the determined at least one target dwelling in a displayed graphical user interface.
- 6 . The system of claim 1 wherein the dwelling text vector-based embedding for each of the plurality of dwellings further encodes a further semantic representation of contents of additional information about that dwelling obtained from one or more public sources of data about dwellings, and wherein the textual dwelling characterization of the user query includes text matching the additional information for one or more of the at least one target dwellings.
- 7 . The system of claim 1 wherein the multiple associated images for each of the at least one target dwellings include at least one image of a first type that shows a ground-level view of an exterior of at least one side of that target dwelling and at least one image of a second type that shows an overhead view including all of that target dwelling and at least one image of a third type that shows an interior view of at least one room of that target dwelling, wherein the generated respective vector-based embeddings for the plurality of dwellings includes, for each of the target dwellings, a separate dwelling image vector-based embedding for each of the multiple associated images for that target dwelling that is generated using the one or more trained machine learning models, wherein the multiple search criteria of the user query include multiple indicated images of at least two of the first and second and third types, wherein the generating of the one or more query image vector-based embeddings includes generating, using the one or more trained machine learning models, a separate query image vector-based embedding for each of the multiple indicated images, and wherein the determining of the at least one target dwelling includes determining that each target dwelling has, for each separate query image vector-based embedding, one or more dwelling image vector-based embeddings that differs from that separate query image vector-based embedding by at most a defined threshold amount.
- 8 . The system of claim 1 wherein the automated operations further include, for each of the at least one target dwellings, generating, for each of the multiple associated images of that target dwelling and using the one or more trained machine learning models, a respective textual description of contents of that associated image, and wherein the determining of the at least one target dwelling further includes determining that each target dwelling has at least one associated image whose respective textual description matches at least some of the textual dwelling characterization.
- 9 . The system of claim 1 wherein the at least one target dwellings each further has an associated video showing at least some of that dwelling, wherein at least one dwelling image vector-based embeddings for each of the at least one target dwellings each encodes a semantic representation of contents of multiple image frames from the associated video for that target dwelling, and wherein the determining of the at least one target dwelling includes determining, for each of the at least one target dwellings, that the at least one dwelling image vector-based embedding for that target dwelling differs from one of the one or more query image vector-based embeddings by at most a defined threshold amount.
- 10 . The system of claim 1 wherein the received user query includes a video with a plurality of image frames, and wherein the automated operations further include selecting at least one of the plurality of image frames to be the one or more indicated images.
- 11 . The system of claim 1 wherein the at least one target dwellings each further has one or more additional associated pieces of media of one or more other media types different from images and text, wherein the generated respective vector-based embeddings further includes, for each of the at least one target dwellings, one or more further dwelling media vector-based embeddings for each of the one or more additional pieces of media associated with that dwelling that encodes a further semantic representation of contents of that additional piece of media and that are generated using the one or more trained machine learning models, wherein the multiple search criteria further include at least one additional indicated piece of media of at least one of the one or more other media types, wherein the multiple additional vector-based embeddings for the user query include an additional query media vector-based embedding for each of the at least one additional indicated pieces of media that is generated using the one or more trained machine learning models, and wherein the determining of the at least one target dwelling further includes determining for each of the at least one target dwellings that the associated one or more additional pieces of media for that target dwelling match the at least one additional indicated pieces of media by each having, for each of the at least one additional indicated pieces of media, at least one dwelling media vector-based embedding that differs from the query media vector-based embedding for that indicated piece of media by at most a defined threshold amount.
- 12 . The system of claim 1 wherein the generating of the multiple additional vector-based embeddings for the user query further includes obtaining additional information specific to a user that supplies the user query, and further encoding a further semantic representation in at least one of the multiple additional vector-based embeddings of contents of the additional information specific to the user.
- 13 . The system of claim 1 wherein the user query is received from a client device, wherein the providing of the information about the determined at least one target dwelling includes transmitting search results that include the information about the determined at least one target dwelling over one or more computer networks to the client device for display on the client device, and wherein the automated operations further include, before the generating of the respective vector-based embeddings for the plurality of dwellings, training the one or more machine learning models using at least one of positive examples each having two or more first real estate phrases that are semantically similar and negative examples each having two or more second real estate phrases that are not semantically similar, or positive examples each having two or more first real estate images that are similar and negative examples each having two or more second real estate images that are not similar.
- 14 . A non-transitory computer-readable medium having stored contents that cause one or more computing devices to perform automated operations, the automated operations including at least: generating, by the one or more computing devices, and using one or more machine learning models that are trained to capture at least one of semantic relationships between words or semantic content of images and to convert high-dimensional data into low-dimensional vectors that preserve data content, vector-based embeddings for a plurality of buildings each having an associated textual description and multiple associated images of portions of that building, wherein the vector-based embeddings include, for each of the plurality of buildings, one or more building vector-based embeddings that encode semantic information of at least some of the associated textual description for that building and semantic information of contents of one or more of the multiple associated images for that building; receiving, by the one or more computing devices and after the generating of the vector-based embeddings for the plurality of buildings, a user query from a client device for information about target buildings that satisfy multiple specified search criteria, the multiple search criteria including a textual building characterization specified using a sequence of freeform terms and further including one or more indicated images depicting one or more representative building portions and further including a geographical location determined and supplied by the client device; generating, by the one or more computing devices and using the one or more trained machine learning models, one or more query vector-based embeddings for the user query that encode semantic information of at least some of the textual building characterization and semantic information of contents of at least one of the one or more indicated images; determining, by the one or more computing devices, at least one target building of the plurality of buildings for the user query that is located in one or more geographical areas associated with the geographical location, including determining that building vector-based embeddings for the at least one target buildings match the one or more query vector-based embeddings for the user query; and providing, by the one or more computing devices and in response to the user query, information about the determined at least one target building.
- 15 . The non-transitory computer-readable medium of claim 14 wherein the generating of the vector-based embeddings for the plurality of buildings further includes generating, for each of the plurality of buildings and using the one or more trained machine learning models, a building text vector-based embedding that encodes a semantic representation of contents of at least some of the associated textual description for that building, and one or more building image vector-based embeddings that each encodes a semantic representation of contents of one or more of the multiple associated images for that building, wherein the generating of the one or more query vector-based embeddings for the user query further includes generating, using the one or more trained machine learning models, a query text vector-based embedding that encodes a semantic representation of contents of at least some of the textual building characterization, and one or more query image vector-based embeddings that each encodes a semantic representation of contents of at least one of the one or more indicated images, and wherein the determining that building vector-based embeddings for the at least one target buildings match the one or more query vector-based embeddings includes determining that building text vector-based embeddings for the at least one target buildings differ from that query text vector-based embedding by at most a defined threshold amount, and that building image vector-based embeddings for the at least one target buildings differ from the one or more query image vector-based embeddings by at most the defined threshold amount.
- 16 . The non-transitory computer-readable medium of claim 15 wherein the stored contents include software instructions that, when executed by the one or more computing devices, cause the one or more computing devices to perform the automated operations and further perform, before the generating of the vector-based embeddings for the plurality of dwellings, training the one or more machine learning models using at least one of positive examples each having two or more first real estate phrases that are semantically similar and negative examples each having two or more second real estate phrases that are not semantically similar, or positive examples each having two or more first real estate images that are similar and negative examples each having two or more second real estate images that are not similar, wherein the determining of the at least one target building includes determining multiple target buildings, wherein the determining that building text vector-based embeddings for the at least one target buildings differ from that query text vector-based embedding by at most a defined threshold amount includes determining that one or more of the multiple target buildings each has an associated building text vector-based embedding that differs from the query text vector-based embedding by at most a defined distance using a defined distance metric, and wherein the determining that building image vector-based embeddings for the at least one target buildings differ from the one or more query image vector-based embeddings by at most the defined threshold amount includes determining that at least one of the multiple target buildings each has an associated building image vector-based embedding that differs from one of the query image vector-based embeddings by at most the defined distance using the defined distance metric.
- 17 . A computer-implemented method comprising: generating, by one or more computing devices using one or more machine learning models that are trained to capture at least one of semantic relationships between words or semantic content of images and to convert high-dimensional data into low-dimensional vectors that preserve data content, vector-based embeddings for a plurality of dwellings in one or more geographical areas, wherein each of the dwellings has an associated textual description and multiple associated images of portions of that dwelling, and wherein the vector-based embeddings include, for each of the plurality of dwellings, a dwelling text vector-based embedding that encodes a semantic representation of contents of at least some of the associated textual description for that dwelling, and one or more dwelling image vector-based embeddings that each encodes a semantic representation of contents of one or more of the multiple associated images for that dwelling; receiving, by the one or more computing devices and after the generating of the vector-based embeddings for the plurality of dwellings, a user query from a client device for information about target dwellings that are in at least one of the one or more geographical areas including a geographical location supplied by the client device and that satisfy multiple specified search criteria, the multiple search criteria including a textual dwelling characterization specified using a sequence of freeform terms submitted via a natural language interface, and further including one or more indicated images depicting one or more representative dwelling portions; generating, by the one or more computing devices using the one or more trained machine learning models, multiple additional vector-based embeddings for the user query that encode additional semantic representations of the user query, the multiple additional vector-based embeddings including a query text vector-based embedding that encodes a semantic representation of contents of at least some of the textual dwelling characterization, and one or more query image vector-based embeddings that each encodes a semantic representation of contents of at least one of the one or more indicated images; determining, by the one or more computing devices, one or more first dwellings of the plurality of dwellings that are located in the one or more geographical areas and whose associated textual descriptions match the textual dwelling characterization by each having a dwelling text vector-based embedding that differs from the query text vector-based embedding by at most a first defined threshold amount; determining, by the one or more computing devices, one or more second dwellings of the plurality of dwellings that are located in the one or more geographical areas and whose associated images match the one or more indicated images by each having at least one dwelling image vector-based embedding that differs from at least one query image vector-based embedding by at most a second defined threshold amount; determining, by the one or more computing devices, at least one target dwelling of the plurality of dwellings that satisfies the multiple search criteria, including identifying that the at least one target dwelling is part of both the one or more first dwellings and the one or more second dwellings; and presenting, by the one or more computing devices, information about the determined at least one target dwelling as part of response information to the user query.
- 18 . The computer-implemented method of claim 17 wherein the determining of the one or more first dwellings includes measuring, for each of the dwelling text vector-based embeddings of the one or more first dwellings, a first distance between the query text vector-based embedding and that dwelling text vector-based embedding that is below a first distance-based threshold, and wherein the determining of the one or more second dwellings includes measuring, for each of the at least one dwelling image vector-based embeddings of the one or more second dwellings, a second distance between one of the query image vector-based embeddings and that dwelling image vector-based embedding that is below a second distance-based threshold.
- 19 . The computer-implemented method of claim 18 wherein the first and second defined threshold amounts are a same amount, wherein the first and second distance-based thresholds are a same distance, and wherein the method further comprises, before the generating of the vector-based embeddings for the plurality of dwellings, training the one or more machine learning models using at least one of positive examples each having two or more first real estate phrases that are semantically similar and negative examples each having two or more second real estate phrases that are not semantically similar, or positive examples each having two or more first real estate images that are similar and negative examples each having two or more second real estate images that are not similar.
- 20 . The computer-implemented method of claim 17 wherein the generating of the multiple additional vector-based embeddings for the user query further includes obtaining additional information specific to a user that supplies the user query, and further encoding a further semantic representation in at least one of the multiple additional vector-based embeddings of contents of the additional information specific to the user.
Description
TECHNICAL FIELD The following disclosure relates generally to techniques for automatically determining and providing information about dwellings in response to search criteria specified using data of multiple modes, such as to automatically respond to a dwelling-related search query that has multiple modes of data including at least free-form natural language text and one or more images by generating and using encoded representations of semantic content of the data of the multiple data modes to identify matching dwellings. BACKGROUND An abundance of information is available to users on a wide variety of topics from a variety of sources. For example, portions of the World Wide Web (“the Web”) are akin to an electronic library of documents and other data resources distributed over the Internet, with billions of documents available, including groups of documents directed to various specific topic areas (e.g., buildings of various types). In addition, various other information is available via other communication mediums. However, existing search engines and other techniques for identifying information of interest suffer from various problems. Non-exclusive examples include a difficulty in understanding natural language requests, difficulty in providing accurate information that is specific to a particular topic of interest, difficulty in limiting information requests to approved topics, etc. BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1A-1C are network diagrams illustrating an example system for performing described techniques, including automatically responding to a dwelling-related search query using multiple modes of data including at least free-form natural language text and one or more images by generating and using encoded representations of semantic content of the data of the multiple data modes to identify matching dwellings. FIG. 1D illustrates examples of non-exclusive types of building description information. FIGS. 2A-2E illustrate examples of performing described techniques, including automatically responding to dwelling-related search queries using multiple modes of data including at least free-form natural language text and one or more images by generating and using encoded representations of semantic content of the data of the multiple data modes to identify matching dwellings. FIG. 3 is a block diagram illustrating an example of a computing system for use in performing described techniques, including automatically responding to a dwelling-related search query using multiple modes of data including at least free-form natural language text and one or more images by generating and using encoded representations of semantic content of the data of the multiple data modes to identify matching dwellings. FIG. 4 illustrates a flow diagram of an example embodiment of an Automated Dwelling Information Retrieval Using Multi-Modal Search (“ADIRUMMS”) system routine. FIG. 5 illustrates a flow diagram of an example embodiment of an ADIRUMMS Candidate Dwelling Evaluator/Selector component routine. FIG. 6 illustrates a flow diagram of an example embodiment of a client device routine. DETAILED DESCRIPTION The present disclosure describes techniques for using computing devices to perform automated operations involving automatically determining and providing information about dwellings in response to search criteria using data of multiple modes, such as to automatically respond to a dwelling-related search query that has multiple modes of data including at least free-form natural language text and one or more images by generating and using encoded representations of semantic content of the data of the multiple data modes to identify matching dwellings. In at least some embodiments, the described techniques include training one or more machine learning (“ML”) models to encode semantic information about dwellings from multiple data modes into corresponding vector-based embeddings (also referred to herein as a “vector embeddings”), and then using the trained ML model(s) to generate dwelling vector embeddings for some or all dwellings in one or more geographical areas, such as to encode textual data about a dwelling from a textual narrative description of the dwelling as well as dwelling information in one or more other textual forms (e.g., a plurality of keyword-value pairs to describe attributes of the dwelling), to encode visual data about the dwelling from one or more images, and to optionally encode data about the dwelling from data of one or more other media data modes (e.g., videos, audio clips, etc.). After the generation of the dwelling vector embeddings, the described techniques may include receiving a search query that specifies multiple search criteria with data of multiple modes (e.g., a textual characterization of target dwellings of interest using a sequence of multiple free-form natural language terms including narrative text and optionally one or more keywords with associated values, one or more represen