US-20260127459-A1 - SYSTEM AND METHOD TO IMPROVE RESULTS OF AN ELECTRONIC SEARCH OF PREEXISTING ELECTRONIC OBJECTS
Abstract
Embodiments described herein includes systems and methods for enhancing the outcomes of electronic searches targeting preexisting electronic objects, each previously segmented into data chunks and encoded into representative numeric vectors via a first embedder, with data stored in a knowledge base. The system includes a second embedder that translates queries into corresponding numeric vectors compatible with the first embedder's outputs. A search engine retrieves data records from the knowledge base by comparing similarity scores between query and object vectors. A processor introduces a bias to these vectors based on a calculated distance function, resulting in biased vectors that refine search results. An output device presents the processed search outcomes, optimizing relevance by leveraging biased vectors within a constrained data record set.
Inventors
- Oleg Vasilyev
- Shelly SCHWARTZ
- John Bohannon
Assignees
- Primer Technologies, Inc.
Dates
- Publication Date
- 20260507
- Application Date
- 20251031
Claims (17)
- 1 . A system for improving results of electronic search of preexisting electronic objects, each of the preexisting electronic objects having been split into one or more data chunks and each of the one or more data chunks having been encoded into a first numeric vector representative of the data contained in that data chunk by a first embedder, each of the one or more data chunks and its representative first numeric vector stored in one or more respective data records in a knowledge base, the electronic search results being based on a query, the system comprising: a second embedder that encodes the query into a second numeric vector representative of data associated with the query, wherein the second embedder works well with the first embedder; a search engine operably connected to the second embedder and to the knowledge base to retrieve a limited set of data records from the knowledge base based on similarity between the first numeric vector of the data record and the second numeric vector; a processor that (i) biases the first numeric vector, E, in each of the data records in the limited set of data records to a biased first numeric vector, E′, as a function of a distance, d, between the first numeric vector and the second numeric vector, E Q , wherein E′=(1−c)*E+c*f(d)*E Q , f(d)=1/(1+|d/D| s ); c is a number between 0 and 1, D is a number between about 3.0 and about 5.5, and s is a number between about 3.5 and about 57.2; and an output device operably connected to the processor to receive the results of the electronic search as a function of each of the biased first numeric vectors in the limited set of data records.
- 2 . The system according to claim 1 wherein the processor further (ii) clusters each of the data records in the limited set of data records into two or more result clusters based on the biased first numeric vector of that data record, and (iii) provides information representative of each of one or more of the two or more result clusters.
- 3 . The system according to claim 2 wherein the processor further (iv) reduces a dimensionality of the biased first numeric vectors in the limited set of data records prior to clustering.
- 4 . The system according to claim 3 wherein the dimensionality reduction is performed with UMAP.
- 5 . The system according to claim 4 further comprising a language model constructed to respond to the query based on the information representative of each of the one or more of the two or more result clusters.
- 6 . The system according to claim 2 further comprising a language model constructed to respond to the query based on the information representative of each of the one or more of the two or more result clusters.
- 7 . The system according to claim 1 further comprising a language model constructed to respond to the query based on the results of the electronic search as a function of each of the biased first numeric vectors in the limited set of data records.
- 8 . The system according to claim 1 wherein the search engine uses either cosine or distance similarity between the second numeric vector and the first numeric vector contained in each of the plurality of data records stored in the knowledge base to retrieve the limited set of data records.
- 9 . The system according to claim 1 wherein the first embedder and the second embedder begin as the same model.
- 10 . The system according to claim 9 wherein the first and second embedders are trained or tuned as a single model.
- 11 . A method for improving results of electronic search of preexisting electronic objects, each of the preexisting electronic objects having been split into one or more data chunks and each of the one or more data chunks having been encoded into a first numeric vector representative of the data contained in that data chunk by a first embedder, each of the one or more data chunks and its representative first numeric vector stored in one or more respective data records in a knowledge base, the electronic search results being based on a query, the method comprising: (a) encoding the query into a second numeric vector representative of the data associated with the query using a second embedder, wherein the second embedder works well with the first embedder; (b) retrieving a potentially relevant data record from the knowledge base based on similarity between the second numeric vector and the first numeric vector of the potentially relevant data record; (c) storing the potentially relevant data record in a temporary data structure; (d) repeating tasks (b) until (c) until the temporary data structure contains a limited set of data records; (e) biasing, using a processor, the first numeric vector, E, in each of the potentially relevant data records in the limited set of data records to a biased first numeric vector, E′, as a function of a distance, d, between the first numeric vector and the second numeric vector, E Q , wherein E′=(1−c)*E+c*f(d)*E Q , f(d)=1/(1+|d/D| s ); c is a number between 0 and 1, D is a number between about 3.0 and about 5.5, and s is a number between about 3.5 and about 57.2; and (f) outputting search results selected from the limited set of data records based on the biased first numeric vectors.
- 12 . The method according to claim 11 further comprising clustering each of the potentially relevant data records in the temporary data structure into two or more result clusters based on the biased first numeric vector of each potentially relevant data record, wherein outputting search results is further based on one or more of the two or more result clusters.
- 13 . The method according to claim 12 further comprising reducing a dimensionality of the biased first numeric vector in the limited set of data records prior to clustering.
- 14 . The method according to claim 13 wherein reducing the dimensionality of the biased first numeric vector is performed using UMAP.
- 15 . The method according to claim 14 further comprising responding to the query using a language model constructed to respond to the query based on the information representative of each of the one or more of the two or more result clusters.
- 16 . The method according to claim 11 further comprising responding to the query using a language model constructed to respond to the query based on the limited set of data records.
- 17 . The method according to claim 11 further comprising training the second embedder alongside the first embedder.
Description
CROSS-REFERENCE TO RELATED APPLICATION This application claims the priority under 35 USC 119 to Provisional Patent Application No. 63/715,392 entitled “System And Method To Improve Results Of An Electronic Search Of Preexisting Electronic Objects” filed Nov. 1, 2024, the disclosure of which is hereby expressly incorporated by reference in its entirety. BACKGROUND A plurality of sources has and continues to constantly generate images, articles, graphs, and other similar electronic objects that are used in providing information about one or more particular events or entities. These plurality of sources include companies (e.g., annual reports, marketing materials, published reports, SEC filings, web content), governmental entities, institutions of higher education (e.g., academic articles), non-fiction books, private think tanks, news organizations (e.g., ABC, BBC, CBS, CNN, CSPAN, The Financial Times, FoxNews, NBC, The New York Times, Newsweek Magazine, NPR, PBS, The San Francisco Chronicle, The Wall Street Journal, The Washington Post), social media (e.g., Facebook, Instagram, TED Talks, TikTok, X), among other possible sources. The events may include financial events, scientific finds, product introductions, world news events, and local news events, among other possible events. The entities may include companies, countries, groups, organizations, and people, among other possible entities. The electronic objects generated may include various facts, images, or other data that can be used in providing a reader or viewer with information about the particular event or entity. Different preexisting electronic objects may provide similar, even redundant information about a particular event or entity. Some electronic objects may provide different, even potentially incremental information about the particular event or entity. Electronic objects may be created by converting printed materials into electronically-readable form. These electronic objects generated may be stored on one or more data servers accessible via one or more computer networks. These one or more computer networks may be private (i.e., accessible only to a select group of users) or public. Each of these computer networks may comprise one or more local area or wide area networks. One exemplary computer network may be the Internet. Another exemplary computer network may be the internal document database of a company, firm, or organization. In view of the foregoing, the number of electronic objects available for consideration is immense and growing larger overtime. These electronic objects are preexisting in the sense that they are created before someone or some process accesses them for review or consideration. For more than a decade, people have been using electronic search engines (such as Google® and Microsoft Bing®) to retrieve potentially relevant electronic objects from the plurality of sources across the Internet. Most often, text-based search queries (e.g., “current automobile recalls”) are fed into the interface of the electronic search engine using a keyboard or speech-to-text conversion utility. Electronic search engines conduct searches in real-time or with indexes or some combination of these two approaches. “Indexing” generally refers to automatic pre-accessing, parsing, and storage of data representative of each electronic object encountered by a mechanism of the search engine. These search engine mechanisms that automatically pre-access and parse electronic objects are often referred to as “crawlers” because they automatically traverse all of the electronic objects stored on the various data servers accessible via the computer network associated with the electronic search engine. Where the computer network is the Internet, they are alternatively referred to as web crawlers. Electronic search engines have a ranking or sorting algorithm that determines the arrangement and order of presentation of each of the potentially relevant electronic objects based on the relevancy (or similarity) of each electronic object to the search query. Depending upon the nature of the concept being searched, the electronic search engine may return pages upon pages of potentially relevant electronic objects. As one would expect, current ranking algorithms rank electronic objects with similar content similarly, as such the electronic objects that populate the first pages of the search results often contain largely similar (i.e. redundant) information. Human users may solve for this redundancy problem by reviewing the electronic objects returned across the first handful of search result pages until the human is satisfied they have uncovered sufficient information or that human is frustrated because the top ranked results returned by the search engine failed to provide some or all of the information desired from the search. Often times, this frustration is followed by a subsequent attempt by the human user to electronically search the topic anew, usually using d