US-12619636-B2 - Entity linking and filtering using efficient search tree and machine learning representations
Abstract
Methods, systems, and computer-readable storage media for a ML system that reduces a number of target items from consideration as potential matches to a query item using token embeddings and a search tree.
Inventors
- Sundeep Gullapudi
- Rajesh Vellore ARUMUGAM
- Matthias Frank
- Wei Xia
Assignees
- SAP SE
Dates
- Publication Date
- 20260505
- Application Date
- 20250109
Claims (9)
- 1 . A computer-implemented method for matching a query item to one or more target items using machine learning (ML) models, the method being executed by one or more processors and comprising: prior to matching a query item to one or more target items of a superset of target items during inference, providing a set of target items from the superset of target items by: for a first query item token of the query item: identifying, within a search space, at least one target item token that is similar to the first query item token, and associating the first query item token with a revised search space within a tracker, the tracker comprising an array data structure that is initialized with a set of null values, associating the first query item token with the revised search space within the tracker comprises replacing a null value with a search space index indicating where the first query item token was found in the search space, and wherein the revised search space is provided in a queue of search spaces, a length of the queue being defined by a window parameter; determining, based on the length between the items being within a window size of the window parameter, a set of matched item tokens indicating one of a match and a partial match between a query item token and a target item token; defining the set of target items from the set of matched item tokens; and executing inference to match the query item to one or more target items in the set of target items to provide inference results, the inference results indicating a match between the query item and at least one target item in the set of target items.
- 2 . The method of claim 1 , wherein the search space is defined within a search tree comprising a set of nodes, each node representative of a respective target item token in a set of target item tokens.
- 3 . The method of claim 1 , wherein the revised search space comprises a search sub-space of the search space.
- 4 . The method of claim 1 , further comprising determining that no target item tokens represented in the revised search space is similar to a second query item token, and in response, comparing a second query item token embedding to target item token embeddings of target items tokens included within an alternative search space present in the queue.
- 5 . The method of claim 4 , further comprising, for the second query item token comparing a second query item token embedding to target item token embeddings of target items tokens included within the revised search space.
- 6 . The method of claim 1 , wherein a first query item token embedding is determined by a first ML model, and the at least one target item token that is similar to the first query item token is identified by comparing the first query item token embedding to target item token embeddings of target items tokens included within the search space.
- 7 . The method of claim 6 , wherein, during a training phase, the first ML model is fine-tuned based on sets of perturbations, each set of perturbations corresponding to an item token.
- 8 . The method of claim 1 , wherein executing inference to match the query item to one or more target items in the set of target items to provide inference results comprises processing the query item and target items of the set of target items through a second ML model that outputs the inference results.
- 9 . The method of claim 1 , wherein a number of target items in the set of target items being less than a number of target items in the superset of target items.
Description
CLAIM OF PRIORITY This application claims priority under 35 USC § 120 to U.S. patent application Ser. No. 17/723,586, filed on Apr. 19, 2022, the entire contents of which are hereby incorporated by reference. BACKGROUND Enterprises continuously seek to improve and gain efficiencies in their operations. To this end, enterprises employ software systems to support execution of operations. Recently, enterprises have embarked on the journey of so-called intelligent enterprise, which includes automating tasks executed in support of enterprise operations using machine learning (ML) systems. For example, one or more ML models are each trained to perform some task based on training data. Trained ML models are deployed, each receiving input (e.g., a computer-readable document) and providing output (e.g., classification of the computer-readable document) in execution of a task (e.g., document classification task). ML systems can be used in a variety of problem spaces. An example problem space includes autonomous systems that are tasked with matching items of one entity to items of another entity. Examples include, without limitation, matching questions to answers, people to products, bank statements to invoices, and bank statements to customer accounts. SUMMARY Implementations of the present disclosure are directed to a machine learning (ML) system for matching a query item to one or more target items. More particularly, implementations of the present disclosure are directed to a ML system that reduces a number of target items from consideration as potential matches to a query item using token embeddings and a search tree. In some implementations, actions include receiving query item text associated with a query item that is to be matched to one or more target items of a superset of target items, the query item text including one or more query item tokens, for a first query item token of the query item text: determining, by a first ML model, a first query item token embedding, comparing the first query item token embedding to target item token embeddings of target items tokens included within a search space to identify at least one target item token that is sufficiently similar to the first query item token, and associating the first query item token with a revised search space within a tracker, determining a set of matched item tokens based on the tracker, the set of matched item tokens indicating one of a match and a partial match between a query item token and a target item token, defining a set of target items from the set of matched item tokens, a number of target items in the set of target items being less than a number of target items in the superset of target items, and providing inference results by processing the query item and target items of the set of target items through a second ML model, the inference results indicating a match between the query item and at least one target item in the set of target items. Other implementations of this aspect include corresponding systems, apparatus, and computer programs, configured to perform the actions of the methods, encoded on computer storage devices. These and other implementations can each optionally include one or more of the following features: the search space is defined within a search tree that includes a set of nodes, each node representative of a respective target item token in a set of target item tokens; the revised search space includes a search sub-space of the search space; the search space is provided in a queue of search spaces, a length of the queue being defined by a window parameter; actions further include, for a second query item token of the query item text, determining, by the first ML model, a second query item token embedding, and comparing the second query item token embedding to target item token embeddings of target items tokens included within the revised search space; actions further include determining that no target item tokens represented in the revised search space is sufficiently similar to the second query item token, and in response, comparing the second query item token embedding to target item token embeddings of target items tokens included within an alternative search space present in the queue; and, during a training phase, the first ML model is fine-tuned based on sets of perturbations, each set of perturbations corresponding to an item token. The present disclosure also provides a computer-readable storage medium coupled to one or more processors and having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with implementations of the methods provided herein. The present disclosure further provides a system for implementing the methods provided herein. The system includes one or more processors, and a computer-readable storage medium coupled to the one or more processors having instructions stored thereon which, when execu