US-12619641-B2 - Finding similar documents using least frequent terms

US12619641B2US 12619641 B2US12619641 B2US 12619641B2US-12619641-B2

Abstract

A computer-implemented method for performing a document search action includes receiving a search request identifying a sample document. A plurality of terms occurring within the sample document is identified. Each term is scored based on a global term frequency of that term within a collection of documents. Each term is then weight based on proximity, within the sample document, of that term to at least one of the other terms of the first plurality of terms. A document search query is created to include a proximity-limiting clause restricting results to include documents that have a highest-weighted term of the first plurality of terms within a threshold distance of another term of the first plurality of terms. The document search query is performed, thereby resulting in identification of a first resulting document.

Inventors

Ge Wang
Samuel J. Shelton
Bhavanesh RENGARAJAN
Nicholas James Robinson
Sundar KAMESWARAN
Venkata Rao SALADI

Assignees

MICROSOFT TECHNOLOGY LICENSING, LLC

Dates

Publication Date: 20260505
Application Date: 20240308

Claims (20)

1 . A search system comprising: a processor; and a computer-readable medium storing instructions that are operative upon execution by the processor to, in response to receiving a first search request from a computing device of a user: identify a first term occurring within a sample document stored in a storage device; assign a score to the first term based at least in part on a global term frequency of the first term within at least one other document; weight the score for the first term based on proximity, within the sample document, of the first term to a second term and based on the second term having a score that is within a threshold of the score of the first term; create a document search query that includes a proximity-limiting clause restricting a result of the document search query to include at least one document that has the first term within a threshold distance of the second term; cause the document search query to be performed, resulting in identifications of first resulting documents; remove a duplicate document from the first resulting documents, resulting in at least one filtered resulting document; cause the at least one filtered resulting document to be displayed on a graphical user interface via the computing device; in response to receiving a user input from the computing device, identifying a selection of at least one selected document of the at least one filtered resulting document; in response to receiving a second search request from the computing device, initiating a search action based on the at least one selected document, wherein the second search request is different from the first search request; and performing the search action using the at least one selected document to obtain a result.
2 . The search system of claim 1 , wherein the assigning further comprises: identifying a global term frequency (GTF) table that defines frequency buckets, each frequency bucket of the frequency buckets includes an associated global frequency value and an associated plurality of words; searching the GTF table for a frequency bucket having the first term; and assigning the associated global frequency value of the frequency bucket as the score for the first term.
3 . The search system of claim 1 , wherein the instructions are further operative to: generate a similarity score for a first document of the first resulting documents, the similarity score being based on a comparison between the first document and the sample document; and remove the first document from the first resulting documents based at least in part on the similarity score of the first document.
4 . The search system of claim 1 , wherein the instructions are further operative to remove the duplicate document from the first resulting documents by: determining that the duplicate document and a first document of the first resulting documents have a same hash value; and removing the duplicate document from the first resulting documents based on the determining.
5 . The search system of claim 1 , wherein the instructions are further operative to: determine a local term frequency of the first term based at least in part on a frequency of occurrence of the first term within the sample document, wherein scoring the first term includes dividing the local term frequency of the first term by the global term frequency of the first term, thereby generating a term frequency-inverse term global frequency (TF-ITGF) value for the first term.
6 . The search system of claim 1 , wherein the instructions are further operative to: identify first and second email-type resulting documents resulting from the document search query, wherein the second email-type resulting document includes at least some content that is included in the first email-type resulting document; and remove the second email-type resulting document from results of the document search query based on determining that the second email-type resulting document includes at least some content that is included in the first email-type resulting document.
7 . The search system of claim 1 , wherein removing the duplicate document from the first resulting documents comprises removing the duplicate document based on a hash value of the duplicate document.
8 . The search system of claim 1 , wherein removing the duplicate document from the first resulting documents comprises identifying identical documents within the first resulting documents.
9 . A computer-implemented method for performing a document search, the method comprising: in response to receiving a first search request from a computing device of a user: identifying a first term occurring within a sample document stored in a storage device; assigning, by a computer system, a score to the first term based at least in part on a global term frequency of the first term within at least one other document; weighting, by the computer system, the score for the first term based on proximity, within the sample document, of the first term to a second term and based on the second term having a score that is within a threshold of the score of the first term; creating, by the computer system, a document search query that includes a proximity-limiting clause restricting a result of the document search query to include at least one document that has the first term within a threshold distance of the second term; causing the document search query to be performed by the computer system, resulting in identification of first resulting documents; removing, by the computer system, a duplicate document from the first resulting documents, resulting in at least one filtered resulting document; causing the at least one filtered resulting document to be displayed on a graphical user interface via the computing device; in response to receiving a user input from the computing device, identifying a selection of at least one selected document of the at least one filtered resulting document; in response to receiving a second search request from the computing device, initiating a search action based on the at least one selected document, wherein the second search request is different from the first search request; and performing, by the computer system, the search action using the at least one selected document to obtain a result.
10 . The method of claim 9 , wherein the assigning further comprises: identifying a global term frequency (GTF) table that defines frequency buckets, each frequency bucket of the frequency buckets includes an associated global frequency value and an associated plurality of words; searching the GTF table for a frequency bucket having the first term; and assigning the associated global frequency value of the frequency bucket as the score for the first term.
11 . The method of claim 9 , the method further comprising: generating a similarity score for a first document of the first resulting documents, the similarity score being based on a comparison between the first document and the sample document; and removing the first document from the first resulting documents based at least in part on the similarity score of the first document.
12 . The method of claim 9 , wherein removing the duplicate document from the first resulting documents comprises: determining that the duplicate document and a first document of the first resulting documents have matching hash values; and removing the duplicate document from the first resulting documents based on the determining.
13 . The method of claim 9 , further comprising: determining a local term frequency of the first term based at least in part on a frequency of occurrence of the first term within the sample document, wherein scoring the first term includes dividing the local term frequency of the first term by the global term frequency of the first term, thereby generating a term frequency-inverse term global frequency (TF-ITGF) value for the first term.
14 . The method of claim 9 , further comprising: identifying first and second email-type resulting documents resulting from the document search query, wherein the second email-type resulting document includes at least some content that is included in the first email-type resulting document; and removing the second email-type resulting document from results of the document search query based on determining that the second email-type resulting document includes at least some content that is included in the first email-type resulting document.
15 . A computer storage media having computer-executable instructions stored thereon, which, on execution by a computer, cause the computer to perform operations comprising: in response to receiving a first search request from a computing device of a user: identifying a first term occurring within a sample document stored in a storage device; assigning a score to the first term based at least in part on a global term frequency of the first term within at least one other document; weighting the score for the first term based on proximity, within the sample document, of the first term to a second term and based on the second term having a score that is within a threshold of the score of the first term; creating a document search query that includes a proximity-limiting clause restricting a result of the document search query to include at least one document that has the first term within a threshold distance of the second term; causing the document search query to be performed, resulting in identification of first resulting documents; removing a duplicate document from the first resulting documents, resulting in at least one filtered resulting document; causing the at least one filtered resulting document to be displayed on a graphical user interface via the computing device; in response to receiving a user input from the computing device, identifying a selection of at least one selected document of the at least one filtered resulting document; in response to receiving a second search request from the computing device, initiating a search action based on the at least one selected document, wherein the second search request is different from the first search request; and performing the search action using the at least one selected document to obtain a result.
16 . The computer storage media of claim 15 , wherein scoring the first term further comprises: identifying a global term frequency (GTF) table that defines frequency buckets, each frequency bucket of the frequency buckets includes an associated global frequency value and an associated plurality of words; searching the GTF table for a frequency bucket having the first term; and assigning the associated global frequency value of the frequency bucket as the score for the first term.
17 . The computer storage media of claim 15 , the operations further comprising: generating a similarity score for a first document of the first resulting documents, the similarity score being based on a comparison between the first document and the sample document; and removing the first document from the first resulting documents based at least in part on the similarity score of the first document.
18 . The computer storage media of claim 15 , removing the duplicate document from the first resulting documents comprises: determining that the duplicate document and a first document of the first resulting documents have a same hash value; and removing the duplicate document from the first resulting documents based on the determining.
19 . The computer storage media of claim 15 , the operations further comprising: determining a local term frequency of the first term based at least in part on a frequency of occurrence of the first term within the sample document, wherein scoring the first term includes dividing the local term frequency of the first term by the global term frequency of the first term, thereby generating a term frequency-inverse term global frequency (TF-ITGF) value for the first term.
20 . The computer storage media of claim 15 , the operations further comprising: identifying first and second email-type resulting documents resulting from the document search query, wherein the second email-type resulting document includes at least some content that is included in the first email-type resulting document; and removing the second email-type resulting document from results of the document search query based on determining that the second email-type resulting document includes at least some content that is included in the first email-type resulting document.

Description

BACKGROUND Finding similar or related content within an organization is a key challenge for data investigators. An entire investigation can often hinge on the ability to find relevant content and communications. Organizations spend time and resources to identify and collect potentially relevant content to address security investigations, regulatory or policy requirements, or in litigation, for example. Finding documents by similarity is a technical problem in information retrieval and analysis. While some search techniques such as file hash or search term matching have been developed, these existing techniques often exhibit poor performance. SUMMARY The disclosed examples are described in detail below with reference to the accompanying drawing figures listed below. The following summary is provided to illustrate some examples disclosed herein. The following is not meant, however, to limit all examples to any particular configuration or sequence of operations. Example solutions for performing a document search include: identifying terms occurring within a sample document; scoring each of the terms based at least in part on a global term frequency of the term within a collection of documents; weighting the score for each term based on proximity, within the sample document, of the term to at least one of the other terms; creating a document search query that includes a proximity-limiting clause restricting a result of the document search query to include documents that have a highest-weighted term of the first plurality of terms within a threshold distance of another term of the first plurality of terms; causing the document search query to be performed on a collection of documents, thereby resulting in identification of a first resulting document. BRIEF DESCRIPTION OF THE DRAWINGS The disclosed examples are described in detail below with reference to the accompanying drawing figures listed below: FIG. 1 illustrates an example search system that facilitates searching a corpus of documents based on similarity to one or more sample documents; FIG. 2 illustrates an example email thread tree constructed for an email thread of conversation occurring between multiple participants that may be used to reduce the number of documents presented in response to a document search, such as the search action of FIG. 1; FIG. 3 illustrates an anatomy of the contents of the example email shown in FIG. 2; FIG. 4 is a flowchart illustrating exemplary operations that may be performed by the system such as that of FIG. 1 for performing a document search action using a sample document; and FIG. 5 is a block diagram of an example computing device (e.g., a computer storage device) for implementing aspects disclosed herein. Corresponding reference characters indicate corresponding parts throughout the drawings. Any of the drawings may be combined into a single example or embodiment. DETAILED DESCRIPTION In some examples, a search system is provided that facilitates searching a corpus of documents given one or more “sample documents” that are identified as being pertinent to the search. More specifically, an example search system allows the investigator to initiate a search by identifying a sample document (e.g., a text-inclusive document, such as an email, a web page, a word processing document, a PDF, or the like). A search device analyzes this sample document for “term frequency” as well as “term global frequency”. Term frequency refers to calculating the relative frequency of each term within the document, as a measure of how important each term is to the sample document. Term global frequency refers to a measure of how common or uncommon a given term typically occurs across a particular set of documents, such as via inverse global frequency. In some examples, the search device uses “frequency bins” to calculate term global frequency, where each frequency bin includes a particular frequency value and a plurality of words that share that particular frequency value. The search device uses the term frequency and term global frequency to identify and rank terms that are significant in the sample document and that are also uncommon words. The search device also evaluates each top term based on its distance, within the sample document, to each of the other top terms (e.g., as identified by the term frequency rankings). More specifically, the search device calculates a distance-weighted aggregation of inverse global frequency for each particular top term from neighbors of all its occurrences (e.g., based on a threshold distance from the term to the neighbor top term within the example document). As such, this additional ranking provides a proximity-ranked list of terms that considers both the term frequency and the proximity of such terms to each other. The search device uses the proximity-ranked list of terms to generate a bespoke query that can be used during a search of a corpus of documents. In examples, the search device generates a Bo