US-12626063-B2 - Forming a hypothesis set from sentences across documents representative of different stances taken across the documents

US12626063B2US 12626063 B2US12626063 B2US 12626063B2US-12626063-B2

Abstract

Provided are a computer program product, system, and method for forming a hypothesis set from sentences across documents representative of different stances taken across the documents. Sentences from the documents are clustered into a plurality of clusters. Sentences in a cluster of the clusters have stance scores with respect to other sentences in the cluster that satisfy a stance criteria. At least one similarity group of sentences is formed in the clusters having similarity scores satisfying a similarity criteria. Sentences are selected from the similarity groups in the clusters based on stance scores of the sentences in a similarity group. A hypothesis set is formed of the selected sentences in the similarity groups. Stance scores are determined of sentences in the documents with the sentences in the hypothesis set to determine stances of the documents with respect to the sentences in the hypothesis set.

Inventors

Futoshi Iwama
MD MARUF HOSSAIN
Mikio Takeuchi

Assignees

INTERNATIONAL BUSINESS MACHINES CORPORATION

Dates

Publication Date: 20260512
Application Date: 20230721

Claims (17)

1 . A computer program product for extracting hypotheses from documents for a stance measurement, the computer program product comprising a computer readable storage medium having computer readable program code embodied therein that is executable to perform operations, the operations comprising: determining, with a stance detector, machine stance scores of sentences; clustering the sentences from the documents into a plurality of clusters, wherein the sentences in a cluster of the clusters have stance scores with respect to other sentences in the cluster that satisfy a stance criteria; forming, with a similarity group generator, at least one similarity group of the sentences in each of a plurality of the clusters, wherein each similarity group in one of the clusters has the sentences in one of the clusters having similarity scores satisfying a similarity criteria, wherein a plurality of similarity groups are formed from the sentences in one of the clusters, and wherein the stance detector and the similarity group generator are implemented using machine learning algorithms; selecting sentences from similarity groups in the clusters based on stance scores of the sentences in the similarity groups; forming a hypothesis set of the selected sentences in the similarity groups; and determining the stance scores of sentences in the documents with the sentences in the hypothesis set to determine stances of the documents with respect to the sentences in the hypothesis set.
2 . The computer program product of claim 1 , wherein the forming the hypothesis set further comprises: for a candidate selected sentence of the selected sentences, adding the candidate selected sentence to the hypothesis set in response to the candidate selected sentence having a pro-con variance score greater than a pro-con variance score threshold and the candidate selected sentence is not similar to a selected sentence already included in the hypothesis set.
3 . The computer program product of claim 1 , wherein the stance criteria comprises a stance score threshold, such that a stance score exceeding the stance score threshold indicates one sentence in a cluster has a clear pro or con stance with respect to another sentence in the cluster, and wherein the similarity criteria comprises a similarity score threshold, such that a similarity score exceeding the similarity score threshold indicates sentences that are similar.
4 . The computer program product of claim 1 , wherein the operations further comprise: determining, for the clusters, a pro-con variance score with respect to each sentence s i in a cluster and at least one other sentence s j in the cluster as a difference of a pro stance score between s i and s j and a con stance between s i and s j , wherein the forming the hypothesis set comprises only including sentences having a pro-con variance score exceeding a pro-con variance score threshold.
5 . The computer program product of claim 1 , wherein each pair of sentences in one cluster having a stance score satisfying the stance criteria includes one sentence having a strong stance score with respect to another sentence in the cluster.
6 . The computer program product of claim 1 , wherein the determining the stance scores of sentences in the documents with the sentences in the hypothesis set comprises: for each document of the documents, determining an aggregate statistic of stance scores of each sentence in the document with respect to each sentence in the hypothesis set, wherein the aggregate statistic of the stance scores with respect to one sentence in the hypothesis set comprises a document stance score for the document with respect to the sentence in the hypothesis set.
7 . A system for extracting hypotheses from documents for a stance measurement, comprising: a processor; and a computer readable storage medium having computer readable program code embodied therein that is executable to perform operations, the operations comprising: determining, with a stance detector, stance scores of sentences; clustering the sentences from the documents into a plurality of clusters, wherein the sentences in a cluster of the clusters have stance scores with respect to other sentences in the cluster that satisfy a stance criteria; forming, with a similarity group generator, at least one similarity group of the sentences in each of a plurality of the clusters, wherein each similarity group in one of the clusters has the sentences in one of the clusters having similarity scores satisfying a similarity criteria, wherein a plurality of similarity groups are formed from the sentences in one of the clusters, and wherein the stance detector and the similarity group generator are implemented using machine learning algorithms; selecting sentences from similarity groups in the clusters based on stance scores of the sentences in the similarity groups; forming a hypothesis set of the selected sentences in the similarity groups; and determining the stance scores of sentences in the documents with the sentences in the hypothesis set to determine stances of the documents with respect to the sentences in the hypothesis set.
8 . The system of claim 7 , wherein the forming the hypothesis set further comprises: for a candidate selected sentence of the selected sentences, adding the candidate selected sentence to the hypothesis set in response to the candidate selected sentence having a pro-con variance score greater than a pro-con variance score threshold and the candidate selected sentence is not similar to a selected sentence already included in the hypothesis set.
9 . The system of claim 7 , wherein the operations further comprise: determining, for the clusters, a pro-con variance score with respect to each sentence s i in a cluster and at least one other sentence s j in the cluster as a difference of a pro stance score between s i and s j and a con stance between s i and s j , wherein the forming the hypothesis set comprises only including sentences having a pro-con variance score exceeding a pro-con variance score threshold.
10 . The system of claim 7 , wherein the determining the stance scores of sentences in the documents with the sentences in the hypothesis set comprises: for each document of the documents, determining an aggregate statistic of stance scores of each sentence in the document with respect to each sentence in the hypothesis set, wherein the aggregate statistic of the stance scores with respect to one sentence in the hypothesis set comprises a document stance score for the document with respect to the sentence in the hypothesis set.
11 . A method for extracting hypotheses from documents for a stance measurement, comprising: determining, with a stance detector, stance scores of sentences; clustering the sentences from the documents into a plurality of clusters, wherein the sentences in a cluster of the clusters have stance scores with respect to other sentences in the cluster that satisfy a stance criteria; forming, with a similarity group generator, at least one similarity group of the sentences in each of a plurality of the clusters, wherein each similarity group in one of the clusters has the sentences in one of the clusters having similarity scores satisfying a similarity criteria, wherein a plurality of similarity groups are formed from the sentences in one of the clusters, and wherein the stance detector and the similarity group generator are implemented using machine learning algorithms; selecting sentences from similarity groups in the clusters based on stance scores of the sentences in the similarity groups; forming a hypothesis set of the selected sentences in the similarity groups; and determining the stance scores of sentences in the documents with the sentences in the hypothesis set to determine stances of the documents with respect to the sentences in the hypothesis set.
12 . The method of claim 11 , wherein the forming the hypothesis set further comprises: for a candidate selected sentence of the selected sentences, adding the candidate selected sentence to the hypothesis set in response to the candidate selected sentence having a pro-con variance score greater than a pro-con variance score threshold and the candidate selected sentence is not similar to a selected sentence already included in the hypothesis set.
13 . The method of claim 11 , wherein the stance criteria comprises a stance score threshold, such that a stance score exceeding the stance score threshold indicates one sentence in a cluster has a clear pro or con stance with respect to another sentence in the cluster, and wherein the similarity criteria comprises a similarity score threshold, such that a similarity score exceeding the similarity score threshold indicates sentences that are similar.
14 . The method of claim 11 , further comprising: determining, for the clusters, a pro-con variance score with respect to each sentence s i in a cluster and at least one other sentence s j in the cluster as a difference of a pro stance score between s i and s j and a con stance between s i and s j , wherein the forming the hypothesis set comprises only including sentences having a pro-con variance score exceeding a pro-con variance score threshold.
15 . The method of claim 11 , wherein each pair of sentences in one cluster having a stance score satisfying a stance criteria includes one sentence having a strong stance score with respect to another sentence in the cluster.
16 . The method of claim 11 , wherein the determining the stance scores of sentences in the documents with the sentences in the hypothesis set comprises: for each document of the documents, determining an aggregate statistic of stance scores of each sentence in the document with respect to each sentence in the hypothesis set, wherein the aggregate statistic of the stance scores with respect to one sentence in the hypothesis set comprises a document stance score for the document with respect to the sentence in the hypothesis set.
17 . The method of claim 11 , wherein the clustering the sentences comprises: for N documents d 1 . . . d N , perform for document d i : determine stance scores for sentence s j of a plurality of sentences in d i and a sentence s k of a plurality of sentences in d i ′, for i=1 to N and i′=(i+1) mod N; forming a graph of nodes representing sentences in the documents with edges between pairs of sentences s j in d i and s k in d i′ having stance scores exceeding a stance score criteria, wherein the clusters are comprised of nodes interconnected by the edges in the graph.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a computer program product, system, and method for forming a hypothesis set from sentences across documents representative of different stances taken across the documents. 2. Description of the Related Art Sentiment analysis is a subset of natural language processing (NLP) capabilities that provides high level filters to explore and evaluate sentiment of data. Sentiment analysis of user entered text is used to construct an enhanced perspective on the voice of the user or creator of the text. Sentiment analysis has been used to score documents and extract indicators using Sentiment Analysis. An example of the use of sentiment analysis is the central bank sentiment index (CBSI), which is derived from central bank documents. Other techniques to determine positions and stances taken in documents include Latent Dirichlet Allocation (LDA), which is a topic modeling technique to extract abstract topics that occur in a collection of documents. Another technique to determine positions in documents is key point analysis that extracts a set of concise and high-level statements from a given collection of arguments, representing the gist of these arguments. SUMMARY Provided are a computer program product, system, and method for forming a hypothesis set from sentences across documents representative of different stances taken across the documents. Sentences from the documents are clustered into a plurality of clusters. Sentences in a cluster of the clusters have stance scores with respect to other sentences in the cluster that satisfy a stance criteria. At least one similarity group of sentences is formed in the clusters having similarity scores satisfying a similarity criteria. Sentences are selected from the similarity groups in the clusters based on stance scores of the sentences in a similarity group. A hypothesis set is formed of the selected sentences in the similarity groups. Stance scores are determined of sentences in the documents with the sentences in the hypothesis set to determine stances of the documents with respect to the sentences in the hypothesis set. Further provided are a computer program product, system, and method for forming a hypothesis set from sentences across documents representative of different stances taken across the documents. For N documents d1 . . . dN, perform for document di, for each sentence sj in di, determine a stance score between sentence sj and each sentence sk in di+1 di, for i=1 to N that satisfies a stance score criteria. A graph of nodes is formed representing sentences in the documents with edges between each pair of sentences having indicating stance scores exceeding the stance score criteria. Cluster groups are formed, wherein each cluster group is comprised of nodes interconnected by the edges in the graph. At least one similarity group of sentences in each cluster are formed that are similar. A hypothesis set is formed of sentences in the similarity groups to form a set of hypothesis across the documents that are as dissimilar in that they are from different similarity groups and indicate a threshold likelihood of agreement or disagreement with another sentence in the cluster. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 illustrates an embodiment of a computing environment in which stance scores are used to form a hypothesis set of sentences in the documents to distinguish documents based on stances of sentences in the documents. FIG. 2 illustrates an embodiment of stance cluster information. FIG. 3 illustrates an embodiment of operations to process a set of documents to extract a hypothesis set to distinguish the documents based on stances from the documents. FIG. 4 illustrates an example of performing the operations of FIG. 3 on documents. FIG. 5 illustrates an embodiment of operations to form a stance cluster of sentences in the documents that have related stances. FIG. 6 an example of performing the operations of FIG. 5 on documents. FIG. 7 illustrates a computing environment in which the components of FIG. 1 may be implemented. DETAILED DESCRIPTION Described embodiments provide improvements to computer technology to improve the capability to determine a hypothesis set of sentences providing insight into a set of documents by extracting from the document set candidate hypothesis sentences which can distinguish each document based on stances from each document to the hypotheses. Described embodiments provide techniques to determine the hypothesis set from sentences that captures the differences in stances of individual documents in the document set. Described embodiments further improve the process by automatically extracting hypothesis sentences using stance measurements. Described embodiments determine a hypothesis set by, first, clustering the sentences that have a clear stance with each other using a stance detector to detect clear stances and then compute the variance of stance (