CN-121278091-B - Method for generating text abstract, text processing method, device, equipment, medium and program product

CN121278091BCN 121278091 BCN121278091 BCN 121278091BCN-121278091-B

Abstract

The embodiment of the application provides a method for generating a text abstract, a text processing method, a device, equipment, a medium and a program product. The method for generating the text abstract comprises the steps of obtaining a text to be processed, carrying out cluster analysis on all sentences in the text to be processed to obtain at least one sentence cluster, determining respective first parameter sets and respective second parameter sets of each sentence cluster in the at least one sentence cluster, wherein the first parameter sets of target sentence clusters are sets formed by position weights of each sentence in the target sentence cluster, the second parameter sets of the target sentence clusters are sets formed by semantic distances between each sentence in the target sentence cluster and cluster centroids of the target sentence clusters, determining representative sentences of each sentence cluster in the at least one sentence cluster according to the respective first parameter sets and the respective second parameter sets of each sentence cluster in the at least one sentence cluster, and combining the representative sentences corresponding to the at least one sentence cluster to obtain the abstract of the text to be processed.

Inventors

Zhu Runsu
LAN YONGHUI
GUO JUNNING
CHEN SHAOQIONG
DUAN CHANGLONG

Assignees

深圳市智城软件技术服务有限公司
深圳市智慧城市科技发展集团有限公司

Dates

Publication Date: 20260508
Application Date: 20251205

Claims (15)

1. A method of generating a text excerpt, the method comprising: acquiring a text to be processed and a preset balance coefficient; clustering all sentences in the text to be processed to obtain at least one sentence cluster; Determining respective first parameter sets and respective second parameter sets of each of the at least one sentence cluster, wherein the first parameter sets of a target sentence cluster are sets formed by the position weights of each of the sentences in the target sentence cluster in the text to be processed, the second parameter sets of the target sentence cluster are sets formed by semantic distances between each of the sentences in the target sentence cluster and the cluster centroid of the target sentence cluster, and the target sentence cluster is any sentence cluster in the at least one sentence cluster; Determining representative sentences of each sentence cluster in the at least one sentence cluster according to the balance coefficient, a first parameter set of each sentence cluster in the at least one sentence cluster and a second parameter set of each sentence cluster, wherein the representative sentences of each sentence cluster are sentences with the lowest comprehensive score in the sentence cluster, and the comprehensive score of each sentence is calculated based on the position weight of the sentence in the text to be processed and the semantic distance between the sentence and the clustering centroid of the affiliated sentence cluster; and combining the representative sentences corresponding to the at least one sentence cluster according to the front-back sequence of each representative sentence in the text to be processed to obtain the abstract of the text to be processed.
2. The method of claim 1, wherein determining representative sentences of each of the at least one sentence cluster from the respective first parameter set and the respective second parameter set for each of the at least one sentence cluster comprises: Calculating a first product of the balance coefficient and a second parameter of a target sentence, wherein the target sentence is any sentence in the text to be processed, the second parameter of the target sentence is a semantic distance between the target sentence and a target cluster centroid, and the target cluster centroid is a cluster centroid of a sentence cluster to which the target sentence belongs; Calculating a first difference value of a natural number 1 minus a first parameter of the target sentence, wherein the first parameter of the target sentence is the position weight of the target sentence in the text to be processed; Calculating a second difference of the natural number 1 minus the balance coefficient; calculating a second product between the first difference and the second difference; calculating the sum of the first product and the second product to obtain a comprehensive score of the target sentence; and determining the sentences with the lowest comprehensive scores in the target sentence cluster as representative sentences of the target sentence cluster.
3. The method of claim 2, wherein the first parameter of the target sentence is determined by: acquiring the total number of sentences of the text to be processed; Calculating a third difference value of a target index minus a natural number 1, wherein the target index is an index of the target sentence, the index of a first sentence positioned at the beginning in the text to be processed is a natural number 1, the absolute value of the difference value of the indexes of any two adjacent sentences in the text to be processed is the natural number 1, and the index of the sentence positioned in the front in the text to be processed is always smaller than the index of the sentence positioned in the rear; calculating a fourth difference of the total number of sentences minus the index of the target sentence; Calculating the minimum value of the third difference value and the fourth difference value; calculating a fifth difference value of subtracting a natural number 1 from the total number of sentences; calculating a first quotient of the fifth difference divided by a natural number 2; Calculating a first sum of the first quotient and a very small positive number; Calculating a second quotient of the minimum divided by the first sum; And calculating a sixth difference value of the natural number 1 minus the second quotient to obtain a first parameter of the target sentence.
4. The method of claim 2, wherein the second parameter of the target sentence is determined by: Acquiring a first embedded vector of the target sentence and a second embedded vector of the target cluster centroid; calculating the Euclidean distance between the first embedded vector and the second embedded vector to obtain a target Euclidean distance between the target sentence and the target cluster centroid; obtaining a minimum Euclidean distance and a maximum Euclidean distance in a target Euclidean distance set, wherein the target Euclidean distance set is a set formed by Euclidean distances between each sentence in a sentence cluster to which the target sentence belongs and the target cluster centroid; calculating a sixth difference of the target euclidean distance minus the minimum euclidean distance; Calculating a seventh difference of the maximum Euclidean distance minus the target Euclidean distance; calculating a second sum of the seventh difference and a very small positive number; and calculating a third quotient of the sixth difference value divided by the second sum value to obtain a second parameter of the target sentence.
5. The method of claim 1, wherein the clustering analysis of all sentences in the text to be processed to obtain at least one sentence cluster comprises: Acquiring a preset constraint coefficient and the total number of sentences of the text to be processed; Calculating a fourth quotient of the total number of sentences divided by the constraint coefficients; Calculating the value of the fourth quotient which is rounded downwards to obtain the cluster number; And clustering all sentences in the text to be processed by taking the number of clusters as a constraint condition to obtain the number of sentence clusters.
6. The method of claim 5, wherein clustering all sentences in the text to be processed using the number of clusters as a constraint condition to obtain the number of clusters of sentences comprises: Selecting sentences with different cluster numbers from the text to be processed as initial cluster centroids; and clustering all sentences in the text to be processed by taking the selected initial cluster centroid as a benchmark to obtain the number of sentence clusters.
7. The method of claim 5, wherein clustering all sentences in the text to be processed using the number of clusters as a constraint condition to obtain the number of clusters of sentences comprises: At least repeatedly executing two rounds of initial cluster centroid selection steps to obtain at least two initial cluster centroid sets, wherein the initial cluster centroid selection step is to select sentences with different cluster numbers from the text to be processed as initial cluster centroids, and each round of initial cluster centroid selection step is executed to obtain one initial cluster centroid set, and each initial cluster centroid set in the at least two initial cluster centroids is different; taking each initial cluster centroid set in the at least two initial cluster centroid sets as a benchmark, carrying out cluster analysis on all sentences in the text to be processed to obtain at least two sentence clusters, wherein one sentence cluster in the at least two sentence clusters corresponds to one initial cluster centroid set in the at least two initial cluster centroid sets, and each sentence cluster in the at least two sentence clusters and each initial cluster centroid set in the at least two initial cluster centroid sets are in one-to-one correspondence; Determining respective concentrations of each of the at least two sentence clusters; and taking the sentence clusters with the highest concentration degree as the sentence clusters with the number of clusters obtained by final cluster analysis.
8. An information extraction method, characterized in that the method comprises: acquiring a target text to be processed and extracting a prompt word; generating a summary of the target text according to the method for generating a text summary according to any one of claims 1 to 7 if the extraction prompt word indicates extraction of deep information of the target text; Inputting the abstract of the target text and the extraction prompt word into an information extraction model to extract information, so as to obtain first target information output by the information extraction model; And storing the first target information in a structured mode.
9. The method of claim 8, wherein the method further comprises: If the extraction prompt word indicates to extract shallow information of the target text, determining a suspected fragment suspected to contain the shallow information in the target text; Combining all the suspected fragments according to the front-back sequence of each suspected fragment in the target text to obtain a simplified text; inputting the simplified text and the extraction prompt words into an information extraction model to extract information, so as to obtain second target information output by the information extraction model; And storing the second target information in a structured mode.
10. The method of claim 8, wherein the method further comprises: if the extraction prompt word indicates that deep information of the target text is to be extracted and shallow information of the target text is to be extracted, generating a summary of the target text according to the method for generating a text summary according to any one of claims 1 to 7, and determining a suspected fragment suspected to contain the shallow information in the target text; Combining all the suspected fragments according to the front-back sequence of each suspected fragment in the target text to obtain a simplified text; inputting the abstract of the target text, the simplified text and the extraction prompt word into an information extraction model to extract information, so as to obtain first target information and second target information output by the information extraction model; The first target information and the second target information are stored in a structured manner.
11. An apparatus for generating a text excerpt, the apparatus comprising: The first acquisition module is used for acquiring the text to be processed and a preset balance coefficient; the first clustering module is used for carrying out clustering analysis on all sentences in the text to be processed to obtain at least one sentence cluster; A first determining module, configured to determine a first parameter set and a second parameter set of each of the at least one sentence cluster, where the first parameter set of a target sentence cluster is a set formed by a position weight of each of the sentences in the target sentence cluster in the text to be processed, and the second parameter set of the target sentence cluster is a set formed by a semantic distance between each of the sentences in the target sentence cluster and a cluster centroid of the target sentence cluster, and the target sentence cluster is any sentence cluster in the at least one sentence cluster; The second determining module is used for determining representative sentences of each sentence cluster in the at least one sentence cluster according to the balance coefficient, the respective first parameter set and the respective second parameter set of each sentence cluster in the at least one sentence cluster, wherein the representative sentences of each sentence cluster are sentences with the lowest comprehensive score in the sentence cluster, and the comprehensive score of each sentence is calculated based on the position weight of the sentence in the text to be processed and the semantic distance between the sentence and the clustering centroid of the affiliated sentence cluster; and the first combination module is used for combining the representative sentences corresponding to the at least one sentence cluster according to the front-back sequence of each representative sentence in the text to be processed to obtain the abstract of the text to be processed.
12. An information extraction apparatus, characterized in that the apparatus comprises: the second acquisition module is used for acquiring the target text to be processed and extracting the prompt words; a first generation module, configured to generate a summary of the target text according to the method for generating a text summary according to any one of claims 1 to 7, if the extraction prompt word indicates that deep information of the target text is extracted; The first extraction module is used for inputting the abstract of the target text and the extraction prompt word into an information extraction model to extract information, so as to obtain first target information output by the information extraction model; And the first storage module is used for structurally storing the first target information.
13. An electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method of generating a text excerpt as claimed in any one of claims 1 to 7 or the steps of the information extraction method of any one of claims 8 to 11.
14. A readable storage medium, characterized in that the readable storage medium has stored thereon a program or instructions which, when executed by a processor, implement the steps of the method of generating a text digest according to any one of claims 1 to 7 or the steps of the information extraction method according to any one of claims 8 to 11.
15. A computer program product, characterized in that the computer program comprises a program or instructions which, when executed by a processor, implement the steps of the method of generating a text excerpt as claimed in any one of claims 1 to 7 or the steps of the information extraction method as claimed in any one of claims 8 to 11.

Description

Method for generating text abstract, text processing method, device, equipment, medium and program product Technical Field The present application relates to the field of computers, and more particularly, to a method of generating a text digest, a text processing method, apparatus, device, medium, and program product. Background With the wide application of natural language processing technology in government departments, legal fields, enterprise management, financial fields, health care and education fields, efficient processing of long text (such as policy documents, legal documents, industry reports, medical documents, academic papers, etc.) information becomes a core requirement of each industry. For example, government agencies need to quickly refine core content from long policy documents to aid policy interpretation and public business management, law workers need to capture key information from massive legal documents to improve judicial efficiency, and medical institutions need to extract core content from lengthy medical documents and medical records to accelerate drug development and disease research. However, the manual processing of the long text is time-consuming and labor-consuming, key information is easy to miss or misread due to subjective factors, and the subsequent information application (such as entity attribute extraction, data storage and query) is hindered, and if the efficient text abstract generation technology is lacking, the subsequent processing of the long text is directly carried out, and the problems of high computing resource consumption and low processing efficiency are faced. Thus, there is a need for a technique for automatically generating text summaries. Disclosure of Invention The embodiment of the application aims to provide a method, a text processing method, a device, equipment, a medium and a program product for generating a text abstract, which can automatically generate the text abstract to a certain extent. A first aspect of an embodiment of the present application provides a method for generating a text excerpt, the method including: Acquiring a text to be processed; clustering all sentences in the text to be processed to obtain at least one sentence cluster; Determining respective first parameter sets and respective second parameter sets of each of the at least one sentence cluster, wherein the first parameter sets of a target sentence cluster are sets formed by the position weights of each of the sentences in the target sentence cluster in the text to be processed, the second parameter sets of the target sentence cluster are sets formed by semantic distances between each of the sentences in the target sentence cluster and the cluster centroid of the target sentence cluster, and the target sentence cluster is any sentence cluster in the at least one sentence cluster; determining representative sentences of each sentence cluster in the at least one sentence cluster according to a respective first parameter set and a respective second parameter set of each sentence cluster in the at least one sentence cluster; and combining the representative sentences corresponding to the at least one sentence cluster according to the front-back sequence of each representative sentence in the text to be processed to obtain the abstract of the text to be processed. A second aspect of an embodiment of the present application provides an information extraction method, including: acquiring a target text to be processed and extracting a prompt word; If the extraction prompt word indicates to extract deep information of the target text, generating a abstract of the target text according to the method for generating the abstract of the text according to the first aspect; Inputting the abstract of the target text and the extraction prompt word into an information extraction model to extract information, so as to obtain first target information output by the information extraction model; And storing the first target information in a structured mode. A third aspect of an embodiment of the present application provides an apparatus for generating a text excerpt, the apparatus including: the first acquisition module is used for acquiring a text to be processed; the first clustering module is used for carrying out clustering analysis on all sentences in the text to be processed to obtain at least one sentence cluster; A first determining module, configured to determine a first parameter set and a second parameter set of each of the at least one sentence cluster, where the first parameter set of a target sentence cluster is a set formed by a position weight of each of the sentences in the target sentence cluster in the text to be processed, and the second parameter set of the target sentence cluster is a set formed by a semantic distance between each of the sentences in the target sentence cluster and a cluster centroid of the target sentence cluster, and the target sentence cluster is any s