CN-116484000-B - Financial text multi-label classification method and device

CN116484000BCN 116484000 BCN116484000 BCN 116484000BCN-116484000-B

Abstract

The invention discloses a multi-label classification method and device for financial texts, the method comprises the steps of obtaining a plurality of sentences of the financial texts to be multi-labeled, inputting the sentences into a pre-established multi-label classification model of the financial texts to obtain a plurality of categories corresponding to each sentence, circularly training the classification model by taking classified labeling samples with few samples as initial sample sets until the model obtained by labeling all non-labeled sentences is labeled, adding the classified labeling samples obtained in each cycle into a sample set training model in the next cycle, comparing non-labeled sentences with non-supervision primary clustering results by taking the model classification results and the existing classification labeling samples as centers, and dynamically adjusting the sample set iteration training model through self-adaptive adjustment Lawl function and Lcal function during training. The invention improves the accuracy of multi-label classification of the financial text.

Inventors

Zeng Juru
LI SHENG
ZHU XUN
ZHANG XIAOGUANG
MA XUEJUN
Liu Yueke

Assignees

银清科技有限公司

Dates

Publication Date: 20260508
Application Date: 20230424

Claims (13)

1. A method for multi-label classification of financial texts, comprising: Acquiring a plurality of sentences corresponding to financial texts to be classified by multiple labels; The method comprises the steps of inputting a plurality of sentences into a pre-established financial text multi-label classification model, identifying and obtaining a plurality of categories corresponding to each sentence, wherein the financial text multi-label classification model is formed by taking samples with category labels smaller than the number of preset samples as an initial sample set, circularly executing operation of training the financial text multi-label classification model until category labels are labeled on all unlabeled sentences, obtaining the financial text multi-label classification model, adding the samples with the category labels obtained in each cycle to a sample set of the next cycle to train the financial text multi-label classification model, comparing the classification result of the financial text multi-label classification model with the results of non-supervision primary clustering on unlabeled sentences with the samples with the category labels smaller than the number of the preset samples as the center, labeling the unlabeled sentences, dynamically adjusting the sample set data through self-adaption adjustment Lawl and Lcal in the training process, wherein the meaning of Attention Wrong Loss is that the samples with the category labels larger than the preset value are obtained in each cycle, the meaning of Attention Wrong Loss is that the weight of the samples with the category labels is larger than the preset value is 383525, and the labels with different meanings are given to different threshold values are rejected.
2. The method of claim 1, further comprising pre-training to generate the financial text multi-label classification model as follows: acquiring a plurality of unlabeled sentences and a plurality of samples with category labels corresponding to the historical financial texts, wherein the number of the samples with category labels is less than a preset sample number value; Clustering a plurality of non-labeling sentences by using an unsupervised clustering model to obtain an unsupervised clustering label corresponding to each non-labeling sentence, wherein the center of each class is a sample with class labeling; And taking the plurality of samples with the category labels as an initial sample set, circularly executing the operation of training the financial text multi-label classification model until category labels are labeled on all the non-labeled sentences to obtain a final trained financial text multi-label classification model, and executing the following operations in each cycle: Training a financial text multi-label classification model corresponding to the current cycle by using a plurality of sample sets with category labels in the current cycle, wherein the plurality of sample sets with category labels in the current cycle are formed by an initial sample set and sample sets with category labels obtained in each cycle; The method comprises the steps of finding out target non-labeling sentences with the distance smaller than the current period distance threshold value corresponding to each category labeling sample, aiming at the target non-labeling sentences corresponding to each category labeling sample, executing the data labeling operation, namely inputting the target non-labeling sentences into a financial text multi-label classification model corresponding to the current period, identifying to obtain model classification labels corresponding to each target non-labeling sentence, and assigning each target non-labeling sentence to the labels corresponding to the category labeling samples when the non-supervision cluster labels corresponding to each target non-labeling sentence are consistent with the corresponding model classification labels to obtain the category labeling samples of the current cycle period; when the non-labeling sentences corresponding to the financial texts are detected, adding the samples with the category labels obtained in the current cycle period into the sample set with the category labels in the next cycle period, increasing each distance threshold value, and repeating the operation of training the financial text multi-label classification model in a cycle mode until the non-labeling sentences corresponding to the financial texts are detected to be absent, so that a final financial text multi-label classification model is obtained.
3. The method of claim 2, wherein iteratively training the financial text multi-label classification model corresponding to the current cycle with the plurality of class-labeled sample sets for the current cycle comprises dynamically adjusting the sample set data iteration to train the financial text multi-label classification model corresponding to the current cycle by adaptively adjusting Lawl and Lcal two loss functions.
4. The method of claim 3, wherein dynamically adjusting the sample set data iteration training the financial text multi-label classification model corresponding to the current period by adaptively adjusting Lawl and Lcal two loss functions, comprising: Each iteration cycle executes the following steps until the financial text multi-label classification model is obtained when a preset iteration termination condition is met: obtaining a first weight processing result by adjusting Lawl loss functions to give higher weights to the sample set data with wrong classification; On the basis of the first weight processing result, different weights are given to different labels of the sample set data by adjusting Lcal loss functions, so that a second weight processing result is obtained; and processing sample set data corresponding to the result by using the second weight, and training a model until a preset iteration termination condition is met to obtain the financial text multi-label classification model.
5. The method of any one of claims 1 to 4, wherein the Lawl loss function is: ; where, for a dataset containing N samples { (x 1 , y 1 ), (X N , y N ) }, where y k =[y 1 k , Y i k ]∈{0, 1} i , the output corresponding to the financial text multi-label classification model is z k =[z 1 k , Z i k epsilon R, calculating p i k =σ(z i k ) by using a sigmoid function, wherein k is the kth sample, i is the class number of the sample, p is the result obtained by calculating the sigmoid function of the probability of the class number, x represents the input financial text, y represents the class number set corresponding to the input financial text, alpha is more than or equal to 0, and alpha is a self-defined parameter.
6. The method of any one of claims 1 to 4, wherein the Lcal loss function is: ; where, for a dataset containing N samples { (x 1 , y 1 ), (X N , y N ) }, where y k =[y 1 k , Y i k ]∈{0, 1} i , the output corresponding to the financial text multi-label classification model is z k =[z 1 k , Z i k ∈r, calculating p i k =σ(z i k ) by using a sigmoid function, wherein for a tag with the overall frequency of n i , α is equal to or greater than 0, β e [0,1], α and β are all self-defined parameters, i belongs to the subscript of n, n i represents the probability of occurrence of the ith class number, k is the kth sample, i is the class number of the sample, p is the result obtained by the sigmoid function calculation of the class number probability, x represents the input financial text, and y represents the class number set corresponding to the input financial text.
7. A financial document multi-label classification device, comprising: the acquiring unit is used for acquiring a plurality of sentences corresponding to the financial texts to be classified by the multiple labels; the multi-label classifying unit is used for inputting a plurality of sentences into a pre-established multi-label classifying model of the financial text, recognizing and obtaining a plurality of categories corresponding to each sentence, wherein the multi-label classifying model of the financial text is characterized in that samples with category labels smaller than the number of preset samples are taken as initial sample sets, the operation of training the multi-label classifying model of the financial text is circularly executed until category labels are marked on all the non-labeled sentences, the obtained multi-label classifying model of the financial text is added into a sample set of the next cycle according to the obtained samples with category labels in each cycle, the multi-label classifying model of the financial text is trained according to the classification result of the multi-label classifying model of the financial text, the non-supervision primary clustering result is compared with the samples with category labels smaller than the number of preset samples, the non-labeled sentences are marked with labels in a first-time, the training process is carried out by dynamically adjusting sample set data through two loss functions of self-adaption Lawl and Lcal, the meaning of Attention Wrong Loss is that the weight of the samples with category labels larger than the preset value is different from 383525 to 24, and different labels with different meaning values are given to labels are different from the labels.
8. The apparatus of claim 7, further comprising a training unit for pre-training to generate the financial text multi-label classification model according to the following method: acquiring a plurality of unlabeled sentences and a plurality of samples with category labels corresponding to the historical financial texts, wherein the number of the samples with category labels is less than a preset sample number value; clustering a plurality of non-labeling sentences by using a clustering method of a designated center to obtain an unsupervised clustering label corresponding to each non-labeling sentence; And taking the plurality of samples with the category labels as an initial sample set, circularly executing the operation of training a financial text multi-label classification model until labels are marked on all the non-labeled sentences to obtain a final trained financial text multi-label classification model, wherein each cycle period comprises the following steps: Training a financial text multi-label classification model corresponding to the current cycle by using a plurality of sample sets with category labels in the current cycle, wherein the plurality of sample sets with category labels in the current cycle are formed by an initial sample set and sample sets with category labels obtained in each cycle; For each unlabeled sentence, determining a model identification class label corresponding to the unlabeled sentence which is smaller than the current period distance threshold corresponding to the labeled sample by using the financial text multi-label classification model, and assigning the label which is consistent with the labeled sample to obtain the labeled sample of the current cycle when the unlabeled cluster label corresponding to each unlabeled sentence is consistent with the model identification class label corresponding to the unlabeled sentence; when the non-labeling sentences corresponding to the financial texts are detected, adding the samples with the category labels obtained in the current cycle period into the sample set with the category labels in the next cycle period, increasing each distance threshold value, and repeating the operation of training a multi-label classification model of the financial texts by the cycle execution until the non-labeling sentences corresponding to the financial texts are detected to be absent, so as to obtain sample training data of the multi-label classification model of the financial texts.
9. The apparatus of claim 8, wherein the training unit is specifically configured to dynamically adjust the sample set data iteration to train the financial text multi-label classification model corresponding to the current period by adaptively adjusting Lawl and Lcal two loss functions.
10. The apparatus of claim 9, wherein dynamically adjusting the sample set data iteration training the financial text multi-label classification model corresponding to the current period by adaptively adjusting Lawl and Lcal two loss functions, comprises: Each iteration cycle executes the following steps until the financial text multi-label classification model is obtained when a preset iteration termination condition is met: obtaining a first weight processing result by adjusting Lawl loss functions to give higher weights to the sample set data with wrong classification; On the basis of the first weight processing result, different weights are given to different labels of the sample set data by adjusting Lcal loss functions, so that a second weight processing result is obtained; and processing sample set data corresponding to the result by using the second weight, and training a model until a preset iteration termination condition is met to obtain the financial text multi-label classification model.
11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1 to 6 when executing the computer program.
12. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, implements the method of any of claims 1 to 6.
13. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, implements the method of any of claims 1 to 6.

Description

Financial text multi-label classification method and device Technical Field The invention relates to the technical field of artificial intelligence, in particular to a financial text multi-label classification method and device. Background This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section. The goal of the multi-tag classification task is to divide the input data into multiple categories. Such as "poor loan rate increase", affects both the "property quality" of the bank and the "management quality" of the bank. Such as "virtual deposit loan" and "transacting bill business without real trade background", involving multiple aspects of "profitability", "quality of property" and "quality of management". The existing classification models for classifying financial texts are mostly single-label classification models, the multi-label classification models can only process simple tasks, namely, the number of label categories is small, the required semantics are clear, the labels appear in sentences, and therefore the efficiency of deployment to specific scenes (the categories are large, the implication semantics need to be inferred, for example, the effect of the existing classification models on the classification of commercial banks is not good, and some models also do not meet the actual classification requirements because of the small categories and inconvenient expansion) is low. The multi-label classification is different from the multi-classification in that the former classification can coexist and the latter classification is one-out-of-multiple. Therefore, the classification result of the existing financial text multi-label classification scheme is inaccurate. Disclosure of Invention The embodiment of the invention provides a multi-label classification method of a financial text, which is used for improving the accuracy of multi-label classification of the financial text, and comprises the following steps: Acquiring a plurality of sentences corresponding to financial texts to be classified by multiple labels; The method comprises the steps of inputting a plurality of sentences into a pre-established financial text multi-label classification model, identifying and obtaining a plurality of categories corresponding to each sentence, wherein the financial text multi-label classification model is formed by taking samples with category labels less than the number of preset samples as an initial sample set, circularly executing operation of training the financial text multi-label classification model until category labels are labeled on all the non-labeled sentences, obtaining the financial text multi-label classification model, adding the samples with the category labels obtained in each cycle to a sample set of the next cycle to train the financial text multi-label classification model, comparing the classification results of the financial text multi-label classification model with the results of non-supervision initial clustering on the non-labeled sentences with the samples with the category labels less than the number of the preset samples, and dynamically adjusting the sample set data iteration training the financial text multi-label classification model through self-adaptive adjustment Lawl and Lcal in the training process. The embodiment of the invention also provides a financial text multi-label classification device, which is used for improving the accuracy of the financial text multi-label classification, and comprises the following steps: the acquiring unit is used for acquiring a plurality of sentences corresponding to the financial texts to be classified by the multiple labels; The multi-label classifying unit is used for inputting a plurality of sentences into a pre-established multi-label classifying model of the financial text, recognizing and obtaining a plurality of categories corresponding to each sentence, wherein the multi-label classifying model of the financial text is characterized in that a sample with category labels, which is less than the number of preset samples, is taken as an initial sample set, the operation of training the multi-label classifying model of the financial text is circularly executed until category labels are labeled on all the non-labeled sentences, the obtained multi-label classifying model of the financial text is added into a sample set of the next cycle to train the multi-label classifying model of the financial text, wherein the non-labeled sentences are compared with the result of non-supervision primary clustering, which is less than the number of the preset samples, of the non-labeled sentences, and the sample set data are dynamically adjusted through two loss functions of self-adaptive adjustment Lawl and Lcal in the training process to train the multi-label classifying model of the financial text.