CN-115982600-B - Matching model training method, equipment and medium

CN115982600BCN 115982600 BCN115982600 BCN 115982600BCN-115982600-B

Abstract

The embodiment of the disclosure provides a matching model training method, equipment and medium, and relates to the technical field of computers, wherein the method comprises the following steps: clustering the original data in the unlabeled target data set through a clustering algorithm to obtain a plurality of data clusters, wherein the target data set is obtained by combining two candidate data sets without labels, each two pieces of original data in the plurality of data clusters are spliced to obtain N pieces of spliced data, each two pieces of spliced data in the N pieces of spliced data are combined to obtain K data sets, the label value of each data set is determined through the clustering condition of the original data in the data sets, and an untrained matching model is trained according to the K data sets and the label value of each data set until a trained matching model is obtained. And adding labels to the label-free data through clustering and data sources, optimizing the effect through a training process which is iterated continuously, and finally obtaining a trained matching model with higher accuracy.

Inventors

HUANG YUYAO
FU WEIWEI
WANG YAN
ZHANG YIN

Assignees

中国电信股份有限公司

Dates

Publication Date: 20260505
Application Date: 20221228

Claims (10)

1. A method of training a matching model, the method comprising: clustering the original data in a target data set through a clustering algorithm to obtain a plurality of data clusters, wherein the target data set is obtained by combining two candidate data sets without labels; Splicing every two pieces of original data in the plurality of data clusters to obtain N pieces of spliced data; Combining every two pieces of spliced data in the N pieces of spliced data to obtain K data groups, and determining a tag value of each data group according to data clustering conditions corresponding to original data in the K data groups; training an untrained matching model according to the K data sets and the label value of each data set until the matching loss function value corresponding to the untrained matching model is smaller than a first preset value, and the similarity result obtained by predicting the untrained matching model based on the original data in the plurality of data clusters is larger than a second preset value, so that the trained matching model is obtained.
2. The method for training a matching model according to claim 1, wherein the combining each two pieces of spliced data in the N pieces of spliced data to obtain K data sets, and determining the tag value of each data set according to the data clustering condition corresponding to the original data in the K data sets, includes: In the process of combining every two pieces of spliced data, judging any one obtained data set, and determining a label value corresponding to any one data set; the judgment for any one of the K data sets is as follows: If the original data in the two pieces of spliced data in any one data set belong to the same candidate data set, determining the label value of the any one data set according to the data clustering condition corresponding to the four pieces of original data in the any one data set, and If any two pieces of original data in the two pieces of spliced data in any one data set belong to different candidate data sets, determining a tag value of the any one data set according to the candidate data set condition and the data clustering condition corresponding to the four pieces of original data in the any one data set.
3. The matching model training method according to claim 2, wherein the arbitrary one data set comprises first splicing data and second splicing data, wherein the first splicing data comprises first original data positioned in a front half part when spliced and second original data positioned in a rear half part when spliced; If the original data in the two pieces of spliced data in the arbitrary data set belong to the same candidate data set, determining the tag value of the arbitrary data set according to the data clustering condition corresponding to the four pieces of original data in the arbitrary data set includes: If the first original data and the second original data in the first spliced data belong to the same data cluster, and the original data in the second spliced data and the original data in the first spliced data belong to the same data cluster, determining the label value of any one data group as a first numerical value; If the first original data and the second original data in the first spliced data belong to the same data cluster, the third original data and the first original data belong to the same data cluster, the fourth original data and the second original data do not belong to the same data cluster, or the third original data and the first original data do not belong to the same data cluster, the fourth original data and the second original data belong to the same data cluster, and then the label value of any one data set is determined to be a second value; If the first original data and the second original data in the first spliced data belong to the same data cluster, and the third original data and the first original data do not belong to the same data cluster, and the fourth original data and the second original data do not belong to the same data cluster, determining the label value of any one data group as a third numerical value; If the first original data and the second original data in the first spliced data do not belong to the same data cluster, the third original data and the first original data belong to the same data cluster, and the fourth original data and the second original data belong to the same data cluster, determining the label value of any one data group as a first numerical value; If the first original data and the second original data in the first spliced data do not belong to the same data cluster, the third original data and the first original data belong to the same data cluster, the fourth original data and the second original data do not belong to the same data cluster, or the third original data and the first original data do not belong to the same data cluster, the fourth original data and the second original data belong to the same data cluster, and then the label value of any one data set is determined to be a second value; If the first original data and the second original data in the first spliced data do not belong to the same data cluster, the third original data and the first original data do not belong to the same data cluster, and the fourth original data and the second original data do not belong to the same data cluster, determining the label value of any one data set as a third numerical value.
4. The matching model training method according to claim 2, wherein the arbitrary one data set comprises first splicing data and second splicing data, wherein the first splicing data comprises first original data positioned in a front half part when spliced and second original data positioned in a rear half part when spliced; If any two pieces of original data in the two pieces of spliced data in any one data set belong to different candidate data sets, determining a tag value of the any one data set according to a candidate data set condition and a data clustering condition corresponding to the four pieces of original data in the any one data set, including: If the first original data and the third original data belong to the same candidate data set, the second original data and the fourth original data do not belong to the same candidate data set, and the first original data and the third original data belong to the same data cluster, determining the label value of any one data set as a second value; If the first original data and the third original data do not belong to the same candidate data set, the second original data and the fourth original data belong to the same candidate data set, and the second original data and the fourth original data belong to the same data cluster, determining the label value of any one data set as a second value; if the first original data and the third original data belong to the same candidate data set, the second original data and the fourth original data do not belong to the same candidate data set, and the first original data and the third original data do not belong to the same data cluster, determining the label value of any one data set as a third numerical value; If the first original data and the third original data do not belong to the same candidate data set, the second original data and the fourth original data belong to the same candidate data set, and the second original data and the fourth original data do not belong to the same data cluster, determining the label value of any one data set as a third numerical value; If the first original data and the third original data do not belong to the same candidate data set, and the second original data and the fourth original data do not belong to the same candidate data set, determining the label value of any one data set as a third numerical value.
5. The matching model training method of claim 1, wherein the untrained matching model comprises two fully connected layers and an untrained matching sub-model; the training of the untrained matching model according to the K data sets and the label value of each data set is as follows: And carrying out iterative training on the untrained matching model through the K data sets and the label value of each data set, wherein one iterative training process comprises the following steps: Extracting samples to be trained from the K data sets, sequentially inputting the samples to be trained into two full-connection layers, and training the self expression of the original data to obtain a first output result; inputting the first output result into the untrained matching submodel, and training the correlation between the original data in the K data sets to obtain a second output result; determining a matching loss function value according to the label value corresponding to the sample to be trained and the second output result; according to the matching loss function value, adjusting network parameters of the untrained matching model until the matching loss function value is smaller than a first preset value, and obtaining an intermediate matching model; Predicting the original data in the data clusters according to the intermediate matching model to obtain a similarity result; and if the similarity result is larger than a second preset value, obtaining a trained matching model.
6. The matching model training method of any one of claims 1 or 5, wherein the matching loss function is as follows: Wherein, the Representing a matching loss function; representing a label value corresponding to a sample to be trained extracted from the K data sets; representing a second output result of training the untrained matching model.
7. The method for training a matching model according to claim 5, wherein predicting the raw data in the plurality of data clusters according to the intermediate matching model to obtain the similarity result comprises: Selecting any one data cluster aiming at a plurality of data clusters, determining original data corresponding to a cluster center of the any one data cluster, taking the original data corresponding to the cluster center of the any one data cluster as first data to be predicted, and taking other original data except the original data corresponding to the cluster center in the any one data cluster as second data to be predicted; Splicing the first data to be predicted with the first data to be predicted to obtain first spliced data to be predicted; splicing the second data to be predicted with the second data to be predicted to obtain second spliced data to be predicted; taking the first splicing data to be predicted and the second splicing data to be predicted as a group of splicing data to be predicted; determining a plurality of groups of spliced data to be predicted from the plurality of data clusters; Inputting the multiple groups of spliced data to be predicted into the intermediate matching model for prediction to obtain the similarity corresponding to the multiple groups of spliced data to be predicted; and averaging according to the similarity corresponding to the plurality of groups of spliced data to be predicted, and determining a similarity result.
8. The method of claim 1, wherein after the obtaining the trained matching model, the method further comprises: acquiring first data to be matched and second data to be matched; splicing the first data to be matched with the first data to be matched to obtain first spliced data to be matched; splicing the second data to be matched with the second data to be matched to obtain second spliced data to be matched; Inputting the first splicing data to be matched and the second splicing data to be matched into the trained matching model, and determining the matching similarity between the first splicing data to be matched and the second splicing data to be matched.
9. An electronic device, comprising: processor, and A memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any one of claims 1-8 via execution of the executable instructions.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any one of claims 1-8.

Description

Matching model training method, equipment and medium Technical Field The disclosure relates to the field of computer technology, and in particular relates to a matching model training method, equipment and medium. Background With the development of computer technology, more and more methods are applied to real life, for example, a matching model is mainly used for researching the relation between two text sections, and the matching model is widely applied to application scenes such as text question-answering, recommendation, intelligent customer service, dialogue quality inspection, database question-answering and the like. In the related art, the training process of the matching model is mainly completed by training data with labels, but the standard training data is difficult to obtain, and the time and effort are consumed for adding labels to the training data, so that a great deal of cost is required, and how to provide a matching model which can be obtained through the label-free training data is a problem to be solved urgently. It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art. Disclosure of Invention The present disclosure provides a matching model training method, apparatus, and medium, which can train a matching model through unlabeled data, so that the matching model prediction is more accurate. Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure. In a first aspect, embodiments in the present disclosure provide a matching model training method, the method comprising: clustering the original data in a target data set through a clustering algorithm to obtain a plurality of data clusters, wherein the target data set is obtained by combining two candidate data sets without labels; Splicing every two pieces of original data in the plurality of data clusters to obtain N pieces of spliced data; Combining every two pieces of spliced data in the N pieces of spliced data to obtain K data groups, and determining a tag value of each data group according to data clustering conditions corresponding to original data in the K data groups; training an untrained matching model according to the K data sets and the label value of each data set until the matching loss function value corresponding to the untrained matching model is smaller than a first preset value, and the similarity result obtained by predicting the untrained matching model based on the original data in the plurality of data clusters is larger than a second preset value, so that the trained matching model is obtained. In a possible embodiment, the combining each two pieces of spliced data in the N pieces of spliced data to obtain K data sets, and determining the tag value of each data set according to the data clustering condition corresponding to the original data in the K data sets includes: In the process of combining every two pieces of spliced data, judging any one obtained data set, and determining a label value corresponding to any one data set; the judgment for any one of the K data sets is as follows: If the original data in the two pieces of spliced data in any one data set belong to the same candidate data set, determining the label value of the any one data set according to the data clustering condition corresponding to the four pieces of original data in the any one data set, and If any two pieces of original data in the two pieces of spliced data in any one data set belong to different candidate data sets, determining a tag value of the any one data set according to the candidate data set condition and the data clustering condition corresponding to the four pieces of original data in the any one data set. In one possible embodiment, the random data group comprises first spliced data and second spliced data, wherein the first spliced data comprises first original data positioned at the front half part when spliced and second original data positioned at the rear half part when spliced; If the original data in the two pieces of spliced data in the arbitrary data set belong to the same candidate data set, determining the tag value of the arbitrary data set according to the data clustering condition corresponding to the four pieces of original data in the arbitrary data set includes: If the first original data and the second original data in the first spliced data belong to the same data cluster, and the original data in the second spliced data and the original data in the first spliced data belong to the same data cluster, determining the label value of any one data group as a first numerical value; If the first original data and the second original data in the first spliced data belong to the same da