CN-121998128-A - Model training method and device, data processing method and device and electronic equipment

CN121998128ACN 121998128 ACN121998128 ACN 121998128ACN-121998128-A

Abstract

The invention provides a model training method, a device, a data processing method, a device and electronic equipment, wherein the model training method comprises the steps of determining first mask texts and second mask texts with different language attributions, processing training data pairs in a first training data set based on second mask probabilities to obtain third mask texts and fourth mask texts, inputting the first mask texts, the second mask texts, the third mask texts and the fourth mask texts into a model to respectively obtain first prediction texts, second prediction texts and third prediction texts, and adjusting weights of a second encoder and a mask language model prediction head based on the first prediction texts, the second prediction texts, the third prediction texts and the training data pairs, so that the texts with different language attributions can be aligned in language semantics, and a user with different language attributions can obtain accurate search results without translating the texts to be searched into language attributions corresponding to a search library in advance when searching.

Inventors

CAI SHIQING
LI ZHIFEI

Assignees

出门问问创新科技有限公司

Dates

Publication Date: 20260508
Application Date: 20251204

Claims (10)

1. A method of model training, the model comprising a first encoder, a second encoder, a cross decoder, and a masking language model predictive head, the method comprising: determining a first training data set comprising a plurality of training data pairs; Processing training data pairs of the first training data set based on the first mask probability to obtain a first mask text and a second mask text with different language attribution types; Inputting the first mask text into a first encoder to obtain a first code vector and a first mask text vector; inputting the second mask text into a second encoder to obtain a second code vector and a second mask text vector; processing training data pairs in the first training data set based on the second mask probability to obtain a third mask text and a fourth mask text with different language attribution types; Inputting one of the first code vector and the second code vector and one of the third mask text and the fourth mask text into a cross decoder to obtain a third mask text vector; Inputting the first mask text vector into a mask language model prediction head to obtain a first predicted text; inputting the first mask text vector into a mask language model prediction head to obtain a second predicted text; Inputting a third mask text vector into a mask language model predictive head, wherein the output of the mask language model predictive head is a third predicted text; The weights of the second encoder and the masking language model pre-header are adjusted based on the first predicted text, the second predicted text, the third predicted text, and the training data pairs.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises, The first mask probability is less than the second mask probability; The language attribution type of the text which can be processed by the first encoder is different from the language attribution type of the text which can be processed by the second encoder; the language attribution type of the first mask text is the same as the language attribution type of the text that the first encoder can process; The language attribution type of the second mask text is the same as the language attribution type of the text that the second encoder can process; Each training data pair comprises two training texts with different language attributions and the same semantics.
3. The method of claim 1, wherein inputting one of the first encoding vector and the second encoding vector, and one of the third mask text and the fourth mask text into the cross decoder, results in a third mask text vector, comprising: splicing the first coding vector and the third mask text to obtain a first spliced text; splicing the second coding vector and the fourth mask text to obtain a second spliced text; replacing or not replacing the coded vector parts in the first spliced text and the second spliced text, and replacing or not replacing the mask text parts in the first spliced text and the second spliced text to obtain an input text of the cross decoder; and inputting the input text into the cross decoder to obtain a third mask text vector.
4. The method of claim 1, wherein adjusting weights of the second encoder and the masking language model prediction head based on the first predicted text, the second predicted text, and the training data pair comprises: Determining a first penalty based on a first predicted text and a first training text in the training data pair that is the same as a language attribution type of the first predicted text, wherein the first predicted text comprises predictions of the text in the first training text that is masked based on a first mask probability; Determining a second penalty based on a second predicted text and a second training text of the training data pair that is the same type of language attribution as the second predicted text, wherein the second predicted text comprises predictions of the text of the second training text that is masked based on the first masking probability; Determining a third penalty based on a third predicted text and training data pair, wherein the third predicted text comprises a prediction of a masked text of the first training text or the second training text based on the second masking probability; determining a first training loss based on the first, second, and third losses; and adjusting weights of a second encoder and a masking language model pre-header based on the first training loss.
5. The method of claim 1, wherein after adjusting the weights of the second encoder and the masking language model pre-header based on the first predicted text, the second predicted text, and the training data pair, the method further comprises: Determining a second training data set comprising at least one training sample, each sample comprising training text, a first related text and a plurality of second related texts; Inputting the training text, the first related text and the plurality of second related texts into a first encoder or a second encoder based on the language attribution type of the training text, the first related text and the plurality of second related texts included in each sample, and obtaining training text characteristics corresponding to the training text, first related characteristics corresponding to the first related text and second related characteristics corresponding to each second related text; adjusting weights of the second encoder based on the training text feature, the first correlation feature, and all of the second correlation features; In each sample, the correlation degree between the first correlation text and the training text is larger than a first threshold value, and the correlation degree between the second correlation text and the training text is smaller than a second threshold value.
6. The method of claim 5, wherein adjusting the weight of the second encoder based on the training text feature, the first correlation feature, and all of the second correlation features comprises: Determining a first relevance based on the training text feature and the first relevance feature; determining a second relevance based on the training text feature and the first relevance feature and all of the second relevance features; determining a second training loss based on the first correlation and the second correlation; and adjusting the weight of a second encoder based on the second training loss.
7. A data processing method, characterized in that a first encoder and a second encoder are implemented based on a model obtained by training according to claims 1 to 6, the method comprising: Determining the language attribution type of the data to be retrieved; Transmitting the data to be searched to a first encoder in response to the fact that the language attribution type of the data to be searched is a first language attribution type; Responding to the language attribution type of the data to be searched as a second language attribution type, and transmitting the data to be searched to a second encoder; Determining matching data corresponding to the data to be retrieved based on the embedded value output by the first encoder or the embedded value output by the second encoder; Wherein the language attribution type processed by the first encoder and the second encoder is different.
8. A model training apparatus, the apparatus comprising: a first determining unit configured to determine a first training data set including a plurality of training data pairs; The first masking unit is used for processing training data pairs of the first training data set based on the first masking probability to obtain a first masking text and a second masking text with different language attribution types; the encoding unit is used for inputting the first mask text into the first encoder to obtain a first encoding vector and a first mask text vector; The second masking unit is used for processing the training data pairs in the first training data set based on the second masking probability to obtain a third masking text and a fourth masking text with different language attribution types; an input unit for inputting one of the first encoding vector and the second encoding vector, and one of the third mask text and the fourth mask text into the cross decoder to obtain a third mask text vector; The prediction unit is used for inputting the first mask text vector into a mask language model prediction head to obtain a first prediction text, inputting the second mask text vector into the mask language model prediction head to obtain a second prediction text, inputting a third mask text vector into the mask language model prediction head, and outputting the third prediction text by the mask language model prediction head; And a first adjusting unit for adjusting weights of the second encoder and the mask language model prediction head based on the first predicted text, the second predicted text, the third predicted text and the training data pair.
9. A data processing apparatus, characterized in that a first encoder and a second encoder are implemented based on a model obtained by training according to claims 1 to 6, the apparatus comprising: the language attribution unit is used for determining the language attribution type of the data to be retrieved; The first transmission unit is used for transmitting the data to be searched to the first encoder in response to the fact that the language attribution type of the data to be searched is the first language attribution type; the second transmission unit is used for transmitting the data to be searched to a second encoder in response to the fact that the language attribution type of the data to be searched is a second language attribution type; The searching unit is used for determining matching data corresponding to the data to be searched based on the embedded value output by the first encoder or the embedded value output by the second encoder; Wherein the language attribution type processed by the first encoder and the second encoder is different.
10. An electronic device, comprising: Wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6; Or performing the method of claim 7.

Description

Model training method and device, data processing method and device and electronic equipment Technical Field The disclosure relates to the technical field of large models, and in particular relates to a model training method, a model training device, a data processing method, a data processing device and electronic equipment. Background In the related art, most of the retrieval models are trained based on corpus of single language (or language attribution), and during reasoning, the retrieval text of one language attribution type is generally used for searching texts similar to the retrieval text in a database of the same language attribution, so that the performance of cross-language text retrieval is poor. Disclosure of Invention The disclosure provides a model training method, a model training device, a model data processing method, a model data processing device and electronic equipment, so as to at least solve the technical problems in the prior art. According to a first aspect of the present disclosure, there is provided a model training method, the model comprising a first encoder, a second encoder, a cross decoder and a mask language model predictive head, the method comprising: determining a first training data set comprising a plurality of training data pairs; Processing training data pairs of the first training data set based on the first mask probability to obtain a first mask text and a second mask text with different language attribution types; Inputting the first mask text into a first encoder to obtain a first code vector and a first mask text vector; inputting the second mask text into a second encoder to obtain a second code vector and a second mask text vector; processing training data pairs in the first training data set based on the second mask probability to obtain a third mask text and a fourth mask text with different language attribution types; Inputting one of the first code vector and the second code vector and one of the third mask text and the fourth mask text into a cross decoder to obtain a third mask text vector; Inputting the first mask text vector into a mask language model prediction head to obtain a first predicted text; inputting the first mask text vector into a mask language model prediction head to obtain a second predicted text; Inputting a third mask text vector into a mask language model predictive head, wherein the output of the mask language model predictive head is a third predicted text; The weights of the second encoder and the masking language model pre-header are adjusted based on the first predicted text, the second predicted text, the third predicted text, and the training data pairs. In the scheme, the first mask probability is smaller than the second mask probability; The language attribution type of the text which can be processed by the first encoder is different from the language attribution type of the text which can be processed by the second encoder; the language attribution type of the first mask text is the same as the language attribution type of the text that the first encoder can process; The language attribution type of the second mask text is the same as the language attribution type of the text that the second encoder can process; Each training data pair comprises two training texts with different language attributions and the same semantics. In the above scheme, the Inputting one of the first encoding vector and the second encoding vector, and one of the third mask text and the fourth mask text into a cross decoder to obtain a third mask text vector, comprising: splicing the first coding vector and the third mask text to obtain a first spliced text; splicing the second coding vector and the fourth mask text to obtain a second spliced text; replacing or not replacing the coded vector parts in the first spliced text and the second spliced text, and replacing or not replacing the mask text parts in the first spliced text and the second spliced text to obtain an input text of the cross decoder; and inputting the input text into the cross decoder to obtain a third mask text vector. In the above solution, the adjusting weights of the second encoder and the mask language model prediction head based on the first predicted text, the second predicted text and the training data pair includes: Determining a first penalty based on a first predicted text and a first training text in the training data pair that is the same as a language attribution type of the first predicted text, wherein the first predicted text comprises predictions of the text in the first training text that is masked based on a first mask probability; Determining a second penalty based on a second predicted text and a second training text of the training data pair that is the same type of language attribution as the second predicted text, wherein the second predicted text comprises predictions of the text of the second training text that is masked based on the first masking probabil