CN-115935190-B - Training set acquisition method and device of semantic similarity model and computer equipment

CN115935190BCN 115935190 BCN115935190 BCN 115935190BCN-115935190-B

Abstract

The application provides a training set acquisition method, device and computer equipment of a semantic similarity model, which comprises the steps of constructing a plurality of similar training samples according to each similar text group of an original corpus, determining each target language text which is dissimilar to the first natural language text according to the similar text group to which the first natural language text belongs in the original corpus, respectively calculating the literal similarity between the first natural language text and each target language text, constructing M dissimilar training samples based on M target language texts with the highest literal similarity, and taking each similar training sample and each dissimilar training sample as a training set for acquiring the semantic similarity model. By adopting the scheme of the application, generalization and detection accuracy of the semantic similarity model can be improved.

Inventors

Deng Jiayang
LIN JIALIANG

Assignees

唯品会(广州)软件有限公司

Dates

Publication Date: 20260508
Application Date: 20221229

Claims (7)

1. A training set acquisition method of a semantic similarity model, the method comprising: Constructing a plurality of similar training samples according to each similar text group of an original corpus, wherein each similar text group comprises a plurality of natural language texts with similar semantics in pairs, and each similar training sample comprises a first natural language text and a second natural language text which belong to the same similar text group; For each first natural language text, determining each target language text which is dissimilar to the first natural language text in the original corpus according to a similar text group to which the first natural language text belongs, and respectively calculating the literal similarity between the first natural language text and each target language text, and constructing M dissimilar training samples based on M target language texts with highest literal similarity, wherein M is a preset positive integer; obtaining a preset number of dissimilar training samples, wherein the number of dissimilar training samples is determined according to a preset sample proportion; For each first natural language text, randomly selecting N-M target language texts from the target language texts corresponding to the first natural language text, and constructing N-M dissimilar training samples based on the randomly selected N-M target language texts, wherein N is the number of the dissimilar training samples, and N is larger than M; Taking each similar training sample and each dissimilar training sample as a training set for acquiring a semantic similarity model; the step of randomly selecting N-M target language texts from the target language texts corresponding to the first natural language text comprises the following steps: randomly selecting a plurality of target language texts from the target language texts corresponding to the first natural language text; And de-duplicating the randomly selected target language texts according to the M target language texts with the highest literal similarity to obtain N-M target language texts.
2. The method for obtaining a training set of a semantic similarity model according to claim 1, wherein the step of determining each target language text that is semantically dissimilar to the first natural language text in the original corpus according to the similar text group to which the first natural language text belongs comprises: and taking each natural language text which belongs to different similar text groups with the first natural language text in the original corpus as each target language text which is dissimilar with the first natural language text semanteme.
3. The training set obtaining method of semantic similarity model according to claim 1 or 2, wherein the step of constructing a plurality of similar training samples according to each similar text group of the original corpus comprises: and combining the natural language texts belonging to each similar text group in pairs for each similar text group so as to obtain a plurality of similar training samples.
4. The training set obtaining method of a semantic similarity model according to claim 1 or 2, wherein the step of calculating the literal similarity between the first natural language text and each of the target language texts, respectively, includes: and respectively calculating the editing distance between the first natural language text and each target language text, wherein the editing distance is used for reflecting the word similarity.
5. A training set acquisition device for a semantic similarity model, the device comprising: The system comprises a similar training sample construction module, a text extraction module and a text extraction module, wherein the similar training sample construction module is used for constructing a plurality of similar training samples according to each similar text group of an original corpus, each similar text group comprises a plurality of natural language texts with semantic similarity, and each similar training sample comprises a first natural language text and a second natural language text which belong to the same similar text group; the first dissimilar training sample construction module is used for determining each target language text which is semantically dissimilar to the first natural language text in the original corpus according to the similar text group to which the first natural language text belongs, respectively calculating the literal similarity between the first natural language text and each target language text, and constructing M dissimilar training samples based on M target language texts with the highest literal similarity, wherein M is a preset positive integer; The number acquisition module is used for acquiring the number of the preset dissimilar training samples, and the number of the dissimilar training samples is determined according to the preset sample proportion; the second dissimilar training sample construction module is used for randomly selecting N-M target language texts from the target language texts corresponding to each first natural language text according to each first natural language text, and constructing N-M dissimilar training samples based on the randomly selected N-M target language texts, wherein N is the number of the dissimilar training samples and is larger than M; The training set acquisition module is used for taking each similar training sample and each dissimilar training sample as a training set for acquiring a semantic similarity model; wherein the second dissimilar training sample construction module comprises: a random selection unit, configured to randomly select a plurality of target language texts from the target language texts corresponding to the first natural language text; And the de-duplication unit is used for de-duplicating the randomly selected target language texts according to the M target language texts with the highest literal similarity so as to obtain N-M target language texts.
6. A storage medium having stored therein computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the training set acquisition method of the semantic similarity model of any one of claims 1 to 4.
7. A computer device includes one or more processors and a memory; Stored in the memory are computer readable instructions which, when executed by the one or more processors, perform the steps of the training set acquisition method of the semantic similarity model of any one of claims 1 to 4.

Description

Training set acquisition method and device of semantic similarity model and computer equipment Technical Field The application relates to the technical field of artificial intelligence, in particular to a training set acquisition method, a device, a storage medium and computer equipment of a semantic similarity model. Background The semantic similarity model is a model for detecting whether meanings expressed by a plurality of natural language texts with different expressions are similar. In the process of acquiring the semantic similarity model, a corresponding training set needs to be constructed in advance, wherein the training set comprises a plurality of groups of semantic training data, and each group of semantic training data comprises two natural language texts with different expressions and a labeling result of whether the semantics of the two natural language texts are the same or not. After the training set is constructed, the training set can be utilized to train the initial model so as to obtain the semantic similarity model. However, in the practical application process, the problem of low detection accuracy exists in the current semantic similarity model. Disclosure of Invention The present application aims to solve at least one of the above technical drawbacks, especially the technical drawbacks of the prior art with low detection accuracy. In a first aspect, the present application provides a training set acquisition method for a semantic similarity model, where the method includes: Constructing a plurality of similar training samples according to each similar text group of an original corpus, wherein each similar text group comprises a plurality of natural language texts with similar semantics in pairs, and each similar training sample comprises a first natural language text and a second natural language text which belong to the same similar text group; For each first natural language text, determining each target language text which is dissimilar to the first natural language text in the original corpus according to a similar text group to which the first natural language text belongs, and respectively calculating the literal similarity between the first natural language text and each target language text, and constructing M dissimilar training samples based on M target language texts with highest literal similarity, wherein M is a preset positive integer; and taking each similar training sample and each dissimilar training sample as a training set for acquiring a semantic similarity model. In one embodiment, before the step of using each of the similar training samples and each of the dissimilar training samples as a training set for obtaining a semantic similarity model, the method further includes: obtaining a preset number of dissimilar training samples, wherein the number of dissimilar training samples is determined according to a preset sample proportion; For each first natural language text, randomly selecting (N-M) target language texts from the target language texts corresponding to the first natural language text, and constructing (N-M) dissimilar training samples based on the randomly selected (N-M) target language texts, wherein N is the number of the dissimilar training samples, and N is larger than M. In one embodiment, the step of randomly selecting (N-M) target language texts from the target language texts corresponding to the first natural language text includes: randomly selecting a plurality of target language texts from the target language texts corresponding to the first natural language text; and de-duplicating the randomly selected target language texts according to the M target language texts with the highest literal similarity to obtain (N-M) target language texts. In one embodiment, the step of determining, in the original corpus, each target language text that is semantically dissimilar to the first natural language text according to the similar text group to which the first natural language text belongs includes: and taking each natural language text which belongs to different similar text groups with the first natural language text in the original corpus as each target language text which is dissimilar with the first natural language text semanteme. In one embodiment, the step of constructing a plurality of similar training samples according to each similar text group of the original corpus includes: and combining the natural language texts belonging to each similar text group in pairs for each similar text group so as to obtain a plurality of similar training samples. In one embodiment, the step of calculating the word similarity between the first natural language text and each of the target language texts, respectively, includes: and respectively calculating the editing distance between the first natural language text and each target language text, wherein the editing distance is used for reflecting the word similarity. In a second aspect, the present applic