CN-115358329-B - Text reconstruction model training method and device based on video assistance

CN115358329BCN 115358329 BCN115358329 BCN 115358329BCN-115358329-B

Abstract

The invention discloses a text reconstruction model training method and device based on video assistance, the method comprises the steps of training a text model to be trained according to all determined training samples, obtaining a trained text model, judging whether the trained text model converges, wherein when the text model to be trained is trained, the input content of a feature fusion layer included in the text model to be trained comprises text feature vectors corresponding to sample texts in the training samples and video feature vectors corresponding to the sample texts generated in advance for any training sample, and if the text model to be trained is determined to be the text reconstruction model. Therefore, the text reconstruction model can be trained through the assistance of the video feature vector, so that a user can quickly manufacture the video text through the text reconstruction model without repeatedly correcting the text reconstruction model, the matching degree between the generated video text and the video is improved, and the manufacturing requirement of the user on the video text is met.

Inventors

HUANG YUYAN
CHEN CHANGXIN

Assignees

有米科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20220825

Claims (9)

1. A text reconstruction model training method based on video assistance, the method comprising: The method comprises the steps of determining a target training sample set, wherein the target training sample set comprises a plurality of target training samples, and each target training sample at least comprises a sample text; Executing model training operation on a text model to be trained according to all target training samples to obtain a text model after training, and judging whether the text model after training is converged, wherein when executing the model training operation on the text model to be trained, for any target training sample, the input content of a feature fusion layer included in the text model to be trained comprises text feature vectors corresponding to sample texts in the target training sample and video feature vectors corresponding to the sample texts generated in advance; When the judgment result is yes, determining the trained text model as a text reconstruction model, wherein the text reconstruction model is used for supplementing text content of a target text material of a text to be generated so as to generate a text matched with the target text material; the training method for the text model to be trained according to the target training samples comprises the steps of: Inputting all the target training samples into a text model to be trained, and executing text vector conversion operation on the sample text included in each target training sample through an embedding layer of the text model to be trained to obtain text feature vectors corresponding to each target training sample; For each input target training sample, performing fusion operation on a text feature vector corresponding to the target training sample and a video feature vector generated in advance by a feature fusion layer of the text model to be trained to obtain a fused feature vector corresponding to the target training sample; And for each input target training sample, carrying out prediction reconstruction on the masked vector content in the fused feature vector corresponding to the target training sample through a prediction reconstruction layer of the text model to be trained, so as to obtain the prediction reconstruction vector content corresponding to the target training sample.
2. The method for training a text reconstruction model based on video assistance according to claim 1, wherein the performing, by the embedding layer of the text model to be trained, a text vector conversion operation on the sample text included in each target training sample to obtain a text feature vector corresponding to each target training sample includes: For each input target training sample, performing word splitting operation on the sample text in the target training sample through an embedding layer of the text model to be trained to obtain all target words of the sample text, and performing word vector conversion operation on all the target words of the sample text to obtain all word feature vectors corresponding to the target training sample; for each input target training sample, performing a stitching operation on all word feature vectors corresponding to the target training sample to obtain all sentence feature vectors corresponding to the target training sample, and determining a text feature vector to be determined corresponding to the target training sample according to all sentence feature vectors corresponding to the target training sample; and masking the vector content matched with the masking parameter in the undetermined text feature vector corresponding to the target training sample according to the preset masking parameter for each input target training sample to obtain the text feature vector corresponding to the target training sample.
3. The method for training a text reconstruction model based on video assistance according to claim 2, wherein for each input target training sample, performing, by the feature fusion layer of the text model to be trained, a fusion operation on a text feature vector corresponding to the target training sample and a video feature vector generated in advance and corresponding to the target training sample, to obtain a fused feature vector corresponding to the target training sample, includes: And for each input target training sample, performing splicing operation on a text feature vector corresponding to the target training sample and a video feature vector generated in advance through a feature fusion layer of the text model to be trained to obtain a spliced feature vector corresponding to the target training sample, performing first dimension transformation operation on the spliced feature vector corresponding to the target training sample to obtain a transformed feature vector corresponding to the target training sample, and performing vector average operation on the transformed feature vector corresponding to the target training sample according to the video feature parameters determined in advance to obtain an average feature vector corresponding to the target training sample as a fused feature vector corresponding to the target training sample.
4. A method according to claim 2 or 3, wherein, before the performing, for each input target training sample, a fusion operation on a text feature vector corresponding to the target training sample and a video feature vector generated in advance and corresponding to the target training sample through a feature fusion layer of the text model to be trained, to obtain a fused feature vector corresponding to the target training sample, the method further comprises: Acquiring video feature vectors corresponding to sample texts in each target training sample pre-generated by the embedding layer; Judging whether all the text feature vectors corresponding to the target training samples are matched with the video feature vectors corresponding to the sample texts according to the first dimension feature information of the text feature vectors corresponding to the sample texts in all the target training samples and the second dimension feature information of the video feature vectors corresponding to the sample texts; When the judgment result is negative, determining all video feature vectors to be processed which are not matched with the corresponding text feature vectors from the video feature vectors corresponding to the sample texts in all target training samples, and executing second dimension transformation operation on all the video feature vectors to be processed according to the first dimension feature information of the text feature vectors corresponding to all the video feature vectors to be processed to obtain all the transformed video feature vectors to be processed; And updating video feature vectors corresponding to sample texts in all target training samples according to all the transformed video feature vectors to be processed, triggering and executing each input target training sample, and executing fusion operation on the text feature vector corresponding to the target training sample and the video feature vector generated in advance by a feature fusion layer of the text model to be trained to obtain the fused feature vector corresponding to the target training sample.
5. The method for training a text reconstruction model based on video assistance according to claim 4, wherein for each input target training sample, performing, by using a predictive reconstruction layer of the text model to be trained, predictive reconstruction on the masked vector content in the fused feature vector corresponding to the target training sample to obtain a predictive reconstruction vector content corresponding to the target training sample, includes: Performing vector order transformation operation on the masked vector content in the fused feature vector corresponding to each target training sample to update the masked vector content in the fused feature vector corresponding to each target training sample; and for each target training sample, extracting semantic feature information of the target training sample according to the masked vector content in the fused feature vector corresponding to the target training sample, and executing vector order recovery operation on the masked vector content in the fused feature vector corresponding to the target training sample according to the semantic feature information of the target training sample so as to update the masked vector content in the fused feature vector corresponding to the target training sample again, and performing predictive reconstruction on the masked vector content in the fused feature vector corresponding to the target training sample according to the semantic feature information of the target training sample so as to obtain the predicted reconstructed vector content corresponding to the target training sample.
6. The method for training a text reconstruction model based on video assist of claim 5, wherein said determining whether the trained text model converges comprises: Obtaining a distance regression loss parameter between the content of the predicted reconstruction vector corresponding to each target training sample calculated by the predicted reconstruction layer and the corresponding text feature vector to be determined, and determining a target reconstruction loss value corresponding to the target training sample set according to the distance regression loss parameters corresponding to all the target training samples; judging whether the target reconstruction loss value is smaller than or equal to a preset reconstruction loss threshold value; when the judgment result is yes, determining that the trained text model converges; and when the judgment result is negative, determining that the trained text model is not converged.
7. A video-assisted-based text reconstruction model training apparatus for performing the video-assisted-based text reconstruction model training method of any of claims 1-6, and comprising: the system comprises a determining module, a determining module and a processing module, wherein the determining module is used for determining a target training sample set, the target training sample set comprises a plurality of target training samples, and each target training sample at least comprises a sample text; The training module is used for executing model training operation on the text model to be trained according to all the target training samples to obtain a trained text model, wherein when executing the model training operation on the text model to be trained, for any target training sample, the input content of a feature fusion layer included in the text model to be trained comprises text feature vectors corresponding to sample texts in the target training sample and video feature vectors corresponding to the sample texts generated in advance; the judging module is used for judging whether the trained text model converges or not; And the determining module is further used for determining the trained text model as a text reconstruction model when the judging result of the judging module is yes, and the text reconstruction model is used for supplementing text content of a target text material of the text to be generated so as to generate a text matched with the target text material.
8. A text reconstruction model training apparatus based on video assistance, the apparatus comprising: A memory storing executable program code; A processor coupled to the memory; the processor invokes the executable program code stored in the memory to perform the video-assisted text reconstruction model training method of any of claims 1-6.
9. A computer storage medium storing computer instructions which, when invoked, are operable to perform the video-assisted text reconstruction model training method of any of claims 1-6.

Description

Text reconstruction model training method and device based on video assistance Technical Field The invention relates to the technical field of model training, in particular to a text reconstruction model training method and device based on video assistance. Background With the rapid development of the video production industry, video production occupies an increasingly important position in the domestic advertising market, and becomes the most common and effective advertising means for various enterprises. In the process of video production, the design of video texts (such as video scripts) is often not separated. Vivid and interesting video texts can enable the manufactured video to be more creative, so that a better marketing effect is created for enterprises. Currently, the generation mode of the video text is mainly realized by editing the video text by a producer through adopting a fixed video text production template. However, through practice, it is found that the video text generation mode relying on human editing requires a producer to repeatedly correct the video text according to own production experience, so that the production period of the video text is too long and the matching degree between the generated video text and the video is low. It is seen that it is particularly important to provide a method that can quickly generate video text that matches a video. Disclosure of Invention The technical problem to be solved by the invention is to provide a method and a device for training a text reconstruction model based on video assistance, which are beneficial to a user to quickly manufacture a video text through the text reconstruction model and are beneficial to improving the matching degree between the generated video text and the video, so that the manufacturing requirement of the user on the video text is met. In order to solve the technical problem, the first aspect of the invention discloses a text reconstruction model training method based on video assistance, which comprises the following steps: The method comprises the steps of determining a target training sample set, wherein the target training sample set comprises a plurality of target training samples, and each target training sample at least comprises a sample text; Executing model training operation on a text model to be trained according to all target training samples to obtain a text model after training, and judging whether the text model after training is converged, wherein when executing the model training operation on the text model to be trained, for any target training sample, the input content of a feature fusion layer included in the text model to be trained comprises text feature vectors corresponding to sample texts in the target training sample and video feature vectors corresponding to the sample texts generated in advance; And when the judgment result is yes, determining the trained text model as a text reconstruction model, wherein the text reconstruction model is used for supplementing text content of a target text material of the text to be generated so as to generate a text matched with the target text material. In an optional implementation manner, in a first aspect of the present invention, the performing, according to all the target training samples, a model training operation on a text model to be trained to obtain a trained text model includes: Inputting all the target training samples into a text model to be trained, and executing text vector conversion operation on the sample text included in each target training sample through an embedding layer of the text model to be trained to obtain text feature vectors corresponding to each target training sample; For each input target training sample, performing fusion operation on a text feature vector corresponding to the target training sample and a video feature vector generated in advance by a feature fusion layer of the text model to be trained to obtain a fused feature vector corresponding to the target training sample; And for each input target training sample, carrying out prediction reconstruction on the masked vector content in the fused feature vector corresponding to the target training sample through a prediction reconstruction layer of the text model to be trained, so as to obtain the prediction reconstruction vector content corresponding to the target training sample. In an optional implementation manner, in a first aspect of the present invention, the performing, by an embedding layer of the text model to be trained, a text vector conversion operation on the sample text included in each target training sample to obtain a text feature vector corresponding to each target training sample includes: For each input target training sample, performing word splitting operation on the sample text in the target training sample through an embedding layer of the text model to be trained to obtain all target words of the sample text, and performing word vector