CN-122023618-A - Seamless brand integration method in text-to-video generation and terminal equipment

CN122023618ACN 122023618 ACN122023618 ACN 122023618ACN-122023618-A

Abstract

The invention discloses a seamless brand integration method and terminal equipment in text-to-video generation, belongs to the technical field of video generation, and can solve the problems that the model training and reasoning of the existing T2V service requires huge computing resource investment, and a service provider is difficult to reclaim the investment and realize profit. The method comprises the steps of S1, constructing a brand knowledge base containing a plurality of brand files, receiving text prompt words, inquiring target brand files corresponding to the text prompt words from the brand knowledge base, S2, generating a brand integration strategy according to the text prompt words and the target brand files, optimizing the text prompt words according to the brand integration strategy to obtain optimized prompt words, S3, conducting multidimensional assessment on the optimized prompt words, determining target prompt words according to assessment results, and S4, generating target videos by utilizing the target prompt words and the target brand files. The method is used for brand integration in text-to-video generation.

Inventors

WU BAOYUAN
ZHU ZIHAO

Assignees

香港中文大学（深圳）

Dates

Publication Date: 20260512
Application Date: 20260127

Claims (10)

1. A method of seamless brand integration in text-to-video generation, the method comprising: S1, constructing a brand knowledge base containing a plurality of brand files, receiving text prompt words, and inquiring target brand files corresponding to the text prompt words from the brand knowledge base; s2, generating a brand integration strategy according to the text prompt words and the target brand file, and optimizing the text prompt words according to the brand integration strategy to obtain optimized prompt words; S3, carrying out multidimensional evaluation on the optimized prompt word, and determining a target prompt word according to an evaluation result; and S4, generating a target video by using the target prompt word and the target brand file.
2. The method of claim 1, wherein the constructing a brand knowledge base comprising a plurality of brand files in S1 specifically comprises: Generating a plurality of test prompt words according to an initial file of a brand, and generating corresponding test videos according to the initial file and each test prompt word; if the ratio of the test videos meeting the preset requirement in all the test videos of the brand is larger than or equal to the preset ratio, storing the initial file of the brand as the brand file of the brand into a brand knowledge base; If the ratio of the test videos meeting the preset requirement in all the test videos of the brand is smaller than the preset proportion, determining the brand adapter according to the initial file, and storing the brand initial file and the brand adapter as the brand file of the brand into a brand knowledge base.
3. The method according to claim 2, characterized in that the determining of the brand of adapter from the initial profile comprises in particular: generating a plurality of comprehensive prompt words according to the initial file, wherein each comprehensive prompt word comprises a brand name and a trigger mark; correspondingly generating a plurality of training videos according to the reference images and the comprehensive prompt words in the initial file, and forming a training data set by the comprehensive prompt words and the corresponding training videos; training a text-to-video model using the training dataset to obtain the brand adapter.
4. A method according to claim 3, wherein generating a plurality of training videos according to the reference image in the initial archive and the plurality of comprehensive prompt words correspondingly comprises: inputting a reference image and each comprehensive prompting word in the initial file into an image model to generate an initial frame corresponding to each comprehensive prompting word; and inputting the initial frame corresponding to each comprehensive prompting word into an image-to-video model, and generating a training video corresponding to each comprehensive prompting word.
5. The method according to claim 1, wherein S3 specifically comprises: Performing four-dimensional evaluation on semantic fidelity, brand definition, integration naturalness and generation effect on the optimized prompt word to obtain an evaluation result; if the scores of the four dimensions in the evaluation result are all greater than or equal to a preset threshold value, the optimized prompt word is used as a target prompt word; And if the evaluation result contains a modification suggestion, modifying the optimized prompt word according to the modification suggestion to obtain a target prompt word.
6. The method according to claim 5, wherein modifying the optimized cue word according to the modification suggestion results in a target cue word, specifically comprising: if the modification suggestion is a re-optimization suggestion, the optimization prompting word is used as a text prompting word, the text prompting word is optimized according to the brand integration strategy and the re-optimization suggestion, the optimization prompting word is obtained, and S3 is re-executed until a target prompting word is obtained; and if the modification suggestion is a re-generation strategy suggestion, taking the optimized prompt word as a text prompt word, and re-executing S2 and S3 until a target prompt word is obtained.
7. The method according to claim 1, wherein S4 is specifically: And inputting the target prompt words and the target brand files into a video model to generate a target video.
8. The method of claim 7, wherein prior to S4, the method further comprises: if the target brand profile contains the brand's adapter, the adapter is loaded into the text-to-video model.
9. The method according to claim 1, characterized in that after said S4, the method further comprises: Collecting feedback information of a user on the target video; And extracting structured experiences from the feedback information, and storing the structured experiences into the brand knowledge base.
10. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, realizes the steps of the seamless brand integration method in text-to-video generation of any one of claims 1 to 9.

Description

Seamless brand integration method in text-to-video generation and terminal equipment Technical Field The invention relates to a seamless brand integration method and terminal equipment in text-to-video generation, and belongs to the technical field of video generation. Background The rapid development of Text-to-Video (T2V) technology opens up unprecedented opportunities for automated content authoring, particularly in the field of advertisement production. However, existing T2V services face serious commercialization dilemma. On the one hand, model training and reasoning require huge computational resource investment, including purchasing and maintaining of large-scale GPU clusters, collection and processing of massive training data, and continuous model optimization and iteration costs. On the other hand, the lack of an effective way of changing results in difficulty for service providers to reclaim investments and realize profitability. Traditional intrusive advertisements, such as spot advertisements, pop-up advertisements, etc., can severely disrupt the user experience, resulting in user churn, ultimately compromising the sustainability of the service. Disclosure of Invention The invention provides a seamless brand integration method and terminal equipment in text-to-video generation, which can solve the problems that the model training and reasoning in the existing T2V service requires huge computational resource investment and lacks an effective transition way, so that a service provider is difficult to reclaim the investment and realize profit. In one aspect, the present invention provides a method of seamless brand integration in text-to-video generation, the method comprising: S1, constructing a brand knowledge base containing a plurality of brand files, receiving text prompt words, and inquiring target brand files corresponding to the text prompt words from the brand knowledge base; s2, generating a brand integration strategy according to the text prompt words and the target brand file, and optimizing the text prompt words according to the brand integration strategy to obtain optimized prompt words; S3, carrying out multidimensional evaluation on the optimized prompt word, and determining a target prompt word according to an evaluation result; and S4, generating a target video by using the target prompt word and the target brand file. Optionally, the constructing a brand knowledge base including a plurality of brand files in S1 specifically includes: Generating a plurality of test prompt words according to an initial file of a brand, and generating corresponding test videos according to the initial file and each test prompt word; if the ratio of the test videos meeting the preset requirement in all the test videos of the brand is larger than or equal to the preset ratio, storing the initial file of the brand as the brand file of the brand into a brand knowledge base; If the ratio of the test videos meeting the preset requirement in all the test videos of the brand is smaller than the preset proportion, determining the brand adapter according to the initial file, and storing the brand initial file and the brand adapter as the brand file of the brand into a brand knowledge base. Optionally, determining the brand of adapter according to the initial archive specifically includes: generating a plurality of comprehensive prompt words according to the initial file, wherein each comprehensive prompt word comprises a brand name and a trigger mark; correspondingly generating a plurality of training videos according to the reference images and the comprehensive prompt words in the initial file, and forming a training data set by the comprehensive prompt words and the corresponding training videos; training a text-to-video model using the training dataset to obtain the brand adapter. Optionally, generating a plurality of training videos correspondingly according to the reference image and the plurality of comprehensive prompt words in the initial archive specifically includes: inputting a reference image and each comprehensive prompting word in the initial file into an image model to generate an initial frame corresponding to each comprehensive prompting word; and inputting the initial frame corresponding to each comprehensive prompting word into an image-to-video model, and generating a training video corresponding to each comprehensive prompting word. Optionally, the step S3 specifically includes: Performing four-dimensional evaluation on semantic fidelity, brand definition, integration naturalness and generation effect on the optimized prompt word to obtain an evaluation result; if the scores of the four dimensions in the evaluation result are all greater than or equal to a preset threshold value, the optimized prompt word is used as a target prompt word; And if the evaluation result contains a modification suggestion, modifying the optimized prompt word according to the modification suggestion to obtain