Search

CN-122027870-A - Video intelligent synthesis method oriented to multi-mode content conversion

CN122027870ACN 122027870 ACN122027870 ACN 122027870ACN-122027870-A

Abstract

The invention provides a video intelligent synthesis method for multi-mode content conversion, which comprises the steps of obtaining video demand data, carrying out semantic analysis on the video demand data, generating a video intelligent synthesis data set, carrying out multi-path material retrieval based on the video intelligent synthesis data set, synchronously pruning by combining with a preset dynamic pruning strategy, generating a video synthesis material candidate set and a key frame demand set, generating intelligent synthesis video fragments based on the key frame demand set, carrying out quality screening on the intelligent synthesis video fragments, generating a candidate video synthesis fragment set, carrying out intelligent synthesis based on beam search guidance on the candidate video synthesis fragment set and the video synthesis material candidate set, generating a candidate intelligent synthesis video set, carrying out quality assessment on the candidate intelligent synthesis video set based on a preset video quality assessment mechanism, carrying out secondary synthesis according to the assessment result, and generating an optimal synthesis video, thereby improving the efficiency, accuracy and reliability of automatic video production.

Inventors

  • WANG GUAN

Assignees

  • 维迈科技股份有限公司

Dates

Publication Date
20260512
Application Date
20260407

Claims (10)

  1. 1. The intelligent video synthesizing method for the multi-mode content conversion is characterized by comprising the following steps of: Acquiring video demand data, and carrying out semantic analysis based on uncertainty perception on the video demand data to generate an intelligent video synthesis data set; Carrying out multipath material retrieval based on the video intelligent composite data set, and synchronously pruning by combining with a preset dynamic pruning strategy to generate a video composite material candidate set and a key frame demand set; generating intelligent synthesized video clips based on the key frame demand set, and performing quality screening on the intelligent synthesized video clips to generate candidate video synthesized clip sets; Performing intelligent video synthesis based on beam search guidance according to the candidate video synthesis fragment set and the video synthesis material candidate set to generate a candidate intelligent synthesis video set; And carrying out quality evaluation on the candidate intelligent synthesized video set based on a preset video quality evaluation mechanism, and carrying out secondary synthesis according to an evaluation result to generate an optimal synthesized video.
  2. 2. The method for intelligent synthesis of video for multimodal content transfer according to claim 1, wherein obtaining video demand data and performing semantic analysis based on uncertainty perception on the video demand data to generate an intelligent synthesis data set of video comprises: the video demand data at least comprises original description text data and original image data; carrying out semantic analysis on the original description text data to obtain a semantic feature data set; Extracting features of the original image data to generate a visual semantic unit; based on the semantic feature data set and the visual semantic unit, carrying out semantic matching and space-time alignment to obtain a semantic matching result; The preset rule checker performs rule conflict sensing and confidence optimization according to the semantic matching result to generate an optimized feature set; And constructing a video intelligent synthesis data set based on the semantic feature data set, the visual semantic unit and the optimized feature set.
  3. 3. The method for intelligently synthesizing video for multi-modal content conversion according to claim 1, wherein the steps of performing multi-path material retrieval based on the intelligently synthesized video data set, and synchronously pruning in combination with a preset dynamic pruning strategy to generate a candidate set of synthesized video material and a key frame requirement set include: Analyzing the video intelligent synthesis data set to obtain deterministic demand data and exploratory demand data, wherein the deterministic demand data and exploratory demand data at least comprise corresponding semantic features and image data; Acquiring a first video composite material candidate set based on the deterministic demand data; Acquiring a second video composite material candidate set based on the exploratory demand data; and packaging the first video synthesis material candidate set and the second video synthesis material candidate set into a video synthesis material candidate set for output.
  4. 4. The method for intelligent synthesis of video for multimodal content transfer as claimed in claim 3, wherein obtaining a first candidate set of video composite material based on the deterministic demand data comprises: and for the deterministic demand data, directly adopting the corresponding image data as video composite materials, and packaging the video composite materials of all the deterministic demand data into a first video composite material candidate set.
  5. 5. The method for intelligent synthesis of video for multimodal content transfer as claimed in claim 3, wherein obtaining a second candidate set of video composite material based on the exploratory demand data comprises: generating a exploratory demand cell based on the exploratory demand data; generating a video material query vector by synchronously combining semantic features corresponding to each exploratory demand unit by adopting a text graph model; performing multipath video material inquiry based on the video material inquiry vector to obtain an original video synthesized material candidate set of the exploratory demand unit; and carrying out quality evaluation and dynamic pruning on the original video composite material candidate set according to a preset dynamic pruning strategy, and constructing a second video composite material candidate set.
  6. 6. The method for intelligent synthesis of video for multimodal content transfer according to claim 5, further comprising: And defining the image frames which do not reach the preset qualification condition after the dynamic pruning as key frames, and generating a key frame demand set according to semantic features corresponding to the key frames.
  7. 7. The intelligent video composition method for multi-modal content transformation according to claim 1, wherein generating intelligent composite video clips based on the keyframe requirement set and performing quality screening on the intelligent composite video clips to generate candidate video composition clip sets comprises: Based on the key frame demand set, generating key frames by adopting a text-generated graph diffusion model, and acquiring a key frame image set; adopting a potential diffusion model to perform joint optimization on the key frame image set to generate a primary intelligent synthesized video segment; carrying out consistency check on the primary intelligent synthesized video clips based on a preset fact checking mechanism to obtain a consistency check result; if the consistency check result is inconsistent with the fact, fine tuning parameters of the key frame generation and intelligent synthetic video segment generation stage according to the consistency check result; after the parameters are finely adjusted, generating optimized intelligent synthesized video clips again according to the key frame demand set until the consistency check result is the fact consistency; and packaging all the intelligent synthesized video clips into a candidate video synthesized clip set for output.
  8. 8. The intelligent video composition method for multi-modal content transformation according to claim 7, wherein the presetting of the fact checking mechanism comprises: the preset fact checking mechanism is represented as a loop flow at least comprising an agent A, an agent B and an agent C to work cooperatively; The intelligent agent A is formed by combining a plurality of neural networks and is used for carrying out logic consistency analysis and obtaining a logic consistency analysis result; the intelligent agent B is used for carrying out root cause analysis according to the logical consistency analysis result and generating a parameter adjustment instruction set according to the root cause analysis result; and the agent C is used for carrying out parameter fine adjustment according to the parameter adjustment instruction set.
  9. 9. The method for intelligent video composition for multi-modal content transformation according to claim 1, wherein performing intelligent video composition based on beam search guidance according to the candidate video composition segment set and the video composition material candidate set, generating a candidate intelligent composition video set, comprises: Generating a narrative logic trunk based on the video intelligent synthesis data set, and filling a first video synthesis material candidate set contained in the video synthesis material candidate set into the narrative logic trunk; Running a beam search algorithm to search for a video composite material sequence based on the candidate video composite fragment set and a second video composite material candidate set contained in the video composite material candidate set; in the process of running the beam search algorithm, scoring and sequence pruning are carried out by combining a preset sequence scoring function; The preset sequence scoring function at least comprises two dimensions of local matching degree and excessive fluency; And after the operation of the beam search algorithm is finished, generating a candidate video composite material sequence set, and generating a candidate intelligent composite video set based on the candidate video composite material sequence set.
  10. 10. The method for intelligently synthesizing video for multi-modal content conversion according to claim 1, wherein the quality evaluation of the candidate intelligent synthesized video set is performed based on a preset video quality evaluation mechanism, and secondary synthesis is performed according to the evaluation result, so as to generate an optimal synthesized video, comprising: performing video quality evaluation on each candidate intelligent composite video in the candidate intelligent composite video set based on a preset multidimensional video quality evaluation matrix to generate multidimensional video quality scores; performing deviation tracing according to the actual multidimensional video quality evaluation matrix corresponding to each candidate intelligent composite video to obtain a deviation tracing result; Uploading the multidimensional video quality scores and deviation traceability results corresponding to each candidate intelligent synthesized video to a human-computer interaction interface, and performing secondary adjustment by a user according to the data of the interface to generate a secondary adjustment instruction; And performing secondary synthesis on the candidate intelligent synthesis video set based on the secondary adjustment instruction to generate an optimal synthesis video.

Description

Video intelligent synthesis method oriented to multi-mode content conversion Technical Field The invention relates to the technical field of data processing, in particular to an intelligent video synthesis method oriented to multi-mode content conversion. Background With the rapid development of artificial intelligence technology, automatic video content generation has become an important research direction, especially in the fields with high requirements on time efficiency and mass production, such as news, education, business presentation and the like. While conventional automated video synthesis methods often lack the perception and processing power of uncertainty, video synthesis requirements often include a large amount of implicit information and various interpretation possibilities, and the prior art generally outputs a definite, but possibly erroneous, intermediate result and conveys the error downstream, resulting in a final product that deviates from the expectations. Therefore, a video intelligent synthesis method capable of deeply fusing multi-modal information, sensing and quantifying uncertainty and realizing intelligent collaborative search and generation is needed to improve efficiency, accuracy and reliability of automatic video production. Disclosure of Invention In view of the above-mentioned problems, in combination with the first aspect of the present invention, an embodiment of the present invention provides a method for intelligently synthesizing video for multimodal content transformation, where the method includes: Acquiring video demand data, and carrying out semantic analysis based on uncertainty perception on the video demand data to generate an intelligent video synthesis data set; Carrying out multipath material retrieval based on the video intelligent composite data set, and synchronously pruning by combining with a preset dynamic pruning strategy to generate a video composite material candidate set and a key frame demand set; generating intelligent synthesized video clips based on the key frame demand set, and performing quality screening on the intelligent synthesized video clips to generate candidate video synthesized clip sets; Performing intelligent video synthesis based on beam search guidance according to the candidate video synthesis fragment set and the video synthesis material candidate set to generate a candidate intelligent synthesis video set; And carrying out quality evaluation on the candidate intelligent synthesized video set based on a preset video quality evaluation mechanism, and carrying out secondary synthesis according to an evaluation result to generate an optimal synthesized video. As a further aspect of the present invention, obtaining video demand data, and performing semantic analysis on the video demand data based on uncertainty perception, to generate a video intelligent composition data set, including: the video demand data at least comprises original description text data and original image data; carrying out semantic analysis on the original description text data to obtain a semantic feature data set; Extracting features of the original image data to generate a visual semantic unit; based on the semantic feature data set and the visual semantic unit, carrying out semantic matching and space-time alignment to obtain a semantic matching result; The preset rule checker performs rule conflict sensing and confidence optimization according to the semantic matching result to generate an optimized feature set; And constructing a video intelligent synthesis data set based on the semantic feature data set, the visual semantic unit and the optimized feature set. As a further scheme of the invention, the method for searching the multipath materials based on the intelligent video composite data set, synchronously pruning by combining with a preset dynamic pruning strategy, and generating a candidate set of video composite materials and a key frame demand set comprises the following steps: Analyzing the video intelligent synthesis data set to obtain deterministic demand data and exploratory demand data, wherein the deterministic demand data and exploratory demand data at least comprise corresponding semantic features and image data; Acquiring a first video composite material candidate set based on the deterministic demand data; Acquiring a second video composite material candidate set based on the exploratory demand data; and packaging the first video synthesis material candidate set and the second video synthesis material candidate set into a video synthesis material candidate set for output. As a further aspect of the present invention, based on the deterministic demand data, obtaining a first video composite material candidate set includes: and for the deterministic demand data, directly adopting the corresponding image data as video composite materials, and packaging the video composite materials of all the deterministic demand data into a first video com