CN-121985200-A - Virtual movie generation method based on VR large space technology and related equipment thereof

CN121985200ACN 121985200 ACN121985200 ACN 121985200ACN-121985200-A

Abstract

The application belongs to the technical field of video processing, and relates to a virtual movie generation method based on VR large space technology and related equipment thereof. The method comprises the steps of extracting key action frames through coding, extracting all description verbs from video description texts, generating paired visual coding results and description verb combination tokens according to time sequence corresponding relations of all the key action frames and all the description verbs, screening out reserved combination tokens of target logarithms, generating key frames only relative to the most main video, generating target virtual films, guaranteeing multidimensional realistic experience of the virtual films, inputting more coding and decoding processing resources into key action frame processing, ignoring and reducing processing of non-key action frames, and improving generation efficiency of the target virtual films.

Inventors

LIAO XINGGUO
CHEN ZHE
LU TINGTING
CHEN ZHONG
LI JINSONG

Assignees

都市圈(上海)信息科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260203

Claims (10)

1. The virtual movie generation method based on the VR large space technology is characterized by comprising the following steps: Obtaining virtual movie generation materials, wherein the movie generation materials comprise source videos, video description texts and voice synthesis packages; Performing key action frame coding extraction on the source video to obtain visual coding results corresponding to all key action frames respectively; Extracting the part-of-speech classification of the video description text to obtain all description verbs contained in the video description text; Generating paired visual coding results and descriptive verb combination tokens according to the time-sequence corresponding relations of all key action frames and all descriptive verbs; Inputting the paired visual coding results and the descriptive verb combination tokens into a preset selection model, and screening out reserved combination tokens of target logarithms; extracting all descriptive verbs in the reserved combined token, inputting all descriptive verbs and the voice synthesis packet in the reserved combined token into a preset voice coding component, and generating a text-voice coding result; acquiring a target visual coding result according to all descriptive verbs in the reserved combined token; Carrying out multi-modal coding fusion on the text-voice coding result and the target visual coding result to obtain a multi-modal coding fusion result; And inputting the multi-mode coding fusion result into a preset decoding component, and decoding to generate a video stream contained in the target virtual movie.
2. The virtual movie generation method based on VR large space technology as set forth in claim 1, wherein the step of performing key action frame coding extraction on the source video to obtain visual coding results corresponding to all key action frames respectively specifically includes: Cutting the source video according to a preset cutting frame rate; comparing all the obtained adjacent frame images one by one, and identifying all key action frames; And inputting all the identified key action frames into a preset visual dynamic feature coding assembly based on SlowFast model according to a frame sequence to obtain visual coding results corresponding to all the key action frames respectively.
3. The virtual movie generating method based on VR large space technology as set forth in claim 2, wherein the step of comparing all the cut adjacent frame images one by one to identify all the key action frames comprises the steps of: inputting all adjacent frame images after the frame cutting treatment into a preset image target detection model; extracting image features of all adjacent frame images after the frame cutting processing by using the image target detection model; acquiring all image static objects and image dynamic change objects output by the image target detection model according to an image feature extraction result; And identifying all key action frames based on the image dynamic change object.
4. The virtual movie generating method based on VR large space technology as set forth in claim 1, wherein the step of generating paired visual coding results and descriptive verb combination tokens according to the temporal correspondence of the all key action frames and the all descriptive verbs specifically includes: acquiring playing time stamp information corresponding to all key action frames in the source video respectively; Extracting marked playing time interval information corresponding to all descriptive verbs in the video description text in advance; Determining all key action frames contained in the corresponding marked playing time interval information by all descriptive verbs according to the playing time stamp information and the marked playing time interval information; All key action frames contained in the corresponding marked playing time interval information of each descriptive verb are sorted according to the sequence of the frames to obtain a key action frame sequence corresponding to each descriptive verb respectively; Taking different descriptive verbs as token text identifiers, taking visual coding results of all key action frames in key action frame sequences corresponding to each descriptive verb as coding subsections, and performing coding subsections splicing in sequence according to a frame sequence relation to obtain coding subsections splicing results corresponding to different descriptive verbs; and generating the paired visual coding result and descriptive verb combination token according to the coding sub-segment splicing result and the corresponding token text identifier.
5. The virtual movie generating method based on VR large space technology as set forth in claim 4, wherein the step of inputting the paired visual coding result and the descriptive verb combination token into a preset pull model to screen out a reserved combination token of a target logarithm specifically includes: Identifying a selection dimension provided by the selection model; Extracting a description verb of a query dimension from the description verb and extracting a visual coding result of a key dimension from the visual coding result according to the selected dimension; Taking the descriptive verbs of the query dimension and the visual coding results of the key dimension as attention calculation parameters, and carrying out attention calculation by using a preset attention mechanism to obtain attention weights of different descriptive verbs for different visual coding results; Based on the attention weights of different descriptive verbs for different visual coding results, the reserved combination tokens of the target logarithm are screened out according to the mode that the attention weights are from high to low.
6. The virtual movie production method based on VR large space technology as set forth in claim 5, wherein after said step of obtaining a target visual coding result from all descriptive verbs in said reserved combined token is performed, said method further comprises: Generating a touch action instruction aiming at all descriptive verbs in the reserved combined token to a preset touch instruction generation component; Fusing the haptic action instruction and the target visual coding result to obtain a visual coding fusion result containing the haptic action instruction, specifically, acquiring the coding sub-segment splicing result corresponding to each descriptive verb respectively, splicing the haptic action instruction corresponding to each descriptive verb as an independent splicing prefix or suffix with the corresponding coding sub-segment splicing result, and generating a visual coding fusion result containing the haptic action instruction; And replacing the visual coding fusion result comprising the tactile action instruction with the target visual coding result in an updated manner.
7. The virtual movie generation method based on VR large space technology as set forth in claim 1, wherein the step of performing multi-modal coding fusion on the text-to-speech coding result and the target visual coding result to obtain a multi-modal coding fusion result specifically includes: recognizing all descriptive verbs in the reserved combined token corresponding to the text-voice coding result as a first recognition result; Identifying all descriptive verbs in the reserved combined token corresponding to the target visual coding result as a second identification result; and according to the mutual correspondence of the descriptive verbs in the first recognition result and the second recognition result and the sequence relation of all the descriptive verbs in the video description text, carrying out multi-mode coding fusion on the text-voice coding result and the target visual coding result.
8. Virtual movie generation device based on VR large space technique, characterized by comprising: the system comprises a film generation material acquisition module, a video generation material generation module and a video generation module, wherein the film generation material acquisition module is used for acquiring virtual film generation material, and the film generation material comprises a source video, a video description text and a voice synthesis packet; the key action frame coding module is used for carrying out key action frame coding extraction on the source video to obtain visual coding results corresponding to all key action frames respectively; the descriptive verb extraction module is used for performing part-of-speech classification extraction on the video descriptive text to obtain all descriptive verbs contained in the video descriptive text; The paired combination token generation module is used for generating paired visual coding results and descriptive verb combination tokens according to the time-sequence corresponding relations of all key action frames and all descriptive verbs; The reserved combination token screening module is used for inputting the paired visual coding results and the descriptive verb combination tokens into a preset selection model and screening reserved combination tokens of target logarithms; the text-to-speech coding module is used for extracting all descriptive verbs in the reserved combined token, inputting all descriptive verbs in the reserved combined token and the speech synthesis packet into a preset speech coding component and generating a text-to-speech coding result; The target visual code acquisition module is used for acquiring a target visual code result according to all descriptive verbs in the reserved combined token; The multi-modal coding fusion module is used for carrying out multi-modal coding fusion on the text-voice coding result and the target visual coding result to obtain a multi-modal coding fusion result; and the encoding fusion result decoding module is used for inputting the multi-mode encoding fusion result into a preset decoding component and decoding to generate a video stream contained in the target virtual movie.
9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions that when executed by the processor implement the steps of the VR large space technology based virtual movie generation method of any one of claims 1 to 7.
10.A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon computer readable instructions, which when executed by a processor, implement the steps of the virtual movie generation method based on VR large space technology as claimed in any one of claims 1 to 7.

Description

Virtual movie generation method based on VR large space technology and related equipment thereof Technical Field The application relates to the technical field of video processing, and relates to a virtual movie generation method based on VR large-space technology and related equipment thereof, which are applied to simulation generation of experience videos such as virtual propaganda movies, virtual driving collision and the like. Background Virtual movie generation based on VR (Virtual Reality) large-space technology has wide application prospects in the fields of 4D and 5D movie production, virtual driving, flight and the like, for example, under the Virtual driving experience scene, a shot multi-section road live video is subjected to secondary creation to generate a complete simulation road section, and a driver can perform driving simulation feeling by using the road section. For example, in the virtual flight scene, a flight bumping scene is created by combining the speed increasing, speed reducing and atmospheric pressure change of the aircraft, the shot flight live video is created for the second time, a complete aircraft simulation channel is generated, and the aircraft simulation channel is utilized for aircraft pilot training and the like. In the prior art, when virtual video generation is performed based on a VR large-space technology, the conventional full-film generation material feature encoding and decoding and feature fusion method is also relied on, so that more computer resources are consumed in the processing process, and some improvement modes exist, and the virtual video generation is realized by simply scaling and adjusting a source video, but the mode cannot pay attention to important video generation frames. Therefore, at present, when virtual movie generation based on VR large space technology is performed, there is also a problem that computer resources are large and important generated frames cannot be focused. Disclosure of Invention The embodiment of the application aims to provide a virtual movie generation method based on a VR large space technology and related equipment thereof, so as to solve the problems that more computer resources exist and important generated frames cannot be focused when virtual movie generation based on the VR large space technology is performed at present. In a first aspect, an embodiment of the present application provides a virtual movie generating method based on VR large space technology, which adopts the following technical scheme: the virtual movie generation method based on the VR large space technology comprises the following steps: Obtaining virtual movie generation materials, wherein the movie generation materials comprise source videos, video description texts and voice synthesis packages; Performing key action frame coding extraction on the source video to obtain visual coding results corresponding to all key action frames respectively; Extracting the part-of-speech classification of the video description text to obtain all description verbs contained in the video description text; Generating paired visual coding results and descriptive verb combination tokens according to the time-sequence corresponding relations of all key action frames and all descriptive verbs; Inputting the paired visual coding results and the descriptive verb combination tokens into a preset selection model, and screening out reserved combination tokens of target logarithms; extracting all descriptive verbs in the reserved combined token, inputting all descriptive verbs and the voice synthesis packet in the reserved combined token into a preset voice coding component, and generating a text-voice coding result; acquiring a target visual coding result according to all descriptive verbs in the reserved combined token; Carrying out multi-modal coding fusion on the text-voice coding result and the target visual coding result to obtain a multi-modal coding fusion result; And inputting the multi-mode coding fusion result into a preset decoding component, and decoding to generate a video stream contained in the target virtual movie. In a second aspect, an embodiment of the present application further provides a virtual movie generating device based on VR large space technology, which adopts the following technical scheme: Virtual movie generating device based on VR large space technique includes: the system comprises a film generation material acquisition module, a video generation material generation module and a video generation module, wherein the film generation material acquisition module is used for acquiring virtual film generation material, and the film generation material comprises a source video, a video description text and a voice synthesis packet; the key action frame coding module is used for carrying out key action frame coding extraction on the source video to obtain visual coding results corresponding to all key action frames respectively; the descriptive v