CN-120434373-B - Video generation method, apparatus, electronic device, storage medium, and program product

CN120434373BCN 120434373 BCN120434373 BCN 120434373BCN-120434373-B

Abstract

The embodiment of the disclosure discloses a video generation method, a device, electronic equipment, a computer readable storage medium and a computer program product, wherein a rough three-dimensional world of a target scene formed by a three-dimensional Gaussian ball set is generated, a multi-angle offset view angle of the same coordinate position is automatically searched to serve as a new view angle, image restoration is carried out after new view angle image data are spliced, a multi-view joint restoration mechanism is realized, geometric and semantic consistency of image data among different view angles is ensured, consistency of scene structure, texture and semantic information under different view angles is ensured, a continuous, consistent and high-quality three-dimensional world is constructed, the problem of inconsistent multiple view angles in the traditional method is solved, the image quality of the new view angle under the transformation of a large scale view angle is improved, the problems of image quality such as floaters, artifacts and structural distortion in images are eliminated, the generated image is more stable and real, the spatial consistency is improved, and the immersion feeling and experience of users are enhanced.

Inventors

ZHU ZHENG
Ni Chaojun
HUANG GUAN

Assignees

北京极佳视界科技有限公司

Dates

Publication Date: 20260505
Application Date: 20250623

Claims (12)

1. A method of video generation, the method comprising: Generating a first three-dimensional world of a target scene, wherein the first three-dimensional world is a three-dimensional Gaussian ball set, and any three-dimensional Gaussian ball comprises a position coordinate parameter, a covariance parameter and a spherical harmonic coefficient; Determining image data of a target view angle group based on the first three-dimensional world, wherein the target view angle group comprises multi-angle offset view angles of the same coordinate position, and the target view angle group is a view angle corresponding to unused observation pose data in the process of establishing the first three-dimensional world of a target scene; Performing stitching processing on the image data of the target view angle group to obtain stitched panoramic image data; performing image restoration processing on the spliced panoramic image data to obtain restored panoramic image data; optimizing the first three-dimensional world based on the repaired panoramic image data to obtain a second three-dimensional world so as to generate a target three-dimensional video based on the second three-dimensional world; the generating a first three-dimensional world of a target scene includes: Responding to any received scene image and scene description prompt text, and extracting image semantic feature distribution of the scene image and text description feature distribution of the scene description prompt text; For any pixel in the scene image, determining a three-dimensional coordinate value of the pixel based on the image semantic feature distribution; creating a first three-dimensional gaussian sphere of any pixel based on the three-dimensional coordinate values of the pixel to obtain a first three-dimensional gaussian sphere set of all pixels of the scene image; and adjusting and optimizing Gaussian ball parameters in the first three-dimensional Gaussian ball set through the text description characteristic distribution to obtain a second Gaussian ball set, wherein the second Gaussian ball set forms the first three-dimensional world.
2. The method of claim 1, wherein the determining image data for a set of target view angles based on the first three-dimensional world, the set of target view angles including multi-angle offset view angles for a same coordinate location, comprises: Receiving any coordinate value input by a user to search the coordinate value from a known view angle set, wherein the known view angle set is observation pose data used in the process of recording and establishing a first three-dimensional world of the target scene; determining the coordinate value as a first view angle in response to not finding the coordinate value from the set of known view angles; determining at least one adjacent offset viewing angle of the first viewing angle according to a preset offset rule based on the first viewing angle, wherein the at least one adjacent offset viewing angle is a second viewing angle, and the first viewing angle and the second viewing angle are the target viewing angle group; and performing image rendering on the first view angle and the second view angle to obtain a plurality of rendered images, and determining the plurality of rendered images as image data of the target view angle group.
3. The method according to claim 1, wherein the stitching the image data of the target view angle group to obtain stitched panoramic image data includes: extracting features of each image data of the target view angle group to obtain image features of the image data of each view angle, and constructing image feature pairs among the view angles; Performing alignment processing on the image data of the target view angle group through a preset image alignment algorithm based on the image characteristics to obtain aligned image data; And performing stitching processing on the aligned image data by using a preset image stitching algorithm to obtain stitched panoramic image data.
4. A method according to claim 3, wherein the stitching the aligned image data using a preset image stitching algorithm to obtain the stitched panoramic image data includes: Performing frequency domain decomposition on the aligned image data by using the preset image stitching algorithm to obtain frequency domain signal characteristics of image data of each view angle in the aligned image data; And carrying out weighted fusion calculation on the frequency domain signal characteristics of the image data of each view angle in the aligned image data according to a preset weighting rule to obtain the spliced panoramic image data.
5. The method according to claim 1, wherein performing an image restoration process on the stitched panoramic image data to obtain restored panoramic image data comprises: inputting the spliced panoramic image data into a video restoration model, determining a to-be-restored area in the spliced panoramic image data through the video restoration model, and restoring the to-be-restored area to obtain restored panoramic image data.
6. The method according to claim 5, wherein inputting the stitched panoramic image data into a video restoration model to determine a region to be restored in the stitched panoramic image data through the video restoration model, and performing restoration processing on the region to be restored to obtain restored panoramic image data, includes: Inputting the spliced panoramic image data into a video restoration model, and carrying out mask positioning on the spliced panoramic image data through the video restoration model so as to determine a region to be restored in the spliced panoramic image data; Aiming at the region to be repaired in the spliced panoramic image data, performing space repair processing through the video repair model to obtain a first repair image; And performing time sequence restoration processing on the first restoration image through the video restoration model to obtain a second restoration image, and determining the second restoration image as restoration panoramic image data.
7. The method of claim 5, wherein the method further comprises: inputting a training data set into a preset video restoration network, wherein the training data set comprises a plurality of first video frame images with mask marks and second video frame images corresponding to the first video frame images, the first video frame images are video images with defect characteristics in areas marked by the mask marks, the second video frame images are video images without the defect characteristics, and the preset video restoration network is a video processing network based on a diffusion model; processing the first video frame image with the mask mark by utilizing the preset video repair network to obtain a predicted video image; calculating a preset distance between the predicted video image and the second video frame image to obtain a predicted loss value; and iteratively optimizing network parameters of the preset video repair network according to the predicted loss value to obtain the video repair model.
8. The method of claim 7, wherein the method further comprises: Initializing a three-dimensional world reconstruction model, and acquiring a rendered video frame image set based on the three-dimensional world reconstruction model, wherein the rendered video frame image set is a video frame image sequence combination obtained by three-dimensional world rendering based on a plurality of scenes; In response to the three-dimensional world reconstruction model according to the preset iteration times, acquiring a third video frame image and a fourth video frame image which are rendered based on the three-dimensional world reconstruction model in the iteration optimization process according to the preset iteration times and the preset sampling step length, wherein the third video frame image is a video frame image output before the preset iteration times are reached, and the fourth video frame image is a video frame image output when the preset iteration times are reached; Determining a region with defect characteristics in the third video frame image by comparing the third video frame image with the fourth video frame image, and carrying out mask marking on the region with defect characteristics by a preset shielding mask mechanism to obtain a third video frame image with mask identification; And determining the third video frame image and the fourth video frame image with mask identification as second data pairs, wherein the second data pairs corresponding to at least one frame image in the rendered video frame image set form the training data set.
9. A video generating apparatus, the apparatus comprising: The three-dimensional world generation module is used for generating a first three-dimensional world of the target scene, wherein the first three-dimensional world is a three-dimensional Gaussian ball set, and any three-dimensional Gaussian ball comprises a position coordinate parameter, a covariance parameter and a spherical harmonic function coefficient; The multi-view image determining module is used for determining image data of a target view group based on the first three-dimensional world, wherein the target view group comprises multi-angle offset view angles of the same coordinate position, and the target view group is a view angle corresponding to unused observation pose data in the process of establishing the first three-dimensional world of a target scene; the image stitching module is used for stitching the image data of the target view angle group to obtain stitched panoramic image data; the image restoration module is used for carrying out image restoration processing on the spliced panoramic image data to obtain restored panoramic image data; The reverse optimization module is used for optimizing the first three-dimensional world based on the restored panoramic image data to obtain a second three-dimensional world so as to generate a target three-dimensional video based on the second three-dimensional world; The three-dimensional world generation module is specifically configured to: Responding to any received scene image and scene description prompt text, and extracting image semantic feature distribution of the scene image and text description feature distribution of the scene description prompt text; For any pixel in the scene image, determining a three-dimensional coordinate value of the pixel based on the image semantic feature distribution; creating a first three-dimensional gaussian sphere of any pixel based on the three-dimensional coordinate values of the pixel to obtain a first three-dimensional gaussian sphere set of all pixels of the scene image; and adjusting and optimizing Gaussian ball parameters in the first three-dimensional Gaussian ball set through the text description characteristic distribution to obtain a second Gaussian ball set, wherein the second Gaussian ball set forms the first three-dimensional world.
10. An electronic device, comprising: A memory for storing a computer program product; A processor for executing a computer program product stored in said memory, which, when executed, implements the method of any of the preceding claims 1-8.
11. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of the preceding claims 1-8.
12. A computer program product comprising computer program instructions which, when executed by a processor, implement the method of any of the preceding claims 1-8.

Description

Video generation method, apparatus, electronic device, storage medium, and program product Technical Field The present disclosure relates to the field of video generation technology, and in particular, to a video generation method, apparatus, electronic device, computer readable storage medium, and computer program product. Background The existing 3D (Three-Dimensional) world generation method mainly adopts a mode of generating images or panoramic views with multiple views from a single image or text description and converting the images or panoramic views into 3D space representation. Representative methods such as Text2Room (a Text-driven 3D space construction tool) and LucidDreamer (an innovative 3D scene generation technique capable of generating a high quality 3D scene ‌ from Text or image cues), generate multi-perspective images from input images and Text, and construct a preliminary 3D scene based thereon. Still other 3D world generation methods use a pre-trained text-to-panorama diffusion model to generate a scene image, reconstructing the 3D world. However, although these methods can generate 3D space that can be interacted to a certain extent, because the training is only supervised by a small amount of viewing angles, when the generated 3D world is subjected to large-scale viewing angle movement or forward exploration by a user, high-quality image representation cannot be maintained, and problems such as floaters, artifacts and structural distortion are easy to occur, so that immersive experience is limited. Disclosure of Invention To solve the technical problems in the related art, embodiments of the present disclosure provide a video generation method, apparatus, electronic device, computer-readable storage medium, and computer program product. According to a first aspect of embodiments of the present disclosure, there is provided a video generation method, the method including: Generating a first three-dimensional world of a target scene, wherein the first three-dimensional world is a three-dimensional Gaussian ball set, and any three-dimensional Gaussian ball comprises a position coordinate parameter, a covariance parameter and a spherical harmonic coefficient; Determining image data of a target view group based on the first three-dimensional world, the target view group including multi-angle offset views of the same coordinate position; Performing stitching processing on the image data of the target view angle group to obtain stitched panoramic image data; performing image restoration processing on the spliced panoramic image data to obtain restored panoramic image data; Optimizing the first three-dimensional world based on the repaired panoramic image data to obtain a second three-dimensional world, and generating a target three-dimensional video based on the second three-dimensional world. As an optional embodiment, the generating the first three-dimensional world of the target scene, the first three-dimensional world being a three-dimensional gaussian sphere set, any three-dimensional gaussian sphere including a position coordinate parameter, a covariance parameter, and a spherical harmonic coefficient, includes: Responding to any received scene image and scene description prompt text, and extracting image semantic feature distribution of the scene image and text description feature distribution of the scene description prompt text; For any pixel in the scene image, determining a three-dimensional coordinate value of the pixel based on the image semantic feature distribution; creating a first three-dimensional gaussian sphere of any pixel based on the three-dimensional coordinate values of the pixel to obtain a first three-dimensional gaussian sphere set of all pixels of the scene image; and adjusting and optimizing Gaussian ball parameters in the first three-dimensional Gaussian ball set through the text description characteristic distribution to obtain a second Gaussian ball set, wherein the second Gaussian ball set forms the first three-dimensional world. As an alternative embodiment, the determining, based on the first three-dimensional world, image data of a target view angle group, the target view angle group including multi-angle offset views of the same coordinate position includes: Receiving any coordinate value input by a user to search the coordinate value from a known view angle set, wherein the known view angle set is observation pose data used in the process of recording and establishing a first three-dimensional world of the target scene; determining the coordinate value as a first view angle in response to not finding the coordinate value from the set of known view angles; determining at least one adjacent offset viewing angle of the first viewing angle according to a preset offset rule based on the first viewing angle, wherein the at least one adjacent offset viewing angle is a second viewing angle, and the first viewing angle and the second viewing angle are the target v