CN-122023637-A - Image generation method, model training method, device and storage medium

CN122023637ACN 122023637 ACN122023637 ACN 122023637ACN-122023637-A

Abstract

The embodiment of the disclosure provides an image generation method, a model training method, equipment and a storage medium. The image generation method comprises the steps of obtaining a plurality of view images of a target object under different view angles, and a plurality of pose information and a plurality of depth images which are in one-to-one correspondence with the view images, determining a plurality of mask images which are in one-to-one correspondence with the view images based on the view images and the depth images, wherein a target area contained in each mask image is used for indicating the area of the target object in the corresponding view image, training an initial image generation model based on the view images, the depth images, the mask images and the pose information to obtain a trained image generation model, wherein the initial image generation model is a nerve radiation field model, and generating target view images different from the view angles of the view images according to the trained image generation model.

Inventors

BI YUEFENG

Assignees

京东方科技集团股份有限公司

Dates

Publication Date: 20260512
Application Date: 20241030

Claims (13)

1. An image generation method, comprising: obtaining a plurality of view images, a plurality of pose information corresponding to the view images one by one, and a plurality of depth images corresponding to the view images one by one, wherein the view images are images of a target object under different view angles; determining a plurality of mask images corresponding to the view images one by one based on the view images and the depth images, wherein a target area contained in each mask image is used for indicating the area of the target object in the corresponding view image; Training an initial image generation model based on the plurality of view images, the plurality of depth images, the plurality of mask images and the plurality of pose information to obtain a trained image generation model, wherein the initial image generation model is a nerve radiation field model; And generating a target visual angle image according to the trained image generation model, wherein the target visual angle image is different from the visual angles of the plurality of visual angle images.
2. The method of claim 1, wherein the determining a plurality of mask images that are in one-to-one correspondence with the plurality of perspective images based on the plurality of perspective images and the plurality of depth images comprises: Inputting the multiple view images into a first feature extraction network to obtain multiple first feature images corresponding to the multiple view images one by one; inputting the plurality of depth images into a second feature extraction network to obtain a plurality of second feature images corresponding to the plurality of depth images one by one; inputting the first feature images and the second feature images into a feature fusion network to obtain a plurality of fused feature images corresponding to the view images one by one; Inputting the multiple fused feature images into a prediction network to obtain multiple initial mask images corresponding to the multiple view images one by one; Inputting the plurality of initial mask images into an optimization network to obtain a plurality of optimized mask images corresponding to the plurality of view images one by one; the plurality of mask images is determined based on the plurality of optimized mask images.
3. The method of claim 2, wherein the first feature extraction network comprises a first codec network comprising a first encoder and a first decoder, the input of the first encoder being the plurality of view images, the output of the first encoder being the input of the first decoder, the output of the first decoder being the plurality of first feature maps; the second feature extraction network comprises a second codec network comprising a second encoder and a second decoder, wherein the input of the second encoder is the plurality of depth images, the output of the second encoder is the input of the second decoder, and the output of the second decoder is the plurality of second feature maps.
4. A method according to claim 2 or 3, wherein said determining said plurality of mask images based on said plurality of optimized mask images comprises: Performing binarization processing on the plurality of optimized mask images to obtain a plurality of binary images; and determining the plurality of binary images as the plurality of mask images.
5. The method of claim 2 or 3, wherein the optimization network comprises a residual refinement module.
6. The method of claim 1, wherein the training the initial image generation model based on the plurality of perspective images, the plurality of depth images, the plurality of mask images, and the plurality of pose information to obtain a trained image generation model comprises: Determining sampling information of the target object in the multiple view images under a world coordinate system according to the multiple view images, the multiple mask images and the multiple pose information, wherein the sampling information of the target object in each view image under the world coordinate system comprises position information and view information of multiple sampling points on each ray corresponding to each pixel point of an area where the target object is located under the world coordinate system, and each ray takes a camera center as a starting point and passes through the camera center and the pixel point of the area where the target object is located; Position coding is carried out on sampling information of the target object in the plurality of view images under a world coordinate system, so as to obtain sampling information of the target object in the plurality of view images after position coding, wherein the sampling information after position coding comprises position information after position coding and view information after position coding; Inputting the sampling information after the position coding of the target object in the multiple view images into a neural network in the initial image generation model to obtain voxel information of each ray corresponding to the target object in the multiple view images, wherein the voxel information of each ray comprises volume density and color corresponding to each of multiple sampling points on each ray; Voxel information of each ray corresponding to the target object in a plurality of view images is input into a rendering network in the initial image generation model, and a predicted rendering image corresponding to each view image is obtained; And adjusting parameters of the neural network in the initial image generation model based on reconstruction loss until the trained image generation model is obtained, wherein the reconstruction loss is calculated by adopting a preset loss function through each view angle image, the depth image corresponding to each view angle image and the prediction rendering image corresponding to each view angle image.
7. The method of claim 6, wherein determining sampling information of the target object in the plurality of view images in a world coordinate system based on the plurality of view images, the plurality of mask images, and the plurality of pose information comprises: acquiring coordinates of pixel points of the region where the target object is located in the multiple view images according to the multiple view images and the target region contained in the multiple mask images; According to the coordinates of the pixel points of the region where the target object is located in the plurality of view images and the internal references in the plurality of pose information corresponding to the plurality of view images, calculating the coordinates of the pixel points of the region where the target object is located in the plurality of view images under a camera coordinate system; Determining sampling information of the target object in the multiple view images under the camera coordinate system according to coordinates of pixel points of the region where the target object is located in the multiple view images under the camera coordinate system, wherein the sampling information of the target object in each view image under the camera coordinate system comprises position information and view information of multiple sampling points of each ray corresponding to each pixel point of the region where the target object is located under the camera coordinate system; And determining sampling information of the target object in the plurality of view images in a world coordinate system based on external parameters in the plurality of pose information corresponding to the plurality of view images and sampling information of the target object in the plurality of view images in the camera coordinate system.
8. The method according to claim 6 or 7, wherein said adjusting parameters of the neural network in the initial image generation model based on reconstruction losses until the trained image generation model is obtained comprises: Determining a color loss based on a difference between each perspective image and the corresponding predicted rendered image; Determining a depth loss based on the depth image corresponding to each view image and each ray of the target object in each view image; Determining the reconstruction loss based on the color loss and the depth loss; and adjusting parameters of the neural network in the initial image generation model based on the reconstruction loss until a preset training ending condition is met, so as to obtain the trained image generation model, wherein the preset training ending condition comprises that the reconstruction loss is smaller than a preset threshold value or the adjustment times reach preset times.
9. The method of claim 8, wherein the determining a color loss based on a difference between each view image and the corresponding predicted rendered image comprises: Based on the differences between each view image and the corresponding predicted rendered image, the color loss is determined according to the following formula: Wherein, the To render color information of an image, C g.t. is color information of a view image, L color represents color loss, and L 2 represents L2 norm.
10. The method of claim 8, wherein the determining a depth loss based on the depth image corresponding to each view image and each ray of the target object in each view image comprises: Determining the depth loss based on the depth image corresponding to each view image and the termination distance of each ray of the target object in each view image according to the following formula: wherein h (t) represents the stopping distance of the ray, The depth information indicating the depth image, KL [ ] indicates KL divergence, and L depth indicates depth loss.
11. A method of model training, comprising: obtaining a plurality of view images, a plurality of pose information corresponding to the view images one by one, and a plurality of depth images corresponding to the view images one by one, wherein the view images are images of a target object under different view angles; determining a plurality of mask images corresponding to the view images one by one based on the view images and the depth images, wherein a target area contained in each mask image is used for indicating the area of the target object in the corresponding view image; training an initial image generation model based on the plurality of view angle images, the plurality of depth images, the plurality of mask images and the plurality of pose information to obtain a trained image generation model, wherein the initial image generation model is a nerve radiation field model, and the trained image generation model is a three-dimensional model of the target object.
12. Computer device comprising a processor and a memory storing a computer program executable on the processor, wherein the processor is arranged to perform the steps of the image generation method according to any of claims 1 to 10 or the steps of the model training method according to claim 11.
13. A non-transitory computer readable storage medium storing computer executable instructions arranged to perform the steps of the image generation method of any one of claims 1 to 10 or the model training method of claim 11.

Description

Image generation method, model training method, device and storage medium Technical Field Embodiments of the present disclosure relate to, but are not limited to, the field of artificial intelligence, and in particular, to an image generating method, a model training method, a device, and a storage medium. Background In a variety of application scenarios, such as autopilot, gaming, virtual reality, and augmented reality, it is often desirable to render images of new perspectives in a particular scene. However, in some technologies, in the image generation process of the new view, the consumption of computing resources is larger, the efficiency is lower, and the rendering effect is poorer. Disclosure of Invention The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims. In a first aspect, an embodiment of the present disclosure provides an image generating method, which includes obtaining a plurality of view images, a plurality of pose information corresponding to the view images one by one, and a plurality of depth images corresponding to the view images one by one, wherein the view images are images of a target object under different view angles, determining a plurality of mask images corresponding to the view images one by one based on the view images and the depth images, wherein a target area included in each mask image is used for indicating an area of the target object in the corresponding view image, training an initial image generating model based on the view images, the depth images, the mask images, and the pose information, and obtaining a trained image generating model, wherein the initial image generating model is a neural radiation field model, and generating a target view image according to the trained image generating model, wherein the target view image is different from the view images. In some exemplary embodiments, the determining a plurality of mask images corresponding to the plurality of view images one-to-one based on the plurality of view images and the plurality of depth images includes inputting the plurality of view images into a first feature extraction network to obtain a plurality of first feature images corresponding to the plurality of view images one-to-one, inputting the plurality of depth images into a second feature extraction network to obtain a plurality of second feature images corresponding to the plurality of depth images one-to-one, inputting the plurality of first feature images and the plurality of second feature images into a feature fusion network to obtain a plurality of fused feature images corresponding to the plurality of view images one-to-one, inputting the plurality of fused feature images into a prediction network to obtain a plurality of initial mask images corresponding to the plurality of view images one-to-one, inputting the plurality of initial mask images into an optimization network to obtain a plurality of optimized mask images corresponding to the plurality of view images one-to-one, and determining the plurality of mask images based on the plurality of optimized mask images. In some exemplary embodiments, the first feature extraction network comprises a first codec network comprising a first encoder and a first decoder, the input of the first encoder being the plurality of view images, the output of the first encoder being the input of the first decoder, the output of the first decoder being the plurality of first feature maps, the second feature extraction network comprising a second codec network comprising a second encoder and a second decoder, the input of the second encoder being the plurality of depth images, the output of the second encoder being the input of the second decoder, the output of the second decoder being the plurality of second feature maps. In some exemplary embodiments, the determining the plurality of mask images based on the plurality of optimized mask images includes performing binarization processing on the plurality of optimized mask images to obtain a plurality of binary images, and determining the plurality of binary images as the plurality of mask images. In some exemplary embodiments, the optimization network includes a residual refinement module. In some exemplary embodiments, training an initial image generation model based on the view images, the depth images, the mask images and the pose information to obtain a trained image generation model includes determining sampling information of the target object in the view images in a world coordinate system according to the view images, the mask images and the pose information, wherein the sampling information of the target object in each view image in the world coordinate system includes position information and view information of a plurality of sampling points on each ray corresponding to each pixel point of an area where the target object is located in the world coordinate system,