CN-121982167-A - Illumination perception video generation method, device and storage medium based on renderer agent reasoning
Abstract
The invention relates to a method, a device and a storage medium for generating illumination perception video based on renderer agent reasoning, which support a user to accurately and de-couple geometric layout, illumination conditions and camera tracks in the video generation process through natural language instructions, introduce a renderer agent to convert texts into structural three-dimensional scene parameters, generate a two-dimensional scene agent containing diffuse reflection, luster and roughness multi-channel layers by using a rendering engine, and then inject physical illumination attributes into a video diffusion model as strong condition signals through a lightweight agent encoder and an adapter. The method realizes end-to-end generation of the video from the text to the physical consistency, and the generated video has accurate shadow, reflection and ambient light shielding effects while maintaining the realistic visual texture, improves the control precision and the automation degree of the physical properties of the scene, and is suitable for the fields of film and television previewing, game asset manufacturing, virtual film making and the like.
Inventors
- Shi Baixin
- CAI ZIQI
- YANG TAOYU
- CHANG ZHENG
- LI SI
- Weng Shuchen
- JIANG HAN
- Lv Tianlei
- PAN XIANGYU
Assignees
- 北京大学
- 贝式计算(天津)信息技术有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260129
Claims (8)
- 1. A method for generating illumination perception video based on renderer agent reasoning is characterized by comprising the following steps: s1, acquiring a text description instruction input by a user; s2, reasoning the text description instruction by using a renderer agent, and constructing a rough three-dimensional scene representation comprising geometric layout, illumination conditions and camera tracks; S3, performing layered rendering based on the rough three-dimensional scene representation by using a three-dimensional rendering engine to generate a two-dimensional scene proxy consisting of a multi-channel rendering layer, wherein the two-dimensional scene proxy contains physical illumination attribute information; S4, inputting the two-dimensional scene agent into a pre-trained video diffusion model to serve as a condition guide signal; S5, generating a final video by using the video diffusion model, wherein the final video has a realistic visual texture while maintaining the physical illumination and the geometric structure defined by the two-dimensional scene proxy.
- 2. The illumination-aware video generation method based on renderer agent reasoning of claim 1, wherein the step S2 is specifically: s2.1, analyzing the text description instruction by using a large language model to construct a scene graph, wherein the scene graph defines object categories and spatial relations thereof; s2.2, retrieving corresponding three-dimensional model assets from a preset three-dimensional asset library according to the scene graph, and carrying out layout according to the spatial relationship to generate a scene geometric structure; S2.3, analyzing the illumination atmosphere description in the text description instruction, and retrieving or generating a matched High Dynamic Range (HDR) environment map as illumination conditions; s2.4, analyzing the description of the lens in the text description instruction, and generating a camera pose sequence which changes with time as a camera track.
- 3. The illumination-aware video generation method based on renderer agent reasoning of claim 1, wherein the two-dimensional scene agent of S3 is a stack of a set of temporally consecutive image sequences, the multi-channel rendering layer comprises a Diffuse reflection channel (Diffuse Pass), a gloss GGX channel (Glossy GGX Pass) and a coarse GGX channel (Rough GGX Pass), the Diffuse reflection channel (Diffuse Pass) is used for capturing low-frequency ambient illumination information, the gloss GGX channel (Glossy GGX Pass) is used for capturing high-frequency specular reflection information at low roughness, and the coarse GGX channel (Rough GGX Pass) is used for capturing medium-frequency reflection information at high roughness.
- 4. The illumination-aware video generation method based on renderer agent reasoning of claim 1, wherein the step S4 is specifically: downsampling and feature extraction are carried out on the two-dimensional scene proxy by using a lightweight proxy encoder, so that scene proxy features are generated; Injecting the scene proxy feature into an intermediate layer of a video diffusion model by using a proxy adapter; The proxy adapter is used for calculating residual errors of scene proxy features and intermediate features of the video diffusion model, and overlapping the residual errors on the original video features through a learnable zero initialization neural network.
- 5. The illumination perception video generation method based on renderer agent reasoning of claim 1, wherein the training of the video diffusion model adopts a three-stage progressive training strategy: a) Freezing backbone parameters of the video diffusion model, training only the proxy encoder and the proxy adapter, so that model learning translates the scene proxy into control signals; b) A low rank adaptation (LoRA) layer in the defreezed video diffusion model, combined with the proxy encoder and adapter for fine tuning to balance control and generation quality; c) The real world video data and the composite video data are mixed in the training data, and the model is finely tuned in a combined manner to enhance the generalization ability of the model to diversified lighting phenomena.
- 6. The illumination-aware video generation method based on renderer agent reasoning of claim 5, wherein the composite video data is constructed by: a) Selecting a three-dimensional asset with a physical attribute material to construct a virtual scene; b) Illuminating the virtual scene with the diversified HDR environment map, and applying a rotation operation which changes with time to the HDR environment map to simulate dynamic illumination; c) And rendering through the visual angle of the mobile camera to obtain the composite video with dynamic illumination change and the corresponding two-dimensional scene proxy thereof.
- 7. An illumination-aware video generation device based on renderer agent reasoning, comprising: the instruction acquisition module is used for acquiring the text description instruction; The agent reasoning module is used for constructing a rough three-dimensional scene containing geometric, illumination and camera parameters through semantic analysis; The proxy rendering module is used for generating a two-dimensional scene proxy image sequence containing diffuse reflection, luster and rough reflection channels based on the physical rendering engine; And the video generation module comprises a proxy encoder, a proxy adapter and a video diffusion backbone network and is used for generating video content with consistent illumination physics based on the two-dimensional scene proxy.
- 8. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1-6.
Description
Illumination perception video generation method, device and storage medium based on renderer agent reasoning Technical Field The invention belongs to the technical field of artificial intelligence and computer vision, and particularly relates to an illumination perception video generation method, device and storage medium based on renderer agent reasoning. Background In recent years, with the development and popularization of artificial intelligence, a video generation technology based on a diffusion model has been remarkably advanced. However, existing data-driven models still face significant challenges in terms of controllability. In particular, existing video generation models often have difficulty decoupling control of physical properties in a scene. For example, it is difficult for a user to precisely control the illumination direction, shadow casting, material reflection properties, and precise layout of objects in a scene, and although some efforts have been made to enhance control by introducing a 3D bounding box or a camera track, these methods mostly ignore the core physical element of "illumination", which leads to the phenomenon that light and shadow inconsistencies (such as a highlight position error, shadow deletion) often occur in the generated video, and it is difficult to satisfy the high requirements for physical realism in professional fields such as movie production, virtual production, and the like. Disclosure of Invention The invention aims to overcome the defects of the prior art, provides a method, a device and a storage medium for generating illumination perception video based on renderer agent reasoning, by introducing "renderer agent" and "scene agent", explicit 3D physical constraints are injected into the video generation process, generating video with high physical realism. The invention solves the technical problems by the following technical proposal: A method for generating illumination perception video based on renderer agent reasoning comprises the following steps: s1, acquiring a text description instruction input by a user; s2, reasoning the text description instruction by using a renderer agent, and constructing a rough three-dimensional scene representation comprising geometric layout, illumination conditions and camera tracks; S3, performing layered rendering based on the rough three-dimensional scene representation by using a three-dimensional rendering engine to generate a two-dimensional scene proxy consisting of a multi-channel rendering layer, wherein the two-dimensional scene proxy contains physical illumination attribute information; S4, inputting the two-dimensional scene agent into a pre-trained video diffusion model to serve as a condition guide signal; S5, generating a final video by using the video diffusion model, wherein the final video has a realistic visual texture while maintaining the physical illumination and the geometric structure defined by the two-dimensional scene proxy. Moreover, the S2 specifically includes: s2.1, analyzing the text description instruction by using a large language model to construct a scene graph, wherein the scene graph defines object categories and spatial relations thereof; s2.2, retrieving corresponding three-dimensional model assets from a preset three-dimensional asset library according to the scene graph, and carrying out layout according to the spatial relationship to generate a scene geometric structure; S2.3, analyzing the illumination atmosphere description in the text description instruction, and retrieving or generating a matched High Dynamic Range (HDR) environment map as illumination conditions; s2.4, analyzing the description of the lens in the text description instruction, and generating a camera pose sequence which changes with time as a camera track. The two-dimensional scene agent of S3 is a stack formed by a group of image sequences which are continuous in time, the multi-channel rendering layer comprises a Diffuse reflection channel (Diffuse Pass), a gloss GGX channel (Glossy GGX Pass) and a rough GGX channel (Rough GGX Pass), the Diffuse reflection channel (Diffuse Pass) is used for capturing low-frequency environment illumination information, the gloss GGX channel (Glossy GGX Pass) is used for capturing high-frequency specular reflection information under low roughness, and the rough GGX channel (Rough GGX Pass) is used for capturing medium-frequency reflection information under high roughness. Moreover, the S4 specifically includes: downsampling and feature extraction are carried out on the two-dimensional scene proxy by using a lightweight proxy encoder, so that scene proxy features are generated; Injecting the scene proxy feature into an intermediate layer of a video diffusion model by using a proxy adapter; The proxy adapter is used for calculating residual errors of scene proxy features and intermediate features of the video diffusion model, and overlapping the residual errors on the original vi