CN-121999118-A - Data processing method, apparatus, electronic device, storage medium, and program product

CN121999118ACN 121999118 ACN121999118 ACN 121999118ACN-121999118-A

Abstract

The application provides a data processing method, a device, electronic equipment, a storage medium and a program product, wherein the method comprises the steps of responding to a request for carrying out three-dimensional reconstruction on a target scene, screening a preset number of historical camera metadata which is positioned before a target time point and closest to the target time point at shooting time points from a plurality of camera metadata corresponding to the target cameras, determining predicted image frames of a plurality of view angles corresponding to the target time point and alignment matrixes corresponding to the predicted image frames according to the historical camera metadata corresponding to the target cameras, wherein the predicted image frames of the plurality of view angles and the alignment matrixes corresponding to the predicted image frames are used for carrying out three-dimensional reconstruction on the target scene corresponding to the target time point, and accordingly, accuracy of the predicted image frames can be improved, errors of space alignment can be reduced, and accuracy of three-dimensional reconstruction of the target scene can be improved.

Inventors

CHEN CHUNHAO
ZHAO HANHAN
LIANG BIN
Yu Yaolong
ZHANG RUI

Assignees

甬江实验室

Dates

Publication Date: 20260508
Application Date: 20241105

Claims (12)

1. A method of data processing, comprising: responding to a request for three-dimensional reconstruction of a target scene, and screening a preset number of historical camera metadata, of which the shooting moment is positioned before a target time point and closest to the target time point, from a plurality of camera metadata corresponding to the target camera for any one of a plurality of target cameras for shooting the target scene, wherein each camera metadata comprises an image frame and the shooting moment; According to historical camera metadata corresponding to each target camera, determining predicted image frames of a plurality of view angles corresponding to a target time point and alignment matrixes corresponding to the predicted image frames, wherein the predicted image frames of the plurality of view angles and the alignment matrixes corresponding to the predicted image frames are used for three-dimensional reconstruction of a target scene corresponding to the target time point, and the view angles are predicted shooting view angles corresponding to the target cameras at the target time point.
2. The method of claim 1, wherein each camera metadata further comprises identification information of a camera capturing the image frames, determining predicted image frames for a plurality of perspectives corresponding to the target time points based on historical camera metadata corresponding to each target camera, respectively, and an alignment matrix corresponding to each predicted image frame, respectively, comprising: Performing feature extraction on image frames in the historical camera metadata aiming at any historical camera metadata to obtain corresponding feature images; respectively carrying out position coding on the identification information and shooting time of the camera in the historical camera metadata to obtain an identification information coding matrix and a shooting time coding matrix; And determining predicted image frames of multiple visual angles corresponding to the target time point and an alignment matrix corresponding to each predicted image frame according to the characteristic image, the identification information coding matrix and the shooting time coding matrix corresponding to each historical camera metadata.
3. The method according to claim 2, wherein the dimensions of the identification information encoding matrix and the capturing moment encoding matrix are the same as those of the feature image, and determining the predicted image frames of the plurality of views corresponding to the target time point and the alignment matrix corresponding to each predicted image frame according to the feature image, the identification information encoding matrix and the capturing moment encoding matrix corresponding to each history camera metadata, respectively, comprises: Summing an identification information coding matrix corresponding to any historical camera metadata and a shooting moment coding matrix according to any historical camera metadata to obtain a matrix sum; and inputting the feature images and matrixes respectively corresponding to the metadata of each historical camera into a first time sequence model, and determining predicted image frames of a plurality of visual angles corresponding to the target time point and alignment matrixes respectively corresponding to each predicted image frame.
4. The method of claim 1, wherein reconstructing the target scene corresponding to the target point in time in three dimensions comprises: inputting a predicted image frame into a depth estimation algorithm for each predicted image frame to obtain a corresponding depth image, wherein the depth image is used for indicating the depth corresponding to each pixel point in the predicted image frame; generating a point cloud corresponding to the target scene according to the alignment matrix and the depth map respectively corresponding to each predicted image frame; generating a three-dimensional model according to the point cloud corresponding to the target scene; And mapping the predicted image frames of a plurality of visual angles corresponding to the target time point onto the three-dimensional model, and adding textures and details for the three-dimensional model.
5. The method of claim 1, wherein in response to a request for three-dimensional reconstruction of a target scene, for any one of a plurality of target cameras used to capture the target scene, comprising: Acquiring request information of target scene data sent by a client, wherein the request information is used for indicating identification information of each target camera in a plurality of target cameras and a time range corresponding to the target scene data; according to the identification information of each target camera, camera metadata corresponding to each target camera are screened out from camera metadata corresponding to a plurality of cameras; for each target point in time within the time range, for any one of a plurality of target cameras for capturing a target scene; The method further comprises the steps of: And sending the predicted image frames of the multiple visual angles corresponding to the target time point and the alignment matrixes corresponding to the predicted image frames to the client so that the client performs three-dimensional reconstruction on the target scene corresponding to the target time point.
6. The method of claim 5, wherein transmitting the predicted image frames for the plurality of views corresponding to the target point in time and the alignment matrix for each of the predicted image frames to the client comprises: Splicing the predicted image frames of a plurality of visual angles corresponding to the target time point into a target image frame; Combining alignment matrixes corresponding to the predicted image frames respectively into a target matrix; and packaging and compressing the target image frames and the target matrix corresponding to the target time point, and sending the compressed data packet to the client.
7. The method of claim 6, wherein the method further comprises: acquiring a network state of the current moment sent by a client; determining the compression ratio of the current period according to the network state at the current moment and a plurality of historical network states; Correspondingly, packaging and compressing the target image frame and the target matrix corresponding to the target time point, and sending the compressed data packet to the client, wherein the method comprises the following steps: And packaging the target image frame corresponding to the target time point and the target matrix, compressing according to the compression ratio of the current period, and transmitting the compressed data packet to the client.
8. The method of claim 7, wherein determining the current compression ratio based on the current time of day network state and the plurality of historical network states comprises: Screening a preset number of target network states closest to the current moment from a plurality of historical network states; And inputting the network state at the current moment, the target network state and the compression ratio of the previous period into a second time sequence model, and determining the compression ratio of the current period.
9. A data processing apparatus, comprising: The system comprises a screening module, a target scene shooting module and a target scene shooting module, wherein the screening module is used for responding to a request for carrying out three-dimensional reconstruction on the target scene, screening a preset number of historical camera metadata which are positioned before a target time point and closest to the target time point in a plurality of camera metadata corresponding to the target camera aiming at any one of a plurality of target cameras for shooting the target scene, wherein each camera metadata comprises an image frame and shooting time; The system comprises a determining module, a prediction module and a display module, wherein the determining module is used for determining predicted image frames of a plurality of view angles corresponding to target time points and alignment matrixes corresponding to the predicted image frames according to historical camera metadata corresponding to the target time points, wherein the predicted image frames of the plurality of view angles and the alignment matrixes corresponding to the predicted image frames are used for three-dimensional reconstruction of a target scene corresponding to the target time points, and the view angles are predicted shooting view angles corresponding to the target time points of the plurality of target cameras.
10. An electronic device comprising a processor and a memory communicatively coupled to the processor; The memory stores computer-executable instructions; the processor executes computer-executable instructions stored in the memory to implement the method of any one of claims 1 to 8.
11. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are adapted to carry out the method of any one of claims 1 to 8.
12. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.

Description

Data processing method, apparatus, electronic device, storage medium, and program product Technical Field The present application relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, electronic device, storage medium, and program product. Background With the rapid development of computer vision technology, the alignment technology of multi-view images has become a key technology, especially in the fields of automatic driving, intelligent monitoring, virtual reality and the like. The alignment technology of the multi-view images includes time alignment and space alignment, the time alignment refers to ensuring that images from different view angles are synchronous in time, for example, in an automatic driving or monitoring system, a target object may move, ideally, the images from multiple view angles are captured at the same time point, but in practical situations, the capturing time point corresponding to the images from each view angle may have a time deviation of tens of milliseconds, and in the situation that a shooting camera is in rapid motion or a shot object is in rapid motion, the time deviation of tens of milliseconds will bring about a huge difference of captured images, in the case that the time deviation is utilized for performing space alignment, the error is large in the time of space alignment, and a motion blur or ghost phenomenon occurs, which causes a synthesized image or a three-dimensional model to be unclear, and affects visual effects and subsequent analysis. Disclosure of Invention The application provides a data processing method, a data processing device, electronic equipment, a storage medium and a program product, which are used for improving the accuracy of three-dimensional reconstruction of a target scene. In a first aspect, an embodiment of the present application provides a data processing method, including: responding to a request for three-dimensional reconstruction of a target scene, and screening a preset number of historical camera metadata, of which the shooting moment is positioned before a target time point and closest to the target time point, from a plurality of camera metadata corresponding to the target camera for any one of a plurality of target cameras for shooting the target scene, wherein each camera metadata comprises an image frame and the shooting moment; According to historical camera metadata corresponding to each target camera, determining predicted image frames of a plurality of view angles corresponding to a target time point and alignment matrixes corresponding to the predicted image frames, wherein the predicted image frames of the plurality of view angles and the alignment matrixes corresponding to the predicted image frames are used for three-dimensional reconstruction of a target scene corresponding to the target time point, and the view angles are predicted shooting view angles corresponding to the target cameras at the target time point. Optionally, each camera metadata further includes identification information of a camera capturing the image frames, determining predicted image frames of multiple views corresponding to the target time point according to historical camera metadata corresponding to each target camera, and an alignment matrix corresponding to each predicted image frame, including: Performing feature extraction on image frames in the historical camera metadata aiming at any historical camera metadata to obtain corresponding feature images; respectively carrying out position coding on the identification information and shooting time of the camera in the historical camera metadata to obtain an identification information coding matrix and a shooting time coding matrix; And determining predicted image frames of multiple visual angles corresponding to the target time point and an alignment matrix corresponding to each predicted image frame according to the characteristic image, the identification information coding matrix and the shooting time coding matrix corresponding to each historical camera metadata. Optionally, the dimensions of the identification information encoding matrix and the shooting time encoding matrix are the same as those of the feature image, and the prediction image frames of multiple views corresponding to the target time point and the alignment matrix corresponding to each prediction image frame are determined according to the feature image, the identification information encoding matrix and the shooting time encoding matrix corresponding to each historical camera metadata respectively, which includes: Summing an identification information coding matrix corresponding to any historical camera metadata and a shooting moment coding matrix according to any historical camera metadata to obtain a matrix sum; and inputting the feature images and matrixes respectively corresponding to the metadata of each historical camera into a first time sequence model, and