CN-121982190-A - Method for fusing monitoring video and 3D virtual scene

CN121982190ACN 121982190 ACN121982190 ACN 121982190ACN-121982190-A

Abstract

The invention relates to a method for fusing a monitoring video and a 3D virtual scene, and belongs to the technical field of image processing and computer vision. The method comprises the steps of firstly constructing a three-dimensional digital scene and completing space calibration of a camera, then detecting offset of the camera through real-time video analysis and correcting, identifying and tracking a target in a picture, then converting 2D pixel coordinates of the target into space coordinates in a 3D scene through back projection calculation, improving positioning accuracy through a multi-view intersection technology, and finally carrying out fusion rendering on the video picture in a rendering engine based on the converted 3D coordinate data and outputting an interactable three-dimensional scene which can be used for situation perception and behavior analysis. The invention effectively solves the problems of inaccurate registration of virtual space and real space, accumulation of dynamic errors and unnatural rendering in the prior art, and realizes high-precision and dynamic fusion of the monitoring video and the virtual scene.

Inventors

XUE CHAO
TANG BO
SONG ZHENDONG
CHEN LEI
WANG XING
ZHAO JINBIAO

Assignees

天津天地伟业信息系统集成有限公司

Dates

Publication Date: 20260505
Application Date: 20251115

Claims (8)

1. The method for fusing the monitoring video and the 3D virtual scene is characterized by comprising the following steps of: S1, constructing a three-dimensional digital scene, namely acquiring or creating a 3D scene file of a target area; s2, calibrating the space of the camera, namely determining the position and the posture parameters of the camera in the 3D scene constructed in the step S1; S3, real-time video analysis and target tracking, namely processing a real-time video stream, detecting whether a camera is offset and correcting, identifying and tracking a dynamic target in a picture, and outputting 2D pixel coordinates and a unique ID (identity) of the dynamic target; s4, converting 2D to 3D coordinates, namely converting the 2D target coordinates obtained in the step S3 into space coordinates in a 3D scene through back projection mapping based on the calibration parameters in the step S2, and carrying out coordinate optimization by combining multi-view data; S5, virtual-real fusion rendering, namely combining the 3D coordinate data obtained in the step S4 with the 3D scene in the step S1 in a rendering engine, and performing rendering treatment on a video picture or a target to generate a fused three-dimensional scene; and S6, outputting the fusion scene generated in the step S5, and supporting the three-dimensional situation awareness, behavior analysis and man-machine interaction of the user.
2. The method according to claim 1, wherein in step S1, a 3D scene file containing the geometric grid and the texture map is constructed by means of laser radar scanning, oblique photogrammetry, BIM model importing or manual modeling, and the file is imported into a rendering engine.
3. The method according to claim 1, wherein the step S2 includes a manual calibration mode or an automatic calibration mode: The manual calibration mode is that at least 4 groups of corresponding characteristic points are respectively selected from a 3D scene and a real video picture, and external parameters and internal parameters of a camera are solved through a PnP algorithm; The automatic calibration mode is that characteristic points of a 3D scene and a video picture are automatically matched by using a computer vision algorithm to calculate parameters, or a positioning and orientation module is carried in the process of installing a camera to directly obtain the spatial pose of the camera.
4. The method according to claim 1, characterized in that said step S3 comprises the sub-steps of: S3.1, correcting the offset of the camera, namely comparing the static scene characteristics in the real-time video with the characteristics in the calibration process through analysis, detecting and correcting the offset of the camera, and updating the calibration parameters; s3.2, target detection, namely identifying a target in a video picture by using a deep learning model, and outputting 2D pixel coordinates and a boundary frame of the target; and S3.3, target tracking, namely using a multi-target tracking algorithm to allocate unique IDs for the detected targets and forming a frame-crossing motion track.
5. The method according to claim 1, characterized in that said step S4 comprises the sub-steps of: s4.1, single-view back projection mapping, namely for a single camera, converting 2D pixel coordinates of a target in a view area of the single camera into a ray in a 3D scene through back projection calculation, and obtaining preliminary coordinates of the target in a 3D space through calculating an intersection point of the ray and a 3D scene geometric model; And S4.2, multi-view intersection optimization, namely when the same target is captured by two or more calibrated cameras, performing intersection calculation on rays from different view angles through a triangulation method to obtain more accurate 3D space coordinates of the target.
6. The method according to claim 5, wherein the core calculation process of the back projection mapping and ray-triangle intersection in step S4.1 includes: defining rays which start from the camera optical center and pass through the pixel points; Calculating the intersection point of the ray and a specific triangle surface in the 3D scene; by calculating the barycentric coordinates and the ray parameters, the precise position P_w of the intersection point in the 3D world coordinate system is determined.
7. The method according to claim 1, wherein in step S5, the rendering engine receives the real-time 3D coordinate data stream output in step S4 through the application program interface, and accurately draws and fuses the dynamic object in the video into the 3D scene in step S1 according to the 3D coordinates thereof, so as to realize synchronous rendering of virtual and real images.
8. The method according to claim 1, wherein in the step S6, the outputted fusion scene is an interactive three-dimensional visual interface, in which the user can perform viewing angle switching, target query, track playback and alarm management operations.

Description

Method for fusing monitoring video and 3D virtual scene Technical Field The invention belongs to the technical field of image processing, and particularly relates to a method for fusing a monitoring video and a 3D virtual scene. Background Video monitoring technology originates in the public security field of the 70 th century of the 20 th century, and originally takes an analog closed circuit television monitoring system (CCTV) as a core, and basic image acquisition and storage are realized through a video camera, a Video Cassette Recorder (VCR) and other devices. The system relies on the coaxial cable to transmit analog signals, has single function and is easy to interfere, and can only meet the requirement of simple security. In the 90 s of the 20 th century, digital technology has driven monitoring into the second generation, the digital video monitoring era. The Digital Video Recorder (DVR) of the marking device integrates functions of video recording, storage, remote control and the like, replaces numerous devices of a traditional analog system, and supports long-time high-quality video recording and multichannel management. The maturing of digital coding and decoding technologies (such as MPEG) significantly improves the image compression efficiency, and the application of network transmission protocol enables remote access, thus preliminarily breaking the region limitation. In the 21 st century, networking and intelligence became the core trend. The third generation full digital network video monitoring system is based on an IP camera, directly transmits digital signals through an Ethernet or a wireless network, and combines an embedded Web server technology to realize plug and play and large-scale networking. The intelligent analysis technology introduces a computer vision algorithm, supports target detection, behavior recognition (such as boundary crossing and track tracking) and abnormal event early warning, and promotes monitoring to change from 'passive recording' to 'active defense'. Currently, video monitoring further integrates AI, internet of things and big data technologies, has the characteristics of high definition (4K/8K), low delay, multi-system integration (such as entrance guard and alarm) and the like, is widely applied to scenes such as logistics loading and unloading, industrial production and the like, and optimizes operation safety and efficiency through real-time monitoring and data analysis. In the future, with the popularization of 5G and edge computing, distributed intelligent monitoring will further enhance real-time performance and coordination capability. Meanwhile, a technology for integrating video monitoring into virtual 3D scenes (commonly called video fusion or live-action three-dimensional fusion) is an important development direction in the fields of smart cities, security monitoring, emergency management and the like. The technology realizes visual presentation of virtual-real combination by spatially aligning and superposing a real-time video stream and a three-dimensional geographic information model (such as BIM, CIM, oblique photography and the like). However, the existing technology for integrating video monitoring into a virtual 3D scene has the problems of spatial registration and alignment precision: the coordinate system is difficult: 1. Video surveillance equipment (especially common cameras) often lack high precision spatial positioning information (e.g., precise latitude and longitude, pitch angle, yaw angle), making it difficult to accurately map their field of view into a three-dimensional scene. 2. The coordinate systems of different data sources (such as GPS, RTK, laser point cloud and oblique photography) are inconsistent, and complicated conversion and calibration are needed during fusion. Second, dynamic error accumulation: after the camera is installed, small displacement can occur due to environmental factors (such as wind, vibration and temperature difference), so that the originally calibrated parameters are invalid, and the stability of long-term use is affected. Distortion under non-ideal imaging conditions: the wide-angle lens and the fisheye lens have serious image distortion, and if the effective correction is not performed, the video picture and the 3D model cannot be accurately matched. Disclosure of Invention The invention aims to overcome the defects of the prior art and provides a method for fusing a monitoring video and a 3D virtual scene, which aims to open the whole process from camera calibration, dynamic target tracking and 2D/3D coordinate conversion to final fusion rendering, and effectively solve the problems of inaccurate spatial registration, dynamic error accumulation and unnatural rendering effect. The invention solves the technical problems by adopting the following technical scheme: a method for fusing a monitoring video and a 3D virtual scene comprises the following steps: S1, constructing a three-dimensional digital