CN-121074953-B - Light-weight human body detection and ranging method and system suitable for stage lamp

CN121074953BCN 121074953 BCN121074953 BCN 121074953BCN-121074953-B

Abstract

The invention discloses a light-weight human body detection and ranging method and system suitable for a stage lamp, which are used for solving the technical problem that the detection precision is low because the traditional human body detection method is easy to generate 'missed judgment' due to larger influence of environment. The method comprises the steps of obtaining left and right visible light images and scene infrared images acquired by a binocular camera, correcting distortion of the left and right visible light images and the scene infrared images by camera calibration parameters, adopting an improved YOLO11n model, outputting a plurality of frame regression distributions according to the corrected left visible light images and the corrected scene infrared images, generating left image candidate detection frames by combining preset anchor frames, suppressing non-maximum values to obtain left image accurate detection frames and expanding pixels, finally outputting a left image target region of interest, combining the corrected right visible light images to generate a right image target region of interest, and detecting the distance between a human body and the camera by a stereo matching algorithm.

Inventors

PENG YINGRU
Mei Jiaying
Request for anonymity
FANG YI
LI XIAOMIN
Request for anonymity

Assignees

广东工业大学
广州市浩洋电子股份有限公司

Dates

Publication Date: 20260505
Application Date: 20250930

Claims (10)

1. A light-weight human body detection and ranging method suitable for stage lamps is characterized by comprising the following steps: Acquiring left and right visible light images and scene infrared images acquired by a binocular camera; Correcting distortion of the left and right visible light images and the scene infrared image based on camera calibration parameters of the binocular camera, and outputting corrected left and right visible light images and corrected scene infrared image; Generating a left image target region of interest according to the corrected left visible light image in the corrected left and right visible light images and the corrected scene infrared image by adopting an improved YOLO11n model; generating a right image target region of interest according to the left image target region of interest and corrected right visible light images in the corrected left and right visible light images; Detecting human body distance according to the right image target region of interest and the left image target region of interest by adopting a stereo matching algorithm, and outputting the distance between the human body and the binocular camera; The improved YOLO11n model comprises a backbone network based on self-adaptive multi-mode fusion, a neck network and a detection head, wherein the backbone network based on the self-adaptive multi-mode fusion comprises a convolution module, a bottleneck convolution module, a convolution module with parallel spatial attention, a multi-mode attention splicing module and a spatial pyramid pooling module, the target multi-mode fusion feature map comprises a first fusion feature map, a second fusion feature map and a third fusion feature map, the backbone network based on the self-adaptive multi-mode fusion is adopted to carry out self-adaptive multi-mode fusion on the corrected left visible light image and the corrected scene infrared image, and the target multi-mode fusion feature map is output and comprises the following steps: Respectively carrying out convolution operation on the corrected left visible light image and the corrected scene infrared image through two convolution modules connected in series, and outputting a first left graph convolution feature map and a first scene convolution feature map; Respectively carrying out cross-stage convolution on the first left graph convolution feature graph and the first scene convolution feature graph by adopting a bottleneck convolution module, and outputting a first left graph cross-stage convolution feature graph and a first scene cross-stage convolution feature graph; Inputting the first left image cross-stage convolution feature image and the first scene cross-stage convolution feature image into a multi-modal attention splicing module to perform middle-stage fusion, outputting a first left image middle-stage fusion feature image, performing convolution operation on the first left image middle-stage fusion feature image and the first scene cross-stage convolution feature image through the convolution module, and outputting a second left image convolution feature image and a second scene convolution feature image; and obtaining a target multi-mode fusion feature map based on the second left graph convolution feature map and the second scene convolution feature map.
2. The method for detecting and ranging a light-weight human body suitable for a stage light fixture according to claim 1, wherein the generating a left image target region of interest according to the corrected left visible light image of the corrected left and right visible light images and the corrected scene infrared image using the improved YOLO11n model comprises: Adopting the backbone network based on the self-adaptive multi-modal fusion to carry out the self-adaptive multi-modal fusion on the corrected left visible light image and the corrected scene infrared image, and outputting a target multi-modal fusion feature map; performing multi-scale feature fusion on the target multi-mode fusion feature map through the neck network to generate a multi-scale fusion feature map; taking the multi-scale fusion feature map as input of the detection head, and outputting regression distribution of a plurality of frames; Generating a plurality of left image candidate detection frames according to a preset anchor frame and a plurality of frame regression distributions; screening the plurality of left image candidate detection frames by adopting a non-maximum suppression algorithm, outputting a left image accurate detection frame, performing pixel expansion on the left image accurate detection frame, and outputting a left image initial region of interest; Counting the number of the initial regions of interest of the left image, and comparing the number of the initial regions of interest of the left image with a preset threshold value of the number of the regions of interest; if the number of the initial regions of interest of the left image is larger than the preset threshold value of the number of the regions of interest, calculating the distance between the adjacent initial regions of interest of the left image, and comparing the distances with the preset threshold value of pixels respectively; Merging the adjacent left image initial interested areas corresponding to the distance smaller than the preset pixel threshold value, and outputting a left image target interested area; And if the number of the initial regions of interest of the left image is smaller than or equal to the preset threshold value of the number of the regions of interest, taking the initial regions of interest of the left image as left image target regions of interest.
3. The method for detecting and ranging a light-weight human body suitable for a stage light fixture according to claim 2, wherein the adopting the backbone network based on adaptive multi-mode fusion performs adaptive multi-mode fusion on the corrected left visible light image and the corrected scene infrared image, and outputs a target multi-mode fusion feature map, and further comprises: respectively carrying out cross-stage convolution on the second left graph convolution feature graph and the second scene convolution feature graph by adopting a bottleneck convolution module, and outputting a second left graph cross-stage convolution feature graph and a second scene cross-stage convolution feature graph; inputting the second left image cross-stage convolution feature image and the second scene cross-stage convolution feature image into a multi-modal attention splicing module to perform middle-stage fusion, outputting a first fusion feature image, performing convolution operation on the first fusion feature image and the second scene cross-stage convolution feature image through a convolution module, and outputting a third left image convolution feature image and a third scene convolution feature image; a bottleneck convolution module is adopted to carry out cross-stage convolution on the third left graph convolution feature graph and the third scene convolution feature graph respectively, and a third left graph cross-stage convolution feature graph and a third scene cross-stage convolution feature graph are output; Inputting the third left image cross-stage convolution feature image and the third scene cross-stage convolution feature image into a multi-modal attention splicing module to perform middle-stage fusion, outputting a second fusion feature image, performing convolution operation on the second fusion feature image and the third scene cross-stage convolution feature image through a convolution module, and outputting a fourth left image convolution feature image and a fourth scene convolution feature image; A bottleneck convolution module is adopted to carry out cross-stage convolution on the fourth left graph convolution feature graph and the fourth scene convolution feature graph respectively, and a fourth left graph cross-stage convolution feature graph and a fourth scene cross-stage convolution feature graph are output; inputting the fourth left image cross-stage convolution feature image and the fourth scene cross-stage convolution feature image into a multi-modal attention splicing module to perform middle-stage fusion, and outputting a second left image middle-stage fusion feature image; a spatial pyramid pooling module is adopted to pool the mid-term fusion feature map of the second left map, and a pooled feature map is output; And taking the pooled feature map as the input of a convolution block with parallel spatial attention, and outputting a third fused feature map.
4. The method for detecting and ranging a light-weight human body suitable for a stage light fixture according to claim 3, wherein the multi-mode attention splicing module comprises a convolution layer, a ReLU activation function layer and a Sigmoid activation function layer, the steps of inputting the first left graph cross-stage convolution feature graph and the first scene cross-stage convolution feature graph into the multi-mode attention splicing module for middle-term fusion, and outputting a first left graph middle-term fusion feature graph comprise the steps of: Splicing the first left graph cross-stage convolution feature graph and the first scene cross-stage convolution feature graph, and outputting a spliced feature graph; Performing convolution operation on the spliced feature images by adopting a convolution layer to generate a first convolution feature image; performing linear transformation on the first convolution feature map through a ReLU activation function layer, and outputting a transformation feature map; Performing convolution operation on the transformation feature map by adopting a convolution layer, and outputting a second convolution feature map; taking the second convolution feature map as the input of the Sigmoid activation function layer, and outputting a convolution feature probability map; Multiplying the convolution feature probability map with the first left map cross-stage convolution feature map to generate a first left map mid-term fusion feature map.
5. The method for detecting and ranging a light-weight human body suitable for a stage light fixture according to claim 1, wherein the step of detecting a human body distance according to the right image target region of interest and the left image target region of interest by using a stereo matching algorithm, and outputting a distance between a human body and a binocular camera comprises the steps of: respectively extracting multi-scale fusion features of the right image target region of interest and the left image target region of interest to generate left image region of interest features and right image region of interest features; Constructing a 3D cost body according to the left image region-of-interest characteristic and the right image region-of-interest characteristic; Cost aggregation is carried out according to the region-of-interest characteristics of the left graph and the 3D cost body, and an aggregation cost body is generated; determining a multiplication cost volume based on the aggregation cost volume; performing parallax regression on the multiplied cost bodies to generate initial region-of-interest parallax values; Performing parallax refinement on the initial region of interest parallax value based on the left image target region of interest, and generating parallax values of corresponding pixels in the left image and the right image; And calculating the distance between the human body and the binocular camera according to the parallax values of the corresponding pixels in the left and right images.
6. The method for detecting and ranging a light-weight human body suitable for a stage light fixture according to claim 5, wherein the calculation process of the distance between the human body and the binocular camera is specifically as follows: ; Wherein, the The depth value of the human body is a distance between the human body and the binocular camera; is the focal length of the binocular camera; is the baseline of the binocular camera; is the abscissa of the human body on the left camera imaging plane; the abscissa of the human body on the imaging plane of the right camera; is the disparity value of the corresponding pixel in the left and right diagrams.
7. A light-weight human body detection and ranging system suitable for a stage light fixture, applied to the light-weight human body detection and ranging method suitable for a stage light fixture as claimed in claim 1, characterized by comprising: the acquisition module is used for acquiring left and right visible light images and scene infrared images acquired by the binocular camera; The correction module is used for carrying out distortion correction on the left and right visible light images and the scene infrared image based on camera calibration parameters of the binocular camera and outputting corrected left and right visible light images and corrected scene infrared image; The left image interest region generation module is used for generating a left image target interest region according to the corrected left visible light image in the corrected left and right visible light images and the corrected scene infrared image by adopting an improved YOLO11n model; the right image interest region generation module is used for generating a right image target interest region according to the left image target interest region and corrected right visible light images in the corrected left and right visible light images; and the human body detection module is used for detecting the human body distance according to the right image target region of interest and the left image target region of interest by adopting a stereo matching algorithm and outputting the distance between the human body and the binocular camera.
8. A computer device comprising a memory and a processor, wherein the memory stores a computer program which, when executed by the processor, causes the processor to perform the steps of the lightweight human detection and ranging method for stage light fixtures as claimed in any one of claims 1 to 6.
9. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed implements the light-weight human body detection and ranging method for stage light fixtures according to any one of claims 1-6.
10. A computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, wherein the program instructions, when executed by a computer, cause the computer to perform the lightweight human detection and ranging method for stage light fixtures according to any one of claims 1-6.

Description

Light-weight human body detection and ranging method and system suitable for stage lamp Technical Field The invention relates to the technical field of computer vision and image processing, in particular to a light-weight human body detection and ranging method and system suitable for stage lamps. Background The stage lamp uses a laser lamp as a core component, and the stage lamp becomes key equipment for creating an immersive visual effect by virtue of the optical characteristics of high brightness and high directivity. However, the high power density characteristic of the laser lamp (especially the laser lamp with the power of more than or equal to 300 mW) has obvious safety risks for human bodies, namely, retinal photoreceptor cells can be burned when laser beams directly irradiate eyes to cause temporary or permanent vision impairment, local thermal burn can be caused even if the laser beams irradiate the skin and focusing is performed for a long time, and the visual judgment of actors and staff can be interfered by the glare effect of strong laser, so that the stage accident probability is increased. Therefore, the laser lamp must strictly avoid the human body area in the working process, so as to ensure that the projection range and the personnel movement range have no intersection, which is the core requirement of stage safety operation. The existing human body detection method detects human body heat signals through an infrared sensor, converts the human body heat signals into electric signals, and compares the electric signals with a preset human body infrared radiation intensity threshold value, so that human body detection is completed. However, a large number of interference sources similar to the infrared radiation characteristics of the human body exist in the stage scene, when the temperature of the stage environment is higher or lower, the temperature difference between the human body and the environment is reduced, the intensity of the infrared signal of the human body received by the sensor is greatly reduced, and the 'missed judgment' is easy to occur, so that the detection precision is lower. Disclosure of Invention The invention provides a light-weight human body detection and ranging method and system suitable for a stage lamp, which are used for solving the technical problem that the detection precision is low because the traditional human body detection method is easy to generate 'missed judgment' due to larger influence of environment. The first aspect of the invention provides a light-weight human body detection and ranging method suitable for a stage lamp, which comprises the following steps: Acquiring left and right visible light images and scene infrared images acquired by a binocular camera; Correcting distortion of the left and right visible light images and the scene infrared image based on camera calibration parameters of the binocular camera, and outputting corrected left and right visible light images and corrected scene infrared image; Generating a left image target region of interest according to the corrected left visible light image in the corrected left and right visible light images and the corrected scene infrared image by adopting an improved YOLO11n model; generating a right image target region of interest according to the left image target region of interest and corrected right visible light images in the corrected left and right visible light images; And detecting the human body distance according to the right image target region of interest and the left image target region of interest by adopting a stereo matching algorithm, and outputting the distance between the human body and the binocular camera. Optionally, the improved YOLO11n model comprises a backbone network, a neck network and a detection head based on adaptive multi-modal fusion, wherein the generating a left image target region of interest by adopting the improved YOLO11n model according to the corrected left visible light image in the corrected left and right visible light images and the corrected scene infrared image comprises the following steps: Adopting the backbone network based on the self-adaptive multi-modal fusion to carry out the self-adaptive multi-modal fusion on the corrected left visible light image and the corrected scene infrared image, and outputting a target multi-modal fusion feature map; performing multi-scale feature fusion on the target multi-mode fusion feature map through the neck network to generate a multi-scale fusion feature map; taking the multi-scale fusion feature map as input of the detection head, and outputting regression distribution of a plurality of frames; Generating a plurality of left image candidate detection frames according to a preset anchor frame and a plurality of frame regression distributions; screening the plurality of left image candidate detection frames by adopting a non-maximum suppression algorithm, outputting a left image accurate detection fra