CN-122023780-A - Method, equipment and medium for detecting target visual element in image

CN122023780ACN 122023780 ACN122023780 ACN 122023780ACN-122023780-A

Abstract

The present disclosure provides a method, an apparatus, and a medium for detecting a target visual element in an image, and relates to the technical field of image processing, where the method includes obtaining an original image to be detected; the target visual element detection model comprises a backbone network, a neck network and a detection head which are sequentially connected, wherein the backbone network at least comprises a plurality of global context characteristic enhancement modules used for aggregating global context information through a linear complexity attention mechanism to inhibit noise characteristics in received image data, the backbone network is used for carrying out noise inhibition on at least an original image to obtain a plurality of first characteristic diagrams with different scales, the neck network is used for carrying out characteristic fusion processing on the plurality of first characteristic diagrams with different scales to obtain a plurality of second characteristic diagrams with different scales, and the detection head is used for generating detection results corresponding to each second characteristic diagram.

Inventors

HAO MING
CHEN ZHE
LI JILONG
XU LINGYAN
SHI YUHAI

Assignees

国家广播电视总局广播电视科学研究院

Dates

Publication Date: 20260512
Application Date: 20260402

Claims (10)

1. A method for detecting a target visual element in an image, the method comprising: Acquiring an original image to be detected; Obtaining a detection result output by the target visual element detection model according to the original image and a pre-trained target visual element detection model; The detection result comprises position information and categories of target visual elements in the original image, the target visual element detection model comprises a backbone network, a neck network and a detection head which are sequentially connected, the backbone network at least comprises a plurality of global context feature enhancement modules, the global context feature enhancement modules are used for aggregating global context information through a linear complexity attention mechanism to inhibit noise features in received image data, the backbone network is used for at least carrying out noise inhibition on the original image to obtain a plurality of first feature images with different scales, the neck network is used for carrying out feature fusion processing on the plurality of first feature images with different scales to obtain a plurality of second feature images with different scales, and the detection head is used for generating detection results corresponding to the second feature images.
2. The method of claim 1, wherein the backbone network further comprises a plurality of dynamic receptive field attention capture modules, a plurality of the global contextual feature enhancement modules and a plurality of the dynamic receptive field attention capture modules are alternately connected in series, the dynamic receptive field attention capture modules being configured to simulate receptive fields of different scales to dynamically accommodate element size variations in the image.
3. The method of claim 1, wherein each global context feature enhancement module comprises a plurality of global enhancement blocks, a bottleneck layer in one-to-one correspondence with each global enhancement block, and a connection layer, wherein the plurality of global enhancement blocks are sequentially connected in series, each bottleneck layer receives an output corresponding to the global enhancement block, the connection layer receives an output of each bottleneck layer and an input of the global context feature enhancement module, the output of the connection layer is an output of the global context feature enhancement module, an activation function in each global enhancement block is a SiLU activation function, and the SiLU activation function is used to reduce computational complexity to linearity.
4. The method of claim 2, wherein each of said dynamic receptive focus capture modules comprises multiple parallel, different expansion rate, a first point-by-point convolution parallel to said hole convolution branches, a splice layer receiving the output of each of said hole convolution branches, a second point-by-point convolution receiving the output of said splice layer, and a summation module summing the output of said second point-by-point convolution with the output of said first point-by-point convolution, the output of said summation module being the output of said dynamic receptive focus capture module, the input of said hole convolution branches being the input of said dynamic receptive focus capture module, each of said hole convolution branches comprising a non-parametric energy function focus mechanical block.
5. The method of claim 1, wherein the neck network comprises a multi-scale visual semantic adaptive fusion module for performing feature weighted fusion processing on the first feature maps of a plurality of different scales to obtain a second feature map of a plurality of different scales.
6. The method of claim 1, wherein the pre-trained target visual element detection model is pre-trained based on: Obtaining a training sample set, wherein the training sample set comprises a plurality of groups of training samples, one training sample comprises a sample image and a sample label corresponding to the sample image, the sample image comprises a target visual element, and the sample label comprises category and position information corresponding to the target visual element in the sample image; Training a target visual element initial detection model to be trained according to the training sample set and a preset boundary regression loss function based on geometric convex hull constraint to obtain a trained target visual element detection model, wherein the boundary regression loss function based on geometric convex hull constraint comprises convex hull penalty items describing the void areas except for a prediction frame and a real frame in the area of the minimum external convex hull wrapping the prediction frame and the real frame and/or diagonal alignment auxiliary items describing the distance between angular points on the diagonal of the prediction frame and corresponding angular points on the corresponding diagonal of the real frame, the prediction frame is a position description frame corresponding to the position information of the target visual element in the sample image detected by the target visual element initial detection model, and the real frame is a position description frame corresponding to the position information of the target visual element in the sample label.
7. The method of claim 6, wherein the geometric convex hull constraint-based boundary regression loss function is: ; Wherein, the In order to achieve a loss value, the value of the loss, For the area of intersection of the real frame and the predicted frame, In order to adjust the penalty coefficients of the weights, To wrap the area of the smallest circumscribing convex hull of the prediction box and the real box, For the area of the union of the real frame and the predicted frame, For the euclidean distance between the preset upper corner point of the prediction frame and the preset upper corner point of the real frame, For Euclidean distance between a preset lower corner point of the prediction frame and a preset lower corner point of the real frame, the preset upper corner point and the preset lower corner point are corner points on diagonal lines, For the width of the minimum circumscribing convex hull, And the height of the minimum external convex hull.
8. The method according to claim 1, wherein the method further comprises: Mapping the position information of the target visual element in the original image and the category into the original image to generate a marked image with a marked frame; and/or, in the case that the detection result represents that the original image comprises a target visual element and is an image in the video to be checked, recording a time code of the original image.
9. An electronic device comprising a memory for storing computer instructions and a processor for invoking the computer instructions from the memory to perform the method of any of claims 1-8.
10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, implements the method according to any of claims 1-8.

Description

Method, equipment and medium for detecting target visual element in image Technical Field The present disclosure relates to the field of image processing technologies, and in particular, to a method, an apparatus, and a medium for detecting a target visual element in an image. Background With the rapid development of the broadcast television and network audiovisual industry, massive videos need to be monitored and checked in real time. Wherein, identifying the target visual element in the video picture is the core of ensuring safe broadcasting and industry supervision, and the target visual element can be a specific visual object which needs to be identified or checked. Currently, in the case where a video frame is a frame belonging to a low-light environment such as night, if a target visual element is directly recognized on the video frame, the difficulty of recognition is generally great. Currently, in order to reduce difficulty in identifying a target visual element in a video frame belonging to a frame in a low-light environment such as night, a serial architecture of image enhancement and target detection is generally adopted. Specifically, an independent front-end image enhancement algorithm (such as a traditional Retinex theory algorithm or a model based on deep learning, such as an antagonism network and a low-illumination image enhancement network) is introduced to perform brightness enhancement, denoising and contrast enhancement processing on a video picture, and then the processed video picture is input into a general target detection model to realize the identification of target visual elements in the video picture. However, most of the existing pre-image enhancement algorithms are optimized for subjective visual perception of human eyes (such as pursuing smoothness and brightness uniformity of pictures), and aim to generate a middle-state image with clearer vision by repairing visual defects of an input image. Therefore, the pre-image enhancement algorithm often misrecognizes the image features corresponding to the tiny target visual elements in the video image as noise and erases the noise in the repair process, which causes a problem that the pre-image enhancement algorithm cannot recognize the tiny target visual elements in the video image. Disclosure of Invention It is an object of the present disclosure to provide a new solution for detection of target visual elements in an image. According to a first aspect of the present disclosure, there is provided a method of detecting a target visual element in an image, the method comprising: Acquiring an original image to be detected; Obtaining a detection result output by the target visual element detection model according to the original image and a pre-trained target visual element detection model; The detection result comprises position information and categories of target visual elements in the original image, the target visual element detection model comprises a backbone network, a neck network and a detection head which are sequentially connected, the backbone network at least comprises a plurality of global context feature enhancement modules, the global context feature enhancement modules are used for aggregating global context information through a linear complexity attention mechanism to inhibit noise features in received image data, the backbone network is used for at least carrying out noise inhibition on the original image to obtain a plurality of first feature images with different scales, the neck network is used for carrying out feature fusion processing on the plurality of first feature images with different scales to obtain a plurality of second feature images with different scales, and the detection head is used for generating detection results corresponding to the second feature images. Optionally, the backbone network further includes a plurality of dynamic receptive field attention capturing modules, a plurality of the global context feature enhancement modules and a plurality of the dynamic receptive field attention capturing modules are alternately connected in series, and the dynamic receptive field attention capturing modules are used for simulating receptive fields with different scales to dynamically adapt to element size changes in the images. Optionally, each global context feature enhancement module includes a plurality of global enhancement blocks, a bottleneck layer corresponding to each global enhancement block one by one, and a connection layer, where the plurality of global enhancement blocks are sequentially connected in series, each bottleneck layer receives an output corresponding to the global enhancement block, the connection layer receives an output of each bottleneck layer and an input of the global context feature enhancement module, an output of the connection layer is an output of the global context feature enhancement module, an activation function in each global enhancement block is a SiLU activati