CN-115908853-B - Target detection method, device, equipment and readable storage medium

CN115908853BCN 115908853 BCN115908853 BCN 115908853BCN-115908853-B

Abstract

The application discloses a target detection method, a device, equipment and a readable storage medium, wherein the method comprises the steps of obtaining a video frame and a target detection model, wherein the target detection model comprises a backbox module, a Rep-PAN Neck module and a Head module, the backbox module is a EFFICIENTREP BACKBONE module which is added with a DGMN module after a step module, the backbox module is used for predicting a transformation matrix of a dynamic filter and an affinity relation of each feature node of the video frame, random walk sampling is carried out based on the transformation matrix, detail features and key features are extracted, the Rep-PAN Neck module is used for fusing the detail features and the key features, and the Head module is used for generating a detection result corresponding to the video frame. Therefore, the application can realize the extraction of more detailed features and features needing important attention, and improve the accuracy and reliability of target detection by the target detection model.

Inventors

XIE ZHIQIANG
BU CHANGHAO
ZHANG JIDONG
ZHAO ZIYING
HUANG YUMING
DENG CHEN
DING JIAWEN
Zou Jingchen

Assignees

天翼数字生活科技有限公司

Dates

Publication Date: 20260512
Application Date: 20221220

Claims (9)

1. A method of detecting an object, comprising: The method comprises the steps of obtaining a video frame and a target detection model, wherein the target detection model comprises a Backbone back-bone module, a feature fusion Rep-PAN Neck module and a prediction Head module, and the Backbone back-bone module is a Backbone feature extraction EFFICIENTREP BACKBONE module added with a dynamic image message transmission DGMN module after a stem module; Predicting a transformation matrix of a dynamic filter and an affinity relation of each feature node of the video frame by using the Backbone module, and performing random walk sampling based on the transformation matrix to extract detail features and key features; Fusing the detail features and the key features by using the Rep-PAN Neck module to obtain fused data; analyzing the fusion data by using the Head module to generate a detection result corresponding to the video frame, wherein the detection result is an empty set or marked with a video frame of a region and a category where a target object is located, and the target object is an object needing to be focused in a scene corresponding to the video frame; the transform matrix for predicting the dynamic filter and affinity relationship of each feature node of the video frame using the backhaul module comprises: downsampling the video frame by using the stem module to obtain a sampling feature map; And predicting the dynamic association between the feature nodes based on the sampling feature map by utilizing the DGMN module, and jointly predicting a transformation matrix of the dynamic filter and the affinity relation of each feature node based on the dynamic association between the feature nodes.
2. The method of claim 1, wherein the DGMN module includes a uniform walk sampler and a random walk sampler; the predicting, by the DGMN module, a dynamic association between each of the feature nodes based on the sampled feature graph, including: Performing feature node sampling on the sampling feature map based on the uniform walk sampler and the random walk sampler by using the DGMN module, and returning a node index; based on the self-attention mechanism and the node index, a dynamic association between the feature nodes is predicted.
3. The method of claim 1, wherein the obtaining the target detection model comprises: Acquiring a plurality of training video frames and an initial target detection model under each scene; Marking an area where a training object corresponding to the training video frame is located in each training video frame, marking a training class corresponding to the training object, and obtaining a training image corresponding to the initial target detection model, wherein the training image is a training video frame containing a marking label; and carrying out iterative training on the initial target detection model by utilizing each training image to obtain a trained initial target detection model, wherein the trained initial target detection model is the latest target detection model.
4. The method of claim 3, wherein marking the region of the training object corresponding to the training video frame in each training video frame, and marking the training class corresponding to the training object, and obtaining the training image corresponding to the initial target detection model, comprises: According to the scene corresponding to each training video frame, determining the training category of the training video frame to be concerned, and searching the training object corresponding to the training category in the training video frame; and marking the training category of the training object and the area where the training object is located on the training video frame where the training object exists, and obtaining a training image corresponding to the initial target detection model.
5. The method of claim 3, wherein iteratively training the initial target detection model using each of the training images to obtain a trained initial target detection model, comprising: The training images are combined in proportion to obtain a training set, a testing set and a verification set; and performing iterative training on the initial target detection model by using the training set, the testing set and the verification set to obtain a trained initial target detection model.
6. The method for detecting a target according to claim 5, wherein the performing iterative training on the initial target detection model by using the training set, the testing set and the verification set to obtain a trained initial target detection model comprises: sequentially carrying out iterative training on the initial target detection model by using training images in the training set, recording iteration times, and obtaining an initial target detection model after each iterative training; Generating an identifier of each initial target detection model after iterative training, wherein the identifier corresponds to the initial target detection models after iterative training one by one; Calculating the average precision of the initial target detection model after each iteration training by using the verification set until the iteration times are greater than a preset first threshold value or the average precision is greater than a preset second threshold value, and taking the latest obtained mark of the initial target detection model after the iteration training as a target mark; repeatedly returning to execute the step of performing iterative training on the initial target detection model by sequentially utilizing training images in the training set so as to obtain a plurality of target identifiers; And selecting the initial target detection model with the highest average precision after iterative training from the initial target detection models after iterative training corresponding to the target identifiers by using the test set as the initial target detection model after training.
7. An object detection apparatus, comprising: The acquisition unit is used for acquiring a video frame and a target detection model, wherein the target detection model comprises a Backbone back bone module, a feature fusion Rep-PAN Neck module and a prediction Head module, and the Backbone back bone module is a Backbone feature extraction EFFICIENTREP BACKBONE module added with a dynamic image message transmission DGMN module after the stem module; the extraction unit is used for predicting a transformation matrix of a dynamic filter and an affinity relation of each feature node of the video frame by using the Backbone module, and carrying out random walk sampling based on the transformation matrix so as to extract detail features and key features; The fusion unit is used for fusing the detail features and the key features by using the Rep-PAN Neck module to obtain fusion data; The analysis unit is used for analyzing the fusion data by utilizing the Head module and generating a detection result corresponding to the video frame, wherein the detection result is an empty set or marked with a video frame of the region and the category where a target object is located, and the target object is an object needing to be focused in a scene corresponding to the video frame; the transform matrix for predicting the dynamic filter and affinity relationship of each feature node of the video frame using the backhaul module comprises: downsampling the video frame by using the stem module to obtain a sampling feature map; And predicting the dynamic association between the feature nodes based on the sampling feature map by utilizing the DGMN module, and jointly predicting a transformation matrix of the dynamic filter and the affinity relation of each feature node based on the dynamic association between the feature nodes.
8. An object detection device comprising a memory and a processor; The memory is used for storing programs; the processor is configured to execute the program to implement the respective steps of the object detection method according to any one of claims 1 to 6.
9. A readable storage medium, on which a computer program is stored which, when being executed by a processor, implements the steps of the object detection method according to any one of claims 1-6.

Description

Target detection method, device, equipment and readable storage medium Technical Field The present application relates to the field of internet technologies, and in particular, to a target detection method, apparatus, device, and readable storage medium. Background Target detection is the most basic problem in the field of computer vision, and the task of the target detection is to find out target objects needing to be focused on in an image or video and determine the category and the position of the target objects. Due to its basic nature, target detection has a great deal of applications in various fields such as unmanned automobiles, medical imaging, urban security, identity verification, robot navigation, motion analysis, etc. In recent years, YOLOv networks have been newly proposed for achieving target detection. The YOLOv network consists of a backbone feature extraction EFFICIENTREP BACKBONE module, a feature fusion Rep-PAN Neck module and a prediction Head module. YOLOv6 outperforms other homography algorithms in terms of accuracy and speed on the COCO dataset. However, although YOLOv network is superior to the same-series YOLO network in terms of target detection, the accuracy of target detection of YOLOv network is still not high enough due to the low capturing capability of YOLOv network, i.e., the accuracy of target detection network in performing target detection is still not high enough. Disclosure of Invention In view of the above, the present application provides a target detection method, apparatus, device and readable storage medium for improving accuracy of a target detection network when performing target detection. In order to achieve the above object, the following solutions have been proposed: A target detection method comprising: The method comprises the steps of obtaining a video frame and a target detection model, wherein the target detection model comprises a Backbone back-bone module, a feature fusion Rep-PAN Neck module and a prediction Head module, and the Backbone back-bone module is a Backbone feature extraction EFFICIENTREP BACKBONE module added with a dynamic image message transmission DGMN module after a stem module; Predicting a transformation matrix of a dynamic filter and an affinity relation of each feature node of the video frame by using the Backbone module, and performing random walk sampling based on the transformation matrix to extract detail features and key features; Fusing the detail features and the key features by using the Rep-PAN Neck module to obtain fused data; And analyzing the fusion data by using the Head module to generate a detection result corresponding to the video frame, wherein the detection result is an empty set or marked with the video frame of the region and the category where the target object is located, and the target object is an object needing to be focused in a scene corresponding to the video frame. Optionally, the transforming matrix for predicting the dynamic filter and affinity relationship of each feature node of the video frame by using the backhaul module includes: downsampling the video frame by using the stem module to obtain a sampling feature map; And predicting the dynamic association between the feature nodes based on the sampling feature map by utilizing the DGMN module, and jointly predicting a transformation matrix of the dynamic filter and the affinity relation of each feature node based on the dynamic association between the feature nodes. Optionally, the DGMN module includes a uniform walk sampler and a random walk sampler; the predicting, by the DGMN module, a dynamic association between each of the feature nodes based on the sampled feature graph, including: Performing feature node sampling on the sampling feature map based on the uniform walk sampler and the random walk sampler by using the DGMN module, and returning a node index; based on the self-attention mechanism and the node index, a dynamic association between the feature nodes is predicted. Optionally, the acquiring the target detection model includes: Acquiring a plurality of training video frames and an initial target detection model under each scene; Marking an area where a training object corresponding to the training video frame is located in each training video frame, marking a training class corresponding to the training object, and obtaining a training image corresponding to the initial target detection model, wherein the training image is a training video frame containing a marking label; and carrying out iterative training on the initial target detection model by utilizing each training image to obtain a trained initial target detection model, wherein the trained initial target detection model is the latest target detection model. Optionally, marking an area where a training object corresponding to the training video frame is located in each training video frame, and marking a training class corresponding to the training object, so as to obtain a