CN-121861509-B - Multitasking unified target detection method based on task prompt guidance

CN121861509BCN 121861509 BCN121861509 BCN 121861509BCN-121861509-B

Abstract

The application relates to a multi-task unified target detection method based on task prompt guidance, which creatively introduces a learnable task token, each input mode is allocated with a unique task mark as an identifier, task dynamic adaptation is realized under the condition of not modifying a network architecture, and visible light, thermal infrared and visible light-thermal infrared bimodal target detection tasks can be processed simultaneously in a single model. Different from the traditional multi-task model, tasks are distinguished through input modes, and efficient cross-mode processing is realized through integrating task mark embedding and pixel level fusion. The frame reduces computational redundancy and resource waste to the greatest extent, and provides an efficient, flexible and unified solution for unmanned aerial vehicle multi-task target detection.

Inventors

Kuai Yangliu
SUN JIAZHI
LI DONGDONG

Assignees

中国人民解放军国防科技大学

Dates

Publication Date: 20260512
Application Date: 20250805

Claims (7)

1. The method for detecting the multitasking unified target based on the task prompt guidance is characterized by comprising the following steps: Obtaining an image to be processed, wherein the image to be processed is a visible light image, a thermal infrared image or a visible light and thermal infrared bimodal image; The model input layer is used for preprocessing the image to be processed, carrying out Patch Embedding operation, then directly carrying out position coding on a single-mode image, carrying out weighted summation on a double-mode image, and then carrying out position coding; The method comprises the steps of inputting the position-coded features and task tokens into a backbone network to obtain final coding features, wherein the backbone network adopts a visual transducer arranged in ViTDet detection algorithm, the backbone network adopts an encoder layer based on a window multi-head attention mechanism to code the position-coded image features, adopts a coding layer based on a global multi-head attention mechanism to code the spliced results of corresponding image block features and task tokens in the coding results to obtain final coding features comprising the image coding features and the task tokens, the task tokens are learning embedded vectors, each type of image to be processed corresponds to one type of target detection task, each type of target detection task is allocated with each task token, and all task tokens adopt the same dimension as the image block embedding; Respectively upsampling and downsampling the final coding features to obtain four feature graphs with different scales, inputting the four feature graphs into an FPN (fast Fourier transform network) for multi-scale fusion, and generating a feature pyramid; processing the feature pyramid by adopting a region generation network to obtain candidate regions; And extracting the features with fixed size from the feature pyramid by the candidate region through RoI alignment operation, and processing the extracted features by adopting a Fast R-CNN detection head to obtain a target classification and regression prediction boundary box.
2. The task prompt-guided multitasking unified target detection method of claim 1, wherein inputting the image to be processed into a model input layer to obtain a position-coded feature comprises: when the image to be processed is a visible light image or a thermal infrared image: normalizing the image to be processed into a fixed size, obtaining the representation of the image through Patch Embedding operation, adding the representation of the image into a position code, and obtaining the characteristics after the position code; when the image to be processed is a visible light and thermal infrared bimodal image: The method comprises the steps of standardizing an image to be processed into a fixed size, obtaining the representation of a visible light image and the representation of a thermal infrared image through Patch Embedding operation, carrying out weighted summation on the representation of the visible light image and the representation of the thermal infrared image, and adding a position code to obtain the characteristics after the position code.
3. The task hint based guided multitasking unified target detection method of claim 1, wherein said backbone network comprises 12 encoder layers, each encoder layer comprising a multi-head attention block and a feed forward network, wherein the multi-head attention blocks of the 4 th, 8 th and 12 th encoder layers employ a global multi-head attention mechanism and the multi-head attention blocks of the other encoder layers employ a windowed multi-head attention mechanism.
4. The task hint-based guided multitasking unified target detection method of claim 3, wherein inputting the position-coded features and task tokens into a backbone network to obtain image features comprises: Inputting the position-coded features into the backbone network and processing the features by a first encoder layer, a second encoder layer and a third encoder layer to obtain first coding features; Splicing the first coding feature and the task token and inputting the spliced first coding feature and the task token into a fourth encoder layer to obtain a second coding feature; Inputting the coding features of the corresponding images in the second coding features into the backbone network to be processed by a fifth encoder layer, a sixth encoder layer and a seventh encoder layer to obtain third coding features; splicing the third coding feature and the coding feature of the corresponding task token in the second coding feature, and inputting the spliced coding feature into an eighth coder layer to obtain a fourth coding feature; inputting the coding features of the image corresponding to the fourth coding features into the backbone network, and processing the coding features by a ninth coder layer, a tenth coder layer and an eleventh coder layer to obtain fifth coding features; and splicing the coding features of the corresponding task tokens in the fifth coding feature and the fourth coding feature, and inputting the spliced coding features into a twelfth coder layer to obtain a final coding feature, wherein the final coding feature comprises an image feature and a task feature.
5. The task hint directed multitasking unified target detection method of claim 1 wherein said Fast R-CNN detection head comprises two parallel branches, one of which predicts the class probability of a target through full-connectivity layer and softmax function and the other branch predicts the exact position and size of a bounding box through regression.
6. The task hint guidance-based unified target detection method of claim 1, wherein for visible light single images, thermal infrared images, and visible light and thermal infrared bimodal images, dimensions of image block features and features after task token stitching are all Wherein N represents the number of tiles and D represents the dimension in which the tiles are embedded.
7. The task prompt-guided multitasking unified target detection method of claim 1, wherein the task prompt-guided multitasking unified target detection model comprises a model input layer, a backbone network, an FPN, a region generation network, a RoI Align and Fast R-CNN detection heads; the loss function of the multitask unified target detection model based on task prompt guidance in the training process consists of two parts, namely RPN loss and detection head loss, and multitask combined optimization is adopted; The RPN loss comprises a candidate frame position prediction loss and a candidate frame classification loss, wherein the candidate frame position prediction loss is used for optimizing the position of a candidate frame by adopting L1 loss, and the candidate frame classification loss is used for optimizing the classification of the candidate frame by adopting cross entropy loss; The detection head loss comprises a boundary frame position prediction loss and a boundary frame classification loss, wherein the boundary frame position prediction loss is used for optimizing the boundary frame position by adopting L1 loss, and the boundary frame classification loss is used for optimizing the classification of the boundary frame by adopting cross entropy loss.

Description

Multitasking unified target detection method based on task prompt guidance Technical Field The invention belongs to the technical field of target detection, and relates to a multitasking unified target detection method based on task prompt guidance. Background Object detection is a central task in the field of computer vision, which is defined as automatically identifying a class of a specific object (e.g., object, pedestrian, vehicle, etc.) in an image or video and determining its precise location in the image (typically represented by a bounding box). In recent years, the rapid development of deep learning, especially the introduction of a transducer architecture, has significantly improved the performance of this field. Unmanned Aerial Vehicles (UAVs) further promote the development of unmanned aerial vehicle-based target detection tasks by virtue of excellent maneuverability and efficient data acquisition capability, and provide a new solution for target detection in complex environments. However, when the drone is operating in low light or severe weather conditions, the single-mode visible light image often does not provide enough information to accurately detect objects. To overcome this limitation, a target detection method based on the fusion of visible light and thermal infrared has been developed. The multi-mode data fusion technology combines the complementary characteristics of the visible light image and the thermal infrared image, and remarkably improves the accuracy and the robustness of target detection in a complex scene. Currently, existing algorithms typically design separate models for visible light detection, thermal infrared detection, and both visible light and thermal infrared bimodal detection tasks, as shown in fig. 1. For example, the feature extraction module and detector need to be designed separately for the visible light detection task, as do the thermal infrared detection task and the bimodal detection task. The design has the following defects of 1) low parameter redundancy and resource utilization, namely that each task needs to train an independent model, so that the parameter redundancy and the resource utilization rate are low, 2) limited cross-mode learning, namely that the model trained on a single-task data set is difficult to utilize the relevance among the tasks, so that the generalization capability among different modes is weaker, and 3) the training and verification flow is redundant, namely that each task still needs to be independently trained and verified in spite of slight differences of the frame design, so that the workflow is complex. Disclosure of Invention Aiming at the problems in the traditional method, the invention provides a multitasking unified target detection method based on task prompt guidance, which can realize task dynamic adaptation without modifying a network architecture and can simultaneously process visible light, thermal infrared and visible light-thermal infrared bimodal target detection tasks in a single model. In order to achieve the above object, the embodiment of the present invention adopts the following technical scheme: In one aspect, a method for detecting a multitasking unified target based on task hint guidance is provided, the method comprising: and obtaining an image to be processed, wherein the image to be processed is a visible light image, a thermal infrared image or a visible light and thermal infrared bimodal image. The method comprises the steps of inputting an image to be processed into a model input layer to obtain image characteristics after position coding, wherein the model input layer is used for preprocessing the image to be processed, carrying out Patch Embedding operation, then directly carrying out position coding on a single-mode image, and carrying out weighted summation on a dual-mode image and then carrying out position coding. The method comprises the steps of inputting the position-coded features and task tokens into a backbone network to obtain final coding features, wherein the backbone network adopts a visual transducer arranged in ViTDet detection algorithm, the backbone network adopts an encoder layer based on a window multi-head attention mechanism to code the position-coded image features, and adopts an encoding layer based on a global multi-head attention mechanism to code the spliced results of the corresponding image block features and the task tokens in the coding results to obtain final coding features comprising the image coding features and the task tokens. And respectively upsampling and downsampling the final coding features to obtain four feature graphs with different scales, inputting the four feature graphs into the FPN for multi-scale fusion, and generating a feature pyramid. And processing the feature pyramid by adopting a region generation network (RPN) to obtain candidate regions. And extracting the features with fixed size from the feature pyramid by the candidate region through