CN-121982283-A - RK 3588-based light-weight multi-mode infrared small target detection method

CN121982283ACN 121982283 ACN121982283 ACN 121982283ACN-121982283-A

Abstract

The invention discloses a lightweight multi-mode infrared small target detection method based on RK3588, which relates to the technical field of image processing and comprises the steps of 1, constructing and training a double-layer collaborative detection framework consisting of a domestic open-source multi-mode large model Qwen2.5-VL and a target detection expert model YOLO-V11, 2, designing a double-part Prompt consisting of a generalization instruction and expert information, 3, configuring PC (personal computer) environment required by the conversion of a Rayleigh core micro RK3588 model and the development of an offline reasoning program, 4, converting the trained collaborative detection model into an offline model format special for an RK3588 platform, 5, realizing image preprocessing, visual coding, cross-mode fusion, text generation and target analysis, 6, completing environment configuration on RK3588 edge end equipment, and deploying an offline model file and the reasoning program to the end side platform. The invention detects infrared small targets in real time with high precision on domestic edge chips, and has the advantages of high recognition rate, low false alarm rate, strong real-time performance and the like.

Inventors

WANG LING
ZHENG CHENGWEN
YAN HE
CHEN BINBIN
JIN ZHIXIN
WEI QINGQING

Assignees

南京航空航天大学

Dates

Publication Date: 20260505
Application Date: 20251226

Claims (7)

1. The lightweight multi-mode infrared small target detection method based on RK3588 is characterized by comprising the following steps of: Step 1, a Qwen2.5-VL multi-mode large model is adopted as a base, a YOLO-V11 target detection expert model is integrated, and a double-layer collaborative detection network architecture is constructed; step 2, designing a double-part template composed of a generalization task instruction and expert auxiliary information, wherein the generalization task instruction comprises task definition aiming at infrared small target detection, a target class set to be detected and output format requirements of absolute coordinates of a target boundary frame; Step 3, based on the double-part promt template designed in the step 2, converting an infrared small target image data set into an instruction fine adjustment data set comprising an image path, a generalization task instruction, expert auxiliary information and a structured answer, taking the instruction fine adjustment data set as a training sample, adopting LoRA low-rank adaptation technology to carry out instruction fine adjustment training on a Qwen2.5-VL multi-mode large model in a double-layer collaborative detection network architecture, and updating weight parameters of a LoRA adapter until the Qwen2.5-VL multi-mode large model converges to obtain a trained collaborative detection model; Step 4, carrying out light weight treatment on the collaborative detection model obtained after training, and reconstructing the collaborative detection model after light weight treatment into an integrated YOLO-V11 target detection expert model which is matched with the RK3588 edge end, a visual perception and coding offline model of a Qwen2.5-VL visual encoder module and a language decoder offline model in a Qwen2.5-VL multi-mode large model which is matched with the RK3588 edge end; Step 5, loading a visual perception and coding offline model and a language decoder offline model at the edge of RK3588 to execute double-layer collaborative detection network reasoning: And (3) obtaining a preliminary detection result and a visual coding vector of the infrared small target by utilizing a visual perception and coding offline model in the infrared small target image data set, formatting the obtained preliminary detection result into a complete text template according to the double-part template of the step (2), constructing a multi-mode token sequence by combining the complete text template with the visual coding vector, inputting the multi-mode token sequence into a language decoder offline model to generate a structured text comprising the absolute coordinates of the target category and the target bounding box, and analyzing the structured text to obtain a final infrared small target detection result.
2. The RK3588 based light-weight multi-modality infrared small target detection method of claim 1, wherein the detection capability of YOLO-V11 is utilized to assist semantic understanding of qwen2.5-VL on infrared images.
3. The RK3588 based light-weight multi-mode infrared small target detection method of claim 1, wherein in step 1, constructing a dual-layer collaborative detection architecture specifically comprises: Step 11, selecting a Qwen2.5-VL multi-mode large model as a base, and selecting a YOLO-V11 target detection expert model; Step 12, configuring a YOLO-V11 target detection expert model loaded with pre-training weights as a front-end visual perception module, performing forward propagation on an input infrared image to extract candidate target areas, and outputting a preliminary detection result comprising target categories and absolute coordinates of a target bounding box; Step 13, configuring a Qwen2.5-VL multi-mode large model as a back-end semantic reasoning module, and fusing visual features of the infrared small target image with expert priori information provided by a preliminary detection result; And 14, migrating the field detection experience of the YOLO-V11 target detection expert model to a Qwen-2.5-VL multi-modal large model through feature mapping alignment and causal language modeling loss constraint.
4. The RK3588 based light-weight multi-modality infrared small target detection method of claim 1, wherein in step 2, designing a two-part promt template consisting of generalized task instructions and expert assistance information comprises: step 21, constructing a generalization task instruction, namely defining a task type of infrared small target detection in an instruction text, enumerating a preset infrared target class set to be detected, defining an arrangement sequence of a boundary frame coordinate array and an origin position of an image coordinate axis, and standardizing an output format of a Qwen2.5-VL multi-mode large model; Step 22, expert auxiliary information is constructed, namely a guide prompt template is established, the absolute coordinates of a target class and a target boundary box output by a YOLO-V11 target detection expert model are formatted into a text sequence, the text sequence is embedded into the guide prompt template, and an explicit visual prompt for a Qwen2.5-VL multi-modal large model is formed; And step 23, splicing the generalization task instruction and expert auxiliary information according to a preset sequence to form a complete promt input sequence, and guiding the Qwen2.5-VL multi-mode large model to focus on a key detection area in the infrared image.
5. The RK3588 based lightweight multi-modality infrared small target detection method of claim 1, wherein step 3 comprises: step 31, uniformly scaling images in the infrared small target image dataset to a preset resolution, and converting a relative coordinate tag in a YOLO format into an absolute coordinate format which is suitable for Qwen2.5-VL multi-mode large model input; Step 32, constructing an instruction fine adjustment dataset based on JSONL format, wherein each sample in the instruction fine adjustment dataset comprises a user field and an assurement field, wherein the user field encapsulates an image path, a generalization task instruction and expert auxiliary information, and the assurement field encapsulates real annotation data comprising a target category marked by a special word and an absolute coordinate of a target boundary frame, and guides a Qwen2.5-VL multi-mode large model to generate a structured text meeting format requirements; Step 33, configuring LoRA adapters, namely injecting the low-rank adapters into an attention mechanism projection layer and a feedforward network layer of the Qwen2.5-VL multi-mode large model, setting the rank, the scaling factor and the rejection rate parameters of the adapters, and freezing the pre-training weight of the Qwen2.5-VL multi-mode large model and the pre-training weight of the YOLO-V11 target detection expert model; And 34, inputting the instruction fine-tuning data set into a Qwen2.5-VL multi-mode large model in a double-layer collaborative detection network to execute mixed precision training, adopting a BF16 mixed precision strategy, updating parameters of a low-rank adapter by minimizing a loss function of causal language modeling until the Qwen2.5-VL multi-mode large model converges to obtain a trained LoRA adapter weight parameter, and jointly forming the weight of the collaborative detection model by the obtained LoRA adapter weight, the Qwen2.5-VL multi-mode large model pre-training weight and the YOLO-V11 target detection expert model pre-training weight.
6. The RK3588 based light-weight multi-modality infrared small target detection method of claim 5, wherein the light-weight processing in step 4 comprises: Step 41, combining the LoRA adapter weight parameters obtained in the training stage with pre-training weight parameters of the Qwen2.5-VL multi-mode large model, and storing the combined Qwen2.5-VL multi-mode large model weight into an FP16 semi-precision format; Step 42, the Qwen2.5-VL visual encoder module and the YOLO-V11 target detection expert model in the combined Qwen2.5-VL multi-mode large model are jointly exported to form a single ONNX universal format model, the RKNN-Toolkit tool chain is utilized to convert the ONNX universal format model into a RKNN offline model adapting to the RK3588 NPU hardware architecture, and the RKNN offline model is a visual perception and coding offline model integrating the Qwen2.5-VL visual encoder module and the YOLO-V11 target detection expert model; and 43, configuring a W8A8 quantization strategy for a language decoder module in the combined Qwen2.5-VL multi-modal large model by utilizing a RKLLM-Toolkit tool chain, compressing the weight and the activation value of the language decoder module into an 8-bit fixed-point format, and generating a language decoder RKLLM offline model adapting to the RK3588 NPU hardware architecture.
7. The method for detecting the lightweight multi-mode infrared small target based on RK3588 as set forth in claim 1, wherein in step 5, the RK3588 edge loading visual perception and encoding offline model and the language decoder offline model perform double-layer collaborative detection network reasoning, comprising: step 51, constructing an RK3588 edge reasoning main program, firstly initializing RKNN and RKLLM runtime environments, and preprocessing an input infrared small target image to adapt to the input dimension requirements of a visual perception and coding offline model; step 52, calling RKNN-run time interface to load and run visual perception and coding RKNN off-line model, forward calculating the preprocessed infrared small target image, and synchronously obtaining the primary detection result and visual coding vector of the infrared small target; Step 53, formatting the preliminary detection result obtained in the step 52 into a complete text Prompt according to a double-part Prompt template, and converting the text Prompt into a token sequence through a word segmentation device; Step 54, linearly splicing the visual coding vector output in the step 52 and the token sequence generated in the step 53 in the sequence length dimension to construct a multi-mode token sequence comprising real-time expert detection data; step 55, calling RKLLM-run time interface to load and run language decoder RKLLM off-line model, and carrying out autoregressive decoding reasoning on the multi-mode token sequence to generate structured text streams comprising target category and absolute coordinates of target boundary frame one by one; and 56, analyzing the absolute coordinates of the target category and the target boundary box in the structured text stream, inversely normalizing and mapping the absolute coordinate values back to the original infrared image coordinate system, and outputting a final infrared small target detection result.

Description

RK 3588-based light-weight multi-mode infrared small target detection method Technical Field The invention relates to the technical fields of image processing, artificial intelligence and embedded systems, in particular to a lightweight multi-mode infrared small target detection method based on RK 3588. Background In the fields of national defense safety and aerospace defense, real-time and accurate detection and identification of infrared small targets are core links of situation awareness and threat coping. The infrared imaging technology plays an irreplaceable role in strategy and tactical layers such as missile early warning, space target monitoring, accurate guidance, frontier defense and sea defense reconnaissance and the like by virtue of the characteristics of excellent working capacity, strong electromagnetic interference resistance, good concealment, long acting distance and the like. However, the high frame frequency infrared small target detection faces serious challenges in practical application, namely, the target characteristics are weak, the target characteristics are usually only tens of pixels in an image, the signal to noise ratio is low, obvious shape and texture information is lacked, key information is difficult to effectively mine, secondly, background noise is complex and changeable, the infrared image background often comprises complex scenes such as ground, cloud layers and buildings, natural or artificial interference such as background clutter, thermal radiation and the like further reduces the target identification, thirdly, the real-time performance requirement is severe, the motion characteristic requirement of a high-speed maneuvering target is that the target is detected, tracked and identified in real time under the extremely high frame frequency, and the extreme requirement is provided for the calculation efficiency of an algorithm. Around the above challenges, the prior art routes fall into two main categories. The first category is traditional algorithms based on artificial feature extraction, including detection methods based on filtering, human visual system or image data structure, such as maximum median filtering, two-dimensional least mean square filtering, local contrast mechanism, sparse representation, etc. The method has the advantages of small calculated amount and good real-time performance, but depends on the characteristics of manual design, has limited generalization capability in complex dynamic scenes, is difficult to effectively inhibit background interference, and is easy to generate high false alarm and omission. The second category is detection algorithms based on deep learning, such as a single-stage detection model represented by the YOLO series and a two-stage detection model represented by the R-CNN series. The method can automatically extract deep semantic features through the neural network, improves detection precision and robustness to a certain extent, but still has certain limitation that the target features are easy to lose in network deep propagation when extremely small targets are processed, and meanwhile, a deep learning model depends on large-scale high-quality labeling data, and samples in the infrared field are rare and have high labeling cost, so that further improvement of model performance is restricted. In recent years, a large model has full-quantity parameterized memory of information and intelligent characteristics of self-learning, self-reasoning and self-generation by virtue of huge parameter scale and generalized training of a massive data set, and the large model has the unique advantage of strong scene generalization capability in a small target detection and recognition technology. However, deployment of large models on the end side still faces significant bottlenecks, the international mainstream large models generally have the problems of technical black boxes and high computational power dependence, and popularization of domestic large models on the end side is limited by hardware resource constraint and domestic adaptation challenges. In the edge computing scene, the inherent huge computing complexity and memory occupation of the billion-level parameter model are fundamentally contradicted between the limited computing resources and memory bandwidth of the domestic embedded chip. Therefore, on the premise of ensuring the detection precision and generalization capability, how to realize efficient embedded deployment of a lightweight domestic large model on a domestic AI chip platform through model quantization, operator optimization and algorithm-chip collaborative design becomes a core technical problem to be overcome currently. The existing infrared small target detection technology has the defects of insufficient generalization capability in a complex scene, limited capability of a traditional deep learning model for extracting micro target features and limitation that a large-scale multi-mode model is difficult to