CN-121982673-A - Target instance detection method based on target scene discrimination consistency

CN121982673ACN 121982673 ACN121982673 ACN 121982673ACN-121982673-A

Abstract

The invention provides a target instance detection method based on target scene discrimination consistency, and relates to the field of cross-scene target detection. The method and the device for identifying the source scene sample through the target scene model can effectively screen the source scene sample with forward migration value in the training process, avoid invalid or harmful samples from participating in model training, generate stable pseudo labels through combining with a teacher model, reduce noise interference in the training process, effectively inhibit negative migration phenomenon in cross-scene training, improve detection precision and training stability of the model in the target scene, and have good engineering application value.

Inventors

LIU ZHITONG
LI YUXI
JI DONG
WEI YANGJIE
LIU SONGRAN

Assignees

东北大学

Dates

Publication Date: 20260505
Application Date: 20260408

Claims (9)

1. The target instance detection method based on the consistency of target scene discrimination is characterized by comprising the following steps: under an automatic driving scene, a plurality of target road scene images are collected through a vehicle-mounted camera to form a target domain image set; Finishing detection frame labeling aiming at a target instance of a preset category in the target road scene image, and forming a target domain labeling data set by all the target road scene images, the corresponding detection frames and category information together; training the target detection network based on the target domain labeling data set to obtain initial parameters, initializing two target detection networks with the same structure by the initial parameters, and sequentially taking the two target detection networks as initial student models And initial teacher model ; Carrying out statistical analysis on target road scene images in the target domain image set to obtain target scene statistical information and target feature statistical information; Acquiring a plurality of source domain road scene images to form a source domain image set; Based on 、 Calculating the total inconsistency score of each source domain road scene image in the source domain image set according to the target scene statistical information and the target feature statistical information According to Screening source domain road scene images in a source domain image set, constructing a source domain pseudo tag data set, and further determining a final teacher model and a final student model; and acquiring an image to be detected, and inputting the image to be detected into the final teacher model or the final student model to obtain a detection frame and category information of the target instance in the image to be detected.
2. The target instance detection method based on target scene discrimination consistency according to claim 1, wherein performing statistical analysis on target road scene images in a target domain image set to obtain target scene statistical information and target feature statistical information includes: performing pixel level processing on each target road scene image in a target domain image set, and counting the pixel level processed data corresponding to the target domain image set to obtain target scene statistical information, wherein the target scene statistical information at least comprises a color statistical range, a lighting statistical range, fuzzy intensity distribution and noise intensity distribution; And executing feature level processing on each target road scene image in the target domain image set, and counting the data after feature level processing corresponding to the target domain image set to obtain target feature statistical information.
3. The target instance detection method based on target scene discrimination consistency according to claim 2, wherein performing pixel level processing on each target road scene image in a target domain image set, and performing statistics on pixel level processed data corresponding to the target domain image set to obtain target scene statistical information, includes: image of target road scene in target domain image set Normalizing each pixel value in the RGB color space image to obtain a normalized image Calculating Is the mean value of R, G, B three channels in the spatial dimension And standard deviation of ; From all target road scene images in a set of target domain images Calculating quantile statistics to obtain the quantile statistical range of RGB mean value according to all target road scene images in the target domain image set Calculating quantile statistics to obtain a quantile statistics range of RGB variance, wherein the quantile statistics range of RGB mean and the quantile statistics range of RGB variance form a color statistics range; Converting the target road scene image from RGB color space to HSV color space to obtain an HSV color space image, acquiring a V channel in the HSV color space image as a brightness component, normalizing the brightness component to obtain a normalized brightness component Calculation of Mean value in spatial dimension And standard deviation ; From all target road scene images in a set of target domain images Calculating quantile statistics to obtain the quantile statistics range of the brightness mean value according to all the target road scene images in the target domain image set Calculating quantile statistics to obtain a quantile statistics range of the brightness variance, wherein the quantile statistics range of the brightness mean and the quantile statistics range of the brightness variance form a lighting statistics range; Converting a target road scene image into a gray scale map For a pair of Applying Laplacian operator to obtain response graph Calculation of The variance of the ambiguity index ; From all target road scene images in a set of target domain images Calculating quantile statistics to obtain fuzzy intensity distribution; Using small-scale Gaussian blur to gray scale Processing to obtain high-frequency residual error ; Sobel operator based separate computation The gradient component in the horizontal direction and the gradient component in the vertical direction are further calculated to obtain gradient amplitude values, and the gradient amplitude values are obtained A pixel region with a medium gradient amplitude smaller than a preset gradient threshold is used as a flat region ; In the flat area In, calculate And uses the standard deviation of (2) as a noise intensity proxy index ; From all target road scene images in a set of target domain images Calculating quantile statistics to obtain noise intensity distribution.
4. The target instance detection method based on target scene discrimination consistency according to claim 2, wherein performing feature level processing on each target road scene image in a target domain image set, and performing statistics on feature level processed data corresponding to the target domain image set to obtain target feature statistical information, includes: inputting a set of target domain images Obtaining first intermediate semantic features of each target road scene image through a feature extraction backbone network and a feature fusion module, and calculating the average value of the first intermediate semantic features according to the first intermediate semantic features of all the target road scene images And standard deviation of , And And forming target feature statistical information.
5. The target instance detection method based on target scene discrimination consistency of claim 1, wherein the method is based on 、 Calculating the total inconsistency score of each source domain road scene image in the source domain image set according to the target scene statistical information and the target feature statistical information Constructing a source domain pseudo tag data set to further determine a final teacher model and a final student model, comprising: setting an initial iteration number m=0, taking the initial iteration number as a current iteration number, and As a teacher model for the current iteration number Will be As a student model at the current iteration number ; According to the target scene statistical information and the target feature statistical information, a teacher model under the current iteration number is adopted Constructing a reference prediction result of a source domain road scene image The target disturbance prediction result set comprises a plurality of disturbance prediction results ; Based on And Calculating a total inconsistency score for each source domain road scene image ; Scoring based on total inconsistency of source domain road scene images According to Screening source domain road scene images in a source domain image set, constructing a source domain pseudo tag data set, and sampling in the source domain pseudo tag data set to obtain a source domain pseudo tag batch Sampling in the target domain labeling data set to obtain a target domain sample batch ; Based on And Updating the student model parameters and the teacher model parameters, and updating the student model after updating And updated teacher model ; Calculating updated student model Verification set for labeling data set in target domain Target domain performance index value on ; Setting an allowable performance fluctuation threshold Judging Target domain performance index value at last iteration number m-1 Whether the forward migration determination condition is satisfied or not, in When the characterization satisfies the forward migration determination condition, when When the characterization does not meet the forward migration judgment condition; setting a termination condition, judging whether the current iteration meets the termination condition or not: A. If the termination condition is satisfied, then As a final student model, will As a final teacher model; B. if the termination condition is not satisfied but the forward migration judgment condition is satisfied, adding 1 to the current iteration number As a student model under new iteration times, the method is to Returning to execution as a teacher model under the new iteration number based on And Calculating a total inconsistency score for each source domain road scene image ; C. if the termination condition is not satisfied or the forward migration judgment condition is not satisfied, adding 1 to the current iteration number As a student model under new iteration times, the method is to Returning to execution as a teacher model under the new iteration number based on And Calculating a total inconsistency score for each source domain road scene image ; The setting termination condition is that the current iteration number reaches the preset maximum round number or the continuous preset round number does not meet the forward migration judgment condition.
6. The target instance detection method based on target scene discrimination consistency of claim 5, wherein a teacher model under a current iteration number is adopted according to target scene statistical information and target feature statistical information Constructing a reference prediction result of a source domain road scene image And a target disturbance prediction result set comprising: image of source domain road scene Inputting teacher model under current iteration number Obtaining a reference prediction result Expressed as: , wherein, The number of instances in the benchmark predicted outcome; Is the first A detection frame of the individual instances; Is the first Class prediction probability distribution of the instances; Is the first Confidence scores for the instances; Through the pairs of the color statistical range and the illumination statistical range Disturbing to obtain a color illumination disturbance view The method is realized by the following specific formula: ; Wherein, the A perturbation operator representing color and brightness; representing a target domain pixel statistical parameter set obtained by sampling from a color statistical range and a lighting statistical range; by blurring intensity distribution pairs Disturbance is carried out to obtain a fuzzy disturbance view The method is realized by the following specific formula: ; Wherein, the A perturbation operator representing the fuzzy intensity distribution; Obtaining fuzzy intensity parameters from fuzzy intensity distribution; By means of noise intensity distribution pairs Disturbance is carried out to obtain a noise disturbance view The method is realized by the following specific formula: ; Wherein, the A perturbation operator representing the noise intensity distribution, A noise intensity parameter obtained from sampling from a noise intensity distribution; Will be 、 And Respectively inputting teacher model under current iteration times Obtaining a color illumination disturbance prediction result Fuzzy disturbance prediction result And noise disturbance prediction results ; Inputting the source domain image set into the teacher model under the current iteration number Obtaining second intermediate semantic features of each source domain road scene image through a feature extraction backbone network and a feature fusion module; Calculating the average value of the second intermediate semantic features according to the second intermediate semantic features of all the source domain road scene images And standard deviation of , And Forming a statistical result of source domain feature distribution; For a pair of And And (3) with And Carrying out channel-level recalibration to obtain features after recalibration The method is realized by the following specific formula: ; Wherein, the A second intermediate semantic feature is represented, A stabilization term to prevent zero removal; Features after recalibration Inputting teacher model under current iteration number Obtaining a feature level disturbance prediction result by the detection head of (1) ; In the reference prediction result Acquiring a collection of detection frames For the following The inner image area of each detection frame in the array is erased with a sub-area randomly in the inner image area to obtain a shielding disturbance image Will be Inputting teacher model under current iteration number Obtaining a shielding disturbance prediction result ; For a pair of Affine transformation is carried out on the internal image area of each detection frame to obtain a scale disturbance image Will be Inputting teacher model under current iteration number Obtaining a scale disturbance prediction result ; 、、、、、 Forming a target disturbance prediction result set, wherein any disturbance prediction result in the target disturbance prediction result set is expressed as , , wherein, Is that The number of instances in (a); Is the first A detection frame of the individual instances; Is the first Class prediction probability distribution of the instances; Is the first Confidence scores for the instances.
7. The target instance detection method based on target scene discrimination consistency of claim 5, wherein the method is based on And Calculating a total inconsistency score for each source domain road scene image Comprising: Based on And Constructing matched pair sets First detection frame set And a second set of detection frames ; Based on preset confidence threshold And IoU threshold Adopting non-maximum suppression NMS algorithm to respectively pair With any one of a set of target disturbance predictors Processing to obtain processed reference prediction result And processed disturbance prediction results Expressed as: ; ; Wherein, the For the processed reference prediction result, Is that The number of instances in (a); Is the first A detection frame of the individual instances; Is the first Class prediction probability distribution of the instances; Is the first Confidence scores for the instances; in order to predict the result of the disturbance after processing, Is that The number of instances in (a); Is the first A detection frame of the individual instances; Is the first Class prediction probability distribution of the instances; Is the first Confidence scores for the instances; Calculation of And Is a cross ratio IoU; At the position of And (3) with In IoU of all detection frames in the test, the acquisition is larger than a threshold value The detection frame corresponding to IoU maximum value is selected from the candidate detection frames as target detection frames, and the target detection frames are the target detection frames The detection frames of the ith example form matching pairs, and a matching pair set is obtained A kind of electronic device Failure to match with Matched first detection frame set A kind of electronic device Failure to match with Matched second detection frame set ; According to 、、 Calculating a total inconsistency score for each source domain road scene image ; For the purpose of In (a, β), calculating a classification difference metric from the class prediction probability distribution of the detection frame in the matching pair The method is realized by the following specific formula: ; Wherein, the Representation of The number of matched pairs in the middle, Representation of Middle (f) The class corresponding to the individual detection boxes predicts the probability distribution, Representation of Middle (f) The class corresponding to the individual detection boxes predicts the probability distribution, Representation of And Jensen-shannon divergence of (a); calculating a positioning difference metric The method is realized by the following specific formula: ; Wherein, the Representation of Middle (f) The number of the detection frames is equal to the number of the detection frames, Representation of Middle (f) The number of the detection frames is equal to the number of the detection frames, Representation of And Is a cross-over ratio of (c); Calculating a missing or newly added difference metric The method is realized by the following specific formula: ; Wherein, the Representation of The number of the detection frames in the middle, Representation of The number of the detection frames in the middle, Representation of The number of the detection frames in the middle, Representation of The number of the detection frames in the middle, A stabilization term to prevent zero removal; computing single view inconsistencies The method is realized by the following specific formula: ; Wherein, the Representing a loss weight; for single view inconsistencies Weighted summation to obtain total inconsistency scores of each source domain road scene image The method is realized by the following specific formula: ; where K represents the number of disturbance predictors, Representing view weights.
8. The method for detecting a target instance based on consistency of target scene discrimination as recited in claim 5, wherein a total inconsistency score of the source domain road scene image is based on According to Screening source domain road scene images in a source domain image set to construct a source domain pseudo tag data set, wherein the method comprises the following steps: Ordering the total inconsistency scores of all source domain road scene images according to the order from small to large, and acquiring the first N source domain road scene images as target images to form a high-value source domain subset ; Subset high value source domains Target image in (a) Inputting teacher model under current iteration number Obtaining a first prediction result Expressed as: ; , Representation of In the number of instances in (a), Is that The detection box of the q-th example; Predicting a probability distribution for the class of the q-th instance; confidence score for the q-th instance; For the purpose of Acquiring a category corresponding to the maximum value in the category prediction probability distribution as a pseudo tag of the detection frame; For the following Confidence scores corresponding to each detection frame in the test table, wherein the confidence scores are smaller than a preset confidence threshold value Removing examples of (2) to obtain a second prediction result; Based on preset confidence threshold And IoU threshold Adopting a non-maximum suppression NMS algorithm to process detection frames belonging to the same category in the second prediction result to obtain a third prediction result; in all detection frames of the third prediction result, acquiring and connecting The cross ratio of the middle detection frame is larger than or equal to a preset threshold value And is connected with The detection frames with the same category are taken as the final detection frame, and the final detection frame and the corresponding pseudo tag are formed Final pseudo tag set of (a) A plurality of And A source domain pseudo tag dataset is composed.
9. The target instance detection method based on target scene discrimination consistency of claim 5, wherein the method is based on And Updating the student model parameters and the teacher model parameters, and updating the student model after updating And updated teacher model Comprising: Based on the current iteration number And For student model parameters under the current iteration times Performing joint optimization to enable objective functions Minimum, get the optimized student model parameters Thereby obtaining updated student model ; Wherein the objective function Expressed as: ; Wherein, the Representing the target domain sample lot size, For the detection box and category of the target road scene image, For the objective detection loss function, including classification loss, regression loss, and confidence loss, Training weight coefficients for the source domain pseudo tags, Representing a source domain pseudo tag batch size; By adopting an index sliding average EMA mode according to By the following formula pair Parameters of (2) Updating to obtain updated teacher model parameters Thereby obtaining the updated teacher model ; ; Wherein, the Is the EMA momentum coefficient.

Description

Target instance detection method based on target scene discrimination consistency Technical Field The invention relates to the field of cross-scene target detection, in particular to a target instance detection method based on target scene discrimination consistency. Background In cross-scene deployment, the target detection model often suffers from performance degradation due to differences in imaging conditions, background textures, target appearances, etc. between the source and target domains, and such problems are often categorized into cross-domain/cross-scene target detection (DAOD) research categories. The existing DAOD method mainly comprises technical paths such as feature alignment, countermeasure learning, reconstruction method and distillation/self-training. The teacher-student self-training (pseudo-label) is a more commonly used technical scheme in recent years, and the method generally generates pseudo-labels in a target Domain by a teacher model to supervise student model training, and combines weak/strong data enhancement, alignment loss and an EMA updating mechanism to relieve pseudo-label noise caused by Domain offset, for example, cross-Domain ADAPTIVE TEACHER (ADAPTIVE TEACHER) is a Cross-Domain detection training process which uses a teacher-student structure as a core organization. Around the quality problem of the pseudo tag, part of the method further introduces quality modeling and learning signals with finer granularity, and if Harmonious Teacher indicates that the confidence of classification is not enough to characterize the reliability of the pseudo tag, the harmony of classification and positioning information needs to be incorporated into the quality modeling of the pseudo tag, and Contrastive MEAN TEACHER combines mean-teacher self-training and contrast learning to relieve the training instability caused by the noise pseudo tag. In the engineering implementation level, the related patent schemes mostly adopt a combined form of 'teacher-student structure+EMA update+pseudo tag filtering or evaluation', for example, CN116091886A measures and filters pseudo tags through a strong and weak double-branch structure, CN118038163A calculates loss through a source domain detection head and a target domain detection head respectively and introduces a multi-scale mask countermeasure alignment module, and in addition, in a passive domain adaptation (source-free) target detection direction, a scheme for correcting teacher prediction and improving pseudo tag quality through constructing a target domain category prototype exists, for example, CN117636086A. Although the above-described techniques improve cross-domain target detection performance to some extent, there are significant shortcomings. First, existing DAOD flows generally lack an explicit, quantifiable evaluation and screening mechanism for whether source domain samples or instances are forward migrated to target domains, which can easily trigger a negative migration risk when the source domain and target domain differ significantly. Secondly, in the pseudo tag self-training route, the quality evaluation of the pseudo tag still depends on a single index or an empirical threshold value, and mismatch of classification and positioning information is easy to occur, and error accumulation is easy to occur, and although auxiliary filtering (such as CN 116091886A) can be performed through structures such as strong and weak branches, the engineering problems of structural complexity, threshold sensitivity enhancement and the like are brought. In addition, in the joint optimization process that the source domain is fully marked and the target domain is not marked, the model is easy to excessively depend on the source domain supervised signal to weaken the target domain optimization effect, so that the quality of pseudo labels generated by teachers is reduced, the detection accuracy is reduced, and the problem is explicitly pointed out in the background of the CN 118038163A. Meanwhile, many DAOD methods still rely on a target domain verification or test set to perform model selection in model training and evaluation, and lack a landable unsupervised model selection and training termination mechanism in a real deployment scene without a target domain mark. In addition, existing self-training paradigms multi-focus front Jing Wei tag driven learning ignores discrimination information that may be contained in the background or difficult regions, and noise pseudo tags are prone to error accumulation. Disclosure of Invention Aiming at the defects of the prior art, the invention aims to provide a target instance detection method based on target scene discrimination consistency, which comprises the following steps: under an automatic driving scene, a plurality of target road scene images are collected through a vehicle-mounted camera to form a target domain image set; Finishing detection frame labeling aiming at a target instance of a preset categor