CN-121811166-B - Weak supervision target detection method and system based on component mining and integral reconstruction

CN121811166BCN 121811166 BCN121811166 BCN 121811166BCN-121811166-B

Abstract

The application relates to a weak supervision target detection method and system based on component mining and integral reconstruction, comprising the steps of mining a plurality of target components of image-level class labels, carrying out cross-image clustering on detection areas extracted from training images to obtain a plurality of clusters, calculating the distance between the average visual embedding of each cluster and the text embedding of each target component, determining the minimum distance between each average visual embedding, determining the target component corresponding to the minimum distance as the component label of each detection area in the corresponding cluster when the minimum distance is smaller than a first threshold, mapping the component label of the detection area to a training image to obtain a marked image set, outputting component detection frames of each component in an image to be detected by using a component detector, carrying out instance reconstruction based on each component detection frame to obtain a detection frame of an instance to be detected in the image to be detected, wherein the component detector is trained by the plurality of marked image sets, and the instance of the component labels of each marked image set are different. The method can improve the detection accuracy.

Inventors

JIANG LE
LI SHUCHENG
LV FENG
LIU MINGLIU
YANG SHUAI
YI CHEN

Assignees

中南大学

Dates

Publication Date: 20260512
Application Date: 20260312

Claims (8)

1. A method for detecting a weakly supervised target based on component mining and global reconstruction, the method comprising: s1, mining a plurality of target components of image-level class labels, and performing cross-image clustering on detection areas extracted from a plurality of training images to obtain a plurality of clustering clusters, wherein class labels of all instances in the plurality of training images are consistent with the image-level class labels; S2, calculating the distance between the average visual embedding of each cluster and the text embedding of each target component, determining the minimum distance corresponding to each average visual embedding, and determining the target component corresponding to the minimum distance as the component label of each detection area in the cluster corresponding to the minimum distance when the minimum distance is smaller than a first threshold; s3, mapping the part labels of the detection area to the training image to obtain a labeling image set with the part labels; s4, outputting component detection frames of all components in the image to be detected by using a component detector, and carrying out instance reconstruction based on all the component detection frames to obtain detection frames of all the instances to be detected in the image to be detected, wherein the component detector is obtained by training a plurality of marked image sets, and the instances of the component labels of all the marked image sets are inconsistent; The digging mode of the target component in the step S1 includes: determining an image-level category label and a prompt word of a component for mining the image-level category label, wherein the image-level category label is text information; Inputting the image-level category labels and the prompt words into a language model, and outputting candidate components corresponding to the image-level category labels; calculating a first cosine distance between the candidate part and a parent node of the candidate part based on a first text embedding of the candidate part and a second text embedding of the parent node; When the first cosine distance is larger than the second threshold value and the second cosine distance between the candidate part and a brother node is smaller than the second threshold value, eliminating the candidate part or the brother node to obtain target parts of the image-level class labels; wherein, the calculation formula of the first cosine distance is D vis (i, p) is the first cosine distance, For the first text embedding of candidate part i, A second text for parent node p.
2. The method according to claim 1, wherein the acquiring of the plurality of clusters in step S1 includes: Acquiring a plurality of training images, and extracting a plurality of detection areas from the plurality of training images; inputting each detection area and each training image into a visual language model, and generating a visual vector of each detection area through an image encoder in the visual language model; And performing cross-image clustering analysis by using a clustering algorithm based on the vision vector of each detection area to obtain a plurality of clusters, wherein each cluster comprises at least one detection area.
3. The method according to claim 2, wherein the obtaining of the average visual embedding of each cluster in step S2 comprises: removing outlier detection areas and noise detection areas in the clusters to obtain core clusters of the clusters; According to the visual vector of each detection area in each core cluster, calculating the average visual embedding of each cluster; wherein, the calculation formula of the average vision embedding of each cluster is that ; For the average visual embedding of cluster c j , Core cluster being cluster c j Visual vector of detection region r in (a).
4. The method of claim 1, wherein outputting the part detection frames of the parts in the image to be detected using the part detector comprises: Based on the detection areas of each type of component labels in the plurality of marked images, respectively training the two types of component detectors corresponding to each type of component labels to obtain trained two types of component detectors, and outputting component detection frames of components in the image to be detected by using the trained two types of component detectors; Or training a multi-classification component detector based on the detection areas of each type of component labels in the plurality of marked images to obtain the trained multi-classification component detector, and outputting a component detection frame of each component in the image to be detected by using the trained multi-classification component detector.
5. The method according to claim 1, wherein the performing instance reconstruction based on each of the component detection frames to obtain a detection frame of each instance to be detected in the image to be detected includes: Determining a first component detection frame belonging to the same instance to be detected according to the edge weight among the component detection frames, and reconstructing to obtain a detection frame of each instance to be detected according to the first component detection frame corresponding to each instance to be detected; The calculation formula of the edge weight is as follows: P ij is the edge weight between the component detection frame of component B i and the component detection frame of component B j , For the ratio of the intersection between the component detection frame of component B i and the component detection frame of component B j , L total is the length of the line connecting the center point of the component detection frame of component B i and the center point of the component detection frame of component B j , For the length of the wire in the component detection frame of component B i , The length in the frame is detected for the component where the wire is in component B j .
6. The method according to claim 5, wherein the determining the first component detection frames belonging to the same to-be-detected instance, and reconstructing each to-be-detected instance detection frame according to the first component detection frame corresponding to each to-be-detected instance, includes: when the side weight between the two component detection frames is greater than a preset third threshold value, determining that the two component detection frames are first component detection frames corresponding to the same instance to be detected; Respectively determining a maximum abscissa, a minimum abscissa, a maximum ordinate and a minimum ordinate in the first component detection frame corresponding to each to-be-detected example; and reconstructing a detection frame of each instance to be detected based on the maximum abscissa, the minimum abscissa, the maximum ordinate and the minimum ordinate corresponding to each instance to be detected.
7. The method according to claim 1, wherein the method further comprises: Outputting a component detection frame of each component in an image to be detected by using a component detector, and outputting a plurality of anchor points of the image to be detected by using a pre-trained basic weak supervision model, wherein the component detector is obtained by training a plurality of marked image sets, and examples of the component labels of each marked image set are inconsistent; taking each anchor point as a center, determining each part detection frame within a preset radius of the anchor point as a second part detection frame corresponding to a first to-be-detected instance in the to-be-detected image, and reconstructing to obtain the detection frame of each first to-be-detected instance based on the second part detection frame corresponding to each first to-be-detected instance; And determining a third component detection frame corresponding to a second to-be-detected instance in the to-be-detected image from the non-second component detection frames according to the edge weight among the non-second component detection frames, and reconstructing the detection frame of the second to-be-detected instance according to the third component detection frame corresponding to each second to-be-detected instance.
8. A weakly supervised object detection system based on component mining and global reconstruction for performing the method of any of claims 1-7, the system comprising: The mining module is used for mining a plurality of target components of the image-level class labels, and performing cross-image clustering on detection areas extracted from a plurality of training images to obtain a plurality of clustering clusters, wherein class labels of all instances in the plurality of training images are consistent with the image-level class labels; The alignment module is used for calculating the distance between the average visual embedding of each cluster and the text embedding of each target component, determining the minimum distance corresponding to each average visual embedding, and determining the target component corresponding to the minimum distance as the component label of each detection region in the cluster corresponding to the minimum distance when the minimum distance is smaller than a first threshold; the mapping module is used for mapping the part labels of the detection area to the training image to obtain a labeling image set with the part labels; The detection module is used for outputting component detection frames of all components in the image to be detected by using a component detector, carrying out instance reconstruction based on all the component detection frames to obtain detection frames of all the instances to be detected in the image to be detected, wherein the component detector is obtained by training a plurality of marked image sets, and the instances of the component labels of all the marked image sets are inconsistent.

Description

Weak supervision target detection method and system based on component mining and integral reconstruction Technical Field The application relates to the technical field of computer vision and pattern recognition, in particular to a weak supervision target detection method and system based on component mining and integral reconstruction. Background With the development of deep learning, full-supervision target detection has made remarkable progress, but the full-supervision target detection relies on a large number of accurate manual labels, such as bounding boxes and instance masks, and the data acquisition cost is extremely high. While the goal of weakly supervised target detection (Weakly Supervised Object Detection, WSOD) is to train the target detector with only image-level class labels, thereby reducing reliance on cumbersome and expensive bounding box labeling. At present, the existing WSOD generally adopts a multi-instance learning framework, namely, image-level labels are in one-to-one correspondence with features extracted from images, multi-instance learning is carried out, the multi-instance learning framework is constructed, the information that the image-level labels are composed of components with semantic meanings is ignored, but the alignment of the image-level labels can enable a detector to train only by relying on the image-level labels, so that richer and more discriminant feature representations are difficult to learn, and more accurate detection results cannot be output. For example, the application CN114648665A only corresponds the image-level label to the extracted features in the image one by one, and the trained weak supervision detection model is difficult to learn the richer and more discriminant feature representation, so that a more accurate detection frame cannot be output. Disclosure of Invention In view of the foregoing, it is desirable to provide a method and a system for detecting a weakly supervised target based on component mining and overall reconstruction, which can output accurate detection results. A weakly supervised target detection method based on component mining and global reconstruction, the method comprising: s1, mining a plurality of target components of image-level class labels, and performing cross-image clustering on detection areas extracted from a plurality of training images to obtain a plurality of clustering clusters, wherein class labels of all instances in the plurality of training images are consistent with the image-level class labels; S2, calculating the distance between the average visual embedding of each cluster and the text embedding of each target component, determining the minimum distance corresponding to each average visual embedding, and determining the target component corresponding to the minimum distance as the component label of each detection area in the cluster corresponding to the minimum distance when the minimum distance is smaller than a first threshold; s3, mapping the part labels of the detection area to the training image to obtain a labeling image set with the part labels; S4, outputting component detection frames of all components in the image to be detected by using a component detector, and carrying out instance reconstruction based on all the component detection frames to obtain detection frames of all the instances to be detected in the image to be detected, wherein the component detector is obtained by training a plurality of marked image sets, and the instances of the component labels of all the marked image sets are inconsistent. According to the method, a plurality of target components of the image-level class labels are mined, cross-image clustering is conducted on detection areas extracted from a plurality of training images, a plurality of clusters are obtained, the distance between the average visual embedding of each cluster and the text embedding of each target component is calculated, the minimum distance corresponding to each average visual embedding is determined, when the minimum distance is smaller than a first threshold value, the target component corresponding to the minimum distance is determined to be a component label of each detection area in the cluster corresponding to the minimum distance, the component label of the detection area is mapped to the training image, a labeling image set with the component label is obtained, the fact that the image-level label is composed of components with semantic meanings is considered, alignment of the component label and the detection area is achieved, instead of alignment of the image-level class label and the detection area, and therefore a feature representation which is richer and more discriminative can be learned when the component detector is used for outputting a component detection frame of each component in an image to be detected, and accuracy of each instance of the component detection frame to be detected in the image to be detected can be obtained