CN-121982634-A - Cross-mode re-identification method and system

CN121982634ACN 121982634 ACN121982634 ACN 121982634ACN-121982634-A

Abstract

The application discloses a cross-modal re-identification method and system, wherein the method comprises the steps of adopting a specific multi-branch feature extraction network, firstly extracting basic features of an image through a shared trunk, and then utilizing three parallel branches with decoupled functions to respectively and simultaneously learn sharing features, deep semantic features and key shallow detail features with unchanged modalities. By fusing the three complementary features, a final feature representation with comprehensive information is generated, so that high-precision mutual retrieval of visible light and infrared images is realized. The method solves the problem that the visible light and infrared images are difficult to simultaneously consider semantic alignment and detail reservation due to large modal difference. By sharing the framework of the decoupling coordination of the trunk and the three branches, the accuracy of re-identifying the target to be queried under the conditions of day-night alternation, illumination change and complex scene of the re-identifying system is improved while the distribution difference of the visible light mode and the infrared mode is reduced.

Inventors

ZHOU YIMIN
CHEN SITAO

Assignees

深圳先进技术研究院

Dates

Publication Date: 20260505
Application Date: 20251231

Claims (10)

1. A cross-modal re-identification method, the method comprising: S1, receiving an image to be queried, and inputting the image to be queried into a multi-branch feature extraction network; s2, extracting basic features of the image to be queried based on a shared trunk of the feature extraction network, processing the basic features through three parallel branches with decoupled functions in the feature extraction network, and outputting shared features, deep specific features and shallow detail features; S3, fusing the shared feature, the specific feature and the detail feature to obtain a final feature representation of the image to be queried; S4, carrying out similarity calculation and sequencing on the final feature representation and feature representations in a pre-generated gallery feature library, wherein the feature representations in the gallery feature library are obtained by extracting and fusing cross-mode gallery images formed by visible light images and infrared images through the multi-branch feature extraction network; and S5, outputting a re-identification search result according to the sorting result.
2. The method according to claim 1, wherein the gallery feature library in step S4 is pre-generated by: Inputting the images in the cross-modal gallery into the multi-branch feature extraction network; extracting the shared features, the deep specific features and the shallow detail features, and fusing to obtain a final feature representation corresponding to each cross-modal gallery image; and storing the final feature representation of the cross-modal gallery image to complete the construction of the gallery feature library.
3. The method according to claim 1, wherein the multi-branch feature extraction network in step S1 is pre-trained by: step S10, constructing a network comprising the shared trunk and three parallel branches with decoupled functions; Step S11, inputting training batches comprising paired visible light and infrared images into the network, extracting basic features from the images through the shared trunk, respectively processing the images by the three parallel branches with decoupled functions, and synchronously outputting shared features, deep specific features and shallow detail features; Step S12, inputting the shared feature, the deep specific feature and the shallow detail feature into a joint loss function to calculate a total loss value, wherein the joint loss function is a weighted sum of measurement learning loss and modal alignment loss, the measurement learning loss acts on the three features and is used for restraining intra-class distances of the same identity feature and inter-class distances of different identity features, and the modal alignment loss acts on the shared feature and is used for reducing feature distribution difference between visible light and infrared states; step S13, based on the total loss value, synchronously optimizing all parameters of the shared trunk and the three parallel branches decoupled by the functions through a back propagation algorithm; and step S14, repeating the steps S11 to S13 until the network converges.
4. A method according to claim 3, wherein the calculation of the modal alignment loss in step S12 is performed by: based on visible light and infrared characteristics in the shared characteristics, computing paired distance distribution of the shared characteristics, and measuring consistency between two modal distributions by using symmetrical KL divergence to obtain a first alignment loss; Based on the classification probability of the three functionally decoupled parallel branch classifiers on the visible light and infrared image output, the consistency of two modes on the category prediction distribution is measured by using the bidirectional KL divergence, and a second alignment loss is obtained; and carrying out weighted summation on the first alignment loss and the second alignment loss to obtain a final modal alignment loss value.
5. The method according to claim 1, wherein the processing of the three functionally decoupled parallel branches in step S2 comprises: Processing the input basic features through the sharing branches, and obtaining sharing features with unchanged modes through the gradient inversion layer; Sequentially carrying out instance normalization processing and attention mechanism weighting on the input basic features through the deep specific branches to obtain deep semantic features with specific modes; And extracting the basic features from the shallow side of the shared trunk through shallow detail branches, and sequentially carrying out instance normalization processing and attention mechanism screening on the basic features to obtain shallow detail features.
6. The method according to claim 1, wherein the step S3 includes: Aligning the three feature vectors of the shared feature, the deep specific feature and the shallow detail feature in a feature dimension; splicing the three aligned feature vectors in the feature dimension according to the sequence of the shared feature, the deep specific feature and the shallow detail feature to form a fused high-dimensional feature vector; and carrying out linear transformation on the high-dimensional feature vector or projecting the high-dimensional feature vector to a designated target dimension through a full connection layer to obtain the final feature representation of the image to be queried.
7. The method according to claim 1, wherein the step S4 includes: Calculating cosine similarity between the final feature representation of the image to be queried and each feature representation in the gallery feature library to obtain a similarity score comparison table; And based on the similarity score comparison table, ordering all the feature representations in the gallery feature library according to the corresponding similarity scores from high to low to generate an ordered list.
8. The method of claim 1, wherein the step of step S5 comprises: and outputting a gallery image corresponding to the feature expression, ranked higher than a preset ranking threshold value, in the ranking list according to the ranking list in the ranking result, as a final retrieval result.
9. A cross-modal re-identification system, the system comprising: The feature extraction module is used for receiving an image to be queried and inputting the image to be queried into the multi-branch feature extraction network; extracting basic features of the image to be queried based on a shared trunk of the feature extraction network, processing the basic features through three parallel branches with decoupled functions in the feature extraction network, and simultaneously outputting shared features, deep specific features and shallow detail features; The image library feature library is pre-stored with feature representations, wherein the feature representations are obtained by extracting and fusing cross-mode image library images formed by visible light images and infrared images through the multi-branch feature extraction network; and the retrieval matching module is used for carrying out similarity calculation and sequencing on the final feature representation obtained by the feature extraction module and the feature representation in the gallery feature library, and outputting a re-identification retrieval result according to the sequencing result.
10. The system of claim 9, wherein the feature extraction module comprises: The network input unit is used for receiving and preprocessing the image to be queried and inputting the image to the multi-branch feature extraction network; The shared trunk unit is composed of shared trunks of the multi-branch feature extraction network and is used for extracting basic features of the image to be queried; the parallel branch processing unit is composed of three parallel branches with decoupled functions and is used for carrying out parallel processing on the basic characteristics and synchronously outputting the shared characteristics, the deep specific characteristics and the shallow detail characteristics; and the feature fusion unit is used for fusing the three features output by the parallel branch processing unit to obtain the final feature representation of the image to be queried.

Description

Cross-mode re-identification method and system Technical Field The application relates to the technical field of image processing, in particular to a cross-mode re-identification method and system. Background In the field of intelligent monitoring, visible light and infrared light are two core imaging modalities. The visible light imaging can provide abundant color and texture information in daytime, but the efficiency is drastically reduced in night or low-illumination environment, the infrared imaging depends on the thermal radiation of an object, is not influenced by visible light, and can stably image in full-black environment, but the image lacks color and the texture detail is more fuzzy. To comprehensively utilize the advantages of the two modes, related technologies generally deploy a dual-spectrum image capturing device capable of capturing visible light and infrared light simultaneously. In the algorithm level, the main stream method aims at aligning and fusing data of two modes, and mainly comprises three technical paths, namely, feature level alignment, mode difference reduction through a double-flow network and a loss function, detail utilization shortage, image level conversion, mode migration through generation of an countermeasure network, but the problem of artifact and unstable training, and simple double-flow feature splicing, and visual structure but failure in effectively decoupling sharing and specific features. However, these technologies fail to realize effective decoupling and collaborative learning of the modality sharing features and the modality specific features, which makes it difficult to simultaneously consider consistency of cross-modalities and discriminant of respective modalities when re-recognition is performed, and ultimately restricts recognition performance in complex actual scenes. Disclosure of Invention The embodiment of the application solves the problem that the visible light and infrared images are difficult to simultaneously consider semantic alignment and detail reservation due to large modal difference by providing the cross-modal re-identification method and the system. Through the framework of the shared trunk and the three-branch decoupling coordination, the accuracy of re-identification of the target to be retrieved under the conditions of day-night alternation, illumination change and complex scene of the re-identification system is improved while the distribution difference of the visible light mode and the infrared mode is reduced. The embodiment of the application provides a cross-mode re-identification method, which comprises the following steps: S1, receiving an image to be queried, and inputting the image to be queried into a multi-branch feature extraction network; s2, extracting basic features of the image to be queried based on a shared trunk of the feature extraction network, processing the basic features through three parallel branches with decoupled functions in the feature extraction network, and outputting shared features, deep specific features and shallow detail features; S3, fusing the shared feature, the specific feature and the detail feature to obtain a final feature representation of the image to be queried; S4, carrying out similarity calculation and sequencing on the final feature representation and feature representations in a pre-generated gallery feature library, wherein the feature representations in the gallery feature library are obtained by extracting and fusing cross-mode gallery images formed by visible light images and infrared images through the multi-branch feature extraction network; and S5, outputting a re-identification search result according to the sorting result. Optionally, the gallery feature library in step S4 is pre-generated by: Inputting the images in the cross-modal gallery into the multi-branch feature extraction network; extracting the shared features, the deep specific features and the shallow detail features, and fusing to obtain a final feature representation corresponding to each cross-modal gallery image; and storing the final feature representation of the cross-modal gallery image to complete the construction of the gallery feature library. Optionally, the multi-branch feature extraction network in step S1 is pre-trained by: step S10, constructing a network comprising the shared trunk and three parallel branches with decoupled functions; Step S11, inputting training batches comprising paired visible light and infrared images into the network, extracting basic features from the images through the shared trunk, respectively processing the images by the three parallel branches with decoupled functions, and synchronously outputting shared features, deep specific features and shallow detail features; Step S12, inputting the shared feature, the deep specific feature and the shallow detail feature into a joint loss function to calculate a total loss value, wherein the joint loss function is a weighted sum