CN-121982470-A - Multi-view late fusion classification method and system based on information consistency and complementarity priori
Abstract
The invention provides a multi-view advanced fusion method and system guided by double-hierarchy information consistency priori, and relates to the technical field of computer vision and pattern recognition. The invention designs a sample-level and view-level double-layer consistency priori information module as a fusion network main building block, in a multi-view classification task, the shared semantics of the same sample are extracted from multi-view features, and the global semantic structure of each sample marked in a single view is utilized to realize the application of semantic consistency constraint on cross-view features and the suppression of deviation from views, so as to obtain a more stable and more discriminative multi-view fusion representation.
Inventors
- GUO QINGBEI
- ZHANG WENXIN
Assignees
- 济南大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260403
Claims (3)
- 1. The multi-view advanced fusion classification method based on the information consistency and complementarity priori is characterized by comprising the following steps of: S1, designing a sample-level and view-level double-layer consistency priori information module as a main building block of a fusion network, in a multi-view classification task, extracting shared semantics of the same sample from multi-view features, and utilizing global semantic structures of all samples carved in a single view to realize the application of semantic consistency constraint on cross-view features and the suppression of deviation views, so as to obtain a more stable and more discriminative multi-view fusion representation; S11, for a sample level and view level double-layer consistency priori information module, the module consists of greedy pairing, one-dimensional convolution and attention fusion networks, and given input multi-view characteristics Where V is the number of views, N is the number of samples, and D is the dimension of the input view feature, which is input into the a priori information module, can be represented by the following procedure: , , ; , ; S12, a sample-level consistency information priori module calculates the similarity among samples according to the characteristics of the same sample in different views, so as to construct consistency semantic information of the same sample in different views by fusing the most similar samples, specifically, the sample-level priori information module firstly performs the characteristic of the same sample in different views Proceeding with Normalizing, n represents the nth sample, v is the v view in which the current sample is located, The v-th view feature of the sample n is normalized, and the normalized inner product between the samples is calculated to obtain a similarity matrix between the samples Searching a sample pair with maximum similarity by using a greedy pairing strategy, in the unpaired view, firstly, searching the current maximum value in the non-diagonal elements of the similarity matrix, and combining the two corresponding views Recorded as a high similarity view pair, and then the corresponding rows and columns of the two views are separated from each other Wherein the step is repeated on the remaining views until no new view pairs can be found, and if V is even and all views participate in pairing, a V/2 group of view pairs is finally obtained, and if V is odd, one view cannot be paired with the other views, and after greedy pairing is completed, for each selected view pair Connecting the coding features in the channel dimension to obtain paired feature vectors, The dimension of the result after stitching for the high similarity view pair is 2d, n represents the nth sample, k represents the kth spliced high similarity view pair, Indicating that the nth sample is in The feature values on the view are displayed in a pattern, Indicating that the nth sample is in The feature values on the view are displayed in a pattern, When the number V of views is odd, the splicing dimension of the rest view features and the view feature is 2D, and all pairing vectors are stacked according to rows to obtain sample set pairing features Finally, will Inputting the vector into a subsequent aggregation branch, firstly carrying out light channel projection on each pairing vector through a one-dimensional convolution layer in the aggregation branch, and carrying out light channel projection on each pairing vector Mapping the paired features in (1) to a unified potential representation space, denoted as Each pairing vector is then scaled up in the pairing dimension using a linear scale and softmax Learning a normalized weight And weight-summing the transformed paired features, compressing them into a sample-level a priori representation Finally, stacking the prior vectors of all samples according to the sample dimension to obtain a sample-level information consistency prior matrix Providing a unified semantic anchor point between multiple views for each sample; s13, the view level consistency information priori module calculates the similarity among samples according to the characteristics of all samples in the same view, so as to construct the global semantic information fused by the most similar samples into one view, specifically, the view level priori information module firstly performs feature on the same sample on different views Proceeding with Normalized, v denotes the v-th view, i is the i-th sample in the current view, The method comprises the steps of normalizing the ith sample feature of a view v, calculating normalized inner products among samples of the same view to obtain a similarity matrix among the samples, searching a sample pair with the maximum similarity by using a greedy pairing strategy, firstly searching the current maximum value in a non-diagonal element of the similarity matrix among samples which are not paired, and comparing the two corresponding samples Marked as a pair of highly similar samples, the corresponding rows and columns of the two samples are then separated from The method comprises the steps of shielding the images to prevent the images from participating in the subsequent search, repeating the steps on the rest views until a new sample pair cannot be found any more, obtaining N/2 group of view pairs if N is even and all samples participate in pairing, obtaining one sample pair if N is odd and not paired with other samples, and after greedy pairing is completed, selecting samples of each pair Connecting the coding features in the channel dimension to obtain a pairing vector, The dimension of the result after stitching for the high-similarity sample pair is 2d, v represents the v-th view, k represents the k-th stitched high-similarity sample pair, Represent the first The eigenvalues of the individual samples on the v-view, Represent the first The eigenvalues of the individual samples on the v-view, When the number N of samples in the view is odd, the splicing dimension of the residual sample features and the sample feature is 2D, and all paired vectors are stacked according to rows to obtain the view grading paired features Finally, will Inputting the vector into a subsequent aggregation branch, firstly carrying out light channel projection on each pairing vector through a one-dimensional convolution layer in the aggregation branch, and carrying out light channel projection on each pairing vector Mapping the paired features in (1) to a unified potential representation space, denoted as Each pairing vector is then scaled up in the pairing dimension using a linear scale and softmax Learning a normalized weight And weight-summing the transformed paired features, compressing them into a view-level prior representation Finally, stacking the prior vectors of all views according to view dimensions to obtain a view level information consistency prior matrix The multi-view feature is used as view level semantic prior in a follow-up multi-head attention fusion module, and is cooperatively constrained with a sample level prior matrix Z to double alignment and information consistency of the multi-view feature on a sample dimension and a view dimension; S2, directly injecting sample level and view level information prior into a multi-head attention fusion module, wherein the multi-head attention fusion module uses view original features as queries, uses sample level and view level information as keys and values, calculates attention weights on a plurality of attention heads in parallel, so as to display global relations between the samples in sample dimensions, and features global semantics of each view in view dimensions, and the output of each attention head is spliced in channel dimensions and projected through a feedforward network to obtain a fused intermediate representation; s3, further reducing the distance between similar samples in the feature space, introducing a contrast learning module with a memory queue, enhancing the modeling capability of the model on the relationship between samples by constructing positive and negative sample pairs, and taking the features output from multiple heads of attention as queries in the training process and corresponding sample-level priori As a positive sample, other sample level priors in the same batch And taking the historical sample stored in the memory queue as a negative sample, wherein the optimization goal of contrast learning is to minimize InfoNCE loss, and optimizing the prior consistency of the sample level and the view level by improving the similarity between the query characteristic and the positive sample characteristic under the same index and the similarity between the query characteristic and other samples and the historical sample characteristic, and averaging the contrast loss obtained by calculating all views to obtain the multi-view contrast learning loss of the batch.
- 2. The multi-view advanced fusion method based on the information consistency and the complementation priori according to claim 1, wherein the sample level and view level consistency priori information module in S1 comprises three stages of greedy pairing, one-dimensional convolution dimension reduction and attention fusion, wherein a sample pair with highest similarity is selected by greedy pairing, dimension reduction operation is carried out through one-dimensional convolution, the relation between adjacent samples is captured, and finally shared semantics of the same sample in multi-view features and global semantic information of the same view are extracted through attention fusion operation.
- 3. A multi-view advanced fusion classification system based on information consistency and complementation priori is characterized by comprising a sample level and view level double-layer consistency priori module, a multi-view attention fusion module, a comparison learning module with a memory queue, and a multi-view advanced fusion classification system with a memory queue, wherein the shared semantics of the same sample are extracted from multi-view features, and the global semantic structure of each sample marked in the single view is utilized to realize that semantic consistency constraint is applied to cross-view features and deviation views are restrained, so that multi-view fusion representation with higher stability and discrimination is obtained, the multi-view attention fusion module injects the sample level and view level priori into the multi-head attention and matches with a lightweight 1-dimensional convolution and residual correction alignment feature, so that fusion stability and discrimination are improved, and the comparison learning module with the memory queue is used for constructing positive sample pairs from the multi-head attention output features and the sample level priori representation and expanding negative samples by combining with the memory queue, so that consistency is stabilized and enhanced.
Description
Multi-view late fusion classification method and system based on information consistency and complementarity priori Technical Field The invention relates to the technical field of computer vision and pattern recognition, in particular to a multi-view late fusion classification method and system based on information consistency and complementarity priori. Background The multi-view data is widely adopted in tasks such as image classification, different views are usually derived from different observation angles or feature description modes, and attribute information of the same object can be characterized from multiple aspects, so that the integrity of semantic expression and the reliability of classification results are improved. The multi-view classification method generally performs feature extraction on each view data respectively, and fuses multi-view information at a feature layer or a decision layer, wherein the late fusion mode is widely applied due to flexible structure and easy expansion. In the existing method, multi-view information is integrated in modes of multi-dependency feature splicing, weighted fusion or correlation modeling, and the like, so that sharing information and difference information among multiple views are difficult to effectively describe, and fusion results are easy to be unstable when view quality is low or view information conflicts exist, so that overall classification performance is limited. To address these issues, some research attempts to introduce a mechanism of attention or feature alignment strategy to achieve adaptive adjustment of the degree of contribution to different views. However, in the absence of explicit structural constraints or a priori guidance, such methods are susceptible to local anomaly features and there is still a risk of overfitting in a scene with limited sample size, making it difficult to maintain stable fusion effects in complex application environments. Disclosure of Invention The invention provides a multi-view advanced fusion classification method based on information consistency and complementarity priori, which comprises the following steps: S1, designing a sample-level and view-level double-layer consistency priori information module as a main building block of a fusion network, in a multi-view classification task, extracting shared semantics of the same sample from multi-view features, and utilizing global semantic structures of all samples carved in a single view to realize the application of semantic consistency constraint on cross-view features and the suppression of deviation views, so as to obtain a more stable and more discriminative multi-view fusion representation; S2, directly injecting sample level and view level information prior into a multi-head attention fusion module, wherein the multi-head attention fusion module uses view original features as queries, uses sample level and view level information as keys and values, calculates attention weights on a plurality of attention heads in parallel, so as to display global relations between the samples in sample dimensions, and features global semantics of each view in view dimensions, and the output of each attention head is spliced in channel dimensions and projected through a feedforward network to obtain a fused intermediate representation; S3, further reducing the distance between similar samples in the feature space, and introducing a contrast learning module with a memory queue, wherein the module enhances the modeling capability of the model on the relationship between samples by constructing positive and negative sample pairs. In the training process, for the characteristics output from multiple heads of attention, the characteristics are used as queries, and the corresponding sample level is priori As a positive sample, other sample level priors in the same batchAnd memorizing the history samples stored in the queue as negative samples. The optimization objective of contrast learning is to minimize InfoNCE loss, and the optimization of sample level and view level consistency priori is realized by improving the similarity between query features and positive sample features under the same index and the similarity between the query features and other sample and historical sample features, and the contrast loss obtained by calculating all views is averaged to obtain the multi-view contrast learning loss of the batch. Preferably, for the sample-level and view-level dual-layer consistency prior information module in S1, the module is composed of greedy pairing, one-dimensional convolution and attention fusion network, and given the input multi-view feature hεrζ (v×n×d), where V is the number of views, N is the number of samples, and D is the dimension of the input view feature, the input of the module into the prior information module can be represented by the following procedure: , , ; , Preferably, for the sample-level consistency information prior module in S1, similarity