CN-118918924-B - Cross-modal voice and face association method based on heterogeneous hash network

CN118918924BCN 118918924 BCN118918924 BCN 118918924BCN-118918924-B

Abstract

The invention belongs to the technical field of communication, and discloses a cross-modal voice and face association method based on a heterogeneous hash network, which comprises the following specific technical scheme: the invention provides a cross-modal depth residual alignment network based on hash learning, the network maps input features into hash codes with fixed length, the use of feature dimension and memory is effectively reduced, meanwhile, the hash is realized, the calculation efficiency is greatly improved, the deep network integrated with a residual structure has stronger feature representation capability and good generalization, the Hamming distances of voice and face features with the same identity in a hash space are closer by minimizing a loss function, the voice and face feature distances of different identities are farther, so that the feature alignment of voice and face is achieved, the cross-modal matching, verification and retrieval tasks are completed, and the accuracy of the model in the cross-modal matching, retrieval and verification tasks is obviously improved compared with the prior art.

Inventors

LIANG YANXIA
ZHANG HUANHUAN
LIU XIN
JIANG JING
WANG FUPING
JIA TONG
WANG HUAN
XU YIKANG

Assignees

西安邮电大学

Dates

Publication Date: 20260508
Application Date: 20240812

Claims (5)

1. A cross-modal voice and face association method based on a heterogeneous hash network is characterized by comprising the following specific steps: Firstly, constructing a data set, namely training and testing a network by using the preprocessed data set; Step two, loading a data set, namely adding a sampler, receiving samples in the training process, wherein each iteration comprises a plurality of samples with different identities, each identity comprises a fixed number of examples, and in the network training process, the data is presented in the form of a tuple in each iteration, and the specific expression is as follows: ; Wherein, the A segment of speech is represented and, A face image is represented and a person's face image is represented, For labels, if Representing And (3) with Belongs to the same person, if Then it is determined And (3) with The individuals belonging to different identities, the batch represents the number of tuples of each iteration of the network; Step three, establishing a voice feature extraction network Using a ECAPA-TDNN model pre-trained on VoxCeleb data set as a voice feature extraction network, firstly extracting Mel spectrum features of voice aiming at voice fragments, normalizing the Mel spectrum features to serve as input of a ECAPA-TDNN model to extract features, and outputting final voice features; step four, constructing a face feature extraction network Using a Inception-V1 network pre-trained on VGGFace data sets as a face feature extraction network, and outputting a 512-dimensional vector as a final face feature by a face picture through a Inception-V1 network; step five, constructing a cross-modal feature alignment network, wherein the specific cross-modal feature alignment network expression is as follows: ; The voice features and the face features obtained in the feature extraction stage are used as the input of the network, the cross-modal feature alignment network automatically inputs the voice features and the face features into the corresponding feature alignment sub-network, and a plurality of full connection layers are used To achieve feature alignment, each fully connected layer is followed by a batch normalization layer except for the fully connected layer of the last layer Batch normalization is expressed as follows: ; Wherein, the Representing the features after passing through the fully-connected layer, Representing the average of all sample features within a batch over each feature channel, Representing the variance of all sample features within a batch over each feature channel, Is a small constant which is used to control the temperature, And Is a learnable parameter; Adding a hash function layer at the last layer of the cross-modal feature alignment network The final output of the cross-modal feature alignment network is: ; ; Wherein, the , Representing voice feature alignment sub-networks in the constructed cross-modal feature alignment network, Representing the face feature alignment sub-network in the constructed cross-modal feature alignment network, For speech features obtained through the feature extraction network, For the face features obtained through the feature extraction network, Representing the parameters learned by two sub-networks in the cross-modal feature alignment network, The hash codes of the two modes are finally obtained.
2. The method of claim 1, wherein in the first step, the data set includes 1225-identity speech segments and face images, and the data set is expressed as Wherein i represents the i-th identity person, n=1225; each identity is internally represented as , wherein, The jth speech segment representing the ith identity, j ranging from 1 to m, i.e. the ith identity has m speech samples in total, and similarly, The jth face image frame representing the ith identity ranges from 1 to n, i.e. the ith identity shares n face image frames.
3. The method for correlating cross-modal voice with human face based on heterogeneous hash network according to claim 2, wherein in the fifth step, the features of the first layer of the network after batch normalization are connected with the features of the last layer, the features of the early layer and the deep layer are fused, the two modes of the early layer are trained respectively, the weight sharing is realized for the two modes of the later layer, and the original 512-dimensional human face features and 192-dimensional voice features are mapped to 128-dimensional space through the network.
4. A method of cross-modal speech and face association based on heterogeneous hash networks as claimed in claim 3, wherein a loss function is constructed, N triples are constructed in one batch according to a triplet construction method, the N triples are derived from the hash codes generated by the cross-modal feature alignment network and are used to calculate the triplet loss to optimize the feature representation and enhance the distinguishing capability of the model: ; Wherein, the Representing the hamming distance of the anchor point from the positive sample, Representing the hamming distance between the anchor point and the negative sample, wherein margin is a super parameter used for controlling the distance difference between the positive and negative sample pairs; in addition, based on the characteristics of the hash codes, the similarity loss among modes is proposed, and the real similarity matrix S is constructed as follows: ; Wherein, the Representing two samples If the labels of the two samples are identical, then the similarity is 1, if the labels of the two samples are not identical, then the similarity is-1, and the inner product of the two vectors is used to calculate the predicted similarity matrix Y: ; Wherein, the For the length of the hash code, Representing the first obtained through network mapping The hash codes of the individual speech samples, Representing the mapped first The hash of the face sample, the similarity penalty, is expressed as follows: ; Wherein, the The distance between two points is represented by using Euclidean measurement method; the overall loss function is: 。
5. The method for associating cross-modal speech with a face based on heterogeneous hash networks according to claim 4, wherein a small-batch training method is selected to train the cross-modal feature alignment network, and data of one batch is iterated each time to train, and the method is specifically expressed as follows: ; The feature extraction network inputs the two modal features into respective feature extraction sub-networks to extract features, the extracted features are used for aligning hash codes of network learning across modal features, after a plurality of iterations, namely after a loss function is calculated, network parameters are reversely propagated and updated to minimize the loss function, and the hash codes of the two modes learned in each iteration are shown as follows: ; Wherein B is a hash code matrix of voice or face obtained by network mapping, n is the size of batch and represents the number of samples processed in each iteration Whether the voice sample or the face sample is a hash code with the length of k, and each dimension of each hash code takes a value of +1 or-1.

Description

Cross-modal voice and face association method based on heterogeneous hash network Technical Field The invention belongs to the technical field of communication, and particularly relates to a cross-modal voice and face association method based on a heterogeneous hash network. Background In modern society, identity recognition is a central need in many fields of security, criminal investigation, monitoring, personal services, etc., and conventional recognition methods, such as based on passwords, identity documents or biological features (fingerprint, facial recognition, etc.), although satisfying the need for identity verification to some extent, may have limitations in some cases, such as in a scene lacking physical contact or other information. The existing cross-modal voice-face matching algorithm is mainly characterized in that features are extracted by using a feature extraction network, two modal features are mapped to the same feature space by using one to two full-connection layers at a feature alignment part, and the intra-class compactness and the inter-class expansion are realized by minimizing a loss function. But using fewer fully connected layers may result in the model being over-fitted on the training data, with insufficient generalization capability, and insufficient efficient feature alignment. In addition, in the process of mapping the original features to the same dimensional space through the network, each sample feature is represented in the form of a high-dimensional floating point number, which can certainly increase memory occupation and calculation burden when processing a large-scale data set. Disclosure of Invention In order to solve the technical problems in the prior art, the invention provides a cross-modal depth residual alignment network based on hash learning, which maps input features into hash codes with fixed length, effectively reduces feature dimension and memory use, greatly improves calculation efficiency through hash, has stronger feature representation capability through a deep network integrating a residual structure, has good generalization, and can enable the Hamming distance of voice and face features with the same identity to be closer in hash space and the voice and face feature distances with different identities to be farther, thereby achieving the feature alignment of voice and face and completing cross-modal matching, verification and retrieval tasks. In order to achieve the purpose, the technical scheme adopted by the invention is that a cross-modal voice and face association method based on a heterogeneous hash network comprises the following specific steps: Constructing a dataset, training and testing the network by using the pre-processed VoxCeleb dataset. The speech face dataset includes 1225 identified speech segments and face images, represented as Where i represents the i-th identity person, n=1225. Internally expressed for each identity asWherein, the The jth speech segment representing the ith identity, j ranging from 1 to m, i.e. the ith identity has m speech samples in total, and similarly,The jth face image frame representing the ith identity ranges from 1 to n, i.e. the ith identity shares n face image frames. For a fair comparison, it is divided into a training set, a test set and a validation set according to a previously working division method. Loading the data set, since the triplet loss is needed subsequently to back-propagate the optimized network parameters, a sufficient number of triplets (active, negative) need to be selected in each iteration, so that a sampler is added to ensure that the model can receive diversified samples during training, each iteration contains multiple samples of different identities, and each identity has a fixed number of instances, so as to pick out more efficient triplets for subsequent computation of the triplet loss. During the network training process, at each iteration, the data is presented in the form of a tuple, in particularWhere v represents a voice segment, f represents a face image, l is a label, if l i＝lj represents that v i,fi and v j,fj belong to the same person, if l i≠lj is a person with different identity, and batch represents the number of tuples of each iteration of the network. A voice feature extraction network a v (·) is constructed using a ECAPA-TDNN network pre-trained on the VoxCeleb data set as the voice feature extraction network, extracting its MFCC features first and normalizing them as input to the ECAPA-TDNN network for the voice segment to extract the features, outputting a 192-dimensional vector as the final voice feature. Constructing a face feature extraction network A F (DEG), namely using a Inception-V1 network pre-trained on a VGGFace data set as a face feature extraction network, and outputting a 512-dimensional vector as a final face feature by a face picture through the network. Constructing a cross-modal feature alignment network phi (·, theta) = { phi v (·, theta