CN-122024341-A - Voiceprint face dynamic binding method and system based on visual distance and auditory distance fusion

CN122024341ACN 122024341 ACN122024341 ACN 122024341ACN-122024341-A

Abstract

The invention discloses a voiceprint face dynamic binding method and a voiceprint face dynamic binding system based on the fusion of visual distance and listening distance, and relates to the technical field of computer vision and voice signal processing; the method comprises the steps of synchronously collecting video streams and audio streams, detecting a plurality of faces from the video streams, obtaining visual distances of the faces, determining candidate face areas and lip movement starting time stamps of the candidate face areas through lip movement detection, conducting sound source separation on the audio streams to obtain candidate voiceprint fragments and voiceprint starting time stamps of the candidate voiceprint fragments, constructing candidate pairs by combining the candidate face areas and the candidate voiceprint fragments in pairs, calculating auditory distances according to lip movement and voiceprint starting time differences and combining sound speeds for each candidate pair, and determining binding pairs from all candidate pairs by comparing difference values of the visual distances and the auditory distances of each candidate pair to achieve dynamic association of voiceprints and faces. The invention can dynamically bind the voiceprint of the speaker with the corresponding face in real time and accurately without pre-training a binding model.

Inventors

MO WEI

Assignees

广东松山职业技术学院

Dates

Publication Date: 20260512
Application Date: 20260130

Claims (8)

1. A voiceprint face dynamic binding method based on sight distance and listening distance fusion is characterized by comprising the following steps: Acquiring a video stream acquired by camera equipment and an audio stream acquired by a microphone array; performing face detection on the video stream, and determining a plurality of face areas and respective corresponding visual distances; performing lip movement detection on the plurality of face areas, and determining a candidate face area with lip movement and a corresponding lip movement starting time stamp; Performing sound source separation on the audio stream, and determining at least one candidate voiceprint fragment and a corresponding voiceprint starting time stamp; combining the candidate face area and the candidate voiceprint fragments in pairs to construct at least one face-voiceprint candidate pair; calculating an auditory distance according to the time difference between the corresponding lip movement starting time stamp and the corresponding voiceprint starting time stamp by combining the sound velocity for each candidate pair; And determining binding pairs from all candidate pairs according to the distance difference between the visual distance and the corresponding hearing distance of the candidate face regions in each candidate pair so as to finish voiceprint face binding.
2. The voiceprint face dynamic binding method based on line-of-sight and listening fusion of claim 1 wherein lip movement detection is performed on a plurality of face regions to determine candidate face regions for lip movement and corresponding lip movement start time stamps, comprising: extracting a corresponding mouth region image sequence aiming at each face region; Analyzing the mouth region image sequence, and detecting the starting moment of mouth opening action; And determining the face area with the detected starting moment as a candidate face area, and determining the video frame time stamp corresponding to the starting moment as a lip movement starting time stamp corresponding to the candidate face area.
3. The voiceprint face dynamic binding method based on fusion of line-of-sight and distance-of-hearing of claim 2 wherein analyzing the sequence of mouth region images to detect the onset of mouth opening motion comprises: Performing optical flow calculation on the mouth region image sequence to obtain optical flow characteristic changes among continuous frames; and when the change of the optical flow characteristics exceeds a preset threshold value, determining the corresponding video frame time as the starting time of the mouth opening action.
4. The method for dynamically binding a voiceprint face based on a fusion of line-of-sight and listening distances of claim 1, wherein performing sound source separation on the audio stream, determining at least one candidate voiceprint segment and a corresponding voiceprint start timestamp, comprises: Performing sound source separation on the audio stream to obtain at least one separated audio signal; performing voice activity detection on each audio signal to locate a starting point of the voice signal; And determining the voice signal segment starting from the starting point as a candidate voiceprint segment, and determining the timestamp corresponding to the starting point as the voiceprint starting timestamp of the corresponding candidate voiceprint segment.
5. The voiceprint face dynamic binding method based on line-of-sight and listening fusion of claim 1 wherein determining binding pairs from all candidate pairs comprises: Taking the distance difference value as a matching cost value; based on the matching cost value, adopting a Hungary algorithm to optimally allocate the candidate face region and the candidate voiceprint fragments; And determining a candidate pair formed by the candidate face area with the corresponding relation and the candidate voiceprint fragment as a binding pair according to the optimal allocation result.
6. The method for dynamically binding a voiceprint face based on a fusion of line-of-sight and listening distances according to claim 1, wherein the calculating the auditory distance by combining sound speeds comprises: Acquiring temperature information and/or humidity information of the current environment; Determining a current sound speed based on the temperature information and/or humidity information; the hearing distance is calculated from the current speed of sound.
7. The voiceprint face dynamic binding method based on line-of-sight and listening fusion of claim 1, further comprising: When the binding pairs are not successfully determined from all candidate pairs, outputting an indication signal of binding failure, and carrying out association record on the identification information of the corresponding candidate face area and the identification information of the candidate voiceprint fragments.
8. A voiceprint face dynamic binding system based on sight distance and hearing distance fusion is characterized by comprising: a data acquisition unit configured to acquire a video stream acquired by the image pickup apparatus and an audio stream acquired by the microphone array; A vision processing unit for: performing face detection on the video stream, and determining a plurality of face areas and respective corresponding visual distances; performing lip movement detection on the plurality of face areas, and determining a candidate face area with lip movement and a corresponding lip movement starting time stamp; the audio processing unit is used for carrying out sound source separation on the audio stream and determining at least one candidate voiceprint fragment and a corresponding voiceprint starting time stamp; The candidate pair generating unit is used for combining the candidate face area and the candidate voiceprint fragments in pairs to construct at least one face-voiceprint candidate pair; An auditory distance calculating unit, configured to calculate, for each of the candidate pairs, an auditory distance according to a time difference between a corresponding lip movement start time stamp and a voiceprint start time stamp in combination with a sound velocity; And the binding decision unit is used for determining binding pairs from all candidate pairs according to the distance difference value between the visual distance and the corresponding auditory distance of the candidate face regions in each candidate pair so as to finish voiceprint face binding.

Description

Voiceprint face dynamic binding method and system based on visual distance and auditory distance fusion Technical Field The invention relates to the technical field of computer vision and voice signal processing, in particular to a voiceprint face dynamic binding method and system based on vision distance and listening distance fusion. Background In practical application scenes such as intelligent education, video conference and multi-person collaborative office, the situation that multiple persons speak alternately or even simultaneously often occurs. In these practical applications, although the independent face recognition and voiceprint recognition technologies are relatively mature, how to automatically, real-time and accurately associate a specific speaking voiceprint with a corresponding face in the absence of prior knowledge, i.e. solve the problem of "who is speaking", is still a key bottleneck restricting the development of accurate speaking record, personalized interaction and advanced man-machine interaction application. In the prior art, the multi-mode fusion method is mainly focused on carrying out identity recognition or verification by utilizing the mode intrinsic characteristics of human faces and voiceprints, and is excellent in performance under the conditions of high data quality and good mode alignment, but under the complex scene that multiple people are concurrent, multiple sound sources and multiple human faces are mutually mixed in time and space, direct feature matching often causes binding errors or failure due to mode misalignment and feature cross interference, and an effective mechanism for carrying out rapid and coarse-granularity association on a signal level is lacked. In addition, existing cross-modal correlation studies based on deep learning are generally aimed at mapping two modalities to a shared embedded space to learn semantic correlation, and such methods not only need to rely on a large amount of paired data for training, but also face the problems of high computational complexity and poor instantaneity when dealing with real-time, streaming-type and many-to-many dynamic binding scenes. Disclosure of Invention In order to solve the problems in the prior art, the invention provides a voiceprint face dynamic binding method and a voiceprint face dynamic binding system based on the fusion of the visual distance and the listening distance, which can dynamically bind voiceprints of a speaker with corresponding faces in real time and accurately without training a binding model in advance. In order to achieve the above purpose, the invention provides a voiceprint face dynamic binding method based on the fusion of the sight distance and the listening distance, which comprises the following steps: Acquiring a video stream acquired by camera equipment and an audio stream acquired by a microphone array; performing face detection on the video stream, and determining a plurality of face areas and respective corresponding visual distances; performing lip movement detection on the plurality of face areas, and determining a candidate face area with lip movement and a corresponding lip movement starting time stamp; Performing sound source separation on the audio stream, and determining at least one candidate voiceprint fragment and a corresponding voiceprint starting time stamp; combining the candidate face area and the candidate voiceprint fragments in pairs to construct at least one face-voiceprint candidate pair; calculating an auditory distance according to the time difference between the corresponding lip movement starting time stamp and the corresponding voiceprint starting time stamp by combining the sound velocity for each candidate pair; And determining binding pairs from all candidate pairs according to the distance difference between the visual distance and the corresponding hearing distance of the candidate face regions in each candidate pair so as to finish voiceprint face binding. Optionally, lip movement detection is performed on a plurality of face areas, and determining a candidate face area where lip movement occurs and a corresponding lip movement starting timestamp includes: extracting a corresponding mouth region image sequence aiming at each face region; Analyzing the mouth region image sequence, and detecting the starting moment of mouth opening action; And determining the face area with the detected starting moment as a candidate face area, and determining the video frame time stamp corresponding to the starting moment as a lip movement starting time stamp corresponding to the candidate face area. Optionally, analyzing the sequence of mouth region images, detecting a starting moment of mouth opening motion includes: Performing optical flow calculation on the mouth region image sequence to obtain optical flow characteristic changes among continuous frames; and when the change of the optical flow characteristics exceeds a preset threshold value, determining the