CN-121999785-A - Speaker tracking method based on audio-visual dual mode and related equipment

CN121999785ACN 121999785 ACN121999785 ACN 121999785ACN-121999785-A

Abstract

The application discloses a speaker tracking method based on audio-visual double modes and related equipment, and relates to the technical field of data processing, wherein the method comprises the steps of responding to a speaker tracking instruction, obtaining audio information and video information, carrying out voiceprint comparison based on a historical database and the audio information to determine whether a voice ID corresponding to a speaker exists in the historical database, wherein the historical database comprises a plurality of voice IDs and face IDs corresponding to the speaker, and if so, determining the speaker based on the voice ID corresponding to the speaker; and if the video information does not exist, carrying out lip movement identification based on the video information so as to determine a speaker. According to the application, multidimensional information such as sound source comparison, lip movement detection and the like is comprehensively utilized, and the identity of the speaker is rapidly confirmed and stably tracked in a complex environment through cross-modal feature fusion.

Inventors

CAO CHANGGANG
ZHANG YUJIN
Luo Qinhan
YU FAN

Assignees

成都维海德科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260130

Claims (10)

1. The audio-visual bimodal speaker tracking method is characterized by comprising the following steps of: responding to a speaker tracking instruction, and acquiring audio information and video information; Based on a historical database and the audio information, conducting voiceprint comparison to determine whether voice IDs corresponding to the speakers exist in the historical database, wherein the historical database comprises a plurality of voice IDs corresponding to the speakers and face IDs; and if the video information does not exist, lip movement identification is performed based on the video information so as to determine the speaker.
2. The audio-visual bimodal speaker tracking method according to claim 1, wherein the step of comparing voiceprints based on a history database and the audio information to determine whether a voice ID corresponding to a speaker exists in the history database further comprises: Determining the source direction of the audio information based on the audio information and a preset DOA algorithm; Determining a plurality of persons to be voiceprint compared based on the source direction; And carrying out voiceprint comparison based on the historical database, the audio information and the plurality of people to be voiceprint compared so as to determine whether a voice ID corresponding to the speaker exists in the historical database.
3. The audio-visual bimodal speaker tracking method according to claim 2, wherein the step of performing voiceprint comparison based on the history database, the audio information, and the plurality of people to be voiceprint compared to determine whether a voice ID corresponding to a speaker exists in the history database further comprises: Based on the history database, the audio information and the plurality of people to be voiceprint compared, voiceprint comparison is performed by using a preset voiceprint feature comparison algorithm to determine whether a voice ID corresponding to a speaker exists in the history database, wherein the preset voiceprint feature comparison algorithm is that Wherein A and B are two voiceprints for voiceprint comparison, respectively.
4. The audio-visual bimodal speaker tracking method as claimed in claim 1 wherein said step of lip movement recognition based on said video information to determine a speaker further comprises: Based on the video information, performing feature extraction by using a face key point detection algorithm to obtain continuous N frames of face data and lip key point information, wherein the value of N is preset; And processing the continuous N frames of face data and lip key point information based on a preset classification network to perform lip movement identification and determine a speaker, wherein the preset classification network is EFFICENTNET in structure.
5. The audio-visual bimodal speaker tracking method as claimed in claim 1 wherein said step of lip movement recognition based on said video information to determine a speaker further comprises, after said step of: Determining voiceprint features corresponding to a speaker based on the audio information, performing voiceprint registration based on the voiceprint features to obtain a new voice ID, and determining a face ID corresponding to the speaker; And associating the new voice ID with the face ID, and updating the history database based on the associated new voice ID and face ID.
6. The audio-visual bimodal speaker tracking method as claimed in claim 5 wherein said step of determining a voice print feature corresponding to a speaker based on said audio information further comprises: obtaining sample data, and performing data enhancement operation on the sample data to obtain data-enhanced sample data, wherein the data enhancement operation comprises random cutting, background noise addition, speech speed adjustment, volume adjustment and SpecAugment; Training the current ECAPATDNN model based on the sample data after data enhancement and a preset loss function to obtain a preset ECAPATDNN model, and processing the audio data by using a preset ECAPATDNN model to obtain voiceprint features corresponding to a speaker, wherein the preset loss function is an additive angle interval loss function.
7. A speaker tracking device based on audio-visual bi-mode, the speaker tracking device based on audio-visual bi-mode comprising: The acquisition module is used for responding to the speaker tracking instruction and acquiring audio information and video information; the voiceprint comparison module is used for carrying out voiceprint comparison based on a historical database and the audio information so as to determine whether a voice ID corresponding to a speaker exists in the historical database, and the historical database comprises a plurality of voice IDs corresponding to the speaker and face IDs; and the determining module is used for determining the speaker based on the voice ID corresponding to the speaker if the speaker exists, and performing lip movement identification based on the video information if the speaker does not exist so as to determine the speaker.
8. An audiovisual bimodal speaker tracking device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the audiovisual bimodal speaker tracking method as claimed in any one of claims 1 to 6.
9. A storage medium, characterized in that the storage medium is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the audio-visual bimodal speaker tracking method according to any one of claims 1 to 6.
10. A computer program product, characterized in that it comprises a computer program which, when executed by a processor, implements the steps of the audio-visual bimodal speaker tracking method according to any one of claims 1 to 6.

Description

Speaker tracking method based on audio-visual dual mode and related equipment Technical Field The application relates to the technical field of data processing, in particular to a speaker tracking method based on audio-visual double modes and related equipment. Background In scenes such as teleconferencing and education live broadcasting, the existing speaker positioning technology has the problems of blurred pictures, scattered focuses, low interaction efficiency and the like. The current mainstream schemes are divided into two types, namely, a method based on visual characteristics relies on a camera to capture facial characteristics for tracking, but cannot automatically identify a speaking starting point and switch speaking targets, and a method based on auditory characteristics utilizes a microphone array to analyze sound source directions, but has insufficient stability in a noisy environment or a multi-sound source scene, and is difficult to accurately relate to specific person identities. The single-mode technology is difficult to meet the requirements of quick response and accurate positioning at the same time, so that the tracking accuracy is obviously reduced in a complex environment. Disclosure of Invention The application mainly aims to provide a speaker tracking method based on audio-visual double modes and related equipment, and aims to solve the technical problem of how to accurately track a speaker in a complex environment. In order to achieve the above object, the present application provides a speaker tracking method based on audio-visual dual mode, which includes: responding to a speaker tracking instruction, and acquiring audio information and video information; Based on a historical database and the audio information, conducting voiceprint comparison to determine whether voice IDs corresponding to the speakers exist in the historical database, wherein the historical database comprises a plurality of voice IDs corresponding to the speakers and face IDs; and if the video information does not exist, lip movement identification is performed based on the video information so as to determine the speaker. In an embodiment, the step of comparing voiceprints based on the history database and the audio information to determine whether a voice ID corresponding to the speaker exists in the history database further includes: Determining the source direction of the audio information based on the audio information and a preset DOA algorithm; Determining a plurality of persons to be voiceprint compared based on the source direction; And carrying out voiceprint comparison based on the historical database, the audio information and the plurality of people to be voiceprint compared so as to determine whether a voice ID corresponding to the speaker exists in the historical database. In an embodiment, the step of performing voiceprint comparison based on the history database, the audio information, and the plurality of people to be voiceprint compared to determine whether a voice ID corresponding to a speaker exists in the history database further includes: Based on the history database, the audio information and the plurality of people to be voiceprint compared, voiceprint comparison is performed by using a preset voiceprint feature comparison algorithm to determine whether a voice ID corresponding to a speaker exists in the history database, wherein the preset voiceprint feature comparison algorithm is that Wherein A and B are two voiceprints for voiceprint comparison, respectively. In an embodiment, the step of performing lip movement recognition to determine a speaker based on the video information further includes: Based on the video information, performing feature extraction by using a face key point detection algorithm to obtain continuous N frames of face data and lip key point information, wherein the value of N is preset; And processing the continuous N frames of face data and lip key point information based on a preset classification network to perform lip movement identification and determine a speaker, wherein the preset classification network is EFFICENTNET in structure. In an embodiment, after the step of performing lip movement recognition to determine the speaker based on the video information, the method further includes: Determining voiceprint features corresponding to a speaker based on the audio information, performing voiceprint registration based on the voiceprint features to obtain a new voice ID, and determining a face ID corresponding to the speaker; And associating the new voice ID with the face ID, and updating the history database based on the associated new voice ID and face ID. In an embodiment, the step of determining, based on the audio information, a voiceprint feature corresponding to the speaker further includes: obtaining sample data, and performing data enhancement operation on the sample data to obtain data-enhanced sample data, wherein the data enhancement operation