US-12626715-B2 - Role separation method, electronic device, and computer storage medium

US12626715B2US 12626715 B2US12626715 B2US 12626715B2US-12626715-B2

Abstract

Embodiments of the present application provide a role separation method, an electronic device, and a computer storage medium. The role separation method includes: acquiring sound source information of target voice data and a voiceprint feature of the target voice data; determining, according to the sound source information, at least one candidate position corresponding to a sound source position; calculating a similarity between a voiceprint feature of a role corresponding to the at least one candidate position and the voiceprint feature of the target voice data; and determining a target role corresponding to the target voice data according to the similarity. By means of the embodiments of the present application, the accuracy of the role separation is improved.

Inventors

Wei Ju

Assignees

ALIBABA DAMO (HANGZHOU) TECHNOLOGY CO., LTD.

Dates

Publication Date: 20260512
Application Date: 20230109
Priority Date: 20220110

Claims (11)

1 . A role separation method, implemented by a processor of an electronic device, comprising: acquiring i) sound source information of target voice data from sound waves received by a microphone acoustically coupled to the electronic device and dividing the target voice data into data frames for a time period and ii) a voiceprint feature of the target voice data by performing feature extraction on the target voice data using a neural network; selecting, from a plurality of positions for respective speakers, a plurality of candidate positions according to a sound source position indicated by the sound source information; determining, from the plurality of candidate positions, at least one candidate position based on an azimuth of each of the plurality of candidate positions relative to the sound source position; calculating a similarity between a voiceprint feature of a role corresponding to the at least one candidate position and the voiceprint feature of the target voice data; determining a target role corresponding to the target voice data according to the similarity; determining whether a number of frames of the data frames of the target voice data is larger than a preset frame number; in a case where the number of frames of the target voice data is larger than the preset frame number and the target voice data is not first voice data, calculating, according to the sound source information, an azimuth change difference value between the sound source position and a position, in existing positions, which has the closest azimuth to the sound source position, wherein the azimuth change difference value is a size of an angle from the sound source position to a reference and from the position which has the closest azimuth to the reference; and in a case where the azimuth change difference value is larger than a preset change difference value, determining the existing positions, other than the position which has the closest azimuth to the sound source position, as the candidate positions; and in a case where the azimuth change difference value is not larger than the preset change difference value, determining the position, which has the closest azimuth to the sound source position, as the candidate position.
2 . The method of claim 1 , wherein the determining the target role corresponding to the target voice data according to the similarity, comprises: determining a role, from roles corresponding to the at least one candidate position, whose voiceprint feature corresponds to a largest similarity, as the target role.
3 . The method of claim 1 , further comprising in a case where the target voice data is the first voice data, generating a new position as a candidate position according to the sound source information of the target voice data.
4 . The method of claim 1 , wherein the determining the target role corresponding to the target voice data according to the similarity, comprises: in a case where the target voice data is not the first voice data, calculating, according to the sound source information, an azimuth change difference value between the sound source position and a position, in existing positions, which has the closest azimuth to the sound source position; in a case where the azimuth change difference value is less than or equal to a preset change difference value and the similarity is larger than a preset similarity, determining a role corresponding to the similarity as the target role; and in a case where the azimuth change difference value is less than or equal to the preset change difference value and the similarity is less than or equal to the preset similarity, calculating similarities between voiceprint features corresponding to other positions within a region where the candidate position is located and the voiceprint feature of the target voice data, and determining a role corresponding to a voiceprint feature with a similarity larger than the preset similarity as the target role.
5 . The method of claim 4 , further comprising: in a case where each of the similarities for the voiceprint features corresponding to the other positions within the region where the candidate position is located is less than or equal to the preset similarity, calculating similarities between voiceprint features corresponding to positions within other regions and the voiceprint feature of the target voice data, and determining a role corresponding to a voiceprint feature with a similarity larger than the preset similarity as the target role; and in a case where each of the similarities for the voiceprint features corresponding to the positions within the other regions is less than or equal to the preset similarity, generating a new role, as the target role, for the target voice data.
6 . The method of claim 1 , further comprising: in a case where the number of the frames of the target voice data is less than or equal to the preset frame number, determining candidate voice data closest to an azimuth for the target voice data according to historical voice data of the sound source information; and calculating an azimuth difference between the target voice data and the candidate voice data, and in a case where the azimuth difference is less than a preset threshold value, determining a role corresponding to the candidate voice data as the target role.
7 . The method of claim 1 , further comprising: recording a corresponding relationship between the target role and a candidate position with a highest voiceprint feature similarity; determining whether candidate positions in multiple pieces of target voice data corresponding to the target role have changed, according to the corresponding relationship; and in a case where the candidate positions in the multiple pieces of target voice data corresponding to the target role have changed, determining position change information of the target role according to the change.
8 . The method of claim 1 , further comprising: determining a space partition to which a sound source position indicated by the sound source information belongs, and determining at least one candidate position corresponding to the sound source position in the space partition, wherein the space partition is one of multiple space regions formed after a physical space, where a speaker corresponding to the target voice data is located, is spatially divided according to a preset angle.
9 . The method of claim 8 , wherein the determining the at least one candidate position corresponding to the sound source position in the space partition, comprises: determining whether there is a candidate position corresponding to the sound source position in the space partition; in a case where there is the candidate position corresponding to the sound source position in the space partition, determining the candidate position as the candidate position corresponding to the sound source position in the space partition; and in a case where there is not the candidate position corresponding to the sound source position in the space partition, creating a candidate position in the space partition according to the sound source position.
10 . An electronic device, comprising: a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface communicate with each other via the communication bus; and the memory is configured for storing at least one executable instruction, and the at least one executable instruction causes the processor to perform operations corresponding to the role separation method of claim 1 .
11 . A non-transitory computer storage medium, storing a computer program, wherein the computer program, when executed by a processor, implements the role separation method of claim 1 .

Description

The present application claims priority to Chinese patent application No. 202210023782.5, filed with the Chinese Patent Office on Jan. 10, 2022 and entitled “ROLE SEPARATION METHOD, ELECTRONIC DEVICE, AND COMPUTER STORAGE MEDIUM”, which is incorporated herein by reference in its entirety. TECHNICAL FIELD Embodiments of the present application relate to a technical field of voice processing, and in particular to a role separation method, an electronic device, and a computer storage medium. BACKGROUND In many application scenes, for example, a conference scene, a voice communication scene, etc., in order to feedback role information of a speaker to a user, it is necessary to determine an identity or role of the speaker according to voice data of the speaker. Usually, the voice data of different roles may be distinguished according to voiceprint features of the different roles. However, in the process of implementing the above role separation, if voiceprint features of two speakers are relatively similar, a large error will be generated during the role separation and wrong information will be fed back to the user. SUMMARY In view of this, embodiments of the present application provide a role separation solution to solve a part or all of the above problems. According to a first aspect of an embodiment of the present application, a role separation method is provided, which includes: acquiring sound source information of target voice data and a voiceprint feature of the target voice data; determining, according to the sound source information, at least one candidate position corresponding to a sound source position; calculating a similarity between a voiceprint feature of a role corresponding to the at least one candidate position and the voiceprint feature of the target voice data; and determining a target role corresponding to the target voice data according to the similarity. According to a second aspect of an embodiment of the present application, a role separation apparatus is provided, which includes: an acquisition module, configured for acquiring sound source information of target voice data and a voiceprint feature of the target voice data; a candidate module, configured for determining, according to the sound source information, at least one candidate position corresponding to a sound source position; a similarity module, configured for calculating a similarity between a voiceprint feature of a role corresponding to the at least one candidate position and the voiceprint feature of the target voice data; and a role separation module, configured for determining a target role corresponding to the target voice data according to the similarity. According to a third aspect of an embodiment of the present application, an electronic device is provided, which includes: a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface communicate with each other via the communication bus; and the memory is configured for storing at least one executable instruction, and the executable instruction causes the processor to perform operations corresponding to the role separation method in the first aspect. According to a fourth aspect of an embodiment of the present application, a computer storage medium is provided, which stores a computer program, wherein the computer program, when executed by a processor, implements the role separation method in the first aspect. According to the role separation solution provided by the embodiments of the present application, sound source information of target voice data and a voiceprint feature of the target voice data are acquired, at least one candidate position corresponding to a sound source position is determined according to the sound source information, a similarity between a voiceprint feature of a role corresponding to the candidate position and the voiceprint feature of the target voice data is calculated, and a target role corresponding to the target voice data is determined according to the similarity. Because the candidate position is filtered first according to the sound source position indicated by the sound source information, which reduces the computation amount, and then, the similarity between the voiceprint feature of the role corresponding to the candidate position and the voiceprint feature of the target voice data is calculated, and the target role is determined according to the similarity, which takes into account both the sound source position and the voiceprint feature, and leads to higher accuracy of role separation. BRIEF DESCRIPTION OF THE DRAWINGS To describe the technical solutions of the embodiments of the present application or the prior art more clearly, the accompanying drawings to be used in the description of the embodiments or the prior art will be described briefly below. Evidently, the accompanying drawings described below are merely drawings of some embodiments recite