CN-114005451-B - Sound object determining method, sound object determining device, computing equipment and medium

CN114005451BCN 114005451 BCN114005451 BCN 114005451BCN-114005451-B

Abstract

The invention discloses a sound object determining method, a sound object determining device, computing equipment and a sound object determining medium. The method comprises the steps of obtaining a target audio frame sent by a second sound production object and target position information of the second sound production object, extracting target voiceprint characteristics of the second target audio segment if the target position information is not matched with first position information corresponding to a first target audio segment, wherein the first target audio segment comprises front N audio frames of the target audio frame, sound production objects of the audio frames in the first target audio segment are identical to sound production objects of the audio frames in the second target audio segment, the first target audio segment is at least one part of the second target audio segment, N is an integer greater than or equal to 1, and determining the target sound production object of the second target audio segment according to the target voiceprint characteristics. The accuracy of determining the sound object can be improved.

Inventors

ZHENG SIQI
WANG XIANLIANG
SUO HONGBIN

Assignees

阿里巴巴集团控股有限公司

Dates

Publication Date: 20260512
Application Date: 20200728

Claims (18)

1. A sound object determination method, comprising: Acquiring a target audio frame sent by a second sound object and target position information of the second sound object; Extracting target voiceprint features of a second target audio segment if the target position information is not matched with first position information corresponding to a first target audio segment, wherein the first target audio segment comprises first N audio frames of the target audio frames, and the sounding objects of the audio frames in the first target audio segment are identical to those of the audio frames in the second target audio segment; And determining a sound production object corresponding to a first voiceprint feature in a preset sound database as a target sound production object of the second target audio segment, wherein when the matching degree between the target position information and the first position information is smaller than a first preset position matching degree threshold and larger than a second preset position matching degree threshold, the matching degree between the first voiceprint feature and the target voiceprint feature is larger than the first preset voiceprint matching degree threshold, or when the matching degree between the target position information and the first position information is smaller than the second preset position matching degree threshold, the matching degree between the first voiceprint feature and the target voiceprint feature is larger than a third preset voiceprint matching degree threshold, and the first preset voiceprint matching degree threshold is smaller than the third preset voiceprint matching degree threshold.
2. The method of claim 1, wherein the target location information comprises relative location information between the second sound object and a preset audio collector.
3. The method of claim 1, wherein the determining that the target location information does not match the first location information corresponding to the first target audio segment, before extracting the target voiceprint feature of the second target audio segment, the method further comprises: Filtering the position information of the sound production object corresponding to the audio frame in the first target audio segment to obtain filtered position information; And determining the first position information based on the filtered position information.
4. The method of claim 1, wherein the determining the sound object corresponding to the first voiceprint feature in the preset sound database as the target sound object of the second target audio segment comprises: under the condition that a first voiceprint feature with the matching degree meeting a first preset matching condition exists in a preset sound database, determining a sounding object corresponding to the first voiceprint feature as a target sounding object of the second target audio segment; the method further comprises the steps of: updating the corresponding voiceprint characteristics of the target sounding object in the preset sound database by utilizing the target voiceprint characteristics under the condition that the matching degree of the first voiceprint characteristics and the target voiceprint characteristics meets a second preset matching condition; the matching degree which needs to be met by the second preset matching condition is larger than the matching degree which needs to be met by the first preset matching condition.
5. The method of claim 4, wherein, in the event that the degree of match between the target location information and the first location information is less than the first preset location degree of match threshold and greater than the second preset location degree of match threshold, the first preset matching condition comprises the degree of match of the voiceprint features in the preset voice database with the target voiceprint features being greater than the first preset voiceprint degree of match threshold; Under the condition that the matching degree between the target position information and the first position information is smaller than the second preset position matching degree threshold, the first preset matching condition comprises that the matching degree of the voiceprint features in the preset sound database and the target voiceprint features is larger than the third preset voiceprint matching degree threshold, and the second preset matching condition comprises that the matching degree of the voiceprint features in the preset sound database and the target voiceprint features is larger than a fourth preset voiceprint matching degree threshold; the fourth preset voiceprint matching degree threshold is greater than the third preset voiceprint matching degree threshold, the first preset voiceprint matching degree threshold is smaller than the second preset voiceprint matching degree threshold, and the second preset voiceprint matching degree threshold is smaller than the fourth preset voiceprint matching degree threshold.
6. The method of claim 4, wherein the method further comprises: Storing the corresponding relation between the target voiceprint feature and a sounding object corresponding to the target voiceprint feature in the preset sound database and determining the sounding object corresponding to the target voiceprint feature as a target sounding object of the second target audio segment under the condition that the matching degree of each voiceprint feature in the preset sound database and the target voiceprint feature meets a third preset matching condition; The third preset matching condition is used for representing that the voiceprint features in the preset voice database are not matched with the target voiceprint features.
7. The method of claim 6, wherein the third preset matching condition includes that a degree of matching of a voiceprint feature in the preset sound database with the target voiceprint feature is less than a fifth preset voiceprint degree of matching threshold if the degree of matching between the target location information and the first location information is less than the first preset location degree of matching threshold and greater than the second preset location degree of matching threshold; When the matching degree between the target position information and the first position information is smaller than the second preset position matching degree threshold value, the third preset matching condition comprises that the matching degree between the voiceprint features in the preset sound database and the target voiceprint features is smaller than a sixth preset voiceprint matching degree threshold value; the fifth preset voiceprint matching degree threshold is smaller than the sixth preset voiceprint matching degree threshold.
8. A sound object determination method, comprising: Acquiring a target audio frame sent by a second sound object and target position information of the second sound object; if the target position information is not matched with first position information corresponding to a first target audio segment, extracting target voiceprint features of the first target audio segment, wherein the first target audio segment comprises all continuous audio frames sent by a first sounding object, and the first sounding object is a sounding object of a previous audio frame sending the target audio frame; And determining a sound production object corresponding to a first voiceprint feature in a preset sound database as a target sound production object of the second target audio segment, wherein when the matching degree between the target position information and the first position information is smaller than a first preset position matching degree threshold and larger than a second preset position matching degree threshold, the matching degree between the first voiceprint feature and the target voiceprint feature is larger than the first preset voiceprint matching degree threshold, or when the matching degree between the target position information and the first position information is smaller than the second preset position matching degree threshold, the matching degree between the first voiceprint feature and the target voiceprint feature is larger than a third preset voiceprint matching degree threshold, and the first preset voiceprint matching degree threshold is smaller than the third preset voiceprint matching degree threshold.
9. A method of determining a starting point of a content of a sound production, comprising: Acquiring a target audio frame sent by a second sound object and target position information of the second sound object; Determining that the target position information is not matched with the first position information corresponding to the first target audio segment, and determining the target audio frame as the starting point of the sounding content of the second sounding object; The first target audio segment comprises all continuous audio frames sent by a first sounding object, wherein the first sounding object is a sounding object of a previous audio frame sending the target audio frame; the end point of the first target audio segment is the previous audio frame of the target audio frame, the starting point of the sounding content of the second sounding object is used for determining the sounding object of the second target audio segment, the sounding object of the second target audio segment is determined based on the sounding object corresponding to the first voice print feature in the preset voice database, under the condition that the matching degree between the target position information and the first position information is smaller than a first preset position matching degree threshold and larger than a second preset position matching degree threshold, the matching degree between the first voice print feature and the target voice print feature is larger than the first preset voice print matching degree threshold, or under the condition that the matching degree between the target position information and the first position information is smaller than the second preset position matching degree threshold, the matching degree between the first voice print feature and the target voice print feature is larger than a third preset voice print matching degree threshold, the first preset voice print matching degree threshold is smaller than the third preset voice print matching degree threshold, and the matching degree between the first voice print feature and the second voice print feature is the same as the first voice print segment of the first target audio segment.
10. A method of altering a sound object identification, comprising: Acquiring a target audio frame sent by a second sound object and target position information of the second sound object; If the target position information is not matched with the first position information corresponding to the first target audio segment, extracting target voiceprint features of a second target audio segment, wherein the first target audio segment comprises a front N audio frame of the target audio frame; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; Determining a sound production object corresponding to a first voiceprint feature in a preset sound database as a target sound production object of the second target audio segment, wherein when the matching degree between the target position information and the first position information is smaller than a first preset position matching degree threshold and larger than a second preset position matching degree threshold, the matching degree between the first voiceprint feature and the target voiceprint feature is larger than the first preset voiceprint matching degree threshold, or when the matching degree between the target position information and the first position information is smaller than the second preset position matching degree threshold, the matching degree between the first voiceprint feature and the target voiceprint feature is larger than a third preset voiceprint matching degree threshold, and the first preset voiceprint matching degree threshold is smaller than the third preset voiceprint matching degree threshold; And changing the identification of the second sound production object and the identification of the target sound production object, wherein the identification is used for representing the sound production state of the sound production object.
11. A session record generation method, comprising: acquiring a target audio frame sent by a second sound object in audio session data and target position information of the second sound object; If the target position information is not matched with first position information corresponding to a first target audio segment in the audio session data, extracting target voiceprint features of a second target audio segment in the audio session data, wherein the first target audio segment comprises a front N audio frame of the target audio frame; Determining a sound production object corresponding to a first voiceprint feature in a preset sound database as a target sound production object of the second target audio segment, wherein when the matching degree between the target position information and the first position information is smaller than a first preset position matching degree threshold and larger than a second preset position matching degree threshold, the matching degree between the first voiceprint feature and the target voiceprint feature is larger than the first preset voiceprint matching degree threshold, or when the matching degree between the target position information and the first position information is smaller than the second preset position matching degree threshold, the matching degree between the first voiceprint feature and the target voiceprint feature is larger than a third preset voiceprint matching degree threshold, and the first preset voiceprint matching degree threshold is smaller than the third preset voiceprint matching degree threshold; and correlating the target sound production object with the text content corresponding to the second target audio segment to obtain the session record of the target sound production object.
12. A sound object determining apparatus, wherein the apparatus comprises: the acquisition module is used for acquiring a target audio frame sent by the second sound production object and target position information of the second sound production object; the extraction module is used for determining that the target position information is not matched with first position information corresponding to a first target audio segment, and extracting target voiceprint features of a second target audio segment, wherein the first target audio segment comprises the first N audio frames of the target audio frames, the sounding objects of the audio frames in the first target audio segment are identical to those of the audio frames in the second target audio segment, and the first target audio segment is at least one part of the second target audio segment; The first determining module is configured to determine a sound generating object corresponding to a first voiceprint feature in a preset sound database as a target sound generating object of the second target audio segment, where, when a matching degree between the target position information and the first position information is smaller than a first preset position matching degree threshold and larger than a second preset position matching degree threshold, a matching degree between the first voiceprint feature and the target voiceprint feature is larger than the first preset voiceprint matching degree threshold, or when a matching degree between the target position information and the first position information is smaller than the second preset position matching degree threshold, a matching degree between the first voiceprint feature and the target voiceprint feature is larger than a third preset voiceprint matching degree threshold, and the first preset voiceprint matching degree threshold is smaller than the third preset voiceprint matching degree threshold.
13. A sound object determining apparatus comprising: the acquisition module is used for acquiring a target audio frame sent by the second sound production object and target position information of the second sound production object; The extraction module is used for determining that the target position information is not matched with first position information corresponding to a first target audio segment, and extracting target voiceprint features of the first target audio segment, wherein the first target audio segment comprises all continuous audio frames sent by a first sounding object, the first sounding object is a sounding object of an audio frame before the target audio frame, and the end point of the first target audio segment is the audio frame before the target audio frame; The first determining module is configured to determine a sound generating object corresponding to a first voiceprint feature in a preset sound database as a target sound generating object of the second target audio segment, where, when a matching degree between the target position information and the first position information is smaller than a first preset position matching degree threshold and larger than a second preset position matching degree threshold, a matching degree between the first voiceprint feature and the target voiceprint feature is larger than the first preset voiceprint matching degree threshold, or when a matching degree between the target position information and the first position information is smaller than the second preset position matching degree threshold, a matching degree between the first voiceprint feature and the target voiceprint feature is larger than a third preset voiceprint matching degree threshold, and the first preset voiceprint matching degree threshold is smaller than the third preset voiceprint matching degree threshold.
14. A sound production content origin determining apparatus comprising: the acquisition module is used for acquiring a target audio frame sent by the second sound production object and target position information of the second sound production object; The first determining module is used for determining that the target position information is not matched with the first position information corresponding to the first target audio segment, and determining the target audio frame as the starting point of the sounding content of the second sounding object; The first target audio segment comprises all continuous audio frames sent by a first sounding object, wherein the first sounding object is a sounding object of a previous audio frame sending the target audio frame; the end point of the first target audio segment is the previous audio frame of the target audio frame, the starting point of the sounding content of the second sounding object is used for determining the sounding object of the second target audio segment, the sounding object of the second target audio segment is determined based on the sounding object corresponding to the first voice print feature in the preset voice database, under the condition that the matching degree between the target position information and the first position information is smaller than a first preset position matching degree threshold and larger than a second preset position matching degree threshold, the matching degree between the first voice print feature and the target voice print feature is larger than the first preset voice print matching degree threshold, or under the condition that the matching degree between the target position information and the first position information is smaller than the second preset position matching degree threshold, the matching degree between the first voice print feature and the target voice print feature is larger than a third preset voice print matching degree threshold, the first preset voice print matching degree threshold is smaller than the third preset voice print matching degree threshold, and the matching degree between the first voice print feature and the second voice print feature is the same as the first voice print segment of the first target audio segment.
15. A sound object identification changing apparatus comprising: the acquisition module is used for acquiring a target audio frame sent by the second sound production object and target position information of the second sound production object; The device comprises an extraction module, a sound generation module and a sound generation module, wherein the extraction module is used for determining that the target position information is not matched with first position information corresponding to a first target audio segment, and extracting target voiceprint characteristics of a second target audio segment, wherein the first target audio segment comprises a front N audio frame of the target audio frame; the first determining module is used for determining a sound production object corresponding to a first voiceprint feature in a preset sound database as a target sound production object of the second target audio frequency band, wherein when the matching degree between the target position information and the first position information is smaller than a first preset position matching degree threshold and larger than a second preset position matching degree threshold, the matching degree between the first voiceprint feature and the target voiceprint feature is larger than the first preset voiceprint matching degree threshold, or when the matching degree between the target position information and the first position information is smaller than the second preset position matching degree threshold, the matching degree between the first voiceprint feature and the target voiceprint feature is larger than a third preset voiceprint matching degree threshold, and the first preset voiceprint matching degree threshold is smaller than the third preset voiceprint matching degree threshold; The changing module is used for changing the identification of the second sound production object and the identification of the target sound production object, and the identification is used for representing the sound production state of the sound production object.
16. A session record generation apparatus comprising: The acquisition module is used for acquiring a target audio frame sent by a second sound production object in the audio session data and target position information of the second sound production object; The extraction module is used for determining that the target position information is not matched with first position information corresponding to a first target audio segment in the audio session data, and extracting target voiceprint characteristics of a second target audio segment in the audio session data, wherein the first target audio segment comprises a front N audio frame of the target audio frame; the first determining module is used for determining a sound production object corresponding to a first voiceprint feature in a preset sound database as a target sound production object of the second target audio frequency band, wherein when the matching degree between the target position information and the first position information is smaller than a first preset position matching degree threshold and larger than a second preset position matching degree threshold, the matching degree between the first voiceprint feature and the target voiceprint feature is larger than the first preset voiceprint matching degree threshold, or when the matching degree between the target position information and the first position information is smaller than the second preset position matching degree threshold, the matching degree between the first voiceprint feature and the target voiceprint feature is larger than a third preset voiceprint matching degree threshold, and the first preset voiceprint matching degree threshold is smaller than the third preset voiceprint matching degree threshold; And the association module is used for associating the target sound production object with the text content corresponding to the second target audio segment to obtain the session record of the target sound production object.
17. A computing device, wherein the computing device includes a processor and a memory storing computer program instructions; The processor, when executing the computer program instructions, implements the method of any of claims 1-11.
18. A computer storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method of any of claims 1-11.

Description

Sound object determining method, sound object determining device, computing equipment and medium Technical Field The present invention relates to the field of data processing, and in particular, to a method, an apparatus, a computing device, and a medium for determining a sound object. Background Conversational speech exists in many situations, such as daily life, meetings, and telephone conversations. In practical applications, in order to analyze a voice signal more accurately, not only voice recognition but also character separation of voices are required to determine the object of each part of voices. After determining the sound object of the speech, a wider application space is created. For example, for a scene of a multi-person large conference room, by performing role separation on voices in the conference, conference recording can be completed quickly, and contents uttered by each speaker in the conference room can be recorded. Currently, determining a sound object is mostly achieved through voiceprint recognition technology, but accuracy is low. Therefore, it is highly desirable to provide a sound object determination method with high accuracy. Disclosure of Invention The embodiment of the invention provides a sound object determining method, a sound object determining device, computing equipment and a sound object determining medium, which can solve the problem that the accuracy of determining sound objects is low. According to a first aspect of an embodiment of the present invention, there is provided a sound object determining method, including: Acquiring a target audio frame sent by a second sound object and target position information of the second sound object; Extracting target voiceprint features of a second target audio segment if the target position information is not matched with first position information corresponding to a first target audio segment, wherein the first target audio segment comprises first N audio frames of the target audio frames, and the sounding objects of the audio frames in the first target audio segment are identical to those of the audio frames in the second target audio segment; And determining a target sounding object of the second target audio segment according to the target voiceprint characteristics. According to a second aspect of an embodiment of the present invention, there is provided a sound object determining method including: Acquiring a target audio frame sent by a second sound object and target position information of the second sound object; if the target position information is not matched with first position information corresponding to a first target audio segment, extracting target voiceprint features of the first target audio segment, wherein the first target audio segment comprises all continuous audio frames sent by a first sounding object, and the first sounding object is a sounding object of a previous audio frame sending the target audio frame; and determining a sound production object of the first target audio segment according to the target voiceprint characteristics. According to a third aspect of the embodiment of the present invention, there is provided a sound production content start point determining method, including: Acquiring a target audio frame sent by a second sound object and target position information of the second sound object; Determining that the target position information is not matched with the first position information corresponding to the first target audio segment, and determining the target audio frame as the starting point of the sounding content of the second sounding object; the first target audio segment comprises all continuous audio frames sent by a first sound object, the first sound object is a sound object of a previous audio frame sent by the target audio frame, and the end point of the first target audio segment is the previous audio frame of the target audio frame. According to a fourth aspect of the embodiment of the present invention, there is provided a method for changing a sound object identifier, including: Acquiring a target audio frame sent by a second sound object and target position information of the second sound object; If the target position information is not matched with the first position information corresponding to the first target audio segment, extracting target voiceprint features of a second target audio segment, wherein the first target audio segment comprises a front N audio frame of the target audio frame; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; Determining a target sound production object of the second target audio segment according to the target voiceprint characteristics; And changing the identification of the second sound production object and the identification of the target sound production object, wherein the identification is used