CN-115312057-B - Conference interaction method, conference interaction device, computer equipment and storage medium
Abstract
The application relates to a conference interaction method, a conference interaction device, a conference interaction computer device, a conference interaction storage medium and a conference interaction computer program product. The voice recognition technology based on artificial intelligence comprises the steps of obtaining wake-up trigger voice data of a local terminal in a non-speaking state of a conference to be voice-awakened in response to a voice wake-up trigger event, detecting wake-up trigger voice data to obtain a wake-up word detection result, conducting voice print matching on the wake-up trigger voice data and standard voice print data of an account number of a speaking right associated with the local terminal to obtain a voice print matching result, switching the conference into the speaking state of the local terminal when the wake-up word detection result indicates that the target wake-up word is included and the voice print matching result is consistent, and sending conference speaking voice data collected by the local terminal to a conference client of the conference when the conference is in the speaking state of the local terminal. By adopting the method, voice leakage can be avoided, and conference safety is improved.
Inventors
- YU ZIQIANG
Assignees
- 腾讯科技(深圳)有限公司
- 腾讯科技(深圳)有限公司
Dates
- Publication Date
- 20260421
- Application Date
- 20220815
- Priority Date
- 20220815
Claims (20)
- 1. A conference interaction method, the method comprising: responding to a triggering operation for a conference entrance, entering a conference associated with the conference entrance, setting the conference to be in a non-speaking state of waiting for voice awakening at a local end, and responding to a voice awakening triggering event, acquiring awakening triggering voice data of the local end, wherein the non-speaking state is a state of not supporting the local end to speak for the conference; performing wake-up word detection on the wake-up trigger voice data to obtain a wake-up word detection result; Performing voiceprint matching on the awakening triggering voice data and standard voiceprint data of the local-end-associated speaking right account to obtain a voiceprint matching result; When the wake-up word detection result indicates that the wake-up trigger voice data comprises a target wake-up word and the voiceprint matching result is consistent in matching, switching the conference into a speaking state at the local end, wherein the speaking state is a state supporting the local end to speak for the conference; Collecting conference speaking voice data of the local terminal and sending the conference speaking voice data to a conference client of the conference under the condition that the conference is in a speaking state at the local terminal, wherein the conference speaking voice data is speaking data sent by the local terminal for the conference; when meeting arrangement information of the meeting is used for determining that a participant member account of the next meeting belongs to the speaking right account associated with the local end, switching the meeting from the non-speaking state to the speaking state of the local end; performing speaking prompt on the local terminal through the perceivable speaking prompt information so as to prompt the account of the participant member who is speaking next to directly speak on the local terminal; when the account number of the participant member who takes the next speech finishes the speech, switching the conference from the speech state to the non-speech state at the local end; When the participant member account of the next utterance does not belong to the speaking right account associated with the local end, triggering the participant member account of the next utterance to perform speaking right configuration so as to give the participant member account of the next utterance with speaking right, and when the participant member account of the next utterance belongs to the participant member account of the local end associated with the speaking right account, performing speaking prompt on the local end through perceivable speaking prompt information so as to prompt the participant member account of the next utterance to wake up through voice triggering, and switching to the speaking state of the local end.
- 2. The method of claim 1, wherein performing wake word detection on the wake-up trigger voice data to obtain a wake-up word detection result comprises: Extracting features of the wake-up trigger voice data to obtain initial audio features of the wake-up trigger voice data; Performing differential mapping on the initial audio features to obtain identification audio features; And carrying out wake-up word detection based on the identification audio features to obtain a wake-up word detection result.
- 3. The method of claim 2, wherein the feature extraction of the wake-up trigger voice data to obtain initial audio features of the wake-up trigger voice data comprises: Performing time domain preprocessing on the wake-up triggering voice data to obtain middle voice data; performing frequency domain conversion on the intermediate voice data to obtain an energy spectrum of the intermediate voice data; Performing frequency conversion on the energy spectrum through a filter bank to obtain a power spectrum; and performing discrete transformation on the power spectrum to obtain the initial audio characteristics of the wake-up triggering voice data.
- 4. The method of claim 2, wherein differentially mapping the initial audio features to obtain identified audio features comprises: Performing at least one difference treatment on the initial audio feature to obtain an identification audio feature; The wake-up word detection based on the identified audio features, to obtain a wake-up word detection result, includes: the method comprises the steps of obtaining a wake-up word detection model, wherein the wake-up word detection model is obtained by training historical trigger voice data carrying wake-up word labels; And carrying out wake-up word detection on the identification audio features through the wake-up word detection model to obtain a wake-up word detection result output by the wake-up word detection model.
- 5. The method of claim 4, wherein the performing wake word detection on the identified audio feature by the wake word detection model to obtain a wake word detection result output by the wake word detection model comprises: Sequentially carrying out feature processing on the identification audio features through a feature processing layer structure of at least two stages in the wake-up word detection model to obtain intermediate audio features; through a classification layer structure in the wake-up word detection model, wake-up word classification is carried out based on the intermediate audio features, and classification probability distribution is obtained; And obtaining wake-up word detection results based on the classification probability distribution.
- 6. The method of claim 1, wherein performing voiceprint matching on the wake-up trigger voice data and standard voiceprint data of the local side associated speaking right account to obtain a voiceprint matching result includes: Extracting wake-up trigger voiceprint features from the wake-up trigger voice data; determining standard voiceprint characteristics of standard voiceprint data of the account of the speaking right associated with the local terminal; And carrying out voiceprint feature matching on the wake-up trigger voiceprint feature and the standard voiceprint feature to obtain a voiceprint matching result.
- 7. The method of claim 1, wherein the obtaining wake-up trigger voice data for the local side in response to a voice wake-up trigger event comprises: voice data acquisition is carried out at the local end; Performing silence detection based on voice characteristics of voice data acquired and obtained at the local end to obtain a silence detection result; And when the silence detection result indicates that voice wake-up is triggered at the local terminal, acquiring wake-up triggering voice data of the local terminal from voice data acquired at the local terminal.
- 8. The method according to claim 1, wherein the method further comprises: And when the duration of conference speech data of the local terminal is not acquired and the condition of the speech ending duration is met, switching the conference from the speech state of the local terminal to the non-speech state of the local terminal.
- 9. The method according to any one of claims 1 to 8, further comprising: entering the conference and displaying a conference information area associated with the conference; displaying a conference state identifier representing the non-speaking state in the conference information area; When voice data is detected at the local end, prompting to trigger voice state switching through the awakening triggering voice data through perceivable awakening triggering prompting information; And when the conference is switched to the speaking state at the local end, the conference state identification is switched and displayed to represent the speaking state.
- 10. The method according to any one of claims 1 to 8, further comprising: Responding to the conference setting triggering operation triggered by the conference, and displaying a conference setting operation area of the conference, wherein the conference setting operation area comprises wake-up word setting items; and responding to the wake-up word setting operation triggered by the wake-up word setting item, and determining the target wake-up word according to the wake-up word setting operation.
- 11. The method according to any one of claims 1 to 8, further comprising: Responding to a meeting member configuration operation of the meeting at the local end, and obtaining a local meeting member account of the meeting at the local end; Determining a speaking right account number associated with the local terminal based on the local participant member account number; And establishing an association relationship between the standard voiceprint data of the local participant member account and the speaking right account.
- 12. A conference interaction device, the device comprising: A wake-up trigger voice acquisition module, configured to enter a conference associated with a conference portal in response to a trigger operation for the conference portal, set the conference to be in a non-speaking state to be awakened by voice at a local end, and acquire wake-up trigger voice data of the local end in response to a voice wake-up trigger event, where the non-speaking state is a state in which the local end is not supported to speak for the conference; the wake-up word detection module is used for carrying out wake-up word detection on the wake-up trigger voice data to obtain a wake-up word detection result; The voiceprint matching module is used for carrying out voiceprint matching on the awakening triggering voice data and the standard voiceprint data of the local terminal-associated speaking right account number to obtain a voiceprint matching result; The voice state switching module is used for switching the conference into a speaking state at the local terminal when the wake-up word detection result indicates that the wake-up trigger voice data comprises a target wake-up word and the voiceprint matching result is consistent in matching, wherein the speaking state is a state supporting the local terminal to speak for the conference; The speaking voice processing module is used for collecting conference speaking voice data of the local terminal and sending the conference speaking voice data to a conference client of the conference under the condition that the conference is in a speaking state at the local terminal, wherein the conference speaking voice data is the speaking data sent by the local terminal for the conference; The voice state switching module is used for switching the conference from the non-speaking state to the speaking state of the local terminal when the conference arrangement information of the conference determines that the account number of the participant member of the next speaking in the conference belongs to the account number of the speaking right associated with the local terminal; the speaking prompt module is used for prompting speaking at the local end through the perceivable speaking prompt information so as to prompt the account of the participant member who is speaking next to speak directly at the local end; the voice state switching module is further configured to switch the conference from the speaking state to a non-speaking state at the local side when the account of the participant member who is speaking next finishes speaking; When the participant member account of the next utterance does not belong to the local-end-associated speaking right account, the participant member configuration module is triggered to conduct speaking right configuration on the participant member account of the next utterance so as to give the participant member account of the next utterance with speaking right, and when the participant member account of the next utterance belongs to the local-end-associated speaking right account, the speaking prompt module conducts speaking prompt on the local end through perceivable speaking prompt information so as to prompt the participant member account of the next utterance to wake up through voice triggering, and the speaking state of the local end is switched.
- 13. The apparatus of claim 12, wherein the device comprises a plurality of sensors, The wake-up word detection module is also used for extracting features of the wake-up trigger voice data to obtain initial audio features of the wake-up trigger voice data, performing differential mapping on the initial audio features to obtain identification audio features, and performing wake-up word detection based on the identification audio features to obtain a wake-up word detection result.
- 14. The apparatus of claim 13, wherein the device comprises a plurality of sensors, The wake-up word detection module is further used for carrying out time domain preprocessing on the wake-up trigger voice data to obtain intermediate voice data, carrying out frequency domain conversion on the intermediate voice data to obtain an energy spectrum of the intermediate voice data, carrying out frequency conversion on the energy spectrum through a filter bank to obtain a power spectrum, and carrying out discrete conversion on the power spectrum to obtain initial audio characteristics of the wake-up trigger voice data.
- 15. The apparatus of claim 13, wherein the device comprises a plurality of sensors, The wake-up word detection module is further used for carrying out at least one difference processing on the initial audio feature to obtain an identification audio feature, obtaining a wake-up word detection model, wherein the wake-up word detection model is obtained by training historical trigger voice data carrying wake-up word labels, and carrying out wake-up word detection on the identification audio feature through the wake-up word detection model to obtain a wake-up word detection result output by the wake-up word detection model.
- 16. The apparatus of claim 15, wherein the device comprises a plurality of sensors, The wake-up word detection module is further used for sequentially carrying out feature processing on the identification audio features through at least two-stage feature processing layer structures in the wake-up word detection model to obtain intermediate audio features, carrying out wake-up word classification based on the intermediate audio features through the classification layer structures in the wake-up word detection model to obtain classification probability distribution, and obtaining a wake-up word detection result based on the classification probability distribution.
- 17. The apparatus of claim 12, wherein the device comprises a plurality of sensors, The voice print matching module is further used for extracting wake-up trigger voice print characteristics from the wake-up trigger voice data, determining standard voice print characteristics of standard voice print data of the local-end-associated speaking right account, and carrying out voice print characteristic matching on the wake-up trigger voice print characteristics and the standard voice print characteristics to obtain voice print matching results.
- 18. The apparatus of claim 12, wherein the device comprises a plurality of sensors, The wake-up triggering voice acquisition module is also used for acquiring voice data at the local terminal, carrying out silence detection based on voice characteristics of the voice data acquired at the local terminal to obtain a silence detection result, and acquiring the wake-up triggering voice data of the local terminal from the voice data acquired at the local terminal when the silence detection result indicates that the voice is awakened up at the local terminal.
- 19. The apparatus of claim 12, wherein the apparatus further comprises: And the ending switching module is used for switching the conference from the speaking state of the local terminal to the non-speaking state of the local terminal when the duration of the conference speaking voice data of the local terminal is not acquired and the speaking duration condition is met.
- 20. The apparatus according to any one of claims 12 to 19, further comprising: The information area display module is used for entering the conference and displaying a conference information area associated with the conference; A state identifier display module, configured to display a conference state identifier that indicates the non-speaking state in the conference information area; the prompt awakening module is used for prompting voice state switching through the awakening triggering voice data trigger through the perceivable awakening triggering prompt information when voice data are detected at the local end; and the state identifier switching module is used for switching and displaying the conference state identifier to represent the speaking state when the conference is switched to the speaking state of the local terminal.
Description
Conference interaction method, conference interaction device, computer equipment and storage medium Technical Field The present application relates to the field of computer technology, and in particular, to a conference interaction method, apparatus, computer device, storage medium, and computer program product. Background With the development of computer technology, the conference forms are more and more diversified, the conference is not limited to meeting participants gathering into a unified conference room for meeting, and cross-region conference can be realized through the network conference of remote audio and video, so that the work and life of people are facilitated. In the online conference process of the network, a speaking party needing the conference starts a microphone to speak, but after the microphone is started, all voices emitted by the speaking party, including voices irrelevant to conference contents, still propagate, voice leakage is generated, and the safety of the conference is affected. Disclosure of Invention In view of the foregoing, it is desirable to provide a conference interaction method, apparatus, computer device, computer readable storage medium, and computer program product that can avoid occurrence of voice leakage and improve conference security. In a first aspect, the present application provides a conference interaction method. The method comprises the following steps: when the conference is in a non-speaking state to be awakened by voice at the local end, responding to a voice awakening trigger event, and acquiring awakening trigger voice data of the local end; performing wake-up word detection on wake-up trigger voice data to obtain a wake-up word detection result; Voice print matching is carried out on the wake-up triggering voice data and standard voice print data of the speaking right account number associated with the local terminal, and a voice print matching result is obtained; when the wake-up word detection result shows that the wake-up trigger voice data comprises a target wake-up word and the voiceprint matching result is consistent in matching, switching the conference into a speaking state at a local end; And under the condition that the conference is in the speaking state at the local end, collecting conference speaking voice data of the local end and sending the conference speaking voice data to a conference client of the conference. In a second aspect, the application further provides a conference interaction device. The device comprises: the wake-up triggering voice acquisition module is used for responding to a voice wake-up triggering event when the conference is in a non-speaking state to be awakened by voice at the local end to acquire wake-up triggering voice data of the local end; The wake-up word detection module is used for carrying out wake-up word detection on wake-up trigger voice data to obtain a wake-up word detection result; the voiceprint matching module is used for carrying out voiceprint matching on the awakening triggering voice data and the standard voiceprint data of the speaking right account number associated with the local terminal to obtain a voiceprint matching result; the voice state switching module is used for switching the conference into the speaking state at the local side when the wake-up word detection result indicates that the wake-up trigger voice data comprises the target wake-up word and the voiceprint matching result is consistent in matching; And the speaking voice processing module is used for collecting conference speaking voice data of the local terminal and sending the conference speaking voice data to a conference client of the conference under the condition that the conference is in a speaking state at the local terminal. In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of: when the conference is in a non-speaking state to be awakened by voice at the local end, responding to a voice awakening trigger event, and acquiring awakening trigger voice data of the local end; performing wake-up word detection on wake-up trigger voice data to obtain a wake-up word detection result; Voice print matching is carried out on the wake-up triggering voice data and standard voice print data of the speaking right account number associated with the local terminal, and a voice print matching result is obtained; when the wake-up word detection result shows that the wake-up trigger voice data comprises a target wake-up word and the voiceprint matching result is consistent in matching, switching the conference into a speaking state at a local end; And under the condition that the conference is in the speaking state at the local end, collecting conference speaking voice data of the local end and sending the conference speaking voice data to a conference client of th