CN-116013313-B - Distributed voice interaction method, system and distributed center

CN116013313BCN 116013313 BCN116013313 BCN 116013313BCN-116013313-B

Abstract

The embodiment of the application discloses a distributed voice interaction method, a distributed voice interaction system and a distributed center. The method comprises the steps of receiving wake-up word segment audio features and wake-up word time ranges uploaded by wake-up devices, wherein the wake-up devices are intelligent voice devices which are awakened by wake-up words in a plurality of intelligent voice devices, the wake-up word segment audio features comprise first reaching time difference tdoa features and first audio quality features, selecting wake-up devices with earliest wake-up word reaching times according to the first tdoa features and the wake-up word time ranges as response devices, and selecting wake-up devices with best wake-up word audio quality according to the first audio quality features as pickup devices in an identification stage. Therefore, the voice intelligent device which is closest to the user always can be guaranteed to respond to the user, so that the response is better perceived by the user, the acquired recognition statement audio quality is better, and therefore a more accurate recognition effect can be obtained.

Inventors

CAO SHENGHONG
MA FENG

Assignees

科大讯飞股份有限公司

Dates

Publication Date: 20260505
Application Date: 20221230

Claims (12)

1. A distributed voice interaction method, the method comprising: The method comprises the steps of receiving wake-up word segment audio characteristics and a wake-up word time range uploaded by wake-up equipment, wherein the wake-up equipment is an intelligent voice equipment which is awakened by wake-up words in a plurality of intelligent voice equipment, and the wake-up word segment audio characteristics comprise a first reaching time difference tdoa characteristic and a first audio quality characteristic; according to the first tdoa feature and the wake-up word time range, selecting wake-up equipment with earliest wake-up word arrival time as response equipment; and selecting wake-up equipment with the best wake-up word audio quality as pick-up equipment in the identification stage according to the first audio quality characteristics.
2. The method of claim 1, wherein the number of wake-up devices is plural, the number of first tdoa features is plural, and the selecting, as the responding device, the wake-up device having the earliest wake-up word arrival time according to the first tdoa features and the wake-up word time range comprises: Respectively taking out tdoa sub-features of a common time period from the plurality of first tdoa features according to wake-up word time ranges respectively corresponding to the plurality of first tdoa features to obtain a plurality of tdoa sub-features; Inputting the tdoa sub-features into a preset first judgment model to obtain a first judgment result, wherein the first judgment model is obtained by training a plurality of pieces of first audio training data with known arrival time differences, and the first audio training data comprises wake-up words; and selecting the awakening device with the earliest arrival time of the awakening word as a response device according to the first judgment result.
3. The method of claim 1, wherein the number of wake-up devices is plural, the number of first audio quality features is plural, and the selecting a wake-up device with a best wake-up word audio quality as a pick-up device for the recognition phase according to the first audio quality features comprises: Inputting a plurality of first audio quality characteristics into a preset second judgment model to obtain a second judgment result, wherein the second judgment model is obtained by training a plurality of second audio training data with known audio quality, and the second audio training data comprises wake-up words; and according to the second judgment result, selecting the wake-up equipment with the best wake-up word audio quality as the pickup equipment in the identification stage.
4. The method according to claim 1, wherein the method further comprises: The first extraction model and the first decision model are constructed by the following construction process: constructing a first training set, wherein the first training set comprises a plurality of pieces of first audio training data, each piece of first audio training data comprises N arrays of audio, each array of audio comprises a wake-up word, the arrival time difference of the wake-up word in the N arrays is known, and N is more than or equal to 2; Inputting the audio of each array in each piece of first audio training data into a first initial extraction model independently to obtain N second tdoa features corresponding to each piece of first audio training data; Randomly setting p second tdoa features corresponding to each piece of first audio training data to be zero to obtain p first zero-setting features, wherein p is more than or equal to 0 and less than or equal to N-2; And inputting p first zeroing features and N-p second tdoa features which are not zeroed and correspond to each piece of first audio training data into a first initial judgment model by taking the obtained N multiplied by 1 dimension vector as a training target, and carrying out joint iterative training on the first initial extraction model and the first initial judgment model to obtain a trained first extraction model and first judgment model, wherein the first extraction model is used for extracting the first tdoa features, and the first judgment model is used for selecting wake-up equipment with earliest wake-up word arrival time.
5. The method according to claim 1, wherein the method further comprises: the second extraction model and the second decision model are constructed by the following construction process: Constructing a second training set, wherein the second training set comprises a plurality of pieces of second audio training data, each piece of second audio training data comprises M arrays of audio, each array of audio comprises a wake-up word, the audio quality of the wake-up word in the M arrays is known, and M is more than or equal to 2; Inputting the audio of each array in each piece of second audio training data into a second initial extraction model independently to obtain M second audio quality characteristics corresponding to each piece of second audio training data; Randomly setting q second audio quality features corresponding to each piece of second audio training data to zero to obtain q second zero-setting features, wherein q is more than or equal to 0 and less than or equal to M-2; And inputting q second zero-setting features and M-q non-zero second audio quality features corresponding to each piece of second audio training data into a second initial judgment model by taking the obtained M multiplied by 1 dimension vector as a training target, and carrying out joint iterative training on the second initial extraction model and the second initial judgment model to obtain a trained second extraction model and second judgment model, wherein the second extraction model is used for extracting the first audio quality features, and the second judgment model is used for selecting wake-up equipment with the best wake-up word audio quality.
6. The method according to claim 1, wherein the method further comprises: and carrying out time synchronization on the plurality of intelligent voice devices, wherein the synchronization error is smaller than d/c, d represents the distance resolution among the intelligent voice devices, and c represents the sound velocity in the air.
7. A distributed hub, the distributed hub comprising: the device comprises a receiving unit, a processing unit and a processing unit, wherein the receiving unit is used for receiving wake-up word segment audio characteristics and wake-up word time ranges uploaded by wake-up equipment, the wake-up equipment is an intelligent voice equipment which is awakened by wake-up words in a plurality of intelligent voice equipment, and the wake-up word segment audio characteristics comprise a first reaching time difference tdoa characteristic and a first audio quality characteristic; The first selection unit is used for selecting a wake-up device with earliest wake-up word arrival time as a response device according to the first tdoa characteristic and the wake-up word time range; And the second selection unit is used for selecting the wake-up equipment with the best wake-up word audio quality as the pickup equipment in the identification stage according to the first audio quality characteristics.
8. A distributed voice interaction system, wherein the voice interaction system comprises a plurality of intelligent voice devices and a distributed center; The distributed center is used for receiving wake-up word segment audio features and wake-up word time ranges uploaded by wake-up devices, wherein the wake-up devices are intelligent voice devices which are awakened by wake-up words in a plurality of intelligent voice devices, and the wake-up word segment audio features comprise a first reaching time difference tdoa feature and a first audio quality feature; the distributed center is further configured to select, according to the first tdoa feature and the wake-up word time range, a wake-up device with an earliest wake-up word arrival time as a response device; the distributed center is further configured to select, according to the first audio quality feature, a wake-up device with a best wake-up word audio quality as a pickup device in the recognition stage.
9. The system of claim 8, wherein the intelligent speech device is configured to determine the wake word time range when woken up by the wake word; The intelligent voice equipment is further used for determining wake-up word segment audio from the collected audio data according to the wake-up word time range; The intelligent voice equipment is further used for inputting the wake-up word segment audio into a preset first extraction model to obtain the first tdoa feature, wherein the first extraction model is obtained by training a plurality of pieces of first audio training data with known arrival time differences, and the first audio training data comprises wake-up words; The intelligent voice device is further configured to upload the first tdoa feature and the wake-up word time range to the distributed center.
10. The system of claim 8, wherein the intelligent voice device is configured to input audio data collected in real time into a preset first extraction model frame by frame to obtain a plurality of first tdoa sub-features, wherein the first extraction model is trained using a plurality of first audio training data with known arrival time differences, and the first audio training data comprises wake-up words; The intelligent voice equipment is further used for determining the time range of the wake-up word when the intelligent voice equipment is waken by the wake-up word; The intelligent voice device is further configured to determine the first tdoa feature from the plurality of first tdoa sub-features according to the wake-up word time range; The intelligent voice device is further configured to upload the first tdoa feature and the wake-up word time range to the distributed center.
11. The system of claim 8, wherein the intelligent speech device is configured to determine the wake word time range when woken up by the wake word; The intelligent voice equipment is further used for determining wake-up word segment audio from the collected audio data according to the wake-up word time range; The intelligent voice equipment is also used for inputting the wake-up word segment audio into a preset second extraction model to obtain the first audio quality characteristic, wherein the second extraction model is obtained by training a plurality of second audio training data with known audio quality, and the second audio training data comprises wake-up words; the intelligent voice device is further configured to upload the first audio quality feature to the distributed center.
12. The system of claim 8, wherein the intelligent speech device is configured to input audio data collected in real time into a preset second extraction model frame by frame to obtain a plurality of first audio quality sub-features, wherein the second extraction model is trained using a plurality of second audio training data with known audio quality, and the second audio training data comprises wake-up words; The intelligent voice equipment is further used for determining the time range of the wake-up word when the intelligent voice equipment is waken by the wake-up word; the intelligent voice device is further configured to determine the first audio quality feature from the plurality of first audio quality sub-features according to the wake-up word time range; the intelligent voice device is further configured to upload the first audio quality feature to the distributed center.

Description

Distributed voice interaction method, system and distributed center Technical Field The application relates to the technical field of voice interaction, in particular to a distributed voice interaction method, a distributed voice interaction system and a distributed center. Background With the popularization of intelligent voice devices, a plurality of intelligent voice devices may work simultaneously in the same environment. For example, in a home environment, home appliances such as a television, an air conditioner, a refrigerator, a washing machine and the like may have an intelligent voice interaction function. And because a plurality of intelligent voice devices exist in the same environment at the same time, the intelligent voice devices can respond to voice instructions of users in a short time, and the phenomenon of 'one-call-multiple-response' can occur. At present, intelligent voice equipment with the maximum signal energy in a wake-up word time period is selected as response equipment, and the response equipment is directly used as pickup equipment in an identification stage, so that the phenomenon of 'one-call-multiple-response' is avoided. However, the method is excessively dependent on signal energy in the wake-up word time period, and in actual situations, the method is influenced by factors such as noise, speaker orientation and the like, and the intelligent voice device closest to the method cannot be used as a response device. In addition, the response device is directly used as pickup device in the recognition stage, so that the quality of the voice data acquired in the recognition stage cannot be guaranteed, and the voice recognition effect is poor. Disclosure of Invention In view of the above, the embodiment of the application discloses a distributed voice interaction method, a distributed voice interaction system and a distributed center, which ensure that voice intelligent equipment closest to a user always responds to the user and ensure that the acquired recognition sentences have better audio quality. The technical scheme provided by the embodiment of the application is as follows: In a first aspect, an embodiment of the present application provides a distributed voice interaction method, where the method includes: The method comprises the steps of receiving wake-up word segment audio characteristics and a wake-up word time range uploaded by wake-up equipment, wherein the wake-up equipment is an intelligent voice equipment which is awakened by wake-up words in a plurality of intelligent voice equipment, and the wake-up word segment audio characteristics comprise a first reaching time difference tdoa characteristic and a first audio quality characteristic; according to the first tdoa feature and the wake-up word time range, selecting wake-up equipment with earliest wake-up word arrival time as response equipment; and selecting wake-up equipment with the best wake-up word audio quality as pick-up equipment in the identification stage according to the first audio quality characteristics. In one possible implementation manner, the number of wake-up devices is multiple, the number of first tdoa features is multiple, and the wake-up device with the earliest wake-up word arrival time is selected as a response device according to the first tdoa features and the wake-up word time range, where the wake-up device includes: Respectively taking out tdoa sub-features of a common time period from the plurality of first tdoa features according to wake-up word time ranges respectively corresponding to the plurality of first tdoa features to obtain a plurality of tdoa sub-features; Inputting the tdoa sub-features into a preset first judgment model to obtain a first judgment result, wherein the first judgment model is obtained by training a plurality of pieces of first audio training data with known arrival time differences, and the first audio training data comprises wake-up words; and selecting the awakening device with the earliest arrival time of the awakening word as a response device according to the first judgment result. In one possible implementation manner, the number of wake-up devices is plural, the number of the first audio quality features is plural, and the wake-up device with the best wake-up word audio quality is selected as the pick-up device in the recognition stage according to the first audio quality features, including: Inputting a plurality of first audio quality characteristics into a preset second judgment model to obtain a second judgment result, wherein the second judgment model is obtained by training a plurality of second audio training data with known audio quality, and the second audio training data comprises wake-up words; and according to the second judgment result, selecting the wake-up equipment with the best wake-up word audio quality as the pickup equipment in the identification stage. In one possible implementation, the method further includes: The first extracti