CN-122024710-A - Configuration method, configuration device, electronic equipment, readable storage medium and computer program product for voice activity detection duration

CN122024710ACN 122024710 ACN122024710 ACN 122024710ACN-122024710-A

Abstract

The disclosure provides a configuration method, a device, an electronic device, a readable storage medium and a computer program product for voice activity detection duration, which relate to the field of voice processing, in particular to the technical fields of voice recognition, voice interaction and vehicle-mounted operating systems. The method comprises the steps of utilizing a trained large model to determine first semantic information of current input information of a user, wherein the first semantic information at least comprises a sentence pattern structure of text corresponding to the current input information of the user, utilizing a predefined mapping relation to determine word count scores of the current input information of the user based on word counts of the text corresponding to the current input information of the user, conducting numerical processing on the first semantic information to obtain result numerical values corresponding to the first semantic information, determining semantic scores of the current input information of the user based on the result numerical values and the word count scores, and configuring voice activity detection duration of the current input information of the user based on the semantic scores.

Inventors

ZHOU WENHUAN

Assignees

北京百度网讯科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260212

Claims (14)

1. A method for configuring a voice activity detection duration, comprising: determining first semantic information of current input information of a user by using a trained large model, wherein the first semantic information at least comprises sentence pattern structures of texts corresponding to the current input information of the user; Determining word number scores of the current input information of the user by utilizing a predefined mapping relation based on the word number of the text corresponding to the current input information of the user; Performing a digitizing process on the first semantic information to obtain a result value corresponding to the first semantic information, including: determining a first intermediate value for characterizing the integrity of the sentence structure, and Determining a result value corresponding to the first semantic information based on the first intermediate value; determining a semantic score of the user's current input information based on the result value and the word count score, and And configuring voice activity detection duration of the current input information of the user based on the semantic score, wherein the voice activity detection duration and the semantic score are inversely related.
2. The method of claim 1, wherein the determining a first intermediate value that characterizes the integrity of the sentence structure comprises: responding to the determination that the text corresponding to the current input information of the user has a predefined sentence pattern with incomplete structure, and taking a first numerical value as the first intermediate numerical value, otherwise And taking a second numerical value as the first intermediate numerical value, wherein the first numerical value is smaller than the second numerical value.
3. The method of claim 1 or 2, wherein the first semantic information further comprises current intent information, entity information, and slot fill information corresponding to the user's current input information, and wherein the digitizing the first semantic information to obtain a result value corresponding to the first semantic information further comprises: Determining a second intermediate value characterizing a confidence level of the current intent information; Determining a third intermediate value characterizing the integrity of the entity information; determining a fourth intermediate value indicative of a slot filling rate of the slot filling information, and Determining, based on the first intermediate value, a result value corresponding to the first semantic information includes: the result value is determined based on the first intermediate value and at least one of the second intermediate value, the third intermediate value, and the fourth intermediate value.
4. A method according to claim 3, wherein said determining a third intermediate value characterizing said entity information comprises: determining an expected entity of the current input information of the user by utilizing the large model; in response to determining that the desired entity is present in the user's current input information, taking a third value as the third intermediate value, otherwise And taking a fourth value as the third intermediate value, wherein the third value is larger than the fourth value.
5. The method according to claim 3 or 4, wherein the method further comprises: Responsive to determining that there is historical dialog information associated with the user's current input information, determining, based on the historical dialog information, desired intent information for the historical dialog information using the large model; determining the matching degree of the current intention information and the expected intention information based on a semantic similarity algorithm, and And updating the value of the semantic score to be a weighted sum value of the semantic score and the matching degree.
6. The method of any of claims 1-5, wherein the configuring a voice activity detection duration of the user's current input information based on the semantic score comprises: In response to determining that the value of the semantic score is greater than a score threshold, taking a first time period as the voice activity detection time period, otherwise And taking a second time length as the voice activity detection time length, wherein the first time length is smaller than the second time length.
7. The method of claim 6, wherein the method further comprises: And in response to determining that the degree of matching is greater than or equal to a degree of matching threshold and the semantic score is greater than or equal to the scoring threshold, updating the voice activity detection duration to a third duration, wherein the third duration is less than the first duration.
8. The method of claim 6, wherein the method further comprises: And in response to determining that the number of words is less than a word number threshold, updating the voice activity detection duration to a fourth duration, wherein the fourth duration is greater than the second duration.
9. A method of training a large model, wherein the large model is used to implement the configuration method of any of claims 1-8, and wherein the method of training a large model comprises: acquiring sample user input information, sample prompt words and sample semantic information corresponding to the sample input text; inputting the sample user input information and the sample prompt word into the large model to obtain second semantic information; Determining a loss value of the large model based on the second semantic information and the sample semantic information, and And carrying out parameter adjustment on the large model based on the loss value until the loss value is smaller than a loss threshold value.
10. A configuration apparatus for detecting a duration of voice activity, comprising: The first semantic information determining module is used for determining first semantic information of current input information of a user by utilizing the trained large model, wherein the first semantic information at least comprises sentence structures of texts corresponding to the current input information of the user; The word number scoring determining module is used for determining the word number scoring of the current input information of the user by utilizing a predefined mapping relation based on the word number of the text corresponding to the current input information of the user; The result value determining module is used for carrying out numerical processing on the first semantic information to obtain a result value corresponding to the first semantic information, and comprises a first intermediate value for representing the integrity of the sentence structure, a second intermediate value for representing the integrity of the sentence structure, and a third intermediate value for representing the integrity of the sentence structure, wherein the first intermediate value is used for representing the integrity of the sentence structure; A semantic score calculating module for determining a semantic score of the user's current input information based on the result value and the word count score, and The detection duration configuration module is used for configuring the voice activity detection duration of the current input information of the user based on the semantic score, wherein the voice activity detection duration and the semantic score are inversely related.
11. A training apparatus for large models, wherein the training apparatus is for use in a configuration apparatus as claimed in claim 10, and wherein the training apparatus comprises: the sample acquisition module is used for acquiring sample user input information, sample prompt words and sample semantic information corresponding to the sample input text; the second semantic information determining module is used for inputting the sample user input information and the sample prompt word into the large model to obtain second semantic information; A loss value determining module for determining a loss value of the large model based on the second semantic information and the sample semantic information, and And the parameter adjusting module is used for carrying out parameter adjustment on the large model based on the loss value until the loss value is smaller than a loss threshold value.
12. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.
13. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.
14. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-9.

Description

Configuration method, configuration device, electronic equipment, readable storage medium and computer program product for voice activity detection duration Technical Field The present disclosure relates to the field of speech processing, and in particular, to the technical fields of speech recognition, speech interaction, and vehicle-mounted operating systems, and more particularly, to a method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product for configuring a speech activity detection duration. Background The voice activity detection is used as a core component of the voice interaction system, wherein the voice activity detection duration is used as a key parameter in the voice activity detection, so that the accurate recognition of the voice detection by the system is directly affected, and particularly in intelligent dialogue scenes based on intelligent vehicle-mounted terminals, customer service robots, voice assistants and the like, if the voice activity detection duration is dependent on fixed voice detection duration or the voice detection duration determined by a single factor, the voice interaction system is difficult to adapt to short pause habits such as thinking, ventilation and the like of a user, and the problem of voice recognition mistaken interception or response delay is easily caused. For example, if the detection duration is set too short, the user is very easy to misjudge that the voice is finished when the user does not express the voice, and the integrity and the accuracy of the voice are destroyed, otherwise, if the detection duration is set too long, the user cannot respond later after speaking, and the interaction speed is influenced. Therefore, reasonable voice activity detection duration configuration can maintain natural fluency of dialogue, and response speed and voice integrity are balanced in real-time interaction. Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. The artificial intelligence hardware technology generally comprises technologies such as a sensor, a special artificial intelligence chip, cloud computing, distributed storage, big data processing and the like, and the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like. The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated. Disclosure of Invention The present disclosure provides a method, apparatus, electronic device, computer-readable storage medium, and computer program product for configuring a voice activity detection duration. According to one aspect of the disclosure, a configuration method of voice activity detection duration is provided, wherein the method comprises the steps of utilizing a trained big model to determine first semantic information of user current input information, wherein the first semantic information at least comprises sentence pattern structures of texts corresponding to the user current input information, utilizing a predefined mapping relation to determine word count scores of the user current input information based on word counts of texts corresponding to the user current input information, conducting numerical processing on the first semantic information to obtain result values corresponding to the first semantic information, and determining result values corresponding to the first semantic information based on the first intermediate values, determining semantic scores of the user current input information based on the result values and the word count scores, and configuring voice activity detection duration of the user current input information based on the semantic scores, wherein the voice activity detection duration and the semantic scores are inversely related. According to two aspects of the disclosure, a training method of a large model is provided, wherein the large model is used for realizing the configuration method of the voice activity detection duration, and the training method of the large model comprises the steps of obtaining sample user input information, sample prompt words and sample semantic information corresponding to sample input text, inputting the sample user input information and