US-12620391-B2 - Method, device and computer program product for controlling a functional unit of a vehicle

US12620391B2US 12620391 B2US12620391 B2US 12620391B2US-12620391-B2

Abstract

Multi-part keywords for activating a speech operating system are recognized by receiving a phonetic sequence and checking the phonetic sequence for the presence of a first part of the keyword. Only if the first part is recognized, the phonetic sequence is checked for the presence of a second part of the keyword. The speech operating system for carrying out an action is activated only if the second part is recognized.

Inventors

Markus Leder
Sabrina Lamberth-Cocca

Assignees

Mercedes-Benz Group AG

Dates

Publication Date: 20260505
Application Date: 20221007
Priority Date: 20211019

Claims (11)

1 . A method comprising: receiving, by a microphone of a vehicle, sound signals; converting, by a speech operating system of the vehicle in a first speech recognition phase, the received sound signals into a phonetic sequence in a stream of symbols; storing the stream of symbols in a ring memory; determining, by the speech operating system, that the stream of symbols stored in the ring memory includes a first part of a multi-part keyword; initiating, only when it is determined the steam of symbols stored in the ring memory include the first part of the multi-part keyword, a second speech recognition phase comprising checking the stream of symbols stored in the ring memory for presence of a second part of the multi-part keyword until an earliest of a first condition is satisfied or a second and third condition are satisfied, wherein the first condition is recognizing the second part of the multi-part keyword, the second condition is expiration of a predetermined amount of time since the determination that the stream of symbols stored in the ring memory include the first part of the multi-part keyword and the third condition is a number of phonetic symbols of the stream of symbols reaches a predetermined limit; determining, by the speech operating system before the second and third conditions are satisfied, that the stream of symbols stored in the ring memory include the second part of the multi-party keyword; and sending, by the speech operating system responsive to the determining the stream of symbols stored in the ring memory include the second part of the multi-part keyword, a command to at least one other control system of the vehicle to actuate at least one function of the vehicle.
2 . The method of claim 1 , wherein the determination that the stream of symbols stored in the ring memory include first and the second part of the multi-part keyword is confirmed as recognized as soon as a degree of similarity of a phonetic sequence with a first and second saved comparison symbol sequence lies above a predetermined threshold value.
3 . The method of claim 1 , wherein successive identical phonetic symbols in the phonetic sequence are considered as a single phonetic symbol.
4 . The method of claim 1 , wherein the second part of the multi-part keyword is determined following a most recent first part of the multi-part keyword.
5 . The method of claim 1 , wherein the received the phonetic sequence is smoothed out by placing a continuous viewing window above the phonetic sequence and only phonetic symbols in the viewing window are used for the determining that the stream of symbols include the first and second parts of the multi-part keyword, wherein a number of the phonetic symbols in the viewing window exceeds a predetermined frequency threshold within the viewing window.
6 . A method comprising: receiving, by a microphone of a vehicle, sound signals; initiating a first speech recognition phase comprising converting, by a speech operating system of the vehicle in a first speech recognition phase, the received sound signals into a phonetic sequence in a stream of symbols; storing the stream of symbols in a ring memory; and determining, by the speech operating system, that the stream of symbols stored in the ring memory includes a first part of a multi-part keyword; initiating, only when it is determined the steam of symbols stored in the ring memory include the first part of the multi-part keyword, a second speech recognition phase, wherein the second speech recognition phase comprises checking the stream of symbols stored in the ring memory for presence of a second part of the multi-part keyword until an earliest of a first condition is satisfied or a second and third condition are satisfied, wherein the first condition is recognizing the second part of the multi-part keyword, the second condition is expiration of a predetermined amount of time since the determination that the stream of symbols stored in the ring memory include the first part of the multi-part keyword and the third condition is a number of phonetic symbols of the stream of symbols reaches a predetermined limit; determining, by the speech operating system, that the second and third conditions are satisfied and the second part of the multi-party keyword is not in the stream of symbols stored in the ring memory and in response returning to the first speech recognition phase until it is determined the first part of the multi-part keyword is in the stream of symbols stored in the ring memory and the second part of the multi-part keyword is in the stream of symbols before the second and third conditions are satisfied; and sending, by the speech operating system responsive to the determining the stream of symbols stored in the ring memory include the second part of the multi-part keyword, a command to at least one other control system of the vehicle to actuate at least one function of the vehicle.
7 . The method of claim 6 , wherein the determination that the stream of symbols stored in the ring memory include first and the second part of the multi-part keyword is confirmed as recognized as soon as a degree of similarity of a phonetic sequence with a first and second saved comparison symbol sequence lies above a predetermined threshold value.
8 . The method of claim 6 , wherein successive identical phonetic symbols in the phonetic sequence are considered as a single phonetic symbol.
9 . The method of claim 6 , wherein the second part of the multi-part keyword is determined following a most recent first part of the multi-part keyword.
10 . The method of claim 6 , wherein the received the phonetic sequence is smoothed out by placing a continuous viewing window above the phonetic sequence and only phonetic symbols in the viewing window are used for the determining that the stream of symbols include the first and second parts of the multi-part keyword, wherein a number of the phonetic symbols in the viewing window exceeds a predetermined frequency threshold within the viewing window.
11 . The method of claim 6 , further comprising: determining, during the second speech recognition phase, that the stream of symbols stored in the ring memory include the first part of the multi-part keyword; and restarting, the second speech recognition phase using the first part of the multi-part keyword, to determine whether the stream of symbols includes the second part of the multi-party keyword, and resetting the predetermined amount of time and the predetermined limit.

Description

BACKGROUND AND SUMMARY OF THE INVENTION Exemplary embodiments of the present invention relate to a method and a device for determining a multi-part keyword in a phonetic sequence in a speech utterance of a user. US 2018/0342237 A1 discloses a method for recognizing a keyword in which a sound sequence is received on a device and a determination of a keyword is carried out. If a keyword can be determined, the phonetic sequence is sent to an external server. In a further step, a piece of text derived from the phonetic sequence by the external server is received, said text being examined for a match with the keyword. Exemplary embodiments of the present invention are directed to an alternative method and a device for recognizing, in particular, a two-part keyword. In the method according to the invention, the received phonetic sequence is checked in a first step or a first search phase for the presence of a first part of the keyword. Here, a stream of sounds emanating from a user is converted to a stream of phonetic symbols, i.e., to a phonetic sequence, which is compared to a comparison symbol sequence. Only if the first part of the keyword is recognized is a check of the phonetic sequence for the presence of a second part of the keyword carried out in a further step, i.e., in a second search phase, and if the second part is also recognized, an activation of the speech operating system is carried out. Accordingly, the recognition of the two-part keyword is implemented in such a way that a first part of the keyword, also known as the differentiator, is initially searched for in the phonetic sequence, and, as soon as this is recognized, only the second part of the keyword, also known as the body, is searched for within limits determined by time and data volumes. An activation of the speech operating system comprises an action; for example, upon recognizing the keyword, a vehicle function is triggered or a dialogue is established in which the user is asked about the entertainment system, what services, functions they want to activate or what information they would like. Thus, the speech control system sending commands to other control systems in order to actuate vehicle functions is also to be understood as implementing an action. The often-erroneous recognition of a keyword by using the keyword or parts of it in a context that differs from the activation intention is reduced by the recognition in two steps. Recognition of the first and second part of the keyword always has a certain error rate. Now, when the two have to be recognized independently of each other, this error rate is much lower than when the keyword is to be recognized as a whole. If the error rate of the first part of the keyword is 0.2 and the error rate of the second part of the keyword is 0.1, for example, then this results in a total error rate of 0.02. In contrast, an error rate for recognizing the keyword as a whole has a much higher error rate of 0.1, for example. In comparison to recognizing the entire keyword, the separate recognition of the parts of the keyword in successive phases thus leads to a lower error rate. In a modified embodiment, the first and the second part of the keyword is confirmed as recognized as soon as a degree of similarity of the phonetic sequence to a first and second stored comparison symbol sequence lies above a predetermined threshold value. The degree of similarity is determined using methods known from the prior art for the string distance, for example the minimum editing distance. In a further embodiment, successive identical phonetic symbols of an incoming phonetic sequence are considered to be a single sound. Advantageously, pronunciation durations of different lengths of individual sounds, for example vowels, of a word are thus compensated for and erroneous recognition of the keywords is reduced. In a further development of the invention, a predetermined time after recognizing the first part of the keyword, the check for the second part is terminated, and the search for the first part is continued. The time is to be measured in such a way that even a very slowly pronounced second part of the keyword is not terminated by the expiry of the predetermined time. Advantageously, if the user pauses after pronouncing the first part of the keyword and later accidentally pronounces a word corresponding to the second part of the keyword, an interpretation as the keyword and an unwanted activation of the speech operation system are avoided. As a result of the termination, a recognition of a newly spoken multi-part keyword is made possible. According to a further additional or alternative embodiment of the present invention, the incoming phonetic symbols are continuously buffered in a memory, wherein the quantity of the phonetic symbols stored in the buffer is determined as soon as the first part of the keyword is recognized. If the number of stored phonetic symbols reaches a predetermined limit, i.e., the upper limit,