CN-122029599-A - Internal speech iterative learning loop

CN122029599ACN 122029599 ACN122029599 ACN 122029599ACN-122029599-A

Abstract

Methods and systems for iteratively training a user and an ML model to produce accurate internal speech output are disclosed. Methods and systems access an ML model and perform a first training iteration in which EMG data corresponding to internal speech is processed by a machine learning model to decode the EMG data into a set of predicted phonemes, phoneme sounds, words or phrases. Methods and systems present a set of predicted phones, phone sounds, words, or phrases to a user and form a first training dataset that includes the set of predicted phones, phone sounds, words, or phrases, the EMG data, and a specified set of phones, phone sounds, words, or phrases as benchmark truth information. The method and system update parameters of the ML model based on the first training data set before starting the second training iteration.

Inventors

MARCOS JIMENEZ
Mel Meishulam
Asif Zifu

Assignees

斯纳普公司

Dates

Publication Date: 20260512
Application Date: 20241001
Priority Date: 20231010

Claims (20)

1.A method, comprising: accessing a Machine Learning (ML) model; Performing a first training iteration in which Electromyography (EMG) data corresponding to internal speech generated by a user for a set of specified phones, phone sounds, words, or phrases is processed by the ML model to decode the EMG data into a set of predicted phones, phone sounds, words, or phrases; presenting the set of predicted phonemes, phoneme sounds, words or phrases to the user to end the first training iteration; Forming a first training dataset comprising said set of predicted phonemes, phoneme sounds, words or phrases, said EMG data, and said set of specified phonemes, phoneme sounds, words or phrases as benchmark truth information, and One or more parameters of the ML model are updated based on the first training data set before a second training iteration is started.
2. The method of claim 1, further comprising: Generating instructions for the user to generate the internal speech corresponding to the set of specified phonemes, phonemic sounds, words or phrases, and The EMG data corresponding to the internal speech generated by the user for the set of specified phonemes, phoneme sounds, words or phrases is collected.
3. The method of any of claims 1-2, further comprising: Performing a second training iteration in which additional EMG data corresponding to additional internal speech generated by the user for an additional set of specified phones, phone sounds, words, or phrases is processed by the ML model with updated one or more parameters to decode the additional EMG data into an additional set of predicted phones, phone sounds, words, or phrases, and The additional set of predicted phonemes, phoneme sounds, words or phrases is presented to the user to end the second training iteration.
4. A method according to claim 3, further comprising: Forming a second training dataset comprising said additional set of predicted phonemes, phoneme sounds, words or phrases, said additional EMG data, and said additional set of specified phonemes, phoneme sounds, words or phrases as additional benchmark truth information, and The one or more parameters of the ML model are updated based on the second training dataset before a third training iteration is started.
5. The method of any one of claims 1 to 4, wherein the ML model comprises a Convolutional Neural Network (CNN) or a transducer model.
6. The method of claim 5, wherein the CNN comprises two or more convolved two-dimensional (2D) layers, with maximum pooling and random deactivation after each of the two or more convolved 2D layers, followed by two or more fully connected layers.
7. The method of any of claims 1 to 6, wherein updating one or more parameters of the ML model comprises updating one or more weights of the ML model.
8. The method of any of claims 1-7, wherein the set of specified phones, phone sounds, words, or phrases comprises a sequence of phones, phone sounds, words, or phrases, wherein forming the first training dataset comprises applying weights to portions of the EMG data corresponding to each of the set of specified phones, phone sounds, words, or phrases, and wherein a first portion of the EMG data corresponding to a later word or phrase in the sequence is associated with a higher weight than a second portion of the EMG data corresponding to an earlier word or phrase in the sequence.
9. The method of claim 8, wherein the user generates internal speech for the later word after generating internal speech for the earlier word.
10. The method of any of claims 1-9, wherein the ML model learns EMG signals captured from the user when the user learns how to generate internal speech in a manner that is accurately decoded by the ML model at each of a plurality of training iterations including the first training iteration and the second training iteration.
11. The method of any one of claims 1 to 10, wherein accessing the ML model comprises initializing the ML model with random parameters to generate random feedback.
12. The method of any of claims 1 to 11, wherein accessing the ML model comprises: capturing an EMG data training set generated by speaking a set of training phones, phone sounds, words, or phrases from the user The ML model is initially trained based on the EMG data training set and a corresponding set of training phones, phone sounds, words, or phrases.
13. The method of any of claims 1 to 12, wherein accessing the ML model comprises: Capturing an EMG data training set generated by speaking a set of training phones, phone sounds, words, or phrases by a set of user utterances, and The ML model is initially trained based on the EMG data training set and a corresponding set of training phones, phone sounds, words, or phrases.
14. The method of any one of claims 1 to 13, further comprising: receiving input from the user indicating a request to terminate execution of a training iteration, and Responsive to receiving the input, the ML model that has been trained in a plurality of training iterations including the first training iteration and the second training iteration is applied to control operation of a user system.
15. The method of any one of claims 1 to 14, further comprising: determining that the prediction made by the ML meets a stopping criterion comprising one or more preset thresholds, and Responsive to determining that the prediction made by the ML meets the stopping criteria including one or more preset thresholds, automatically terminating execution of a training iteration and controlling operation of a user system by applying the ML model that has been trained in a plurality of training iterations including the first training iteration and the second training iteration.
16. The method of any one of claims 1 to 15, further comprising: capturing a first portion of EMG data corresponding to a first word or phrase of the specified phone, phone sound, word or phrase; decoding a first portion of the EMG data by the ML model into a predicted word or phrase of the set of predicted phonemes, phonemic sounds, words or phrases, and The predicted word or phrase is presented to the user while capturing a second portion of the EMG data corresponding to a second one of the specified set of phonemes, phoneme sounds, words or phrases, wherein each word or phrase of the specified set of phonemes, phoneme sounds, words or phrases is identical.
17. The method of any one of claims 1 to 16, further comprising: converting the EMG data into one or more images depicting a spectrogram comprising a matrix representing the EMG data, the one or more images each corresponding to a different electrode of a set of electrodes used to collect the EMG data, and The one or more images depicting the spectrogram are processed by the ML model to generate the set of predicted phonemes, phonemic sounds, words or phrases.
18. The method of any one of claims 1 to 17, further comprising: Establishing a secure connection between a mobile device and an EMG device, the mobile device executing an interactive application, and The EMG data is received by the mobile device from the EMG device over the secure connection, wherein the mobile device detects the presence of the internal voice.
19. A system, comprising: at least one processor, and At least one memory component having instructions stored thereon that, when executed by the at least one processor, cause the at least one processor to perform operations comprising: accessing a Machine Learning (ML) model; Performing a first training iteration in which Electromyography (EMG) data corresponding to internal speech generated by a user for a set of specified phones, phone sounds, words, or phrases is processed by the ML model to decode the EMG data into a set of predicted phones, phone sounds, words, or phrases; presenting the set of predicted phonemes, phoneme sounds, words or phrases to the user to end the first training iteration; Forming a first training dataset comprising said set of predicted phonemes, phoneme sounds, words or phrases, said EMG data, and said set of specified phonemes, phoneme sounds, words or phrases as benchmark truth information, and One or more parameters of the ML model are updated based on the first training data set before a second training iteration is started.
20. A non-transitory computer-readable storage medium having instructions stored thereon, which when executed by at least one processor, cause the at least one processor to perform operations comprising: accessing a Machine Learning (ML) model; Performing a first training iteration in which Electromyography (EMG) data corresponding to internal speech generated by a user for a set of specified phones, phone sounds, words, or phrases is processed by the ML model to decode the EMG data into a set of predicted phones, phone sounds, words, or phrases; presenting the set of predicted phonemes, phoneme sounds, words or phrases to the user to end the first training iteration; Forming a first training dataset comprising said set of predicted phonemes, phoneme sounds, words or phrases, said EMG data, and said set of specified phonemes, phoneme sounds, words or phrases as benchmark truth information, and One or more parameters of the ML model are updated based on the first training data set before a second training iteration is started.

Description

Internal speech iterative learning loop Priority claim The present application claims the benefit of priority from U.S. patent application Ser. No. 18/484,243, filed 10/2023, which is incorporated herein by reference in its entirety. Technical Field The present disclosure relates to Electromyography (EMG) voice systems, and to interactive applications and/or augmented reality (XR) devices, such as Augmented Reality (AR) devices and/or Virtual Reality (VR) devices. Background Some electronic enabled devices include various input interfaces that allow a user to communicate with other users. Such input interfaces include a voice message interface that enables a user to send verbal messages to other people. Other input interfaces include text input into which users type their desired messages. These types of input interfaces require movement of the user, such as moving facial muscles to produce speech about a verbal message, or moving fingers to select different keys on a keyboard. Drawings In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. To facilitate recognition of a discussion of any particular element or act, one or more of the most significant digits in a reference numeral refer to the figure number in which that element was first introduced. Some non-limiting examples are shown in the figures of the accompanying drawings, in which: FIG. 1 is a diagrammatic representation of a networking environment in which the present disclosure may be deployed, according to some examples. Fig. 2 is a diagrammatic representation of a messaging system having both client-side and server-side functionality in accordance with some examples. FIG. 3 is a diagrammatic representation of a data structure maintained in a database in accordance with some examples. Fig. 4 is a diagrammatic representation of a message according to some examples. Fig. 5 is a diagrammatic representation of a user wearing an EMG communication device according to some examples. Fig. 6 is a diagrammatic representation of an EMG voice detection system according to some examples. Fig. 7 is an illustrative output of an EMG voice detection system according to some examples. Fig. 8 is a flow chart illustrating example operations of an EMG voice detection system according to some examples. FIG. 9 is a diagrammatic representation of machine in the form of a computer system within which a set of instructions, for causing the machine to perform any one or more of the methodologies discussed herein, may be executed according to some examples. Fig. 10 is a block diagram illustrating a software architecture in which an example may be implemented. Fig. 11 illustrates a system in which a head wearable device may be implemented, according to some examples. Detailed Description The following description includes systems, methods, techniques, sequences of instructions, and computer-machine program products embodying illustrative examples of the present disclosure. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide an understanding of the various examples. It will be apparent, however, to one skilled in the art that the examples may be practiced without these specific details. In general, well-known instruction instances, protocols, structures, and techniques have not necessarily been shown in detail. Some conventional non-invasive brain-computer interfaces (BCIs) use electroencephalogram (EEG) sensors. Such systems detect neural signals in the brain of a user and decode the neural signals into various operations. These systems can be cumbersome to deploy and difficult to accurately place on the user's head. Other non-invasive computer interfaces utilize Electromyography (EMG) electrodes that detect electrical signals associated with muscle activity. Such systems rely on measurements of muscle activity (as captured by the EMG signal). Of particular interest to BCI is the use of surface EMG to discriminate and identify sub-audible speech signals generated with relatively little or no acoustic input. Speech-related EMG signals can be measured at various locations across the face and neck, including on the sides of the subject's throat, near the throat, and under the chin. Speaking is an athletic activity associated with apparent muscle movement, but thinking speech is not an athletic activity. Internal speech or imaginative speech refers to voluntary behaviors that speak something silently, such as imagining talking vividly, with the tongue, mouth and/or facial muscles motionless or seldom moving, and not intended to be understood by another person. Specifically, when a person wants to speak a word or phrase, the brain of the person generates a neural signal and supplies the neural signal to a corresponding speech generating muscle, such as the larynx, throat, tongue, and the like. Subthreshold muscle activation, also known as subthr