CN-113971948-B - Representation method, voice recognition device and electronic equipment

CN113971948BCN 113971948 BCN113971948 BCN 113971948BCN-113971948-B

Abstract

The embodiment of the application provides a representation method, a voice recognition device and electronic equipment. The representation method comprises the steps of obtaining data to be processed, wherein the data to be processed is one of voice data to be processed, text data to be processed and image data to be processed, obtaining a feature vector generated after feature extraction of a data vector corresponding to the data to be processed, converting the feature vector through a filter to obtain a triplet used for self-attention computation, and obtaining corresponding network representation according to the query vector, the key vector and the value vector. By the technical scheme provided by the embodiment of the application, the number of parameters corresponding to the self-attention mechanism can be directly reduced, so that the number of parameters of the neural network applying the self-attention mechanism is reduced, namely, the memory occupied by the neural network is reduced, and the popularization of the neural network is facilitated.

Inventors

Luo Haoneng
ZHANG SHILIANG

Assignees

阿里巴巴集团控股有限公司

Dates

Publication Date: 20260512
Application Date: 20200723

Claims (15)

1. A method of speech recognition, comprising: Acquiring a voice characteristic vector generated after characteristic extraction of a data vector corresponding to voice to be processed; The voice feature vector is converted through a filter to obtain a triplet used for self-attention calculation, wherein the triplet comprises a query vector, a key vector and a value vector, the filter comprises a first filter and a second filter, the magnitude of the parameter number of the first filter or the second filter is at least one level smaller than the magnitude of the dimension number of the query vector, any two vectors in the triplet are respectively determined based on the first filter and the second filter, and the other vector in the triplet is determined based on the voice feature vector through assignment operation; performing self-attention calculation according to the query vector, the key vector and the value vector to obtain a network representation corresponding to the voice feature vector, wherein the network representation is determined based on a splicing result, the splicing result is obtained by calculating the split query vector, the split key vector and the split value vector and splicing the calculation result; and recognizing the voice to be processed according to the network representation.
2. The method of claim 1, wherein the converting the speech feature vector through a filter to obtain a triplet for self-attention computation comprises: Respectively converting the voice feature vectors through two filters to obtain two vectors in a triplet for self-attention calculation; And carrying out assignment operation according to the voice feature vector to obtain another vector in the triplet for self-attention calculation.
3. The method of claim 2, wherein, Respectively converting the voice feature vectors through two filters to obtain two vectors in the triples for self-attention calculation, wherein the two vectors comprise the query vector and the key vector in the triples for self-attention calculation; the assigning operation is performed according to the voice feature vector, and the obtaining of the other vector in the triplet used for self-attention calculation comprises the step of assigning the voice feature vector as a value to the value vector in the triplet used for self-attention calculation.
4. The method of claim 1, wherein the method is applied to a speech recognition model, the speech recognition model comprising a speech encoder; The voice characteristic vector generated after the feature extraction of the data vector corresponding to the voice to be processed is obtained comprises the steps of obtaining the data vector corresponding to the voice to be processed through the voice recognition model, carrying out feature extraction on the data vector through a feature extraction part of the voice recognition model, generating a voice characteristic vector and inputting the voice characteristic vector into the voice encoder; the voice feature vector is converted through a filter to obtain a triplet for self-attention calculation, wherein the triplet for self-attention calculation is obtained by converting the voice feature vector through the filter in the voice encoder; The self-attention calculation is carried out according to the query vector, the key vector and the value vector to obtain a corresponding network representation, and the self-attention calculation is carried out on the query vector, the key vector and the value vector through a self-attention layer in the voice encoder to obtain the corresponding network representation.
5. The method of claim 1, wherein the method is applied to a speech recognition model comprising a speech encoder and a speech decoder; The voice characteristic vector generated after the feature extraction of the data vector corresponding to the voice to be processed is obtained comprises the steps of obtaining a voice coding vector output by the voice coder through a voice decoder in the voice recognition model, and carrying out feature extraction on the voice coding vector to generate a voice characteristic vector; The voice feature vector is converted through a filter to obtain a triplet for self-attention calculation, wherein the triplet for self-attention calculation is obtained by converting the voice feature vector through the filter in the voice decoder; The self-attention calculation is carried out according to the query vector, the key vector and the value vector to obtain a corresponding network representation, and the self-attention calculation is carried out on the query vector, the key vector and the value vector through a self-attention layer in the voice decoder to obtain the corresponding network representation.
6. A self-attention mechanism based representation method, comprising: acquiring data to be processed, wherein the data to be processed is one of voice data to be processed, text data to be processed and image data to be processed; Acquiring a feature vector generated after feature extraction of a data vector corresponding to the data to be processed; The feature vector is converted through a filter to obtain a triplet used for self-attention calculation, wherein the triplet comprises a query vector, a key vector and a value vector, the filter comprises a first filter and a second filter, the magnitude of the parameter number of the first filter or the second filter is at least one level smaller than the magnitude of the dimension number of the query vector, any two vectors in the triplet are respectively determined based on the first filter and the second filter, and the other vector in the triplet is determined based on the assignment operation of the feature vector; And performing self-attention calculation according to the query vector, the key vector and the value vector to obtain a network representation corresponding to the feature vector, wherein the network representation is determined based on a splicing result, the splicing result is obtained based on calculation of the segmented query vector, the segmented key vector and the segmented value vector, and the splicing calculation result is obtained.
7. The method of claim 6, wherein, When the data to be processed is voice data to be processed, the corresponding feature vector is a voice feature vector; or when the data to be processed is text data to be processed, the corresponding feature vector is a text feature vector; Or when the data to be processed is image data to be processed, the corresponding feature vector is an image feature vector.
8. The method of claim 6, wherein the converting the feature vector through a filter to obtain a triplet for self-attention computation comprises: respectively converting the feature vectors through two filters to obtain two vectors in a triplet for self-attention calculation; And carrying out assignment operation according to the feature vector to obtain another vector in the triplet for self-attention calculation.
9. The method of claim 8, wherein, The feature vectors are respectively converted through two filters to obtain two vectors in the triplet used for self-attention calculation, wherein the two filters are used for respectively converting the feature vectors to obtain the query vector and the key vector in the triplet used for self-attention calculation; The assigning operation is performed according to the feature vector, and the obtaining of the other vector in the triplet for self-attention calculation comprises the step of assigning the feature vector as a value to the value vector in the triplet for self-attention calculation.
10. The method of any of claims 6-9, wherein the filter is a neural network model modeling context information.
11. The method of claim 10, wherein the neural network model is a neural network model based on a feedforward-type sequence memory network for modeling context information.
12. A self-attention mechanism based presentation device comprising: the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring data to be processed, the data to be processed is one of voice data to be processed, text data to be processed and image data to be processed, and the acquisition module is used for acquiring a feature vector generated after feature extraction of a data vector corresponding to the data to be processed; the vector generation module is used for carrying out conversion processing on the feature vector through a filter to obtain a triplet used for self-attention calculation, wherein the triplet comprises a query vector, a key vector and a value vector, the filter comprises a first filter and a second filter, the magnitude of the parameter number of the first filter or the second filter is at least one level smaller than the magnitude of the dimension number of the query vector, any two vectors in the triplet are respectively determined based on the first filter and the second filter, and the other vector in the triplet is determined based on assignment operation of the feature vector; The representation generation module is used for carrying out self-attention calculation according to the query vector, the key vector and the value vector to obtain a network representation corresponding to the feature vector, wherein the network representation is determined based on a splicing result, the splicing result is obtained based on calculation of the segmented query vector, the segmented key vector and the segmented value vector, and the splicing calculation result is obtained.
13. A speech recognition apparatus comprising: The voice feature vector acquisition module is used for acquiring a voice feature vector generated after feature extraction of a data vector corresponding to voice to be processed; The vector generation module is used for carrying out conversion processing on the voice feature vector through a filter to obtain a triplet used for self-attention calculation, wherein the triplet comprises a query vector, a key vector and a value vector, the filter comprises a first filter and a second filter, the magnitude of the parameter number of the first filter or the second filter is at least one level smaller than the magnitude of the dimension number of the query vector, any two vectors in the triplet are respectively determined based on the first filter and the second filter, and the other vector in the triplet is determined based on the voice feature vector through assignment operation; The representation generation module is used for carrying out self-attention calculation according to the query vector, the key vector and the value vector to obtain a network representation corresponding to the voice feature vector, wherein the network representation is determined based on a splicing result, the splicing result is obtained based on calculation of the segmented query vector, the segmented key vector and the segmented value vector, and the splicing calculation result is obtained; And the recognition module is used for recognizing the voice to be processed according to the network representation.
14. An electronic device, comprising: one or more processors; A computer readable medium configured to store one or more programs, The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the speech recognition method of any of claims 1-5 or the representation method of any of claims 6-11.
15. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a speech recognition method according to any one of claims 1-5 or a representation method according to any one of claims 6-11.

Description

Representation method, voice recognition device and electronic equipment Technical Field The embodiment of the application relates to the field of artificial intelligence, in particular to a representation method, a voice recognition method, a device and electronic equipment. Background The self-attention mechanism (self-attention) is a network framework in deep learning and is widely applied to the fields of natural language processing, voice recognition and the like. The self-attention mechanism can enable the neural network model to correlate the context better, and further enable the processing result to be more accurate. In the existing self-attention mechanism, the input feature vectors and the like are generally subjected to linear change through a linear transformation matrix to obtain Query, key, value corresponding to the feature vectors, and then according to Query, key, value corresponding to each feature vector, all features are inquired to obtain the network representation after the self-attention mechanism is applied. However, the large number of parameters of the attention mechanism, i.e. the large memory occupied when the self-attention mechanism is applied, results in that the self-attention mechanism is not easy to popularize, especially in devices with small memory or cache. Disclosure of Invention The application aims to provide a representation method, a voice recognition device and electronic equipment, so as to at least solve or alleviate the problems. According to a first aspect of the embodiment of the application, a representation method based on a self-attention mechanism is provided, wherein the representation method comprises the steps of obtaining data to be processed, wherein the data to be processed is one of voice data to be processed, text data to be processed and image data to be processed, obtaining a feature vector generated after feature extraction of a data vector corresponding to the data to be processed, converting the feature vector through a filter to obtain a triplet used for self-attention calculation, and the triplet comprises a query vector, a key vector and a value vector, and performing self-attention calculation according to the query vector, the key vector and the value vector to obtain a corresponding network representation. According to a second aspect of the embodiment of the application, a voice recognition method is provided, which comprises the steps of obtaining a voice feature vector generated after feature extraction is carried out on a data vector corresponding to voice to be processed, carrying out conversion processing on the voice feature vector through a filter to obtain a triplet used for self-attention calculation, wherein the triplet comprises a query vector, a key vector and a value vector, carrying out self-attention calculation according to the query vector, the key vector and the value vector to obtain a corresponding network representation, and carrying out recognition on the voice to be processed according to the network representation. According to a third aspect of the embodiment of the application, a representation device based on a self-attention mechanism is provided, and the representation device comprises an acquisition module, a vector generation module and a representation generation module, wherein the acquisition module is used for acquiring data to be processed, the data to be processed is one of voice data to be processed, text data to be processed and image data to be processed, the feature vector is generated after feature extraction is carried out on a data vector corresponding to the data to be processed, the vector generation module is used for carrying out conversion processing on the feature vector through a filter to obtain a triplet used for self-attention calculation, and the triplet comprises a query vector, a key vector and a value vector, and the representation generation module is used for carrying out self-attention calculation according to the query vector, the key vector and the value vector to obtain a corresponding network representation. According to a fourth aspect of the embodiment of the application, a voice recognition device is provided, which comprises a voice feature vector determining module, a vector generating module and a recognition module, wherein the voice feature vector determining module is used for obtaining a voice feature vector generated after feature extraction is carried out on a data vector corresponding to voice to be processed, the vector generating module is used for carrying out conversion processing on the voice feature vector through a filter to obtain a triplet used for self-attention calculation, the triplet comprises a query vector, a key vector and a value vector, the representation generating module is used for carrying out self-attention calculation according to the query vector, the key vector and the value vector to obtain a corresponding network representation, an