EP-4305544-B1 - REGULARIZING WORD SEGMENTATION

EP4305544B1EP 4305544 B1EP4305544 B1EP 4305544B1EP-4305544-B1

Inventors

RAMABHADRAN, BHUVANA
XU, Hainan
AUDHKHASI, KARTIK
HUANG, YINGHUI

Dates

Publication Date: 20260506
Application Date: 20220324

Claims (5)

A computer-implemented method (600) of subword segmentation executed on data processing hardware (710), the method comprising: receiving, as input to a subword segmentation routine (300), an input word (302) to be segmented into a plurality of subword units (119); and executing the subword segmentation routine (300) to segment the input word (302) into the plurality of subword units (119) by: accessing a trained vocabulary set (350) of subword units (119); and selecting the plurality of subword units (119) from the input word (302) by greedily finding a longest subword unit from the input word (302) that is present in the trained vocabulary set (350) until an end of the input word (302) is reached, wherein selecting the plurality of subword units (119) comprises, for each corresponding position of a plurality different positions of the input word (302): identifying all possible candidate subword units (119) from the input word (302) at the corresponding position that are present in the trained vocabulary set (350); and randomly sampling from all of the possible candidate subword units (119) by assigning a 1-p probability to a longest one the possible candidate subword units (119) and dividing a rest of the p probability evenly among all of the possible candidate subword units (119) from the input word (302) at the corresponding position.
The method (600) of claim 1, wherein the operations further comprise, prior to executing the subword segmentation routine (300), creating a misspelling to the input word (302) by randomly deleting, using a pre-specified probability, a character from the input word (302) independently.
The method (600) of claim 1 or 2, wherein the operations further comprise, prior to executing the subword segmentation routine (300), creating a misspelling to the input word (302) by: pre-specifying a probability for swapping an order of adjacent character-pairs; and for each adjacent character-pair in the input word (302), swapping the order of the characters from the adjacent character-pair in the input word (302) based on the pre-specified probability.
The method (600) of claim 3, wherein the order of any given character in the input word (302) is limited to at most one swap.
A system (100) comprising: data processing hardware (710); and memory hardware (720) in communication with the data processing hardware (710) and storing instructions that when executed on the data processing hardware (710) causes the data processing hardware (710) to perform the method of any one of the preceding claims.

Description

TECHNICAL FIELD This disclosure relates to regularizing word segmentation. BACKGROUND Automated speech recognition (ASR) systems have evolved from multiple models (e.g., acoustic, pronunciation, and language models) where each model had a dedicated purpose to integrated models where a single neural network is used to directly map an audio waveform (i.e., input sequence) to an output sentence (i.e., output sequence). This integration has resulted in a sequence-to-sequence approach, which generates a sequence of words or graphemes when given a sequence of audio features. With an integrated structure, all components of a model may be trained jointly as a single end-to-end (E2E) neural network. Here, an E2E model refers to a model whose architecture is constructed entirely of a neural network. A fully neural network functions without external and/or manually designed components (e.g., finite state transducers, a lexicon, or text normalization modules). Additionally, when training E2E models, these models generally do not require bootstrapping from decision trees or time alignments from a separate system. Xiao et al., "Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units," 2018 (2018-11-26), 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei, Taiwan, 2018, pp. 146-150, doi: 10.1109/ISCSLP.2018.8706675, describe an end-to-end automatic speech recognition system, which employs subword units in a hybrid CTC-Attention based system. The subword units are obtained by the byte-pair encoding (BPE) compression algorithm. Schuster et al., "Japanese and Korean voice search," 2012 (2012-03-05), IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 2012, pp. 5149-5152, doi: 10.1109/ICASSP.2012.6289079, describe a technique to learn word units from large amounts of data automatically and incrementally by running a greedy algorithm. SUMMARY The matter for protection is defined by the claims. One aspect of the disclosure provides a computer-implemented method for subword segmentation. The computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations that include receiving an input word to be segmented into a plurality of subword units. The operations also include executing a subword segmentation routine to segment the input word into a plurality of subword units by accessing a trained vocabulary set of subword units and selecting the plurality of subword units from the input word by greedily finding a longest subword unit from the input word that is present in the trained vocabulary set until an end of the input word is reached. Selecting the plurality of subword units includes, for each corresponding position of a plurality different positions of the input word: identifying all possible candidate subword units from the input word at the corresponding position that are present in the trained vocabulary set; and randomly sampling from all of the possible candidate subword units by assigning a 1-p probability to a longest one the possible candidate subword units and dividing a rest of the p probability evenly among all of the possible candidate subword units from the input word at the corresponding position. Implementations of the disclosure may include one or more of the following optional features. The operations may further include, prior to executing the subword segmentation model, creating a misspelling to the input word by randomly deleting, using a pre-specified probability, a character from the input word independently. In some examples, the operations also include, prior to executing the subword segmentation model, creating a misspelling to the input word by pre-specifying a probability for swapping an order of adjacent character-pairs, and for each adjacent character-pair in the input word, swapping the order of the characters from the adjacent character-pair in the input word based on the pre-specified probability. Here, the order of any given character in the input word is limited to at most one swap. In some implementations, the operations also include receiving a training example comprising audio data characterizing an utterance of the input word and processing the audio data to generate, for output by a speech recognition model, a speech recognition result for the utterance of the input word. Here, the speech recognition result includes a sequence of hypothesized sub-word units each output from the speech recognition model at a corresponding output step. In these implementations, the operations further include determining a supervised loss term based on the sequence of hypothesized sub-word units and the plurality of subword units selected from the input word by the subword segmentation routine and updating parameters of the speech recognition model based on the supervised loss term. In some examples, the speech recognition model includes a Recurrent N