KR-102962503-B1 - AUTOMATIC SPEECH RECOGNITION DEVICE AND CONTROL METHOD THEREOF

KR102962503B1KR 102962503 B1KR102962503 B1KR 102962503B1KR-102962503-B1

Abstract

The present invention relates to an apparatus and a control method for performing denormalization on speech recognition text. More specifically, the present invention relates to a denormalization technique in a method for controlling speech recognition text of a speech recognition device, wherein at least one denormalization model is stored for each category, a speech recognition text is received as input, a category predictor predicts the category of the input speech recognition text, a denormalization model corresponding to the predicted category is selected from among the at least one stored denormalization models, and denormalization is performed on the input speech recognition text based on the selected denormalization model.

Inventors

장윤정
박종세

Assignees

주식회사 카카오엔터프라이즈
주식회사 에이엑스지

Dates

Publication Date: 20260507
Application Date: 20220222

Claims (10)

A method for controlling a speech recognition device that performs denormalization on speech recognition text, Step of training the category prediction unit; The above category prediction unit is based on a Deep Neural Network (DNN)-based artificial neural network algorithm, and The above category prediction unit receives voice recognition text as input; A step in which the above category prediction unit predicts the category of the input speech recognition text among a plurality of categories; A step in which a model selection unit selects a denormalization model corresponding to the predicted category from among a plurality of denormalization models; and The above multiple denormalization models are composed of WFST (Weighted Finite State Transducer) models, which are based on probabilistic models, and The above-mentioned selected denormalization model performs denormalization on the above-mentioned input speech recognition text, comprising the step of: Multiple corpora are collected to correspond to each of the above multiple categories, and The above plurality of denormalization models are models generated corresponding to each of the above plurality of corpora, and The training data consists of a text to be analyzed and a correct answer category of the text to be analyzed forming a single set, and The above learning step is performed based on training data consisting of at least one set, and The above learning step is, When one set of the above training data is input, a step of analyzing morphemes for the text to be analyzed included in the one set; A step of converting the above morphological analysis results into a vector sequence; and A step including updating the parameters of an artificial neural network so that the distance between vector sequences corresponding to the same correct answer category becomes smaller. Control method of a voice recognition device.
delete
delete
delete
delete
In a speech recognition device that performs denormalization on speech recognition text, Memory for storing multiple denormalization models; and It includes a processor configured to execute at least one instruction, and The above multiple denormalization models are composed of WFST (Weighted Finite State Transducer) models, which are based on probabilistic models, and The above processor learns the category prediction unit, The training data consists of a text to be analyzed and a correct answer category of the text to be analyzed forming a single set, and The above processor performs learning based on training data consisting of at least one set, and when one set of the training data is input, it performs morphological analysis on the text to be analyzed included in the set, converts the morphological analysis results into a vector sequence, and updates the parameters of the artificial neural network so that the distance between vector sequences corresponding to the same correct answer category becomes smaller. The above category prediction unit is based on a Deep Neural Network (DNN)-based artificial neural network algorithm, and Receive voice recognition text input, The above category prediction unit predicts the category of the input speech recognition text among a plurality of categories, and Among the multiple denormalization models stored above, select the denormalization model corresponding to the predicted category, and Denormalization of the input speech recognition text is performed based on the above-selected denormalization model, wherein Multiple corpora are collected to correspond to each of the above multiple categories, and The plurality of stored denormalization models are models generated corresponding to each of the plurality of corpora, Voice recognition device.
delete
delete
delete
delete

Description

Automatic speech recognition device and control method thereof The present invention relates to a speech recognition device and a control method thereof, and more specifically, to a speech recognition device and a control method that perform denormalization on text which is a result of speech recognition. Recently, voice recognition technology is being applied across various sectors of society. The scope of its use is expanding significantly, extending beyond electronic devices like smartphones to include call centers, meeting minutes, and video. Speech recognition technology converts human speech into text. The recognized result is determined by the format of the text constituting the language model. If all training text is written in Korean, the speech recognition results will be output in Korean. Conversely, if data containing a mixture of numbers and English letters is used, the results will output the numbers or English letters accordingly. With the recent application of speech recognition technology to various services, requirements for outputting results vary depending on the type and purpose of the service. Some services require that information such as dates and times be displayed with high readability, while others require the entire output to be in Korean. However, modifying text corpora or reconstructing language models every time to meet these requirements is a very difficult task. Prior art (U.S. Patent Application Publication US2009/0157385) relates to a technique for performing Inverse Text Normalization (ITN) in a speech recognition system. It includes a technique that proposes a method for converting speech-form text generated by a speech-to-text conversion engine into an appropriate format for display on a screen. However, the prior art has a problem in that it performs conversion in bulk without considering various types of services. Accordingly, there is a need for research on technology that allows for the selective application of language models depending on the type of service. FIG. 1 is a conceptual diagram illustrating the network structure of a Weighted Finite State Transducer (WFST) model used in an embodiment of the present invention. FIG. 2 illustrates a conceptual diagram of matching between training data and a WFST model according to an embodiment of the present invention. FIG. 3 is a drawing illustrating an example of the output of a plurality of WFST models according to an embodiment of the present invention. FIG. 4 is a block diagram illustrating a voice recognition device (100) according to an embodiment of the present invention. FIG. 5 is a diagram illustrating a flowchart in which a voice recognition device (100) according to an embodiment of the present invention receives a category from a user and selects a denormalization model. FIG. 6 illustrates a text denormalization flowchart through category prediction according to an embodiment of the present invention. FIG. 7 is a block diagram illustrating a learning unit (409) according to an embodiment of the present invention. FIG. 8 is a diagram illustrating a conceptual diagram of training data (801) according to an embodiment of the present invention. FIG. 9 is a diagram illustrating the learning flowchart of a learning unit (409) according to an embodiment of the present invention. FIG. 10 illustrates a conceptual diagram of the learning process of a learning unit (409) according to an embodiment of the present invention. FIG. 11 is a diagram illustrating an interface state diagram for reflecting user intent according to an embodiment of the present invention. FIG. 12 illustrates an example in which a voice recognition device (100) according to an embodiment of the present invention changes training data based on the user's intent. FIG. 13 is a diagram illustrating the configuration of a voice recognition device (100) according to one embodiment. Hereinafter, embodiments disclosed in this specification will be described in detail with reference to the attached drawings. Identical or similar components regardless of drawing symbols will be assigned the same reference number, and redundant descriptions thereof will be omitted. The suffixes "module" and "part" used for components in the following description are assigned or used interchangeably solely for the ease of drafting the specification and do not inherently possess distinct meanings or roles. Furthermore, in describing embodiments disclosed in this specification, if it is determined that a detailed description of related prior art could obscure the essence of the embodiments disclosed in this specification, such detailed description will be omitted. Additionally, the attached drawings are intended only to facilitate understanding of the embodiments disclosed in this specification; the technical concept disclosed in this specification is not limited by the attached drawings, and it should be understood that they include all modifications, equivalents, and substitut