KR-102962223-B1 - DEVICE AND METHOD FOR IMPROVING MEDICAL SPEECH-TO-TEXT ACCURACY WITH VISION-LANGUAGE PRE-TRAINING MODEL

KR102962223B1KR 102962223 B1KR102962223 B1KR 102962223B1KR-102962223-B1

Abstract

The present disclosure provides an apparatus and a method for improving medical speech recognition performance using an image-language dictionary learning model. The apparatus may include a speech conversion unit configured to convert an input speech signal into text information; and a multi-modal speech processing unit configured to output corrected text information based on the text information and medical image information. According to the present disclosure, a new method can be provided to correct text information from various freely available speech recognition modules (STT) to be specialized for the medical domain without requiring additional training for medical speech recognition or large amounts of data for such purposes.

Inventors

예종철
허재영
박상준
이정은

Assignees

한국과학기술원

Dates

Publication Date: 20260512
Application Date: 20230811

Claims (17)

As a device for improving medical voice recognition performance, A voice converter configured to convert an input voice signal into text information; and It includes a multi-modal speech processing unit configured to output corrected text information based on the above text information and medical image information , and The above-mentioned multi-modal voice processing unit is, An image encoder configured to encode the above medical image information; A multi-modal encoder configured to perform cross-attention between the text information from the speech conversion unit and the medical image information from the image encoder using a visual-language dictionary learning (VLP) model; and A multimodal decoder configured to generate corrected text information reflecting visual semantics using a multimodal representation generated through the cross-attention above, Device for improving medical voice recognition performance.
delete
delete
In Article 1, The above multimodal voice processing unit further includes a Momentum Teacher module for improving the multimodal expression, Device for improving medical voice recognition performance.
In Article 1, The above VLP model is configured to perform cross-modal contrast (CMC) learning and intra-modal contrast (IMC) learning so that image and text features are in the same embedding space. Device for improving medical voice recognition performance.
In Article 5, The above VLP model is configured to be optimized based on Mask Language Modeling (MLM), Mask Image Modeling (MIM), CMC loss, IMC loss, and Image-Text Matching (ITM) loss, Device for improving medical voice recognition performance.
In Article 1, The above multi-modal speech processing unit is trained to predict the next word according to an auto-regressive language model, Device for improving medical voice recognition performance.
In Article 1, The above multi-modal voice processing unit is configured to be fine-tuned by performing random removal, random replace, and random insertion, Device for improving medical voice recognition performance.
As a method for improving medical voice recognition performance that can be performed by a computing device, A step of converting an input voice signal into text information in a voice conversion unit; and The multimodal voice processing unit includes the step of outputting corrected text information based on the text information and medical image information, and The above multi-modal voice processing unit includes a video encoder, a multi-modal encoder, and a multi-modal decoder, and The step of outputting the above-mentioned corrected text information is, A step of encoding the medical image information in the above image encoder; A step of performing cross-attention between the text information from the speech conversion unit and the medical image information from the image encoder using a video-language dictionary learning (VLP) model in the multi-modal encoder; and A method comprising the step of generating corrected text information reflecting visual semantics using the multimodal representation generated through the cross-attention in the multimodal decoder. Method for improving medical voice recognition performance.
delete
delete
In Article 9, The step of generating the above-mentioned corrected text information is, A method comprising the step of improving the multi-modal representation using a momentum teacher model, Method for improving medical voice recognition performance.
In Article 9, The above VLP model is configured to perform cross-modal contrast (CMC) learning and intra-modal contrast (IMC) learning so that image and text features are in the same embedding space, Method for improving medical voice recognition performance.
In Article 13, The above VLP model is configured to be optimized based on Mask Language Modeling (MLM), Mask Image Modeling (MIM), CMC loss, IMC loss, and Image-Text Matching (ITM) loss, Method for improving medical voice recognition performance.
In Article 9, The above multi-modal speech processing unit is trained to predict the next word according to an autoregressive language model, Method for improving medical voice recognition performance.
In Article 9, The above multi-modal voice processing unit is fine-tuned by performing random removal, random replacement, and random insertion, Method for improving medical voice recognition performance.
A computer program stored on a computer-readable medium comprising computer-executable instructions for executing a method according to any one of claims 9, 12 through 16.

Description

Device and method for improving medical speech-to-text recognition performance using a vision-language pre-training model The present disclosure relates to speech recognition technology, and more specifically, to an apparatus and method for improving medical speech recognition performance using an image-language dictionary learning model. Automatic Speech Recognition (ASR) is a technology that enables a machine to recognize spoken language and convert it into text. By analyzing speech patterns, ASR determines which words have been pronounced and translates them into text. This technology is widely used in voice-controlled devices and virtual assistants, and can make communication more convenient by enabling hands-free interaction. One of the most widespread applications of ASR is Speech-to-Text (STT) models, which convert speech into text in real time. Traditionally, probabilistic models such as Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) have been used for speech-to-text conversion. Recent advancements in deep learning frameworks are enabling more accurate and sophisticated STT models. For example, big tech companies such as Google, Microsoft, Facebook, and Baidu are introducing STT models utilizing deep learning techniques like Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). Additionally, open-source speech recognition toolkits such as Kaldi and Julius are being used for speech recognition in various fields. Medical STT applications are typically used to convert medical speech instructions into text, thereby reducing congestion and increasing efficiency in workflows. However, medical terminology and language are complex and nuanced, so STT models trained on general language may fail to convert them into text with precise meanings. Furthermore, training STT models for a specific medical domain is resource-intensive and time-consuming because it requires training on large volumes of medical speech-to-text data. Additionally, medical datasets can present privacy and security issues due to their difficult accessibility. FIG. 1 is an exemplary block diagram illustrating a computing device for improving medical speech recognition performance using an image-language dictionary learning model according to one embodiment of the present disclosure. FIG. 2 is an exemplary block diagram showing functional modules of a computing device of FIG. 1 according to one embodiment of the present disclosure. FIG. 3 is an exemplary diagram visually illustrating the functional modules and input/output signals of FIG. 2. FIG. 4 is an exemplary drawing illustrating an image-language dictionary learning (VLP) model according to one embodiment of the present disclosure. FIG. 5 is an exemplary drawing showing a multi-modal medical voice module (MMSM) model according to one embodiment of the present disclosure. Figure 6 is an exemplary diagram showing a text-only speech recognition model. FIG. 7 is an exemplary drawing illustrating a training strategy for fine-tuning an MMSM according to one embodiment of the present disclosure. FIG. 8 is an exemplary flowchart illustrating a method for improving medical speech recognition performance using an image-language dictionary learning model according to one embodiment of the present disclosure. FIG. 9a is an exemplary drawing showing the results of comparison between the method of the present disclosure and other models regarding clinically significant importance. FIG. 9b is an exemplary drawing illustrating the results of a comparison between the method of the present disclosure and other models for long and complex sentences. FIG. 10a is an exemplary drawing illustrating the correction results of the method of the present disclosure and other models for text obtained using the Google STT system. FIG. 10b is an exemplary drawing illustrating the correction results of the method of the present disclosure and other models for text obtained using the Julius STT system. FIG. 11 is an exemplary drawing showing quantitative comparison results of the method of the present disclosure and other models. Figures 12a and 12b are exemplary figures showing comparisons of clinical evaluations in successful and failed cases, respectively. Figure 13 is an exemplary figure showing comparison results of WER and CER measured using three different STT systems at different noise levels. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. First, it should be noted that in assigning reference numerals to the components of each drawing, the same components are given the same reference numeral whenever possible, even if they are shown in different drawings. Furthermore, in describing the present invention, if it is determined that a detailed description of related known components or functions could obscure the essence of the invention, such detailed description is omitted. Various aspects o