Search

KR-20260065411-A - Dialect Speech Classification System and Method

KR20260065411AKR 20260065411 AKR20260065411 AKR 20260065411AKR-20260065411-A

Abstract

A dialect voice classification system according to a disclosed embodiment includes: a database storing first set data containing a plurality of voice data; and a processor that trains an artificial intelligence model and classifies dialect types through the trained artificial intelligence model. The processor includes: a voice data conversion unit that converts the first set data into an image; a model learning unit that trains an artificial intelligence model to classify the first set data into a plurality of dialect types based on the first set data and an image generated by the voice data conversion unit; and a dialect type classification unit that classifies the dialect type of a second set data input by a user into one of the plurality of dialect types based on the artificial intelligence model trained by the model learning unit.

Inventors

  • 권용
  • 반능기
  • 차병래

Assignees

  • 한양대학교 산학협력단

Dates

Publication Date
20260508
Application Date
20241101

Claims (13)

  1. A database storing first set data including multiple voice data; A processor that trains an artificial intelligence model and classifies dialect types through the trained artificial intelligence model; The above processor is, A voice data conversion unit that converts the above-mentioned first set data into an image; A model learning unit that performs training of an artificial intelligence model to classify the first set data into a plurality of dialect types based on the first set data and an image generated by the voice data conversion unit; and A dialect speech classification system comprising: a dialect type classification unit that classifies the dialect type of a second set of data input by a user into one of the plurality of dialect types based on an artificial intelligence model learned by the above model learning unit.
  2. In Article 1, The above voice data conversion unit is, The above first set data is preprocessed by scrubbing and vectorizing, and A dialect speech classification system that converts the above-mentioned preprocessed first set data into an image.
  3. In Paragraph 2, The above voice data conversion unit is, A dialect speech classification system that converts the first set data preprocessed in the above-mentioned speech data conversion unit into a spectrogram and a Mel spectrogram, respectively.
  4. In Article 1, The above model learning unit is, A dialect speech classification system that performs training of the artificial intelligence model based on data obtained by merging the spectrogram and mel spectrogram generated by the voice data conversion unit.
  5. In Paragraph 4, The above model learning unit is, A dialect speech classification system that performs training of the artificial intelligence model based on data obtained by concatenating the spectrogram and mel spectrogram generated by the voice data conversion unit.
  6. In Article 1, The above voice data conversion unit is, Preprocess the second set of data entered by the user, and The above-mentioned preprocessed second set data is converted into a spectrogram and a Mel spectrogram, respectively, and The above dialect type classification section is, A dialect speech classification system that takes as input data obtained by merging or concatenating spectrograms and Mel spectrograms converted from the second set data above, and classifies into one of the plurality of dialect types through an artificial intelligence model trained in the model learning unit.
  7. A plurality of voice data received from the outside are stored as the first set data; Preprocess the above first set data; Converting the above-mentioned preprocessed first set data into an image; Based on the first set data and the image converted therefrom, an artificial intelligence model is trained to classify the first set data into multiple dialect types; A dialect speech classification method comprising classifying a second set of data input by a user into one of the plurality of dialect types based on the above-mentioned learned artificial intelligence model.
  8. In Article 7, The above preprocessing is, A dialect speech classification method comprising scrubbing and vectorizing the above-mentioned first set data.
  9. In Article 7, The above conversion is, A dialect speech classification method comprising: converting the above-mentioned preprocessed first set data into spectrogram and Mel-spectrogram images, respectively, through spectrogram analysis and Mel-spectrogram analysis.
  10. In Article 7, The above-mentioned training is, A dialect speech classification method comprising: training the artificial intelligence model based on data obtained by merging the spectrogram and Mel spectrogram generated by transforming the first set data.
  11. In Article 10, The above-mentioned training is, A dialect speech classification method comprising: training the artificial intelligence model based on data obtained by concatenating the spectrogram and Mel spectrogram generated by converting the first set data.
  12. In Article 7, The above classification is, A dialect speech classification method comprising classifying the second set data into any one of the plurality of dialect types based on data obtained by merging or concatenating spectrograms and Mel spectrograms through the aforementioned pre-trained artificial intelligence model.
  13. A database storing first set data including an artificial intelligence model and multiple data; A processor that preprocesses the first set data, converts the preprocessed first set data into an image, and trains an artificial intelligence model to classify the first set data into a plurality of dialect types based on the first set data and the converted image; and A dialect speech classification system comprising: an output unit that outputs a result of classifying a second set of data input by a user into one of the plurality of dialect types based on the above-mentioned learned artificial intelligence model.

Description

Dialect Speech Classification System and Method A disclosed embodiment relates to a dialect speech classification system and a classification method capable of classifying dialect speech into multiple types by converting dialect speech data into images and training an artificial intelligence model. Speech recognition technology is a technology in which a computer recognizes spoken human voice to convert it into text or execute commands. It analyzes voice data to identify words, sentences, and meanings, and is primarily based on natural language processing and deep learning technologies. Although speech recognition technology is applied in various fields, including not only voice-based virtual assistant services but also automobiles, home appliances, and medical services, the recognition rate for dialects is relatively low compared to standard language, so people who speak dialects may experience inconvenience when using speech recognition technology. Dialect speech recognition is a much more challenging task than standard language recognition due to various reasons, including not only differences in pronunciation, intonation, vocabulary, and grammar, but also data scarcity, individual variations, noise, and dialect changes. While dialect speech recognition requires the development of technologies capable of training diverse dialect models and being robust against dialect changes, the development of technologies to classify dialect speech is required more proactively. Meanwhile, the neural network structures primarily used in artificial intelligence are CNNs (Convolutional Neural Networks) and RNNs (Recurrent Neural Networks); CNNs are suitable for processing spatial data such as images, while RNNs are suitable for processing time-series data such as speech. Although the above CNN (Convolutional Neural Network) is a neural network structure suitable for image processing, it has the following advantages when applied to speech recognition and processing. First, it effectively extracts local patterns to thoroughly analyze the detailed features of speech signals; second, it maintains robust performance even under variations in pronunciation or speech signals through invariance; and third, it enables parallel processing, allowing for the rapid processing of large speech data. In order to apply voice data to the aforementioned CNN (Convolutional Neural Network), a process of converting the voice data into an image format is required. One method of converting voice data into an image is to generate a spectrogram. A spectrogram is an image that visually represents changes in frequency components over time by combining the waveform and spectrum characteristics of voice data; it can be used to visually analyze voice data or for speech recognition and processing. A waveform is a graph representing changes in the amplitude of a signal over time; while it indicates changes in sound intensity or loudness over time, it does not directly provide frequency information. A spectrum analyzes the frequency components of a signal to determine the intensity of each frequency component, but since it does not display changes over time, it does not reveal how the speech signal changes over time. A spectrogram combines the aforementioned waveform and spectrum information; it is an image that visually represents the interaction between time and frequency by comprehensively analyzing the waveform and spectrum. Another method for converting speech data into images is to generate a Mel-spectrogram. Human hearing responds non-linearly to frequency; a Mel scale is a non-linear transformation of the frequency axis that mimics this characteristic of human hearing. A Mel-spectrogram is a spectrogram converted into a Mel scale, and because it utilizes the Mel scale—a frequency scale that reflects human auditory characteristics—it is suitable for speech processing and analysis. Figure 1 is a diagram for schematically illustrating a disclosed dialect speech classification system. Figure 2 is a control block diagram of a disclosed dialect speech classification system. Figure 3 is a flowchart of a method for receiving voice data and classifying dialect types. Figure 4 is a flowchart illustrating the process of learning and inferring that the dialect type classification unit classifies speech by dialect type. Figures 5 and 6 are drawings to illustrate an example of the flowchart of Figure 4. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the attached drawings. However, the technical concept of the present invention is not limited to the embodiments described herein and may be embodied in other forms. Rather, the embodiments introduced herein are provided to ensure that the disclosed content is thorough and complete and to sufficiently convey the concept of the present invention to those skilled in the art. In this specification, when a component is described as being on another component, it