US-20260128045-A1 - SPEECH RECOGNITION SYSTEM AND RELATED SPEECH RECOGNITION METHOD

US20260128045A1US 20260128045 A1US20260128045 A1US 20260128045A1US-20260128045-A1

Abstract

The present invention provides a voice recognition method, which includes the steps of: receiving a plurality of voice signals; performing a time-domain to frequency-domain conversion operation on the plurality of voice signals to generate a plurality of frequency-domain signals; using a morphological filter to perform filtering operations on the plurality of frequency-domain signals to generate a plurality of filtered backgrounds; generating a plurality of initial voice fingerprints according to the plurality of filtered backgrounds, wherein each of the plurality of initial voice fingerprints comprises times and frequency points corresponding to a plurality of peaks in the corresponding filtered background; generating at least one voice fingerprint according to the plurality of initial voice fingerprints; and storing the at least one voice fingerprint in a memory for subsequent voice recognition operation.

Inventors

Ying-Ying Chao

Assignees

REALTEK SEMICONDUCTOR CORP.

Dates

Publication Date: 20260507
Application Date: 20251020
Priority Date: 20241107

Claims (10)

1 . A voice recognition system, comprising: a processing circuit; and a memory; wherein the processing circuit is configured to perform steps of: receiving a plurality of voice signals; performing a time-domain to frequency-domain conversion operation on the plurality of voice signals to generate a plurality of frequency-domain signals; using a morphological filter to perform filtering operations on the plurality of frequency-domain signals to generate a plurality of filtered backgrounds; generating a plurality of initial voice fingerprints according to the plurality of filtered backgrounds, wherein each of the plurality of initial voice fingerprints comprises times and frequency points corresponding to a plurality of peaks in the corresponding filtered background; generating at least one voice fingerprint according to the plurality of initial voice fingerprints; and storing the at least one voice fingerprint in the memory for subsequent voice recognition operation.
2 . The voice recognition system of claim 1 , wherein the step of generating the at least one voice fingerprint according to the plurality of initial voice fingerprints comprises: performing similarity matching on at least two of the plurality of initial voice fingerprints to determine the times and frequency points corresponding to a plurality of peaks of the at least one voice fingerprint.
3 . The voice recognition system of claim 2 , wherein the step of performing the similarity matching on the at least two of the plurality of initial voice fingerprints to determine the times and frequency points corresponding to the plurality of peaks of the at least one voice fingerprint comprises: Searching for multiple peaks with matching time intervals in the at least two of the plurality of initial voice fingerprints, and determining the time and frequency points corresponding to the multiple peaks of the at least one voice fingerprint according to the time and frequency points corresponding to the multiple peaks with matching time intervals in the at least two of the plurality of initial voice fingerprints.
4 . The voice recognition system of claim 1 , wherein the plurality of voice signals are generated by a user speaking a same keyword at different times.
5 . The voice recognition system of claim 1 , wherein the processing circuit further performs steps of: receiving a specific voice signal; performing the time-domain to frequency-domain conversion operation on the specific voice signal to generate a specific frequency-domain signal; using the morphological filter to perform the filtering operation on the specific frequency-domain signal to generate a specific filtered background; generating a target voiceprint to be recognized according to the specific filtered background; and determining whether the target voiceprint matches the at least one voiceprint.
6 . A voice recognition method, comprising: receiving a plurality of voice signals; performing a time-domain to frequency-domain conversion operation on the plurality of voice signals to generate a plurality of frequency-domain signals; using a morphological filter to perform filtering operations on the plurality of frequency-domain signals to generate a plurality of filtered backgrounds; generating a plurality of initial voice fingerprints according to the plurality of filtered backgrounds, wherein each of the plurality of initial voice fingerprints comprises times and frequency points corresponding to a plurality of peaks in the corresponding filtered background; generating at least one voice fingerprint according to the plurality of initial voice fingerprints; and storing the at least one voice fingerprint in a memory for subsequent voice recognition operation.
7 . The voice recognition method of claim 6 , wherein the step of generating the at least one voice fingerprint according to the plurality of initial voice fingerprints comprises: performing similarity matching on at least two of the plurality of initial voice fingerprints to determine the times and frequency points corresponding to a plurality of peaks of the at least one voice fingerprint.
8 . The voice recognition method of claim 7 , wherein the step of performing the similarity matching on the at least two of the plurality of initial voice fingerprints to determine the times and frequency points corresponding to the plurality of peaks of the at least one voice fingerprint comprises: Searching for multiple peaks with matching time intervals in the at least two of the plurality of initial voice fingerprints, and determining the time and frequency points corresponding to the multiple peaks of the at least one voice fingerprint according to the time and frequency points corresponding to the multiple peaks with matching time intervals in the at least two of the plurality of initial voice fingerprints.
9 . The voice recognition method of claim 6 , wherein the plurality of voice signals are generated by a user speaking a same keyword at different times.
10 . The voice recognition method of claim 6 , further comprising: receiving a specific voice signal; performing the time-domain to frequency-domain conversion operation on the specific voice signal to generate a specific frequency-domain signal; using the morphological filter to perform the filtering operation on the specific frequency-domain signal to generate a specific filtered background; generating a target voiceprint to be recognized according to the specific filtered background; and determining whether the target voiceprint matches the at least one voiceprint.

Description

BACKGROUND OF THE INVENTION 1. FIELD OF THE INVENTION The present invention relates to a speech recognition system. 2. DESCRIPTION OF THE PRIOR ART Due to the uniqueness of each individual's voice, many electronic devices in recent years have been using voiceprints to identify users. However, the uniqueness of a voice is not solely based on differences in the user's vocal structure, but also the user's age and health. Therefore, traditional speech recognition devices typically first perform loudness normalization and Voice Activity Detection (VAD) on the received audio signal, then detecting the segments including speech and performing feature extraction to improve the accuracy of speech recognition. However, the aforementioned loudness normalization and VAD processes increase the design and manufacturing costs of speech recognition devices. SUMMARY OF THE INVENTION Therefore, one of the objectives of the present invention is to provide a speech recognition system that can accurately perform speech recognition without the need for loudness normalization and/or VAD operations, in order to solve the problems described in the prior art. According to one embodiment of the present invention, a voice recognition system comprising a processing circuit and a memory is disclosed. The processing circuit is configured to perform steps of: receiving a plurality of voice signals; performing a time-domain to frequency-domain conversion operation on the plurality of voice signals to generate a plurality of frequency-domain signals; using a morphological filter to perform filtering operations on the plurality of frequency-domain signals to generate a plurality of filtered backgrounds; generating a plurality of initial voice fingerprints according to the plurality of filtered backgrounds, wherein each of the plurality of initial voice fingerprints comprises times and frequency points corresponding to a plurality of peaks in the corresponding filtered background; generating at least one voice fingerprint according to the plurality of initial voice fingerprints; and storing the at least one voice fingerprint in the memory for subsequent voice recognition operation. According to one embodiment of the present invention, a voice recognition method comprises the steps of: receiving a plurality of voice signals; performing a time-domain to frequency-domain conversion operation on the plurality of voice signals to generate a plurality of frequency-domain signals; using a morphological filter to perform filtering operations on the plurality of frequency-domain signals to generate a plurality of filtered backgrounds; generating a plurality of initial voice fingerprints according to the plurality of filtered backgrounds, wherein each of the plurality of initial voice fingerprints comprises times and frequency points corresponding to a plurality of peaks in the corresponding filtered background; generating at least one voice fingerprint according to the plurality of initial voice fingerprints; and storing the at least one voice fingerprint in a memory for subsequent voice recognition operation. These and other objectives of the present invention will no doubt become obvious to those of ordinary skill in the art after reading the following detailed description of the preferred embodiment that is illustrated in the various figures and drawings. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic diagram of a speech recognition system according to an embodiment of the present invention. FIG. 2 is a flowchart showing the process of establishing a voiceprint in the speech recognition system according to an embodiment of the present invention. FIG. 3 is a flowchart showing the filtering operation of a processed frequency-domain signal using a morphological filter to generate a filtered background. FIG. 4 is a schematic diagram showing the dilation background generated by the morphological filter. FIG. 5 is a schematic diagram showing the erosion background generated by the morphological filter. FIG. 6 is a schematic diagram showing how the morphological filter generates a filtered background based on the dilation background and erosion background. FIG. 7 is a schematic diagram showing the generation of multiple voiceprints based on multiple initial voiceprints. FIG. 8 is a flowchart showing the process of speech recognition in the speech recognition system according to an embodiment of the present invention. FIG. 9 is a schematic diagram showing the process of determining whether the voiceprint to be recognized matches a voiceprint stored in memory according to an embodiment of the present invention. DETAILED DESCRIPTION FIG. 1 is a schematic diagram of a speech recognition system 100 according to an embodiment of the present invention. As shown in FIG. 1, the speech recognition system 100 includes a radio device 110, an audio interface 120, a processing circuit 130, and a memory 140. In this embodiment, the speech recognition system 100