US-12625936-B2 - System and method for highly accurate voice-based biometric authentication

US12625936B2US 12625936 B2US12625936 B2US 12625936B2US-12625936-B2

Abstract

The present disclosure provides a system and a method for voice based authentication, which involves receiving voice data of a user enunciating a predetermined sequence of speech elements, with the predetermined sequence of speech elements is configured to capture a defined range of voiced sounds produced by the user; extracting voice features from the received voice data; deriving a voice signature for the user based on the extracted voice features, wherein the voice signature is representative of at least one of tonal, timbral, and temporal characteristics of the user's voiced speech; storing the derived voice signature in a database; receiving a verification voice sample of the user enunciating a predetermined sub-set of speech elements; comparing the verification voice sample with the stored voice signature; and authenticating the user based on the comparison.

Inventors

Anurag Goel
Sairam Sankaranarayanan

Assignees

TURANT INC.

Dates

Publication Date: 20260512
Application Date: 20240620

Claims (16)

1 . A voice authentication system comprising a server comprising one or more processors and a memory storing instructions that, when executed by the one or more processors, cause the server to: receive voice data of a user enunciating a predetermined sequence of speech elements, wherein the predetermined sequence of speech elements is configured to capture a defined range of voiced sounds produced by the user; pre-emphasize the received voice data; frame the pre-emphasized voice data into overlapping time frames; apply a window function to each time frame to reduce boundary discontinuities; perform a Fourier transform on each windowed time frame to convert to a frequency domain representation; apply a set of band-pass filters modelling human auditory perception to the frequency domain representation; performing a decorrelation transform to derive Mel-Frequency Cepstral Coefficients (MFCCs), wherein the MFCCs constitute extracted voice features; derive a voice signature for the user by performing statistical modelling of the extracted voice features using a Gaussian Mixture Model (GMM), wherein the voice signature is representative of at least one of tonal, timbral, and temporal characteristics of the user's voiced speech; store the derived voice signature in a database; receive a verification voice sample of the user enunciating a predetermined sub-set of speech elements; process the verification voice sample using the same pre-emphasis, framing, windowing, Fourier transform, band-pass filtering, decorrelation transform, and MFCC extraction steps to extract verification voice features; compare the extracted verification voice features with the stored voice signature; and authenticate the user based on the comparison.
2 . The voice authentication system according to claim 1 , wherein the predetermined sequence of speech elements comprises a series of spoken numerals.
3 . The voice authentication system according to claim 2 , wherein the series of spoken numerals includes numerals from 0 to 9.
4 . The voice authentication system according to claim 1 , wherein the instructions for comparing the verification voice sample cause the server to: extract voice features from the verification voice sample; compare the extracted voice features against the stored voice signature and a universal background model representing average voice characteristics; and determine a match score based on the comparison.
5 . The voice authentication system according to claim 4 , wherein the instructions further cause the server to authenticate the user when the match score exceeds a predetermined threshold.
6 . The voice authentication system according to claim 1 , wherein the instructions further cause the server to: measure a response timing of the user enunciating the predetermined sub-set of speech elements; determine whether the response timing exceeds a predetermined threshold; re-prompt the user to enunciate the predetermined sub-set of speech elements again if response timing exceeds the predetermined threshold; and fail the authentication of the user if the response timing exceeds the predetermined threshold after a predefined number of re-prompts.
7 . The voice authentication system according to claim 1 , wherein the system is configured to provide voice authentication services to third-party systems via an application programming interface (API).
8 . The voice authentication system according to claim 1 , wherein the memory further stores instructions that, when executed by the one or more processors, cause the server to implement a microservices-based architecture for the voice authentication system, the microservices-based architecture comprising a plurality of independent modules for performing different tasks in the voice authentication process.
9 . A method for authenticating a user by voice in a voice authentication system, the method comprising: receiving, by a server comprising one or more processors, voice data of a user enunciating a predetermined sequence of speech elements, wherein the predetermined sequence of speech elements is configured to capture a defined range of voiced sounds produced by the user; pre-emphasizing, by the server, the received voice data; framing, by the server, the pre-emphasized voice data into overlapping time frames; applying, by the server, a window function to each time frame to reduce boundary discontinuities; performing, by the server, a Fourier transform on each windowed time frame to convert to a frequency domain representation; applying, by the server, a set of band-pass filters modelling human auditory perception to the frequency domain representation; performing, by the server, a decorrelation transform to derive Mel Frequency Cepstral Coefficients (MFCCs), wherein the MFCCs constitute extracted voice features; deriving, by the server, a voice signature for the user by performing statistical modelling of the extracted voice features using a Gaussian Mixture Model (GMM), wherein the voice signature is representative of at least one of tonal, timbral, and temporal characteristics of the user's voiced speech; storing, by the server, the derived voice signature in a database; receiving, by the server, a verification voice sample of the user enunciating a predetermined sub-set of speech elements; processing, by the server, the verification voice sample using the same pre-emphasis, framing, windowing, Fourier transform, band-pass filtering, decorrelation transform, and MFCC extraction steps to extract verification voice features; comparing, by the server, the extracted verification voice features with the stored voice signature; and authenticating, by the server, the user based on the comparison.
10 . The method according to claim 9 , wherein the predetermined sequence of speech elements comprises a series of spoken numerals.
11 . The method according to claim 10 , wherein the series of spoken numerals includes numerals from 0 to 9.
12 . The method according to claim 9 , further comprising: extracting, by the server, voice features from the verification voice sample; comparing, by the server, the extracted voice features against the stored voice signature and a universal background model representing average voice characteristics; and determining, by the server, a match score based on the comparison.
13 . The method according to claim 12 , further comprising authenticating, by the server, the user when the match score exceeds a predetermined threshold.
14 . The method according to claim 9 , further comprising: measuring, by the server, a response timing of the user enunciating the predetermined sub-set of speech elements; determining, by the server, whether the response timing exceeds a predetermined threshold; re-prompting, by the server, the user to enunciate the predetermined sub-set of speech elements again if response timing exceeds the predetermined threshold; and failing, by the server, the authentication of the user if the response timing exceeds the predetermined threshold after a predefined number of re-prompts.
15 . The method according to claim 9 , further comprising providing, by the server, voice authentication services to third-party systems via an application programming interface (API).
16 . The method according to claim 9 , further comprising implementing, by the server, a microservices-based architecture for the voice authentication system, the microservices-based architecture comprising a plurality of independent modules for performing different tasks in the voice authentication process.

Description

CROSS-REFERENCE TO RELATED APPLICATION This application claims the benefit of priority under 35 U.S.C. § 119(e) based on U.S. Provisional Patent Application having Application No. 63/628,176 filed on Jun. 29, 2023, and entitled “System and method for highly accurate voice-based biometric authentication”, which is hereby incorporated herein by reference in its entirety. FIELD OF INVENTION The present disclosure relates generally to the field of biometric authentication. In particular, the present disclosure pertains to a system and a method for voice-based biometric authentication to facilitate secure and efficient identification and verification of individuals based on their unique voice signatures. More specifically, the present disclosure pertains to language-independent, number-based voice biometric authentication. BACKGROUND In an increasingly interconnected world, there is a need for effective and reliable methods of user authentication. Traditional forms of authentication, such as password-based or pin-based methods, have been widely adopted, but their weaknesses are well known. They are susceptible to breaches due to their static nature, and reliance on the user's memory, making them prone to being easily forgotten, guessed, or even obtained via phishing attempts or brute-force attacks. Addressing these weaknesses, various forms of biometric authentication have been introduced. Biometrics refer to the physiological or behavioral attributes of a person that can be measured and used for identification and authentication. These include fingerprints, iris patterns, facial features, gait, and voice. While biometric authentication offers enhanced security compared to traditional methods, the practical implementation of many such systems is hampered by the need for specialized hardware, user discomfort, or privacy concerns. Among various biometric modalities, voice-based authentication, also known as voice biometrics, is increasingly recognized for its potential to provide secure and user-friendly identity verification solutions across various digital platforms. Voice biometrics capitalizes on the uniqueness of an individual's voice. This uniqueness arises due to the individual's physical characteristics, such as the shape and size of the throat and mouth, and behavioral aspects, such as accent, speed of speech, and emphasis on certain syllables. With the proliferation of smart devices and voice-interactive systems, voice biometrics has the opportunity to become an integral part of security protocols in sectors ranging from telecommunications to national defense. In general, the voice-based authentication provides a dynamic, multifactor authentication mechanism that can significantly increase the security of a system across various digital platforms. Despite its advantages, the conventional voice authentication technology faces significant challenges. Current voice authentication systems typically use a range of speech elements and complex algorithms to improve accuracy and reliability. These systems may apply techniques like noise filtering, voice activity detection, and dynamic feature extraction to enhance performance under varied conditions. Most rely on a combination of hardware and software to preprocess and analyze the voice data, using statistical models like Gaussian Mixture Models (GMMs) to compare current voice samples with previously stored voice signatures. Such conventional techniques for voice authentication, which involves free-flow speech, is limited in its accuracy, often not exceeding 92%. Further, the dependency on specific linguistic content and the need for continuous calibration against background models limit their applicability across different languages and dialects. Additionally, the complex preprocessing and feature extraction processes require substantial computational resources, which can hinder the scalability and efficiency of these systems, especially in resource-constrained environments. In light of these challenges, there exists a need for an improved voice authentication system that can provide high accuracy and reliability while catering to the inherent variability of human speech and environmental factors. Such system should be capable of functioning effectively across multiple languages without the need for linguistic calibration, and it should simplify the authentication process to facilitate wider adoption in commercial and security-sensitive applications. The present disclosure aims to provide systems and methods for voice data processing and authentication accuracy addressing the limitations of existing technologies. SUMMARY In an aspect of the present disclosure, a voice authentication system is provided. The voice authentication system comprises a server including one or more processors and a memory storing instructions that, when executed by the one or more processors, cause the server to: receive voice data of a user enunciating a predetermined sequence of spee