US-12620321-B1 - Systems and methods for automated fine-grained speech scoring

US12620321B1US 12620321 B1US12620321 B1US 12620321B1US-12620321-B1

Abstract

A speech in response to a prompt is accessed. The speech is provided to a speech recognition module that is configured to generate a text transcript of the speech. Speech features are extracted from the speech. Similarly, text features are extracted from the text transcript. Both speech features and text features are vector representations of the speech. The two features are concatenated into one vector representation that captures both perceptual and linguistic components of the speech. The concatenated vector is provided to a speech scoring model. The speech scoring model simultaneously provides a holistic score as well as fine-grained scores to the speech based on the concatenated features.

Inventors

Seongjin Park
Rutuja Ubale

Assignees

EDUCATIONAL TESTING SERVICE

Dates

Publication Date: 20260505
Application Date: 20240813

Claims (20)

1 . A computer-implemented method comprising: accessing a speech in response to a prompt; extracting, with a speech representation module that comprises a first transformer model, speech features from the speech as vector representations; providing the speech to a speech recognition module; generating, with the speech recognition module, a text transcript of the speech; extracting, with a text representation module that comprises a second transformer model, text features from the text transcript as vector representations; concatenating, with a concatenation module, the speech features and the text features into a combined vector representation; and simultaneously providing, with a speech scoring module, a holistic score and fine-grained scores as scalar values based on the concatenated features.
2 . The computer implemented method of claim 1 , wherein the fine-grained scores comprise delivery score, language use score, and topic development score.
3 . The computer-implemented method of claim 1 , wherein the speech representation module is implemented on a self-supervised transformer model.
4 . The computer-implemented method of claim 3 , wherein the speech features are extracted from the speech representation module by applying global pooling to a last hidden layer.
5 . The computer-implemented method of claim 1 , wherein the speech is transcribed into the text transcript, which comprises the speech's content, word-level and phone-level features, hesitation markers, and repetitions.
6 . The computer-implemented method of claim 1 , wherein the speech recognition module is trained on both native and non-native speech.
7 . The computer-implemented method of claim 1 , wherein the text representation module is implemented on a bidirectional transformer model.
8 . The computer-implemented method of claim 7 , wherein the text features are extracted from a final hidden layer of the text representation module.
9 . The computer-implemented method of claim 7 , wherein text of the prompt is provided to the text representation module.
10 . The computer-implemented method of claim 9 , wherein a vector representation of the prompt is extracted from the text representation module.
11 . The computer implemented method of claim 10 , wherein the vector representation of the prompt is concatenated with the text features.
12 . The computer-implemented method of claim 1 , wherein the speech scoring model comprises four regression layers.
13 . The computer-implemented method of claim 12 , wherein each regression layer predicts a specific score based on the concatenated features.
14 . A system comprising: one or more data processors; and a computer-readable medium encoded with instructions for commanding the one or more data processors to execute steps of a process, the steps comprising: accessing a speech in response to a prompt; extracting, with a speech representation module that comprises a first transformer model, speech features from the speech; providing the speech to a speech recognition module; generating, with the speech recognition module, a text transcript of the speech; extracting, with a text representation module that comprises a second transformer model, text features from the text transcript; concatenating, with a concatenation module, the speech features and the text features; and simultaneously providing, with a speech scoring module, a holistic score and fine-grained scores based on the concatenated features.
15 . A computer-implemented method comprising: accessing a speech in response to a prompt; extracting speech features from the speech as vector representations; generating a text transcript of the speech; extracting text features from the text transcript as vector representations; concatenating the speech features and the text features into a combined vector representation; and simultaneously providing a holistic score and fine-grained scores as scalar values based on the concatenated features, wherein the extracting steps and the scoring step are implemented by neural network language models.
16 . The computer-implemented method of claim 1 , wherein the holistic score comprises a singular numerical score that represents an evaluation of the entire speech, without providing insights into particular strengths and weaknesses of the speech, and wherein the fine-grained scores comprise multiple numerical scores for specific aspects of speech.
17 . The computer-implemented method of claim 12 , wherein each regression layer is configured to predict numerical scores along a continuous scale for various aspects of speech by applying a linear transformation to the corresponding concatenated features and assigning weights to particular concatenated features.
18 . The computer-implemented method of claim 7 , wherein the bidirectional transformer model is configured to capture the meanings and contextual relationships of words, phrases, and sentences of the text transcript in both the forward and backward directions.
19 . The computer-implemented method of claim 1 , wherein the text transcript is tokenized into vector representations and passed through multiple layers of the text representation module to refine the vector representations based on additional context and textual relationships.
20 . The computer-implemented method of claim 19 , wherein a classification token is added at the beginning of the tokenized text transcript and continuously updated as the text transcript is passed through the layers of the text representation module to reflect the additional information captured at each layer.

Description

RELATED APPLICATION This application claims priority to U.S. Provisional Application No. 63/519,633, filed on Aug. 15, 2023, the entirety of which is incorporated by reference herein. TECHNICAL FIELD The subject matter described herein relates to speech scoring, and more particularly to automated speech scoring. BACKGROUND Effective communication in any language can be facilitated through proficient speaking skills. Proficient speaking skills allow the speaker to clearly and accurately articulate their ideas. One way to improve proficiency in speaking skills is through detailed feedback. Spoken communication comprises key traits such as delivery, language use, and topic development, that each contribute to the overall quality of the spoken communication. These traits can be individually evaluated to provide granular feedback to the speaker. Such granular feedback helps the speaker identify their strengths and weakness so that they can focus on specific areas of improvement. Therefore, a speech scoring model that can provide fine-grained speech scores identifying specific areas of strengths and weakness can help a speaker improve their speaking skills. SUMMARY A speech in response to a prompt is accessed. The speech is provided to a speech recognition module that is configured to generate a text transcript of the speech. Speech features are extracted from the speech. Similarly, text features are extracted from the text transcript. Both speech features and text features are vector representations of the speech. The two features are concatenated into one vector representation that captures both perceptual and linguistic components of the speech. The concatenated vector is provided to a speech scoring model. The speech scoring model simultaneously provides a holistic score as well as fine-grained scores to the speech based on the concatenated features. DESCRIPTION OF DRAWINGS FIG. 1 illustrates an exemplary system for fine-grained speech scoring. FIG. 2 illustrates further details of an exemplary speech scoring model that is configured to evaluate a speech in order to provide a holistic score, as well as a delivery score, a language use score, and a topic development score. FIG. 3 illustrates further details of an exemplary feature extractor that is configured to extract both speech and text features from a speech. FIG. 4 illustrates an exemplary system for training a speech recognition module using both native and non-native speech. FIG. 5 illustrates an exemplary system for training a speech scoring module to evaluate a speech and simultaneously provide multiple scores. FIG. 6 is an exemplary process flow diagram for automated fine-grained speech scoring. FIGS. 7A-7C depict example systems for implementing the approaches described herein for automated fine-grained speech scoring. DETAILED DESCRIPTION Receiving feedback on their speech is one of the primary ways in which a speaker can improve their speaking skills. It is especially helpful when the feedback is granular, rather than when it is a singular, holistic score that represents an assessment of the entire speech. Granular feedback is helpful over a holistic score because it helps the writer identify specific areas of strengths and weaknesses. Speech comprises a plurality of distinct traits that contribute to the overall quality of the speech, so it is helpful to identify with particularity which traits the speaker should focus on improving. For example these traits may include both perceptual and linguistic components such as delivery, language use, and topic development. Especially in the context of learning a new language, receiving feedback on multiple traits can be particularly helpful because the language learner can focus on specific aspects to improve. For example, if the granular feedback indicates that the language learner's speech is lacking proper pronunciation, the language learner can focus on improving their pronunciation instead of focusing on other aspects in which they are already proficient. Computer-implemented systems and methods as described herein are directed to automated fine-grained speech scoring. In embodiments, the systems and methods herein are configured to evaluate a speech in order to provide a holistic score, as well as fine-grained scores across key traits such as delivery, language use, and topic development. Systems and methods herein take advantage of both text and speech features extracted from a speech, so that the holistic evaluation and fine-grained scores take into consideration both the perceptual and linguistic components of the speech. FIG. 1 shows an exemplary system for fine-grained speech scoring. Speech 100 is a verbal response to a prompt. For example, speech 100 may be a language learner's verbal answer to a question on a language assessment test. In another example, speech 100 may be an oral presentation on a given topic. In embodiments, speech 100 may be a raw audio recording of the verbal response in