EP-4742237-A1 - TEXT CONVERSION METHOD FOR VOICE DATA, INFORMATION PROCESSING DEVICE, AND NON-TRANSITORY STORAGE MEDIUM

EP4742237A1EP 4742237 A1EP4742237 A1EP 4742237A1EP-4742237-A1

Abstract

A text conversion method for voice data that is executed by an information processing device includes: acquiring voice data; detecting a specific expression in the voice data and feature information relevant to vocalization of the specific expression; converting the specific expression into corresponding standard language, based on the detected specific expression and the detected feature information; and outputting text information relevant to the voice data.

Inventors

MORISHITA, HIROFUMI

Assignees

TOYOTA JIDOSHA KABUSHIKI KAISHA

Dates

Publication Date: 20260513
Application Date: 20251029

Claims (7)

A text conversion method for voice data that is executed by an information processing device, the text conversion method comprising: acquiring voice data; detecting a specific expression in the voice data and feature information relevant to vocalization of the specific expression; converting the specific expression into corresponding standard language, based on the detected specific expression and the detected feature information; and outputting text information relevant to the voice data.
The text conversion method according to claim 1, further comprising converting the specific expression into the corresponding standard language, based on a pair of the detected specific expression and the detected feature information and a conversion rule between non-standard language and standard language.
The text conversion method according to claim 1, wherein the feature information relevant to the vocalization is voice tone information.
The text conversion method according to claim 1, wherein the specific expression includes dialect and slang.
The text conversion method according to claim 1, wherein: the voice data is voice in a business talk relevant to a predetermined provision object; and the text conversion method includes specifying regionality information corresponding to a speaking person, from the voice data, and presenting a suggestion relevant to the predetermined provision object, based on the regionality information.
An information processing device (10) comprising one or more processors configured to: acquire voice data; detect a specific expression in the voice data and feature information relevant to vocalization of the specific expression; convert the specific expression into corresponding standard language, based on the detected specific expression and the detected feature information; and output text information relevant to the voice data.
A non-transitory storage medium storing instructions that are executable by one or more processors and that cause the one or more processors to perform functions comprising: acquiring voice data; detecting a specific expression in the voice data and feature information relevant to vocalization of the specific expression; converting the specific expression into corresponding standard language, based on the detected specific expression and the detected feature information; and outputting text information relevant to the voice data.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present disclosure relates to a text conversion method for voice data, an information processing device, and a non-transitory storage medium. 2. Description of Related Art A technology for analyzing the content of a business talk is known. For example, Japanese Unexamined Patent Application Publication No. 2019-28910 (JP 2019-28910 A) discloses a dialogue analysis system for checking that a sales person has explained matters that should be explained and has not said matters that should not be said, in a business talk with a customer. Further, for example, "Toyama Dialect Recognition and Conversion to Standard Japanese via Deep Learning" (The 38th Annual Conference of the Japanese Society for Artificial Intelligence (2024)) by Horimoto, et al. discloses a voice recognition technology for the Toyama dialect. SUMMARY OF THE INVENTION In JP 2019-28910 A, a technology for analyzing the content of the business talk by machine learning is shown, but in JP 2019-28910 A and "Toyama Dialect Recognition and Conversion to Standard Japanese via Deep Learning" (The 38th Annual Conference of the Japanese Society for Artificial Intelligence (2024)) by Horimoto, et al., the transcription of the voice in the business talk or the like, that is, a text conversion technology for voice data is not mentioned. Particularly, there is room for improvement in a voice transcription technology for voice data that includes non-standard language, such as dialects and accents. Meanwhile, for the analysis, feedback, and others of the content of the business talk or the like, it is desirable to improve the text conversion technology for voice data. Thus, there is room for improvement in the text conversion technology for voice data in business talks and the like. The present disclosure provides a text conversion technology for voice data. A text conversion method for voice data that is executed by an information processing device according to a first aspect of the present disclosure includes: acquiring voice data; detecting a specific expression in the voice data and feature information relevant to vocalization of the specific expression; converting the specific expression into corresponding standard language, based on the detected specific expression and the detected feature information; and outputting text information relevant to the voice data. An information processing device according to a second aspect of the present disclosure includes one or more processors configured to: acquire voice data; detect a specific expression in the voice data and feature information relevant to vocalization of the specific expression; convert the specific expression into corresponding standard language, based on the detected specific expression and the detected feature information; and output text information relevant to the voice data. A non-transitory storage medium according to a fourth aspect of the present disclosure stores instructions that are executable by one or more processors and that cause the one or more processors to perform functions including: acquiring voice data; detecting a specific expression in the voice data and feature information relevant to vocalization of the specific expression; converting the specific expression into corresponding standard language, based on the detected specific expression and the detected feature information; and outputting text information relevant to the voice data. With an embodiment of the present disclosure, the text conversion technology for voice data is improved. BRIEF DESCRIPTION OF THE DRAWINGS Features, advantages, and technical and industrial significance of exemplary embodiments of the invention will be described below with reference to the accompanying drawings, in which like signs denote like elements, and wherein: FIG. 1 is a block diagram showing a schematic configuration of a system according to the embodiment; andFIG. 2 is a flowchart showing the operation of an information processing device. DETAILED DESCRIPTION OF EMBODIMENTS An embodiment of the present disclosure will be described below. Overview of Embodiment The overview and configuration of a system 1 according to the embodiment will be described with reference to FIG. 1. The system 1 according to the embodiment includes an information processing device 10 and a terminal device 20. The information processing device 10 and the terminal device 20 are communicably connected to a network 30 including a mobile body communication network and the internet, for example. The information processing device 10 is a server device that is installed in a data center, for example. For example, the information processing device 10 is a server that belongs to a cloud computing system or another computing system. The number of information processing devices 10 included in the system 1 is one as an example shown in FIG. 1, but is not limited to this. The system 1 may include two or more informat