CN-116246192-B - Subtitle display method, device and equipment

CN116246192BCN 116246192 BCN116246192 BCN 116246192BCN-116246192-B

Abstract

The embodiment of the specification discloses a method, a device and equipment for displaying subtitles, wherein the method comprises the steps of establishing long connection with first equipment, receiving target voice streams to be converted acquired by the first equipment through the long connection, performing text conversion processing on the target voice streams to obtain first text data corresponding to the target voice streams, correcting the first text data to obtain second text data corresponding to the target voice streams, and determining that the subtitles are target text display data corresponding to the target voice streams based on the first text data and the second text data. By the caption display method, long connection can be established with the first equipment, the resource utilization rate is improved, and the text data display efficiency and accuracy aiming at the target voice stream are improved.

Inventors

CHI HAIBO
ZHOU JIAN
WANG HONGBIN
HAO ZHENGPENG

Assignees

马上消费金融股份有限公司

Dates

Publication Date: 20260505
Application Date: 20211208

Claims (11)

1. A method of displaying subtitles, the method comprising: establishing long connection with first equipment, and receiving target voice streams to be converted, which are acquired by the first equipment, through the long connection; performing text conversion processing on the target voice stream to obtain first text data corresponding to the target voice stream, and performing correction processing on the first text data to obtain second text data corresponding to the target voice stream; determining that the subtitle is target text display data corresponding to the target voice stream based on the first text data and the second text data; The determining, based on the first text data and the second text data, that the subtitle is the target text display data corresponding to the target voice stream includes: If the corrected text data does not completely contain intermediate text data, determining text display data corresponding to a target voice segment based on a pre-trained semantic analysis model, the intermediate text data corresponding to the target voice segment and the corrected text data, wherein the first text data is determined based on the intermediate text data, the second text data is determined based on the corrected text data, the corrected text data is obtained by correcting the intermediate text data corresponding to the target voice segment in the target voice stream, the intermediate text data is obtained by performing text conversion processing on a sub-voice segment in the target voice segment, and the target voice segment is a segment of voice data containing voice data in the target voice stream; if the corrected text data completely contains the intermediate text data, determining the intermediate text data as text display data corresponding to the target voice segment; And determining that the caption is the target text display data corresponding to the target voice stream based on the text display data corresponding to the target voice fragment.
2. The method of claim 1, the method further comprising: And returning the target text display data to the first device through the long connection, and triggering the first device to display the target text display data corresponding to the target voice stream as a subtitle.
3. The method of claim 1, the method further comprising: and establishing long connection with a second device, and sending the target text display data to the second device through the long connection established with the second device, and triggering the second device to display the target text display data corresponding to the target voice stream.
4. The method of claim 1, wherein the performing text conversion processing on the target voice stream to obtain the first text data corresponding to the target voice stream includes: and performing text conversion processing on the target voice stream based on a first time interval to obtain first text data corresponding to the target voice stream.
5. The method according to claim 4, wherein the performing text conversion processing on the target voice stream based on the first time interval to obtain the first text data corresponding to the target voice stream includes: receiving target parameters sent by the first device through the long connection, wherein the target parameters comprise format parameters and sampling parameters of the target voice stream; verifying the target voice stream based on the target parameters, and determining whether text conversion can be performed on the target voice stream based on a verification result; And if the target voice stream can be subjected to text conversion, performing text conversion processing on the target voice stream based on the first time interval to obtain first text data corresponding to the target voice stream.
6. The method of claim 5, wherein the performing text conversion processing on the target voice stream based on the first time interval to obtain the first text data corresponding to the target voice stream, includes: Determining a target format conversion algorithm based on the target parameters, wherein the target format conversion algorithm is used for carrying out format conversion on the target voice stream; Based on the target format conversion algorithm, carrying out format conversion on the target voice stream to obtain a target voice stream after format conversion; And performing text conversion processing on the target voice stream after format conversion based on the first time interval to obtain first text data corresponding to the target voice stream.
7. The method of claim 1, the determining means of the intermediate text data comprising: receiving target parameters sent by the first device through the long connection, wherein the target parameters comprise format parameters and sampling parameters of the target voice stream; Determining a target text conversion model based on the target parameters, wherein the target text conversion model is used for converting the target voice stream into text data; and carrying out text conversion processing on each sub-voice segment based on the target text conversion model to obtain intermediate text data corresponding to the target voice segment.
8. The method of claim 1, the determining text presentation data corresponding to the target speech segment based on the pre-trained semantic analysis model, the intermediate text data corresponding to the target speech segment, and the corrected text data, comprising: Acquiring a first voice fragment in the target voice stream, wherein the first voice fragment comprises a previous target voice fragment of the target voice fragment and/or a subsequent target voice fragment of the target voice fragment; Determining text display data corresponding to the target voice stream based on the pre-trained semantic analysis model, the intermediate text data corresponding to the target voice fragment and the corrected text data, wherein the pre-trained semantic analysis model is obtained by training a model constructed by a preset semantic analysis algorithm based on historical intermediate text data and historical corrected text data.
9. A caption presentation device, the device comprising: the connection establishment module is configured to establish a long connection with the first equipment and receive a target voice stream to be converted, which is acquired by the first equipment, through the long connection; The data conversion module is configured to perform text conversion processing on the target voice stream to obtain first text data corresponding to the target voice stream, and perform correction processing on the first text data to obtain second text data corresponding to the target voice stream; the data determining module is configured to determine that the subtitle is target text display data corresponding to the target voice stream based on the first text data and the second text data; Wherein the data determination module is configured to: If the corrected text data does not completely contain intermediate text data, determining text display data corresponding to a target voice segment based on a pre-trained semantic analysis model, the intermediate text data corresponding to the target voice segment and the corrected text data, wherein the first text data is determined based on the intermediate text data, the second text data is determined based on the corrected text data, the corrected text data is obtained by correcting the intermediate text data corresponding to the target voice segment in the target voice stream, the intermediate text data is obtained by performing text conversion processing on a sub-voice segment in the target voice segment, and the target voice segment is a segment of voice data containing voice data in the target voice stream; if the corrected text data completely contains the intermediate text data, determining the intermediate text data as text display data corresponding to the target voice segment; And determining that the caption is the target text display data corresponding to the target voice stream based on the text display data corresponding to the target voice fragment.
10. A presentation apparatus of subtitles, the presentation apparatus of subtitles comprising: processor, and A memory arranged to store computer executable instructions that, when executed, cause the processor to: establishing long connection with first equipment, and receiving target voice streams to be converted, which are acquired by the first equipment, through the long connection; performing text conversion processing on the target voice stream to obtain first text data corresponding to the target voice stream, and performing correction processing on the first text data to obtain second text data corresponding to the target voice stream; determining that the subtitle is target text display data corresponding to the target voice stream based on the first text data and the second text data; The determining, based on the first text data and the second text data, that the subtitle is the target text display data corresponding to the target voice stream includes: If the corrected text data does not completely contain intermediate text data, determining text display data corresponding to a target voice segment based on a pre-trained semantic analysis model, the intermediate text data corresponding to the target voice segment and the corrected text data, wherein the first text data is determined based on the intermediate text data, the second text data is determined based on the corrected text data, the corrected text data is obtained by correcting the intermediate text data corresponding to the target voice segment in the target voice stream, the intermediate text data is obtained by performing text conversion processing on a sub-voice segment in the target voice segment, and the target voice segment is a segment of voice data containing voice data in the target voice stream; if the corrected text data completely contains the intermediate text data, determining the intermediate text data as text display data corresponding to the target voice segment; And determining that the caption is the target text display data corresponding to the target voice stream based on the text display data corresponding to the target voice fragment.
11. A storage medium for storing computer executable instructions that when executed by a processor implement the following: establishing long connection with first equipment, and receiving target voice streams to be converted, which are acquired by the first equipment, through the long connection; performing text conversion processing on the target voice stream to obtain first text data corresponding to the target voice stream, and performing correction processing on the first text data to obtain second text data corresponding to the target voice stream; determining that the caption is target text display data corresponding to the target voice stream based on the first text data and the second text data; The determining, based on the first text data and the second text data, that the subtitle is the target text display data corresponding to the target voice stream includes: If the corrected text data does not completely contain intermediate text data, determining text display data corresponding to a target voice segment based on a pre-trained semantic analysis model, the intermediate text data corresponding to the target voice segment and the corrected text data, wherein the first text data is determined based on the intermediate text data, the second text data is determined based on the corrected text data, the corrected text data is obtained by correcting the intermediate text data corresponding to the target voice segment in the target voice stream, the intermediate text data is obtained by performing text conversion processing on a sub-voice segment in the target voice segment, and the target voice segment is a segment of voice data containing voice data in the target voice stream; if the corrected text data completely contains the intermediate text data, determining the intermediate text data as text display data corresponding to the target voice segment; And determining that the caption is the target text display data corresponding to the target voice stream based on the text display data corresponding to the target voice fragment.

Description

Subtitle display method, device and equipment Technical Field The present document relates to the field of computer technologies, and in particular, to a method, an apparatus, and a device for displaying subtitles. Background With the rapid development of computer technology, the display demands of real-time captions are increasing, for example, in the scene of watching a live broadcast class, participating in a video conference and the like, in order to enable a user to obtain the content explained by an explanation person more intuitively, the speech explained by the explanation person needs to be converted into captions for display. Generally, the voice stream may be divided into a plurality of voice segments, and a connection is established between each voice segment and the server, and the voice segments are respectively sent to the server through the established connection to perform text conversion processing, so as to obtain text data corresponding to the voice stream, and real-time display of subtitles is realized according to the text data. However, under the condition of larger voice conversion requirement, a larger number of connections need to be established between equipment (including voice acquisition equipment, caption display equipment and the like) and a server based on the method, and because the larger number of connections are frequently established between the equipment and the server and the voice stream needs to be divided into a plurality of voice fragments to be respectively identified, the time required for acquiring the complete caption is long, so that the voice and the caption are difficult to synchronize, and the text display accuracy is low. Disclosure of Invention The embodiment of the specification aims to provide a technical scheme for improving the resource utilization rate and the text display efficiency and accuracy of voice data. In order to achieve the above technical solution, the embodiments of the present specification are implemented as follows: the embodiment of the specification provides a caption display method, which comprises the following steps: establishing long connection with first equipment, and receiving target voice streams to be converted, which are acquired by the first equipment, through the long connection; performing text conversion processing on the target voice stream to obtain first text data corresponding to the target voice stream, and performing correction processing on the first text data to obtain second text data corresponding to the target voice stream; and determining that the caption is target text display data corresponding to the target voice stream based on the first text data and the second text data. The embodiment of the specification provides a caption display device, the device includes: the connection establishment module is configured to establish a long connection with the first equipment and receive a target voice stream to be converted, which is acquired by the first equipment, through the long connection; The data conversion module is configured to perform text conversion processing on the target voice stream to obtain first text data corresponding to the target voice stream, and perform correction processing on the first text data to obtain second text data corresponding to the target voice stream; And the data determining module is configured to determine that the subtitle is target text display data corresponding to the target voice stream based on the first text data and the second text data. The embodiment of the present disclosure provides a caption display device, where the caption display device includes: processor, and A memory arranged to store computer executable instructions that when executed cause the processor to establish a long connection with a first device and receive a target voice stream to be converted collected by the first device over the long connection; performing text conversion processing on the target voice stream to obtain first text data corresponding to the target voice stream, and performing correction processing on the first text data to obtain second text data corresponding to the target voice stream; and determining that the caption is target text display data corresponding to the target voice stream based on the first text data and the second text data. The present specification embodiments also provide a storage medium for storing computer executable instructions that when executed implement the following: establishing long connection with first equipment, and receiving target voice streams to be converted, which are acquired by the first equipment, through the long connection; performing text conversion processing on the target voice stream to obtain first text data corresponding to the target voice stream, and performing correction processing on the first text data to obtain second text data corresponding to the target voice stream; and determining that the caption is target text d