CN-121985185-A - Subtitle generating method, smart playback device, storage medium, and computer program

CN121985185ACN 121985185 ACN121985185 ACN 121985185ACN-121985185-A

Abstract

The embodiment of the disclosure provides a subtitle generating method, intelligent playing equipment, storage medium and computer program, wherein the method comprises the steps of inputting audio information in audio and video content into an ASR model for voice recognition to obtain text information output by the ASR model in a streaming mode, wherein the ASR model is obtained based on punctuation removal corpus training, the text information does not contain punctuation marks, determining audio time information corresponding to each text information output by the ASR model in a streaming mode, determining target text information of voice pause and voice pause time after the target text information according to the audio time information of adjacent text information, determining corresponding punctuation marks according to the voice pause time, and inserting the determined punctuation marks after the target text information to form subtitle content containing the text information and the punctuation marks. The embodiment of the disclosure can improve the real-time performance of subtitle generation and the audio-video experience of a user while ensuring the accuracy of voice recognition and the readability of the subtitles.

Inventors

DU JIACHENG
Shui Wen
Cao Yigu

Assignees

晶晨半导体（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20260408

Claims (18)

1. A subtitle generating method, applied to an intelligent playing device, comprising: inputting audio information in audio and video content into an ASR model for speech recognition to obtain text information output by the ASR model in a streaming mode, wherein the ASR model is obtained based on punctuation corpus training, and the text information does not contain punctuation marks; Determining audio time information corresponding to each text message output by the ASR model in a streaming mode, wherein the audio time information reflects the corresponding time of the text message in the audio information; determining target text information of voice pause and voice pause time after the target text information according to the audio time information of the adjacent text information; And determining corresponding punctuation marks according to the voice pause time, and inserting the determined punctuation marks after the target text information to form caption contents containing the text information and the punctuation marks, wherein the voice pause time corresponding to different punctuation marks is different.
2. The method of claim 1, wherein determining target text information for a speech pause based on audio time information of adjacent text information comprises: determining an audio time interval between adjacent text information according to the audio time information of the adjacent text information; judging whether a voice pause exists between adjacent text messages or not based on the relation between the audio time interval between the adjacent text messages and the numerical value of the voice pause judging threshold value; if voice pauses exist between adjacent text messages, determining the previous text message in the adjacent text messages as the target text message; The voice pause duration after the target text information is the audio time interval between the target text information and the subsequent text information.
3. The method according to claim 2, wherein the method further comprises: Carrying out statistical processing on the audio time intervals among a plurality of pieces of adjacent text information output by the ASR model so as to obtain statistical characteristics of the audio time intervals of the plurality of pieces of adjacent text information; Determining a voice pause judging threshold based on the statistical characteristics of the audio time intervals of the plurality of adjacent text information and a functional relation or a mapping relation corresponding to the voice pause judging threshold; Wherein the plurality of adjacent text information includes: the ASR model aims at all the adjacent text information which is output by the audio information in the audio and video content; or the ASR model outputs the adjacent text information corresponding to the preset number of text information recently; or the ASR model outputs adjacent text information corresponding to the text information within a preset time range.
4. The method of claim 3, wherein the statistical characteristics of the audio time intervals of the plurality of adjacent text messages comprise an average value and/or a standard deviation of the audio time intervals of the plurality of adjacent text messages; And/or the number of the groups of groups, The determining whether the voice pause exists between the adjacent text messages based on the numerical relation between the audio time interval between the adjacent text messages and the voice pause determination threshold value comprises the following steps: If the audio time interval between the adjacent text messages is greater than or equal to the voice pause judging threshold value, determining that voice pause exists between the adjacent text messages; if the audio time interval between adjacent text messages is less than the voice pause decision threshold, it is determined that there is no voice pause between adjacent text messages.
5. The method of claim 1, wherein the determining the corresponding punctuation from the voice pause duration comprises: and determining punctuation marks matched with the voice pause duration according to the numerical matching relation of the voice pause duration threshold values corresponding to different punctuation marks, wherein the different punctuation marks correspond to different voice pause duration threshold values.
6. The method of claim 5, wherein the punctuation marks are divided into periods and other punctuation marks except the periods, wherein a voice pause time threshold corresponding to the periods is a fixed value and is set based on statistics of period end pause time in historical audio and/or experience values, and the voice pause time threshold of at least part of the punctuation marks except the periods is dynamically adjusted based on speech speed characteristics corresponding to played audio of the audio-video content by intelligent playing equipment.
7. The method of claim 6, wherein the at least some punctuation marks comprise commas, the method further comprising: Dividing the played audio time of the intelligent playing equipment aiming at the audio and video content into a plurality of historical time intervals, and respectively determining the historical speech speed corresponding to each historical time interval to obtain a plurality of historical speech speeds; The method comprises the steps of carrying out weighted average on a plurality of historical speech speeds to obtain an average speech speed, wherein the weight of the historical speech speed corresponding to a historical time interval and the time distance of the historical time interval from the current moment form a negative correlation; And determining the voice pause time length threshold corresponding to the comma based on the mapping relation or the functional relation of the average speech speed and the voice pause time length threshold corresponding to the comma, wherein the voice pause time length threshold corresponding to the average speech speed and the comma is in a negative correlation relation.
8. The method of claim 7, wherein the determining punctuation marks that match the voice pause duration based on numerical matches of voice pause duration thresholds corresponding to different punctuation marks comprises: If the voice pause time length is greater than or equal to a voice pause time length threshold corresponding to a comma and less than a voice pause time length threshold corresponding to a period, determining that a punctuation mark matched with the voice pause time length is a comma and taking the punctuation mark as a punctuation mark after the target text information is inserted; If the voice pause time length is greater than or equal to a voice pause time length threshold corresponding to the period, determining punctuation marks matched with the voice pause time length as the period, and taking the punctuation marks as punctuation marks after inserting the target text information; the voice pause time threshold corresponding to the period is larger than the voice pause time threshold corresponding to the comma.
9. The method of claim 1, wherein determining audio time information corresponding to each text information stream output by the ASR model comprises: aiming at the current text information output by the ASR model, aligning the current text information and the corresponding audio information on an audio time axis; And determining a time point of the current text information in the audio information based on the time corresponding relation aligned with the audio time axis to obtain corresponding audio time information, wherein the time point comprises audio start time and/or audio end time.
10. The method of any one of claims 1-9, wherein the text information streamed output by the ASR model includes words streamed output by the ASR model, wherein words are comprised of a plurality of words, and wherein the audio time information corresponding to the text information includes word-level timestamps corresponding to the words, representing corresponding times of the words in the audio information.
11. The method of claim 1, wherein the training process of the ASR model is cooperatively performed by a training server and a testing device and is constrained by at least speech recognition accuracy and real-time metrics, and wherein the ASR model deployed to the smart playback device meets both the speech recognition accuracy requirements and the real-time requirements of the smart playback device; Wherein, the training server is used for: presetting ASR model structure parameters, wherein the preset ASR model structure parameters are used for constructing an ASR model, so that the ASR model outputs text information which corresponds to input audio and does not contain punctuation marks in a streaming manner; Training an ASR model based on a training data set of the punctuation corpus and preset ASR model structure parameters to obtain a trained ASR model, and transmitting the trained ASR model to test equipment; After receiving the real-time index of the ASR model fed back by the testing equipment, judging whether the real-time index meets the real-time requirement of the intelligent playing equipment; If the real-time index does not meet the real-time requirement of the intelligent playing equipment, adjusting the structural parameters of the ASR model to reduce the parameter scale of the ASR model, and retraining the ASR model based on the training data set of the punctuation corpus so as to send the retrained ASR model to the testing equipment for real-time index evaluation; If the real-time index meets the real-time requirement of the intelligent playing device, deploying an ASR model to the intelligent playing device; The test equipment is a test prototype or a test platform corresponding to the intelligent playing equipment and is used for receiving and loading the trained ASR model issued by the training server, performing speech recognition processing on input audio through the ASR model, evaluating the real-time index of the ASR model and feeding the real-time index of the ASR model back to the training server.
12. The method of claim 11, wherein the training server, if it is determined that the real-time index of the ASR model meets the numerical requirement of the preset real-time index threshold, the real-time index of the ASR model meets the real-time requirement of the smart playback device, wherein the preset real-time index threshold matches the real-time requirement of the smart playback device for the ASR model; And/or the number of the groups of groups, The training server adjusts the ASR model structure parameters according to a predefined model structure parameter self-adaptive adjustment strategy to reduce the parameter scale of the ASR model, wherein the model structure parameter self-adaptive adjustment strategy is a strategy for adjusting the model structure parameters of the ASR model according to a preset rule based on the phase difference degree of the real-time index of the ASR model relative to a preset real-time index threshold, and the phase difference degree of the real-time index relative to the preset real-time index threshold is in positive correlation with the reduction of the parameter scale.
13. The method according to claim 1, wherein the method further comprises: the method comprises the steps of obtaining a subgraph scheduling strategy and a caption generation calculation map, wherein the caption generation calculation map comprises a model calculation map of an ASR model or a calculation map of the model calculation map of the ASR model and a punctuation mark prediction program, and the punctuation mark prediction program is used for supplementing punctuation marks for text information output by the ASR model in a streaming mode; Mapping each sub-graph of the subtitle generation calculation graph to a corresponding calculation unit for execution according to a sub-graph scheduling strategy, wherein the sub-graph scheduling strategy is determined based on quantization sensitivity of each sub-graph and heterogeneous calculation force information of the intelligent playing device, and represents a mapping relation between each sub-graph and the calculation unit on the intelligent playing device and an execution relation between sub-graphs.
14. The method of claim 13, wherein the sub-graph scheduling policy is generated by a server and issued to the intelligent playing device by the server, wherein the server generates the sub-graph scheduling policy based on quantization sensitivity of each sub-graph and heterogeneous computing power information of the intelligent playing device, and performs iterative adjustment of the sub-graph scheduling policy based on an operation effect of the sub-graph scheduling policy on the intelligent playing device; the generation process of the sub-graph scheduling strategy comprises the following steps: Acquiring a subtitle generation calculation map of the intelligent playing equipment; Sub-dividing the subtitle generation calculation graph to obtain a plurality of sub-graphs; carrying out quantitative sensitivity evaluation on each subgraph; Determining a sub-graph scheduling strategy based on the quantization sensitivity of each sub-graph and heterogeneous calculation force information of the intelligent playing equipment, and issuing the sub-graph scheduling strategy to the intelligent playing equipment; Obtaining operation effect data corresponding to the sub-picture scheduling strategy in the intelligent playing equipment; and adjusting the sub-picture scheduling strategy according to the operation effect data until the processing speed and the subtitle precision reflected by the operation effect data corresponding to the adjusted sub-picture scheduling strategy reach the expectations, obtaining a final sub-picture scheduling strategy, and fixedly deploying the final sub-picture scheduling strategy in the intelligent playing equipment.
15. The method according to claim 1, wherein the method further comprises: Monitoring the buffer memory amount of the audio to be processed of the ASR model in real time; If the buffer memory amount of the audio to be processed is higher than the preset buffer memory amount upper limit, switching the operation mode of the processor of the intelligent playing device to a high-performance mode; if the buffer memory amount of the audio to be processed is lower than the lower limit of the preset buffer memory amount, switching the operation mode of the processor of the intelligent playing device to an equalization mode or an energy-saving mode; the audio time length corresponding to the audio to be processed of the ASR model reflects the buffer quantity of the audio to be processed, and the audio time length of the audio to be processed and the buffer quantity of the audio to be processed are in positive correlation.
16. A smart playback device comprising a memory storing computer-executable instructions and a processor that invokes the computer-executable instructions to perform the subtitle generating method of any one of claims 1-15.
17. A storage medium storing computer-executable instructions which, when executed by a processor, implement the subtitle generating method according to any one of claims 1-15.
18. A computer program comprising computer-executable instructions which, when executed by a processor, implement the subtitle generating method according to any one of claims 1-15.

Description

Subtitle generating method, smart playback device, storage medium, and computer program Technical Field The embodiment of the disclosure relates to the technical field of computers, in particular to a subtitle generating method, intelligent playing equipment, a storage medium and a computer program. Background The intelligent playing device is widely applied to scenes such as home entertainment, commercial display and the like. The smart play device may refer to a terminal device with audio and video content receiving, decoding and playing capabilities, such as a set top box, a smart television, a projector, and the like. When the intelligent playing device is used, the subtitle generating function of the intelligent playing device can generate corresponding subtitle content based on the audio information in the audio and video content and display the subtitle content to a user, so that the user can understand the audio and video content conveniently. Therefore, the method has important significance in realizing accurate and real-time subtitle generation on the intelligent playing equipment. At present, intelligent playing equipment mainly relies on an ASR (Automatic Speech Recognition ) model to perform speech recognition processing on audio information in audio and video contents, and outputs a corresponding speech recognition result so as to realize subtitle generation. However, subtitle generation requires not only high recognition accuracy but also high real-time. If the ASR model outputs a voice recognition result with larger delay, subtitle display is delayed from the currently played audio, and normal understanding and watching of the audio and video contents by a user are affected. Therefore, how to provide a technical solution to improve the real-time performance of subtitle generation of the smart playback device becomes a technical problem that needs to be solved by those skilled in the art. Disclosure of Invention In view of this, embodiments of the present disclosure provide a subtitle generating method, an intelligent playing device, a storage medium, and a computer program, so as to promote real-time subtitle generation of the intelligent playing device. In order to achieve the above object, the embodiments of the present disclosure provide the following technical solutions. In a first aspect, an embodiment of the present disclosure provides a subtitle generating method, applied to an intelligent playing device, including: inputting audio information in audio and video content into an ASR model for speech recognition to obtain text information output by the ASR model in a streaming mode, wherein the ASR model is obtained based on punctuation corpus training, and the text information does not contain punctuation marks; Determining audio time information corresponding to each text message output by the ASR model in a streaming mode, wherein the audio time information reflects the corresponding time of the text message in the audio information; determining target text information of voice pause and voice pause time after the target text information according to the audio time information of the adjacent text information; And determining corresponding punctuation marks according to the voice pause time, and inserting the determined punctuation marks after the target text information to form caption contents containing the text information and the punctuation marks, wherein the voice pause time corresponding to different punctuation marks is different. In a second aspect, an embodiment of the present disclosure provides an intelligent playing device, including a memory and a processor, where the memory stores computer-executable instructions, and the processor invokes the computer-executable instructions to perform the subtitle generating method according to the first aspect. In a third aspect, an embodiment of the present disclosure provides a storage medium storing computer-executable instructions that, when executed by a processor, implement the subtitle generating method according to the first aspect described above. In a fourth aspect, an embodiment of the present disclosure provides a computer program, including computer-executable instructions, which when executed by a processor, implement the subtitle generating method as described in the first aspect above. The caption generating method provided by the embodiment of the disclosure can input the audio information in the audio and video content into the ASR model for speech recognition to obtain the text information output by the ASR model in a streaming mode, wherein the ASR model is obtained based on punctuation corpus training, and the text information does not contain punctuation marks. It can be seen that in the embodiments of the present disclosure, the ASR model is used to perform speech recognition on audio information to stream text information that does not contain punctuation, i.e., the ASR model does not need to output punctuat