CN-115659957-B - Video subtitle mispronounced word detection method, device, equipment and storage medium

CN115659957BCN 115659957 BCN115659957 BCN 115659957BCN-115659957-B

Abstract

The application discloses a method, a device, equipment and a storage medium for detecting video subtitle mispronounced words, which are used for identifying subtitle text in video containing user lip-shaped and/or sign language images, extracting lip-shaped image sequences and/or sign language image sequences from the video, extracting text mode characteristics of the subtitle text, extracting lip-shaped mode characteristics of the lip-shaped image sequences, extracting sign language mode characteristics of the sign language image sequences, taking the lip-shaped mode characteristics and/or the sign language mode characteristics as visual mode characteristics, fusing the visual mode characteristics and the text mode characteristics, and determining real text contained in the video based on the fusion characteristics. The application further merges the visual mode characteristics of lip shape/sign language in the video on the basis of considering the text mode characteristics of the caption text, so that the prediction result is more accurate, and on the basis, the misprinted word detection result is determined by comparing the real text with the caption text, thereby greatly improving the accuracy of misprinted word detection.

Inventors

XU ZIHANG
YANG ZIQING
CUI YIMING
WANG SHIJIN

Assignees

科大讯飞股份有限公司

Dates

Publication Date: 20260505
Application Date: 20221028

Claims (15)

1. A method for detecting mispronounced characters of video subtitles, comprising: acquiring a video containing subtitles and user lip shapes and/or sign language images matched with the subtitles; identifying caption text in the video, extracting a lip movement process of a user in the video into a lip-shaped image sequence, and/or extracting a sign language action process of the user in the video into a sign language image sequence; Extracting text modal characteristics of the caption text, extracting lip modal characteristics of the lip image sequence, and/or extracting sign language modal characteristics of the sign language image sequence, wherein the lip modal characteristics and/or the sign language modal characteristics are used as visual modal characteristics; fusing the visual mode characteristics and the text mode characteristics to obtain fusion characteristics; Determining real text contained in the video based on the fusion features; and comparing the real text with the subtitle text to obtain a wrongly written word detection result of the video subtitle.
2. The method of claim 1, wherein the visual modality features and the text modality features are each in a vector form; the process of fusing the visual mode characteristics and the text mode characteristics to obtain fusion characteristics comprises the following steps: And fusing the visual mode characteristics and the text mode characteristics in the vector form by adopting a gating fusion mode to obtain fusion characteristics.
3. The method of claim 2, wherein the method further comprises, after fusing the visual mode feature and the text mode feature in the vector form by using a gating fusion method to obtain a fused feature: And adding the fusion feature and the character modal feature to obtain a residual fusion feature as a final fusion feature.
4. The method of claim 2, further comprising, prior to fusing the visual modality features and the text modality features in vector form using a gated fusion approach: And performing representation offset and nonlinear transformation on the visual mode characteristics to obtain processed visual mode characteristics.
5. The method according to claim 1, wherein the process of extracting and fusing the visual modality features and the text modality features, and determining the real text contained in the video based on the fused features is obtained by a pre-trained video text recognition model process; The video text recognition model is configured to extract lip mode features of an input lip image sequence and/or extract sign language mode features of the input sign language image sequence, extract text mode features of an input subtitle text by using the lip mode features and/or the sign language mode features as visual mode features, fuse the visual mode features and the text mode features, and predict an internal state representation of a real text contained in a video based on the fused features.
6. The method of claim 5, wherein the video text recognition model comprises an image processing module, a text processing module, a multimodal fusion module, and an output module; The image processing module is used for extracting lip form mode characteristics of an input lip form image sequence and/or extracting sign language mode characteristics of the input sign language image sequence, wherein the lip form mode characteristics and/or the sign language mode characteristics are used as visual mode characteristics; the text processing module is used for extracting the character modal characteristics of the input caption text; The multi-mode fusion module is used for fusing the visual mode characteristics and the text mode characteristics to obtain fusion characteristics; And the output module is used for determining the real text contained in the video based on the fusion characteristics.
7. The method of claim 6, wherein the multimodal fusion module comprises: the characteristic editing module is used for carrying out representation offset and nonlinear transformation on the visual mode characteristics to obtain processed visual mode characteristics; the gating fusion module is used for fusing the processed visual mode characteristics and the character mode characteristics by adopting a gating fusion mode to obtain fusion characteristics; And the residual error connection module is used for adding the fusion characteristic and the character modal characteristic to obtain a residual error fusion characteristic serving as a final fusion characteristic.
8. The method of claim 6, wherein the image processing module comprises: The image normalization module is used for performing normalization processing on the input lip-shaped image sequence and/or sign language image sequence to obtain a processed lip-shaped image sequence and/or sign language image sequence; the image feature extraction module is used for extracting visual mode features from the processed lip-shaped image sequence and/or sign language image sequence; and the linear transformation module is used for carrying out linear transformation on the dimensionality of the visual mode characteristics so as to output the visual mode characteristics with the same dimensionality as the text mode characteristics.
9. The method of claim 6, wherein the text processing module comprises: the text preprocessing module is used for editing the input caption text to a set length in a mode of filling set characters, and determining the characteristic representation of the edited caption text; And the text modal feature extraction module is used for encoding the feature representation of the caption text to obtain the text modal feature of the caption text.
10. The method according to any one of claims 1-9, wherein the process of comparing the real text with the subtitle text to obtain the mispronounced word detection result of the video subtitle comprises: and matching whether characters inconsistent with the real text exist in the subtitle text, and if so, taking the inconsistent characters in the subtitle text as wrongly written characters of the video subtitle.
11. The method according to any one of claims 1-9, wherein after comparing the real text and the subtitle text to obtain a mispronounced word detection result of a video subtitle, the method further comprises: deleting the wrongly written words identified in the caption text to obtain an edited text of the wrongly written words; Respectively calculating the confusion degree of each subtitle text and the editing text of the deleted wrongly written characters by adopting a pre-trained language model; If the confusion degree of the edited text for deleting the wrongly written word is smaller than the confusion degree of the subtitle text and the absolute value of the difference value of the two is larger than a set threshold value, the wrongly written word is used as a final wrongly written word detection result, and otherwise, the wrongly written word is removed from the final wrongly written word detection result.
12. The method according to any one of claims 1-9, wherein after comparing the real text and the subtitle text to obtain a mispronounced word detection result of a video subtitle, the method further comprises: determining the position of the wrongly written word in the video frame picture; And marking the wrongly written characters in the video frame picture according to the positions.
13. A video subtitle mispronounced word detection apparatus, comprising: a video acquisition unit for acquiring a video including subtitles and user lip-shape and/or sign language images matched with the subtitles; the video preprocessing unit is used for identifying caption texts in the video, extracting lip-shaped image sequences from a user lip movement process in the video, and/or extracting sign language image sequences from a user sign language action process in the video; the feature extraction unit is used for extracting the character modal feature of the caption text, extracting the lip modal feature of the lip image sequence and/or extracting the sign language modal feature of the sign language image sequence, wherein the lip modal feature and/or the sign language modal feature are used as the visual modal feature; The feature fusion unit is used for fusing the visual mode features and the character mode features to obtain fusion features; a real text determination unit for determining a real text contained in the video based on the fusion feature; And the wrongly written word determining unit is used for comparing the real text with the subtitle text to obtain a wrongly written word detection result of the video subtitle.
14. A video subtitle mispronounced word detection device is characterized by comprising a memory and a processor; the memory is used for storing programs; The processor is configured to execute the program to implement the steps of the video subtitle miscorrection word detection method according to any one of claims 1-12.
15. A storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the video subtitle miscord detection method of any one of claims 1-12.

Description

Video subtitle mispronounced word detection method, device, equipment and storage medium Technical Field The present application relates to the field of natural language processing, and in particular, to a method, apparatus, device, and storage medium for detecting wrongly written characters of video subtitles. Background With the development of information technology, the medium platform has come in the era characterized by diversification of information transmission forms and multipoint transmission sources. Currently available multimedia information such as delivering expertise or propagating social hotspots from a media person to a lens, online video conferences provided by various types of video conference software, etc. The multimedia information of each type generally includes images of the user during the presentation, and corresponding subtitles are also provided to improve the communication efficiency. In addition, in order to facilitate the hearing impaired people to know information, special sign language personnel are also configured to express sign language under the condition that the part of the multimedia video contains subtitles. The method is limited by carelessness of caption makers or the immaturity of related caption generation technology, a large number of video captions are wrongly written in a video platform, and the figure and the shadow of the wrongly written characters can be always seen in the captions generated by video conference software in real time. This phenomenon is extremely serious jeopardizing both the accuracy of the information transfer and the broad breadth of the cultural spread. If these texts are simply checked and corrected by manpower, a lot of manpower and time are consumed. Today, where artificial intelligence is actively developed, and especially benefits from the progress of natural language processing technology, various text error detection and correction systems have been developed to help people efficiently perform text error checking and modification. Taking video subtitles as an example, an existing error correction system generally identifies a video subtitle, performs error correction processing on the subtitle text information based on the context of the subtitle text information, locates possible errors therein, and returns the errors to a user. The existing error correction mode only uses plain text information to correct errors, so that the detection accuracy of wrongly written characters is not high. Disclosure of Invention In view of the above problems, the present application is provided to provide a method, apparatus, device and storage medium for detecting video subtitle miscorrection words, so as to achieve improvement of accuracy of detecting video subtitle miscorrection words. The specific scheme is as follows: in a first aspect, a method for detecting a video subtitle mispronounced word is provided, including: acquiring a video containing subtitles and user lip shapes and/or sign language images matched with the subtitles; identifying caption text in the video, extracting a lip movement process of a user in the video into a lip-shaped image sequence, and/or extracting a sign language action process of the user in the video into a sign language image sequence; Extracting text modal characteristics of the caption text, extracting lip modal characteristics of the lip image sequence, and/or extracting sign language modal characteristics of the sign language image sequence, wherein the lip modal characteristics and/or the sign language modal characteristics are used as visual modal characteristics; fusing the visual mode characteristics and the text mode characteristics to obtain fusion characteristics; Determining real text contained in the video based on the fusion features; and comparing the real text with the subtitle text to obtain a wrongly written word detection result of the video subtitle. In a second aspect, a video subtitle miscord detection apparatus is provided, including: a video acquisition unit for acquiring a video including subtitles and user lip-shape and/or sign language images matched with the subtitles; the video preprocessing unit is used for identifying caption texts in the video, extracting lip-shaped image sequences from a user lip movement process in the video, and/or extracting sign language image sequences from a user sign language action process in the video; the feature extraction unit is used for extracting the character modal feature of the caption text, extracting the lip modal feature of the lip image sequence and/or extracting the sign language modal feature of the sign language image sequence, wherein the lip modal feature and/or the sign language modal feature are used as the visual modal feature; The feature fusion unit is used for fusing the visual mode features and the character mode features to obtain fusion features; a real text determination unit for determining a real text contained in the video b