CN-116311200-B - Text attribute identification method and device, electronic equipment and storage medium

CN116311200BCN 116311200 BCN116311200 BCN 116311200BCN-116311200-B

Abstract

The embodiment of the disclosure provides a text attribute identification method, a device, electronic equipment and a storage medium, and relates to the technical field of computers. According to the method, text line related information of text lines in each video frame is obtained based on text lines in at least two continuous video frames in the video to be identified, text line characteristics corresponding to the text lines in each video frame are obtained through feature encoding of the text line related information of the text lines in each video frame through a pre-trained video attribute identification model, text line inter-frame characteristics corresponding to each text line are obtained based on the text line characteristics corresponding to the at least two video frames through a video attribute identification model, attribute identification is conducted on the text line inter-frame characteristics, and text attributes corresponding to each text line in the video frame to be identified are determined. In this way, in the video text analysis process, the text line inter-frame characteristics in a plurality of continuous video frames are considered, so that the accuracy of identifying text attributes in the video is improved to a certain extent in the frames.

Inventors

LI CONG
XIA KUI
LI JIAHONG

Assignees

北京达佳互联信息技术有限公司

Dates

Publication Date: 20260512
Application Date: 20230203

Claims (9)

1. A method for identifying text attributes, the method comprising: Acquiring text line related information of text lines in each video frame based on text lines in at least two continuous video frames in the video to be identified, wherein the text line related information comprises text information; Performing feature coding on text line related information of Chinese lines in each video frame through a pre-trained video attribute identification model to obtain text line features corresponding to each text line in each video frame; acquiring text line inter-frame characteristics corresponding to each text line based on the text line characteristics corresponding to the at least two video frames through the video attribute identification model; Performing attribute identification on the inter-frame features of the text lines, and determining text attributes corresponding to each text line in the video frame to be identified; Wherein the method further comprises: setting continuous first position identifiers for target information of all text lines in the video frame, wherein the target information comprises the text information; the obtaining, by the video attribute recognition model, text line inter-frame features corresponding to each text line based on text line features corresponding to the at least two video frames includes: Acquiring a first position identifier corresponding to the text information of the Chinese lines in the at least two video frames through a second processing layer in the video attribute identification model; For any text line in any video frame, determining the text line inter-frame characteristics of the text line according to the text line characteristics corresponding to the text line, the display duration and the first position identification corresponding to the text information, wherein the display duration is used for representing the appearance duration of the text line in at least two video frames.
2. The method of claim 1, wherein the text line related information further comprises image information and location information, wherein the obtaining text line related information for text lines in each of the video frames based on text lines in at least two consecutive video frames in the video to be identified comprises: For any video frame in the at least two video frames, acquiring local image characteristics of an area where each text line in the video frame is located as the image information; acquiring text content in each text line in the video frame to serve as the text information; And acquiring the position coordinates of each text line in the video frame as the position information.
3. The method according to claim 2, wherein the feature encoding the text line related information of the text lines in each of the video frames by the pre-trained video attribute recognition model to obtain the text line features corresponding to each of the text lines in each of the video frames includes: Generating, by a first processing layer in the video attribute recognition model, first concatenation information for each text line based on the position information and the text information for each text line, and second concatenation information for each text line based on the position information and the image information for each text line; splicing the first splicing information and the second splicing information of each text line to obtain splicing information of each text line; and generating text line characteristics of the text lines based on the splicing information of the text lines.
4. A method according to claim 3, characterized in that the method further comprises: for any video frame, determining a character sequence identifier of character information of a character line and an image sequence identifier of image information in the video frame through the first processing layer, wherein the character sequence identifier of the same character line is the same as the image sequence identifier; The target information further comprises image information; The generating the first splicing information of each text line based on the position information of each text line and the text information comprises the steps of splicing the position information of each text line, the text information, the first position identification of the text information and the text sequence identification to obtain the first splicing information of each text line; the generating the second splicing information of each text line based on the position information of each text line and the image information comprises splicing the position information of each text line, the image information, the first position identification of the image information and the image sequence identification to obtain the second splicing information of each text line.
5. The method according to claim 1, wherein the determining the text line inter-frame feature of the text line according to the text line feature, the display duration, and the first location identifier corresponding to the text information includes: Encoding the display duration and the first position identifier corresponding to the text information respectively to obtain a display duration vector and a first position identifier vector; And splicing the text line features, the display duration vector and the first position identification vector to obtain the text line inter-frame features.
6. The method of claim 1, wherein the video attribute identification model is trained by: Training a basic model based on a first appointed training task and a first sample video, and training the basic model based on a second appointed training task and each video frame in the first sample video to obtain an identification model to be trained; the second specified training task comprises a text content recovery task, an image area recovery task and/or an image area and text content alignment task; Taking text lines in at least two second sample video frames as input of a recognition model to be trained, and acquiring text attributes predicted by the recognition model to be trained; based on the text attribute and text attribute labels of Chinese lines in the at least two second sample video frames, carrying out parameter adjustment on the recognition model to be trained; and under the condition that the recognition model to be trained reaches a stop condition, determining the recognition model to be trained which reaches the stop condition as the video attribute recognition model.
7. A text attribute recognition device, the device comprising: The system comprises a first acquisition module, a second acquisition module and a first judgment module, wherein the first acquisition module is used for acquiring text line related information of text lines in each video frame based on text lines in at least two continuous video frames in the video to be identified; The first coding module is used for carrying out feature coding on the text line related information of the text lines in each video frame through a pre-trained video attribute identification model to obtain text line features corresponding to each text line in each video frame; the second acquisition module is used for acquiring the text line inter-frame characteristics corresponding to each text line based on the text line characteristics corresponding to the at least two video frames through the video attribute identification model; The first recognition module is used for carrying out attribute recognition on the inter-frame characteristics of the text lines and determining text attributes corresponding to each text line in the video frame to be recognized; wherein the apparatus further comprises: The first setting module is used for setting continuous first position identifiers for target information of all text lines in the video frame, wherein the target information comprises the text information; the second acquisition module is specifically configured to: a fourth obtaining sub-module, configured to obtain, by using a second processing layer in the video attribute identification model, a first location identifier of a Chinese line in the at least two video frames; The second determining module is used for determining the inter-frame characteristics of the text lines of any text line in any video frame according to the text line characteristics, the display duration and the first position identification corresponding to the text lines, wherein the display duration is used for representing the occurrence duration of the text lines in the video frame.
8. An electronic device, comprising: A processor; a memory for storing the processor-executable instructions; Wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 6.
9. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, cause the electronic device to perform the method of any of claims 1 to 6.

Description

Text attribute identification method and device, electronic equipment and storage medium Technical Field The disclosure relates to the field of computer technology, and in particular, to a text attribute identification method, a device, an electronic device and a storage medium. Background With the development of computer technology, short videos are popular with users due to the characteristics of timeliness, rapidness, dynamic pictures and stronger entertainment, so that the short videos are required to be applied to services such as video searching and video file extraction based on the attribute of text content in video content, namely, the services which are required to utilize the text attribute of the text content in video are more and more. In the related art, text attribute recognition is mainly performed on texts in a single-frame image, but the recognition method is low in accuracy of text attribute obtained by recognition due to the fact that only text information of the single-frame image is used for analysis and recognition. Disclosure of Invention The disclosure provides a text attribute identification method, a device, an electronic device and a storage medium, so as to at least solve the problem of how to improve the accuracy of text attribute identification. The technical scheme of the present disclosure is as follows: According to a first aspect of an embodiment of the present disclosure, there is provided a text attribute identification method, including: Acquiring text line related information of text lines in each video frame based on text lines in at least two continuous video frames in the video to be identified; Performing feature coding on text line related information of Chinese lines in each video frame through a pre-trained video attribute identification model to obtain text line features corresponding to each text line in each video frame; acquiring text line inter-frame characteristics corresponding to each text line based on the text line characteristics corresponding to the at least two video frames through the video attribute identification model; and carrying out attribute identification on the inter-frame characteristics of the text lines, and determining text attributes corresponding to each text line in the video frame to be identified. Optionally, the text line related information includes text information, image information and position information, and the obtaining text line related information of text lines in each video frame based on text lines in at least two continuous video frames in the video to be identified includes: For any video frame in the at least two video frames, acquiring local image characteristics of an area where each text line in the video frame is located as the image information; acquiring text content in each text line in the video frame to serve as the text information; And acquiring the position coordinates of each text line in the video frame as the position information. Optionally, the feature encoding of the text line related information of the text line in each video frame by using a pre-trained video attribute recognition model to obtain the text line feature corresponding to each text line in each video frame includes: Generating, by a first processing layer in the video attribute recognition model, first concatenation information for each text line based on the position information and the text information for each text line, and second concatenation information for each text line based on the position information and the image information for each text line; splicing the first splicing information and the second splicing information of each text line to obtain splicing information of each text line; and generating text line characteristics of the text lines based on the splicing information of the text lines. Optionally, the method further comprises: for any video frame, determining a character sequence identifier of character information of a character line and an image sequence identifier of image information in the video frame through the first processing layer, wherein the character sequence identifier of the same character line is the same as the image sequence identifier; Setting continuous first position identifiers for target information of all text lines in the video frame, wherein the target information comprises the text information and the image information; The generating the first splicing information of each text line based on the position information of each text line and the text information comprises the steps of splicing the position information of each text line, the text information, the first position identification of the text information and the text sequence identification to obtain the first splicing information of each text line; the generating the second splicing information of each text line based on the position information of each text line and the image information comprises splicing the position info