EP-3951617-B1 - VIDEO DESCRIPTION INFORMATION GENERATION METHOD, VIDEO PROCESSING METHOD, AND CORRESPONDING DEVICES

EP3951617B1EP 3951617 B1EP3951617 B1EP 3951617B1EP-3951617-B1

Inventors

WANG, Bairui
MA, LIN
JIANG, WENHAO
LIU, WEI

Dates

Publication Date: 20260506
Application Date: 20200316

Claims (9)

A method for generating video description information in order to provide a textual understanding of a video, performed by a processor of an electronic device when the processor executes instructions stored in a memory of the electronic device, the method comprising: obtaining a video feature sequence on a frame level corresponding to the video, the video feature sequence being a sequence formed by combining video features of all video image frames, wherein the video features are extracted from each video image frame of the video by using a convolutional neural network (S101); generating a global part-of-speech sequence feature of the video according to the video feature sequence (S102), the global part-of-speech sequence feature being a feature of a sequence of parts of speech, the parts of speech each comprising an attribute of a character, a word, or a phrase; generating natural language description information of the video according to the global part-of-speech sequence feature and the video feature sequence (S103); and automatically outputting the natural language description information, wherein the generating a global part-of-speech sequence feature of the video according to the video feature sequence (S102) comprises: determining a fused feature of the video according to the video feature sequence; and generating the global part-of-speech sequence feature of the video based on the fused feature of the video by utilizing a first neural network, wherein the determining a fused feature of the video according to the video feature sequence comprises: determining weights corresponding to time instants of the first neural network, wherein the weights are weights of frame features in the video feature sequence; and fusing the frame features in the video feature sequence respectively according to the weights corresponding to the time instants, to obtain the fused features of the video that correspond to the time instants, wherein the generating natural language description information of the video according to the global part-of-speech sequence feature and the video feature sequence (S103) comprises: determining a fused feature of the video according to the video feature sequence; and generating the natural language description information of the video according to the global part-of-speech sequence feature and the fused feature of the video by utilizing a second neural network, wherein the time instants of the first neural network are different points in time at which fused features ϕ t Z are inputted to the first neural network, the time instants refer to timesteps t in a recurrent neural network, RNN, each fused feature ϕ t Z is input at timestep t; wherein the determining weights corresponding to time instants of the first neural network comprises: obtaining a weight corresponding to a current time instant according to a part-of-speech sequence feature determined at a previous time instant and the frame features in the video feature sequence, wherein a weight of the i-th frame feature in the video feature sequence that corresponds to the time instant t is α i t , to obtain fused features of the video corresponding to the time instants: ϕ t Z H = ∑ i = 1 m α i t h i where ϕ t Z H represents a fused feature obtained at the time instant t of the first neural network, α i t represents a weight dynamically allocated to the i-th frame feature at the time instant t according to an attention mechanism, which meets: ∑ i = 1 m α i t = 1 the second neural network obtains the global part-of-speech sequence feature and the fused feature of the video as inputs and outputs the natural language description information, the generating the natural language description information of the video according to the global part-of-speech sequence feature and the fused feature of the video by utilizing a second neural network comprises: obtaining prediction guided information corresponding to a current time instant in the global part-of-speech sequence feature according to word information corresponding to a previous time instant and the global part-of-speech sequence feature; obtaining word information corresponding to the current time instant according to the fused feature of the video and the prediction guided information by utilizing the second neural network; and generating the natural language description information of the video according to word information corresponding to time instants, and the obtaining prediction guided information corresponding to a current time instant in the global part-of-speech sequence feature according to word information determined at a previous time instant and the global part-of-speech sequence feature comprises: obtaining the prediction guided information corresponding to the current time instant in the global part-of-speech sequence feature according to the word information determined at the previous time instant and the global part-of-speech sequence feature by utilizing a context guided network; wherein the prediction guided information gt is expressed as: gt =f (s t-1 , global part-of-speech sequence feature); where s t-1 represents word information predicted at a previous time instant, f represents function of the context guided network.
The method according to claim 1, wherein the video feature sequence includes time series information and is obtained by: extracting time series information of the video feature sequence according to a time series relationship between the video features in the video feature sequence in a time direction, and fusing the extracted time series information with the video feature sequence.
The method according to claim 1, wherein the first neural network is a long short-term memory network.
The method according to claim 1, wherein the determining a fused feature of the video according to the video feature sequence comprises: determining weights corresponding to time instants of the second neural network, wherein the weights are weights of frame features in the video feature sequence; and fusing the frame features in the video feature sequence respectively according to the weights corresponding to the time instants, to obtain the fused features of the video that correspond to the time instants.
A video processing method based on natural language description information of a video, performed by an electronic device, the method comprising: obtaining natural language description information of a video, wherein the natural language description information of the video is obtained by utilizing the method according to any one of claims 1 to 4 (S201); and processing the video based on the natural language description information (S202).
The video processing method according to claim 5, wherein the processing the video comprises at least one of: video classification, video retrieval, and generating prompt information corresponding to the video.
The video processing method according to claim 6, further comprising: in a case that the processing the video comprises the video classification and after the natural language description information of the video is obtained, inputting the obtained natural language description information of the video to a classification network, outputting a text feature of the natural language description information, inputting the text feature to a classifier of the classification network, and outputting a classification result of the video; and in a case that the processing the to-be-processed video comprises generating the prompt information corresponding to the to-be-processed video and after the natural language description information of the to-be-processed video is obtained, converting the obtained natural language description information into audio information as the prompt information corresponding to the to-be-processed video.
An electronic device, comprising a processor and a memory, wherein the memory stores instructions, and the instructions, when executed by the processor, cause the processor to perform the method according to any one of claims 1 to 7.
A computer-readable storage medium, storing a computer instruction, a program, a code set, or an instruction set, wherein the computer instruction, the program, the code set, or the instruction set, when run on a computer, causes the computer to perform the method according to any one of claims 1 to 7.

Description

This application claims priority to Chinese Patent Application No. 201910263207.0, titled "METHOD AND APPARATUS FOR GENERATING VIDEO DESCRIPTION INFORMATION, AND VIDEO PROCESSING METHOD AND APPARATUS" filed with the China National Intellectual Property Administration on April 2, 2019. FIELD Embodiments of the present disclosure relate to the technical field of image processing, and particularly, the embodiments of the present disclosure relate to a method and an apparatus for generating video description information, and a video processing method and apparatus. BACKGROUND Artificial Intelligence (AI) is a theory, a method, a technology, and an application system in which a digital computer or a machine controlled by the digital computer simulates, extends, and expands human intelligence, to perceive an environment, obtain knowledge, and use knowledge so as to obtain an optimal result. In other words, artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new type of intelligent machine that can respond in a similar way to human intelligence. The artificial intelligence focuses on studying the design principles and implementation methods of various intelligent machines, to cause the machines to have the functions of perception, reasoning, and decision-making. The artificial intelligence technology is a comprehensive discipline, and relates to a wide range of fields including both hardware technologies and software technologies. The basic artificial intelligence technologies generally include a sensor, a dedicated artificial intelligence chip, cloud computing, distributed storage, a big data processing technology, an operating/interaction system, and electromechanical integration and so on. The artificial intelligence software technologies mainly include a computer vision technology, a voice processing technology, a natural language processing technology, and machine learning/deep learning. The computer vision (CV) technology is a science that studies how to cause a machine to "see". That is, a camera and a computer are used to replace human eyes to perform machine vision such as recognition, tracking, and measurement on a target, and graphic processing is performed, so that the computer processes the target into an image more suitable for human eyes to observe, or an image transmitted to an instrument for detection. As a scientific discipline, the computer vision studies related theories and technologies and attempts to establish an artificial intelligence system that can obtain information from images or multidimensional data. The computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, a 3D technology, virtual reality, augmented reality, synchronous positioning, and map construction and so on, and further include biological feature recognition technologies such as common face recognition and fingerprint recognition. Nature Language processing (NLP) is an important study direction in the fields of computer science and artificial intelligence. The natural language processing studies various theories and methods that enable effective communication between humans and computers in natural language. The natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, studies in this field relate to natural languages, that is, languages used by people in daily life, and the natural language processing is closely related to linguistic studies. The natural language processing technology generally includes text processing, semantic understanding, machine translation, robot question and answer, and knowledge graph. Under the background of the stable development of Internet and big data, demands for multimedia information are in explosive growth, resulting in that traditional information processing technologies cannot meet the needs of multimedia data on tasks such as labeling and description. Describing a video, an image, or the like in a natural language is simple for humans, but it is a difficult task for machines. This requires machines to overcome semantic difficulty in image understanding and correctly integrate computer vision and natural language processing. At present, the research in this direction has attracted extensive attention and can be effectively applied in the fields of security, home furnishing, medical treatment, and teaching. The conventional technology can already implement the automatic description of a video by a machine to a certain extent. However, in the conventional technology, extracted frame-level features of a video are converted into video-level features, and then the video-level features directly function as an input of a decoder network to obta