CN-116246211-B - First view video description generation method based on space-time packet attention mechanism

CN116246211BCN 116246211 BCN116246211 BCN 116246211BCN-116246211-B

Abstract

The invention provides a first view video description generation method based on a space-time grouping attention mechanism, which comprises the steps of extracting a characteristic spectrum of a first view video as a region characteristic from an input first view video, outputting a position code corresponding to the region characteristic by a position code module, adding a result of the position code to the region characteristic to update the region characteristic, merging a high dimension and a wide dimension of the region characteristic into a space dimension by a space grouping attention module, calculating grouping attention twice on all the features in the space dimension to fully model an interaction relation between all the features in the space dimension, exchanging the space and time dimension, calculating grouping attention twice on all the features in the time dimension by a time grouping attention module, outputting the grouping attention twice on the space dimension, averaging the grouping attention twice to obtain a first view video characteristic code, providing more time-space information video characteristics for a decoder, adapting to jitter and inter-frame change existing in the first view video, and improving the quality of a description sentence.

Inventors

WANG SHISEN
PAN LILI
LI HONGLIANG
He Naiyu
ZHOU YUXUAN
XIE JINGJING
LIANG YUE
MENG FANMAN
WU QINGBO
XU LINFENG

Assignees

电子科技大学

Dates

Publication Date: 20260505
Application Date: 20230315

Claims (3)

1. A first view video description generation method based on a spatio-temporal packet attention mechanism, comprising the steps of: Uniformly downsampling an input first visual angle video to obtain key frames, shrinking and scaling each key frame to a set size, and inputting the key frames to a pre-trained ResNet model and a position coding module; ResNet outputting the characteristic spectrum of the last layer of convolution to form the regional characteristic of the video of the first view angle, outputting the position code corresponding to the regional characteristic by a position code module, updating the regional characteristic by adding the result of the position code to the regional characteristic, and inputting the updated regional characteristic to a space grouping attention module; The space grouping attention module receives the input regional characteristics, combines the high dimension and the wide dimension of the regional characteristics into the spatial dimension, and calculates twice grouping attention on all the characteristics in the spatial dimension so as to finish the information interaction of the regional characteristics in space; exchanging the space and time dimensions of the regional features output by the space grouping attention module, and outputting the regional features with the exchanged space and time dimensions to the time grouping attention module; The time grouping attention module receives the input regional characteristics, calculates twice grouping attention for all the characteristics in the time dimension, thereby completing the information interaction of the regional characteristics in time; Averaging the regional characteristics output by the time grouping attention module in the space dimension to obtain the codes of the video characteristics of the first view angle; An encoding input decoder of the first view video feature generates a sequence of words for the video content, thereby generating a video description sentence; the space grouping attention module and the time grouping attention module have the same structure and comprise a k-means module, a multi-head attention module, a full-connection layer, an activation function GELU layer and a layer normalization module; For an input of size N D video features X, wherein N represents the number of features and D represents the dimension of each feature, and the k-means module aggregates the input video features X into g groups according to the preset grouping number g to obtain the video features with the size g D grouping center The multi-head attention module inputs features As a result of the query Q, And calculating the multi-head attention of the user as a key K and a value V so as to update the video features, carrying out nonlinear linearization on the updated video features through a layer of full-connection layer and an activation function GELU layer, adding the video features X which are originally input into the nonlinear video features through a residual structure, and finally processing the added video features through a layer normalization module and outputting the video features.
2. The method of claim 1, wherein the position coding module is configured to calculate the position information of each of the region features of the first view video using a learnable parameter as a position coding table for a size of L represents the number of frames, H represents the height of the regional feature, W represents the width of the regional feature, D represents the dimension of each feature, and the features of the ith frame, the jth row and the kth column in the regional feature of the video of the first view angle Coding table corresponding to three positions And Position coding The method comprises the following steps: ; Wherein, the 、、 Respectively representing the corresponding coding tables of the ith frame, the jth row and the kth column And Position coding of the position; The specific implementation of updating the region features by adding the region features to the result of the position coding thereof is as follows: 。
3. the method of claim 1, wherein the decoder is a long and short term memory artificial neural network LSTM, a recurrent neural network GRU, or a neural network fransformer based on a self-attention mechanism.

Description

First view video description generation method based on space-time packet attention mechanism Technical Field The invention relates to a video description generation technology, in particular to a first view video description generation technology based on a space-time grouping attention mechanism. Background Video description generation is a task in computer vision that performs feature abstraction on a given video, converts it into natural language, and structurally summarizes and reformulates the visual content. In short, given a piece of video, the computer outputs a textual description describing the piece of video. The technology belongs to the field of video understanding, has wide application prospect, for example, with the rise of an Internet short video platform, the number of network videos is rapidly increased, and the demand for the video understanding technology is increased. The technical platform classifies and screens videos and helps internet users to know the content in the videos according to the text description of the videos. In video monitoring systems, there is also an increasing demand for automatically generating text descriptions for video, and abnormal conditions occurring are monitored and timely broadcasted through text description records. In daily life, the video description technology can bring convenience to life of visually impaired people, and video description service is provided for the visually impaired people through the visual auxiliary wearable equipment, so that the visually impaired people can know surrounding environment according to voice broadcasting of characters. From video feature coding to description sentence generation, various effective models and methods are designed, so that the model performance is greatly improved, and the quality of generated sentences is effectively improved. Most video description frameworks are designed as encoder-decoder structures, where the encoder learns compressed video representations from multi-modal features by various methods such as convolutional neural networks (Convolutional Neural Networks, CNN), cyclic neural networks (Recurrent Neural Network, RNN), etc., and where the decoder generates sentences word by word from the representations learned from the encoder, using mainly RNN and transform based models. There are many variations of CNN, RNN, and transducer in video description systems. Feature vectors are generated from image and video space data using CNNs, and the vectors are input into the RNN/transducer architecture through the full connection layer to generate word sequences. Most studies in this area have attempted to describe events using third party video data, but have proven difficult to describe finer granularity activities (e.g., cooking) because these activities require more detailed information than the third party video data. The first perspective video may generate finer granularity information about the camera wearer's activity from a closer perspective. However, unlike the third person called video, the first view video has difficulties in motion blur, focus blur, and severe inter-frame variation. Most of the existing methods utilize CNN to extract global features of video, and although the features contain abundant semantic information, spatial information is lost. Disclosure of Invention Aiming at the difficulty that the inter-frame variation of the video of the first view angle is severe and the problem that the global feature adopted by the current mainstream video description generation method has spatial information loss, the invention provides a more reliable video feature coding method to improve the description accuracy. The technical scheme adopted by the invention for solving the technical problems is that the first view video description generation method based on a space-time grouping attention mechanism comprises the following steps: Uniformly downsampling an input first visual angle video to obtain key frames, shrinking and scaling each key frame to a set size, and inputting the key frames to a pre-trained ResNet model and a position coding module; The ResNet model outputs the characteristic spectrum of the last layer of convolution to form the regional characteristic of the video of the first view angle, the position coding module outputs the position code corresponding to the regional characteristic, and then the regional characteristic is updated by adding the result of the position coding to the regional characteristic, and the updated regional characteristic is input to the space grouping attention module; The space grouping attention module receives the input regional characteristics, combines the high dimension and the wide dimension of the regional characteristics into the spatial dimension, and calculates twice grouping attention on all the characteristics in the spatial dimension so as to finish the information interaction of the regional characteristics in space; exchangi