CN-116503784-B - Short video event detection method and system based on deep dynamic semantic association

CN116503784BCN 116503784 BCN116503784 BCN 116503784BCN-116503784-B

Abstract

The invention discloses a short video event detection method and a short video event detection system based on depth dynamic semantic association, comprising the steps of obtaining and strengthening frame importance information of visual features to obtain differentiated frame importance scores and short video feature representations weighted by the frame importance scores; according to the attention characteristics of the feature space, the short video feature representation weighted by the frame importance score is utilized to guide the attention learning, the feature representation of the inter-frame self-attention enhancement is obtained, the specific graph representation of the short video sample is constructed, the hidden attribute of the complex event and the relevance between the hidden attribute of the complex event are learned through the hidden attribute relevance learning network under the dynamic graph convolution, the short video feature representation with potential semantic information perception is obtained, the event category score is obtained according to the short video feature representation, and the short video event detection task is completed. The invention provides a new method for solving the short video event detection problem and effectively enhances the feature representation capability.

Inventors

JING PEIGUANG
SONG XIAOYI
SU YUTING

Assignees

天津大学

Dates

Publication Date: 20260512
Application Date: 20230505

Claims (7)

1. The short video event detection method based on deep dynamic semantic association is characterized by comprising the following steps of: Collecting short videos, extracting visual features of the short videos, and acquiring and reinforcing frame importance information of the visual features based on the visual features of the short videos to obtain differentiated frame importance scores and short video feature representations weighted by the frame importance scores; According to the attention characteristic of the feature space, the short video feature representation weighted by the frame importance score is utilized to guide the learning of attention, and the inter-frame self-attention-enhanced feature representation is obtained in cooperation with the inherent relevance between the frames and the feature space; the hidden attributes of the short video events are regarded as nodes, the association degree between the hidden attributes is regarded as an edge, and a specific graph representation of the short video sample is constructed; Obtaining event category scores according to the short video characteristic representations, and completing short video event detection tasks; Acquiring and enhancing frame importance information of visual features based on the visual features of the short video, the process of obtaining a short video feature representation having differentiated frame importance scores and weighted frame importance scores comprising, Obtaining a differentiated frame importance score and a short video characteristic representation weighted by the frame importance score by varying the self-encoder and generating enhanced frame importance information against the network joint structure to the greatest extent; The short video characteristic representation after the frame importance score weighting has the following expression: Wherein, the Self-encoder for embedded variations and generation of frame importance scores updated against network joint structure Expanding the representation of the dimension; The method comprises the steps of extracting original visual features of short video; a feature representation weighted for the importance score; Is the number of key frames of the short video, In order to have a short number of video samples, Is the number of feature dimensions; representing the multiplication of the corresponding elements; The process of obtaining a short video feature representation with potential semantic information awareness includes, The method comprises the steps of constructing a hidden attribute activation mapping unit to capture a hidden attribute response matrix, inputting the hidden attribute response matrix into a dynamic relevance unit, and obtaining relevance characteristics among hidden attributes by the dynamic relevance unit through constructing a static diagram and a dynamic diagram, and finally obtaining feature representation with potential semantic information relevance.
2. The short video event detection method based on deep dynamic semantic association according to claim 1, wherein the expression of the inter-frame self-attention enhanced feature representation is: Wherein, the For the inter-frame self-attention enhanced feature representation, L is the number of heads of the multi-head attention mechanism, Representing the series operation of the matrix, As the weight parameter to be learned, Represent the first The head of the device is provided with a plurality of heads, Is a scaling factor.
3. The short video event detection method based on deep dynamic semantic association according to claim 1, wherein the expression of the short video feature representation with potential semantic information perception is: Wherein, the And Are all functions of activation, and are all functions of activation, Is a convolution layer and is used for dimension conversion; A static associated representation of the hidden attribute; And Respectively a dynamic graph association matrix and a dynamic weight updating matrix; To hide the dynamically associated representation of the attribute, Is that Is a global representation of (2); Is a feature representation that contains potential semantic relevance; the number of hidden attributes is represented and, Representing the number of feature dimensions obtained after the static part training, Representing the number of feature dimensions obtained after the dynamic part training.
4. The short video event detection method based on deep dynamic semantic association according to claim 1, wherein obtaining an event category score from the short video feature representation comprises, Short video characteristic representations with potential semantic information perception are subjected to a global average pooling layer and a normalized exponential function to obtain event category scores; The event category score is expressed as: Wherein, the Representing event category scores, GAP represents a global average pooling layer, Is a normalized exponential function.
5. A short video event detection system based on deep dynamic semantic association is characterized by comprising, The frame importance evaluation module is used for collecting short videos and extracting visual features of the short videos, acquiring and reinforcing frame importance information of the visual features based on the visual features of the short videos, and obtaining differentiated frame importance scores and short video feature representations weighted by the frame importance scores; the inter-frame self-attention enhancement module is connected with the frame importance evaluation module and is used for guiding the learning of attention by utilizing the short video characteristic representation weighted by the frame importance score according to the attention characteristic of the characteristic space, and obtaining the characteristic representation of the inter-frame self-attention enhancement by cooperating with the inherent relevance between the inter-frame and the characteristic space; The latent semantic information perception module is connected with the inter-frame self-attention enhancement module and is used for regarding hidden attributes of short video events as nodes, regarding the association degree between the hidden attributes as edges and constructing a specific graph representation of a short video sample; The category score calculation module is connected with the potential semantic information perception module and is used for obtaining event category scores according to the short video characteristic representation and completing short video event detection tasks; Wherein the process of obtaining the short video characteristic representation with differentiated frame importance scores and weighted frame importance scores based on the frame importance information of the short video visual characteristic acquisition and enhancement visual characteristic comprises, Obtaining a differentiated frame importance score and a short video characteristic representation weighted by the frame importance score by varying the self-encoder and generating enhanced frame importance information against the network joint structure to the greatest extent; The short video characteristic representation after the frame importance score weighting has the following expression: Wherein, the Self-encoder for embedded variations and generation of frame importance scores updated against network joint structure Expanding the representation of the dimension; The method comprises the steps of extracting original visual features of short video; a feature representation weighted for the importance score; Is the number of key frames of the short video, In order to have a short number of video samples, Is the number of feature dimensions; representing the multiplication of the corresponding elements; The process of obtaining a short video feature representation with potential semantic information awareness includes, The method comprises the steps of constructing a hidden attribute activation mapping unit to capture a hidden attribute response matrix, inputting the hidden attribute response matrix into a dynamic relevance unit, and obtaining relevance characteristics among hidden attributes by the dynamic relevance unit through constructing a static diagram and a dynamic diagram, and finally obtaining feature representation with potential semantic information relevance.
6. The short video event detection system based on deep dynamic semantic association according to claim 5, The frame importance evaluation module comprises an indication vector calculation unit, a weight updating unit, an indicator, an encoder, a decoder, a discriminator and a weight distribution unit; the indication vector calculation unit is used for generating importance weights of the initial short video key frames; The weight updating unit and the indicator work cooperatively and are used for updating the importance weight; The encoder and the decoder together form a variable self-encoder for mining potential importance information of the sample, and meanwhile, the decoder and the discriminator together form a generation countermeasure network, and feedback values learned by the discriminator act on the weight updating unit and the indicator for guiding the updating of the importance weight.
7. The short video event detection system based on deep dynamic semantic association according to claim 5, The potential semantic information perception module comprises a hidden attribute activation mapping unit and a dynamic relevance unit; the hidden attribute activation mapping unit is used for capturing a hidden attribute response matrix and inputting the hidden attribute response matrix into the dynamic relevance unit; the dynamic relevance unit is used for obtaining relevance characteristics between hidden attributes by constructing a static diagram and a dynamic diagram, and finally obtaining feature representation with potential semantic information relevance.

Description

Short video event detection method and system based on deep dynamic semantic association Technical Field The invention belongs to the technical field of multimedia and big data analysis, and particularly relates to a short video event detection method and system based on deep dynamic semantic association. Background With the rapid development of the short video industry, short video content analysis typified by short video event detection has received increasing attention. Short video event detection helps to solve the short video supervision problem, so that the industry is continuously and healthily developed. However, with the increasing number of short videos and the increasing complexity and variety of information, how to use the existing short video information to quickly and efficiently search for short videos needed by users is a problem to be solved in the present day. Currently, artificial intelligence techniques, typified by deep learning, have been rapidly developed in various fields, and are also widely used in the field of video information processing. The problem of short video event detection is solved by utilizing an artificial intelligence technology, so that the development of the field of computer vision can be promoted, and the user experience can be improved, thereby having research value and practical application value. Disclosure of Invention In order to achieve the purpose, the invention provides a short video event detection method and a short video event detection system based on deep dynamic semantic association. The short video event detection method based on the depth dynamic semantic association comprises the following steps: Collecting short videos, extracting visual features of the short videos, and acquiring and reinforcing frame importance information of the visual features based on the visual features of the short videos to obtain differentiated frame importance scores and short video feature representations weighted by the frame importance scores; According to the attention characteristic of the feature space, the short video feature representation weighted by the frame importance score is utilized to guide the learning of attention, and the inter-frame self-attention-enhanced feature representation is obtained in cooperation with the inherent relevance between the frames and the feature space; the hidden attributes of the short video events are regarded as nodes, the association degree between the hidden attributes is regarded as an edge, and a specific graph representation of the short video sample is constructed; and obtaining an event category score according to the short video characteristic representation, and completing a short video event detection task. Preferably, the process of obtaining the short video feature representation with differentiated frame importance scores and weighted frame importance scores based on the frame importance information of the short video visual feature acquisition and enhancement visual feature comprises, Obtaining a differentiated frame importance score and a short video characteristic representation weighted by the frame importance score by varying the self-encoder and generating enhanced frame importance information against the network joint structure to the greatest extent; The short video characteristic representation after the frame importance score weighting has the following expression: Wherein, the Self-encoder for embedded variations and generation of frame importance scores updated against network joint structureExpanding the representation of the dimension; The method comprises the steps of extracting original visual features of short video; The weighted feature of importance score is represented by T is the number of key frames of short video, B is the number of short video samples, D is the number of feature dimensions, and by the term of multiplication of corresponding elements. Preferably, the expression of the inter-frame self-attention enhanced feature representation is: F=WoConcat(G1,G2,…,GL) Wherein, the For the inter-frame self-attention enhanced feature representation, L is the number of heads of the multi-head attention mechanism, concat (·) represents the tandem operation of the matrix,As the weight parameter to be learned,Represents the first header, D v = D/L is the scaling factor. Preferably, the process of obtaining a short video feature representation with latent semantic information awareness includes, The method comprises the steps of constructing a hidden attribute activation mapping unit to capture a hidden attribute response matrix, inputting the hidden attribute response matrix into a dynamic relevance unit, and obtaining relevance characteristics among hidden attributes by the dynamic relevance unit through constructing a static diagram and a dynamic diagram, and finally obtaining feature representation with potential semantic information relevance. Preferably, the expression of the short video