CN-121999405-A - Video labeling method, device, equipment and medium

CN121999405ACN 121999405 ACN121999405 ACN 121999405ACN-121999405-A

Abstract

The application discloses a video labeling method, a video labeling device, video labeling equipment and video labeling media. The method comprises the steps of obtaining an original video to be annotated, preprocessing the original video to obtain an image frame sequence, extracting entity structured data of each entity contained in the image frame for each image frame in the image frame sequence to obtain an entity identification sequence, wherein the entity identification sequence contains a structured data set corresponding to each image frame, combining each structured data set in the entity identification sequence, identifying a scene tag corresponding to each image frame and a behavior state corresponding to each entity contained in the entity identification sequence to obtain a space-time knowledge graph, and combining the entity identification sequence and the space-time knowledge graph to generate a video annotation file. Therefore, manual annotation is not needed, the annotation cost is reduced, the video annotation efficiency is improved, the entity association relationship between different image frames is represented by combining the space-time knowledge graph constructed by the structured data sets of the plurality of image frames, the accuracy of video annotation data is improved, and the quality of the video annotation data is ensured.

Inventors

ZHENG QINGFANG
WU ZHIHONG
WU SHAOCONG
XIAO WEIWEI
HOU KUI
LI GUANGWEI
WANG ZHIJIA
QI CHONGYING
XING PANPAN

Assignees

鹏城实验室
中国人民解放军93209部队

Dates

Publication Date: 20260508
Application Date: 20251211

Claims (12)

1. A method for video annotation comprising: acquiring an original video to be annotated, and preprocessing the original video to obtain an image frame sequence; Extracting entity structured data of each entity contained in the image frame aiming at each image frame in the image frame sequence to obtain an entity identification sequence, wherein the entity identification sequence contains a structured data set corresponding to each image frame; Combining each structured data set in the entity identification sequence, and identifying a scene tag corresponding to each image frame and a behavior state corresponding to each entity contained in the scene tag to obtain a space-time knowledge graph; and combining the entity identification sequence and the space-time knowledge graph to generate a video annotation file.
2. The method according to claim 1, wherein the entity structured data includes an entity identification result, object mask information, and key point coordinates, and the extracting, for each image frame in the image frame sequence, the entity structured data of each entity included in the image frame to obtain an entity identification sequence includes: Identifying an entity marking frame and an entity category label corresponding to each entity in the image frame aiming at each image frame in the image frame sequence; for each image frame in the image frame sequence, carrying out pixel level instance segmentation processing on each entity in the image frame to obtain target mask information corresponding to each entity in each image frame; aiming at each image frame in the image frame sequence, carrying out gesture recognition on a target entity in the image frame to obtain a key point coordinate corresponding to each entity in each image frame; And obtaining entity structured data corresponding to each entity in each image frame based on the entity marking frame, the entity category label, the target mask information and the key point coordinates corresponding to each entity in each image frame, and constructing an entity identification sequence by combining the entity structured data corresponding to each entity in each image frame.
3. The method of claim 2, wherein identifying, for each image frame in the sequence of image frames, an entity tag box and an entity class label corresponding to each entity within the image frame, comprises: Performing feature extraction processing on the image frames aiming at each image frame in the image frame sequence to obtain initial image features corresponding to each image frame; acquiring a preset query vector, and calculating initial image features corresponding to each image frame of the preset query vector through a cross attention mechanism to obtain intermediate image features corresponding to each image frame; and identifying an entity marking frame and an entity category label corresponding to each entity based on the intermediate image characteristics corresponding to each image frame.
4. The method according to claim 2, wherein the performing pixel-level instance segmentation processing on each entity in the image frame for each image frame in the image frame sequence to obtain the target mask information corresponding to each entity in each image frame includes: Performing feature extraction processing on each image frame in the image frame sequence through a convolutional neural network to obtain a bottom visual feature corresponding to each image frame; decoding processing is carried out by a pixel-level decoder based on the bottom visual characteristics, so as to obtain a multi-scale characteristic diagram; Acquiring a preset entity query vector, inputting the preset entity query vector and the multi-scale feature map into a self-adaptive decoder to perform multi-layer cross attention calculation to obtain at least one entity category and an entity mask vector associated with each entity category; And generating a high-resolution feature map according to the multi-scale feature map, and performing dot product operation by combining each entity mask vector with the high-resolution feature map to obtain target mask information corresponding to each entity in each image frame.
5. The method according to claim 2, wherein for each image frame in the image frame sequence, performing gesture recognition on a target entity in the image frame to obtain a key point coordinate corresponding to each entity in each image frame, including: inputting each image frame in the sequence of image frames into a high resolution network; Performing feature representation on each image frame through a high-resolution backbone network in the high-resolution network to obtain high-resolution features corresponding to each image frame, performing feature representation on each image frame through a low-resolution branch network parallel to the high-resolution backbone network to obtain low-resolution features corresponding to each image frame, and fusing the high-resolution features corresponding to each image frame with the low-resolution features to obtain high-resolution fusion features corresponding to each image frame; Acquiring predefined key point information, and combining the high-resolution fusion characteristics and the predefined key point information through a convolution network to generate a thermodynamic diagram corresponding to each preset key point, wherein the thermodynamic diagram comprises pixel values corresponding to each key pixel point; And combining the pixel values of each key pixel point contained in each thermodynamic diagram to determine the coordinates of the key point.
6. The method according to claim 1, wherein the identifying, in combination with each structured dataset in the entity identification sequence, a scene tag corresponding to each image frame and a behavior state corresponding to each entity included, to obtain a spatiotemporal knowledge-graph includes: based on each structured data set in the entity identification sequence, identifying a behavior state corresponding to each entity contained in each image frame and a scene label corresponding to each image frame; Combining the scene label corresponding to each image frame and the behavior state corresponding to each entity contained in the scene label to construct an event node corresponding to each image frame; Determining node time sequence relations among a plurality of event nodes corresponding to a plurality of image frames, and calculating causal weights among any two event nodes adjacent in time sequence according to the node time sequence relations; and according to the node time sequence relation and the causal weight between any two event nodes adjacent to the time sequence, connecting the two event nodes adjacent to the time sequence in series to obtain a space-time knowledge graph.
7. The method of claim 1, wherein the combining the entity identification sequence and the spatiotemporal knowledge-graph to generate a video annotation file comprises: Performing annotation data verification on the entity identification sequence and the space-time knowledge graph to obtain a verification result; When the verification result is verification passing, determining the entity identification sequence and the space-time knowledge graph as verified annotation data; And packaging the verified annotation data to obtain a video annotation file.
8. The method of claim 7, wherein the performing the labeling data verification on the entity identification sequence and the spatiotemporal knowledge-graph to obtain a verification result includes: determining a first confidence level of each structured dataset in the entity identification sequence and a second confidence level of the spatiotemporal knowledge-graph through a Monte Carlo algorithm; Performing information cross comparison on the entity identification sequence and the space-time knowledge graph to obtain a comparison result; When the first confidence coefficient and the second confidence coefficient are both larger than or equal to a preset confidence coefficient threshold value, and the entity identification sequence is determined to be consistent with the information between the space-time knowledge patterns according to the comparison result, determining that the verification result is verification passing; And when the first confidence coefficient or the second confidence coefficient is smaller than the preset confidence coefficient threshold value, or the entity identification sequence is determined to be inconsistent with the information between the space-time knowledge patterns according to the comparison result, determining that the verification result is that the verification is not passed.
9. The method of claim 7, wherein the method further comprises: When the verification result is that verification fails, determining an abnormal time node corresponding to abnormal data in target data, wherein the target data is at least one of the entity identification sequence and the space-time knowledge graph; Collecting at least one target image frame of other time nodes adjacent to the abnormal time node from the original video; Performing entity identification on the target image frames to obtain target structured data sets corresponding to the target image frames, and identifying target behavior states corresponding to each entity based on the target structured data sets; Verifying the target structured data set and the target behavior state to obtain a verification result; And when the verification result is that the verification of the target structured data set and the target behavior state is passed, determining that the verification result is that the verification is passed.
10. A video annotation device comprising: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring an original video to be marked and preprocessing the original video to obtain an image frame sequence; The extraction unit is used for extracting the entity structured data of each entity contained in the image frame aiming at each image frame in the image frame sequence to obtain an entity identification sequence, wherein the entity identification sequence contains a structured data set corresponding to each image frame; The construction unit is used for combining each structured data set in the entity identification sequence, identifying a scene tag corresponding to each image frame and a behavior state corresponding to each entity contained in the scene tag, and obtaining a space-time knowledge graph; And the generating unit is used for combining the entity identification sequence and the space-time knowledge graph to generate a video annotation file.
11. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the video annotation method according to any of claims 1 to 9 when the computer program is executed.
12. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the video annotation method of any of claims 1 to 9.

Description

Video labeling method, device, equipment and medium Technical Field The present application relates to the field of computer technologies, and in particular, to a video labeling method, apparatus, device, and medium. Background Video data is used as a real-world dynamic record carrier and carries space-time continuous behavior patterns, environment changes and multi-mode interaction information, and is a key entry for an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) model to understand complex physical laws and human actions. The video annotation provides learnable basic data for the downstream task model through structural analysis of video content (such as target category identification, target action track annotation, action time sequence type annotation and the like). In the related art, video data is marked frame by frame in a manual marking mode, or video is detected and marked through a single model, and learnable basic data is provided for a downstream task model. However, the manual labeling mode in the related art is low in efficiency and needs to spend a great deal of cost, and labeling data is inconsistent due to subjectivity deviation in the manual labeling process, while the single-model automatic labeling mode depends on single video frame detection, and false detection or missed detection can be caused under the video frame background of long distance, shielding and dynamic blurring, so that the accuracy of the video labeling data is reduced, and the quality of the video labeling data is influenced. Disclosure of Invention The embodiment of the application provides a video labeling method, a device, equipment and a medium, which can improve the labeling efficiency of video data, improve the accuracy of the video labeling data and ensure the quality of the video labeling data. In a first aspect, the present application provides a video annotation method, including: acquiring an original video to be annotated, and preprocessing the original video to obtain an image frame sequence; Extracting entity structured data of each entity contained in the image frame aiming at each image frame in the image frame sequence to obtain an entity identification sequence, wherein the entity identification sequence contains a structured data set corresponding to each image frame; Combining each structured data set in the entity identification sequence, and identifying a scene tag corresponding to each image frame and a behavior state corresponding to each entity contained in the scene tag to obtain a space-time knowledge graph; and combining the entity identification sequence and the space-time knowledge graph to generate a video annotation file. In a second aspect, the present application provides a video labeling apparatus, including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring an original video to be marked and preprocessing the original video to obtain an image frame sequence; The extraction unit is used for extracting the entity structured data of each entity contained in the image frame aiming at each image frame in the image frame sequence to obtain an entity identification sequence, wherein the entity identification sequence contains a structured data set corresponding to each image frame; The construction unit is used for combining each structured data set in the entity identification sequence, identifying a scene tag corresponding to each image frame and a behavior state corresponding to each entity contained in the scene tag, and obtaining a space-time knowledge graph; And the generating unit is used for combining the entity identification sequence and the space-time knowledge graph to generate a video annotation file. In some embodiments, the extraction unit is further configured to: Category, obtaining entity identification results corresponding to each entity in each image frame; for each image frame in the image frame sequence, carrying out pixel level instance segmentation processing on each entity in the image frame to obtain target mask information corresponding to each entity in each image frame; aiming at each image frame in the image frame sequence, carrying out gesture recognition on a target entity in the image frame to obtain a key point coordinate corresponding to each entity in each image frame; And obtaining entity structured data corresponding to each entity in each image frame based on the entity marking frame, the entity category label, the target mask information and the key point coordinates corresponding to each entity in each image frame, and constructing an entity identification sequence by combining the entity structured data corresponding to each entity in each image frame. In some embodiments, the extraction unit is further configured to: Performing feature extraction processing on the image frames aiming at each image frame in the image frame sequence to obtain initial image f