WO-2026092679-A1 - VIDEO ANNOTATION METHOD AND APPARATUS, DEVICE, AND STORAGE MEDIUM

WO2026092679A1WO 2026092679 A1WO2026092679 A1WO 2026092679A1WO-2026092679-A1

Abstract

A video annotation method, comprising: by means of key frame matching, automatically determining a key frame comprising a target object from a video to be annotated, and segmenting said video by using the key frame as a video segmentation point; and separately performing target tracking processing on two image sequences located before and after the key frame in the video. Thus, without changing the working principle of the target tracking algorithm, the video annotation method does not need to separately determine whether the first frame in the video to be annotated comprises the target object, thereby achieving the automation of video annotation and effectively improving the efficiency and accuracy of video annotation.

Inventors

XU, Bowen
WANG, MENG

Assignees

网易灵动(杭州)科技有限公司

Dates

Publication Date: 20260507
Application Date: 20251031
Priority Date: 20241101

Claims (10)

A video annotation method, the video annotation method comprising: Based on the standard keyframes of the target object, image frames that match the standard keyframes are determined from the video to be labeled as keyframes of the target object in the video to be labeled; wherein, the standard keyframes include the location marker information of the target object; The standard keyframe, the keyframe, and the first image sequence are combined to obtain a first sub-video; the standard keyframe and the keyframe are combined to obtain a second sub-video; wherein, the first image sequence represents the image sequence obtained by reversing the image frames in the video to be labeled that are located before the keyframe; the second image sequence represents the image sequence composed of the image frames in the video to be labeled that are located after the keyframe. Using the target object as the tracking target, target tracking processing is performed on the first sub-video and the second sub-video respectively to obtain the target tracking processing results corresponding to the first sub-video and the second sub-video respectively; Based on the order of image frames in the video to be labeled, the target tracking processing results corresponding to the first sub-video and the second sub-video are recombined to obtain the target tracking result of the target object in the video to be labeled.
According to claim 1, the video annotation method further includes: Based on the video shooting scene of the video to be labeled, a target image frame containing the target object is determined from multiple original videos corresponding to the video shooting scene; The target image frame is input into the image segmentation model, and the target object contained in the target image frame is labeled by the image segmentation model. The labeled target image frame is then output as the standard keyframe of the target object.
According to the video annotation method of claim 1, the step of recombining the target tracking processing results corresponding to the first sub-video and the second sub-video respectively according to the order of image frames in the video to be annotated includes: Obtain the target tracking processing result of the first image sequence from the target tracking processing result corresponding to the first sub-video; Obtain the target tracking processing result of the second image sequence from the target tracking processing result corresponding to the second sub-video; Obtain the target tracking processing result of the key frame from the target tracking processing result corresponding to the first sub-video or the second sub-video; Based on the order of the image frames in the video to be labeled, the order of the image frames in the target tracking processing results of the first image sequence, the target tracking processing results of the key frames, and the target tracking processing results of the second image sequence is adjusted to obtain the target tracking result of the target object in the video to be labeled.
According to claim 1, the video annotation method further includes: When the target object includes multiple entity objects of different categories, based on the target tracking results corresponding to the multiple entity objects in the video to be labeled, it is determined whether there are abnormal image frames with overlapping position markers among the target tracking results of the multiple entity objects. If the abnormal image frame is determined to exist, the local image region with overlapping position markers in the abnormal image frame is segmented and predicted, and the entity objects to which different pixel units in the local image region belong are determined based on the segmentation and prediction results.
According to the video annotation method of claim 4, the step of segmenting and predicting the local image regions where position markers overlap in the abnormal image frames includes: For multiple sub-image regions contained in the local image region, based on the region boundary that surrounds the sub-image region, the entity object whose position marker is connected to the boundary of the region is determined as the segmentation prediction result of the sub-image region.
According to the video annotation method of claim 4, the step of segmenting and predicting the local image regions where position markers overlap in the abnormal image frames further includes: The local image region is input into the image segmentation model, and the image segmentation model is used to segment and predict the entity objects contained in the local image region to obtain the segmentation prediction result of the local image region.
According to the video annotation method of claim 4, the step of segmenting and predicting the local image regions where position markers overlap in the abnormal image frames further includes: For multiple sub-image regions contained in the local image region, based on the region boundary that surrounds the sub-image region, the entity object whose position marker is connected to the boundary of the region is determined as the segmentation prediction result of the sub-image region; The local image region is input into the image segmentation model, and the image segmentation model is used to segment and predict the entity objects contained in the local image region to obtain the segmentation prediction result of the local image region. If the segmentation prediction result of the sub-image region matches the segmentation prediction result of the local image region, then the entity object to which the sub-image region belongs is determined based on the segmentation prediction result of the sub-image region. If the segmentation prediction result of the sub-image region does not match the segmentation prediction result of the local image region, then the sub-image region is removed from the local image region.
A video annotation device, the video annotation device comprising: The matching module is configured to perform operations based on standard keyframes of the target object, determining image frames that match the standard keyframes from the video to be labeled as keyframes of the target object in the video to be labeled; wherein, the standard keyframes include the location marker information of the target object; The grouping module is configured to combine the standard keyframe, the keyframe, and a first image sequence to obtain a first sub-video, and combine the standard keyframe, the keyframe, and a second image sequence to obtain a second sub-video; wherein, the first image sequence represents the image sequence obtained by reversing the image frames in the video to be labeled that are located before the keyframe; the second image sequence represents the image sequence composed of the image frames in the video to be labeled that are located after the keyframe; The target tracking module is configured to perform target tracking processing on the first sub-video and the second sub-video respectively, using the target object as the tracking target, to obtain the target tracking processing results corresponding to the first sub-video and the second sub-video respectively; The recombination module is configured to recombine the target tracking processing results corresponding to the first sub-video and the second sub-video respectively according to the sorting of image frames in the video to be labeled, so as to obtain the target tracking result of the target object in the video to be labeled.
An electronic device includes a processor, a memory, and a bus. The memory stores machine-readable instructions executable by the processor. When the electronic device is running, the processor communicates with the memory via the bus. When the machine-readable instructions are executed by the processor, they perform the steps of the video annotation method as described in any one of claims 1 to 7.
A computer-readable storage medium storing a computer program that, when executed by a processor, performs the steps of the video annotation method as described in any one of claims 1 to 7.

Description

A video annotation method, apparatus, device, and storage medium Cross-reference of related applications This application claims priority to Chinese Patent Application No. 202411554570.5, filed on November 1, 2024, entitled “A video annotation method, apparatus, device and storage medium”, the entire contents of which are incorporated herein by reference. Technical Field This disclosure relates to the field of image processing technology, and more specifically, to a video annotation method, apparatus, device, and storage medium. Background Technology In scenarios such as excavator operation and loader material handling, it is often necessary to record the working process of the vehicles in the scene on video. However, in addition to the target objects related to the video shooting task, such as the working vehicles and the materials being shoveled, the video often contains background elements unrelated to the video shooting task, such as trees and warehouses. Therefore, in order to improve the data quality of the video data, users need to label the above-mentioned target objects contained in the video data so that they can more intuitively and quickly locate the labeled target objects from the labeled video. Currently, target tracking algorithms can be used to track specified objects in videos and mark their positions within each frame. However, since these algorithms typically only work on image sequences where the target object appears in the first frame and the images were captured consecutively, they cannot track video data where the target object does not appear in the first frame. This hinders the automation of video annotation, resulting in reduced efficiency and accuracy. Summary of the Invention According to one aspect of this disclosure, a video annotation method is provided, comprising: determining, based on standard keyframes of a target object, image frames matching the standard keyframes from a video to be annotated as keyframes of the target object in the video to be annotated; wherein the standard keyframes include position marker information of the target object; combining the standard keyframes, the keyframes, and a first image sequence to obtain a first sub-video; and combining the standard keyframes, the keyframes, and a second image sequence to obtain a second sub-video; wherein the first image sequence represents an image sequence obtained by reversing image frames in the video to be annotated that precede the keyframes; and the second image sequence represents an image sequence composed of image frames in the video to be annotated that follow the keyframes; performing target tracking processing on the first sub-video and the second sub-video respectively, using the target object as the tracking target, to obtain target tracking processing results corresponding to the first sub-video and the second sub-video respectively; and recombining the target tracking processing results corresponding to the first sub-video and the second sub-video respectively according to the order of the image frames in the video to be annotated to obtain target tracking results of the target object in the video to be annotated. According to one aspect of this disclosure, a video annotation apparatus is provided, comprising: a matching module, configured to determine, based on standard keyframes of a target object, image frames matching the standard keyframes from a video to be annotated as keyframes of the target object in the video to be annotated; wherein the standard keyframes include position marker information of the target object; a grouping module, configured to combine the standard keyframes, keyframes, and a first image sequence to obtain a first sub-video, and combine the standard keyframes, keyframes, and a second image sequence to obtain a second sub-video; wherein the first image sequence represents an image sequence obtained by reversing image frames in the video to be annotated that precede the keyframes; and the second image sequence represents an image sequence composed of image frames in the video to be annotated that follow the keyframes; a target tracking module, configured to perform target tracking processing on the first sub-video and the second sub-video respectively, using the target object as the tracking target, to obtain target tracking processing results corresponding to the first sub-video and the second sub-video respectively; and a recombination module, configured to recombine the target tracking processing results corresponding to the first sub-video and the second sub-video respectively, based on the order of image frames in the video to be annotated, to obtain target tracking results of the target object in the video to be annotated. According to one aspect of this disclosure, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implem