CN-116152710-B - Video instance segmentation method based on cross-frame instance association

CN116152710BCN 116152710 BCN116152710 BCN 116152710BCN-116152710-B

Abstract

The invention discloses a video instance segmentation method based on cross-frame instance association, which is characterized in that a video frame sequence to be segmented is input into a multi-scale feature extractor to be extracted into feature graphs with different scales, space-time features are extracted from a transformer encoder, then fused space-time features are obtained through a pixel decoder, finally a final embedded vector is obtained through the transformer decoder, and dot product operation is carried out on the embedded vector and the high-resolution space-time features to obtain an instance segmentation result. According to the method, the space-time correlation of the dynamic instance is learned by a multi-scale method facing the space-time characteristics, a more stable cross-frame instance association is constructed, a reliable cross-frame instance association is established, the precision of a video instance segmentation task is improved, and the advanced performance is achieved on two popular data sets compared with the nearest method.

Inventors

LIU SHENG
CHEN JUNHAO
CHEN RUIXIANG
GUO BINGNAN
ZHANG FENG

Assignees

浙江工业大学

Dates

Publication Date: 20260505
Application Date: 20230208

Claims (5)

1. A video instance segmentation method based on cross-frame instance association, the video instance segmentation method based on cross-frame instance association comprising: constructing and training a video instance segmentation network comprising a multi-scale feature extractor, a transformer encoder, a pixel decoder, and a transformer decoder; Inputting the video frame sequence to be segmented into a multi-scale feature extractor to extract feature images with different scales, wherein the feature images are respectively according to the scale size 、、 And ; Extracting the characteristic diagram 、 And Input to transformer encoder to extract space-time characteristics ; Map the characteristic map And spatiotemporal features Input to a pixel decoder and output a spatio-temporal feature Separation into feature map dimensions 、 And Corresponding features 、 And Then gradually up-sampling and cross-fusing to obtain fused time-space characteristics And fused spatio-temporal features And fused spatiotemporal features ; Features to be characterized 、 And Inputting the final embedded vector into a transformer decoder to obtain a final embedded vector; Embedding vectors and spatio-temporal features Performing dot product operation to obtain an instance segmentation result; Wherein the feature map to be extracted 、 And Input to transformer encoder to extract space-time characteristics Comprising: Map the characteristic map 、 And Performing position coding, performing tensor flattening operation, inputting into deformable attention module, and generating basic features ; Map the characteristic map 、 And Position coding is carried out, and then the position coding is input into an S2S attention module to generate basic space-time characteristics The S2S attention module comprises an intra-scale time attention module and a scale space-time attention module, wherein the intra-scale time attention module adopts a time attention mechanism, and the scale space-time attention module adopts a deformable attention mechanism; Fusing two features of the same dimension And Finally, the space-time characteristics are obtained 。
2. The cross-frame instance correlation based video instance segmentation method of claim 1 wherein training the video instance segmentation network comprises preprocessing an acquired sequence of video frames to generate training sample data, comprising: Two sections of frame sequences with the same frame number are taken from the collected video frame sequence data set, one section is taken as a target set, and the other section is taken as a source set; Establishing a one-to-one correspondence of image frames between the target set and the source set according to a time sequence; And copying and pasting the instance in the source set image onto the corresponding image frame in the target set to generate a new frame sequence, and placing the new frame sequence into the video frame sequence data set.
3. The method for video instance segmentation based on cross-frame instance correlation according to claim 1, wherein the feature map is generated by a video instance segmentation method based on cross-frame instance correlation And spatiotemporal features Input to a pixel decoder and output a spatio-temporal feature Separated into and characterized by 、 And Scale-corresponding features 、 And Then gradually up-sampling and cross-fusing to obtain fused time-space characteristics And fused spatio-temporal features And fused spatiotemporal features Comprising: features of time and space Separated into and characterized by 、 And Scale-corresponding features 、 And ; For characteristics of Upsampling, adjusting to AND characteristics The same scale is then interpolated by bilinear interpolation to the same Cross fusion to generate fused space-time features ; For the time-space characteristics after fusion Upsampling, adjusting to AND characteristics The same scale is then interpolated by bilinear interpolation to match it with the features Cross fusion to generate fused space-time features ; For the time-space characteristics after fusion Upsampling and adjusting to the feature map The same scale is then interpolated by bilinear interpolation to match it with the feature map Cross fusion to generate fused space-time features 。
4. The method for video instance segmentation based on cross-frame instance correlation according to claim 1, wherein the transformer decoder comprises three decoder units corresponding to different scales and one MLP module connected in series, the feature to be described is that 、 And Input to a transformer decoder to obtain a final embedded vector, comprising: Features to be characterized 、 And Inputting the input features into decoder units of corresponding scales, taking the input features as attention masks, keys and values of the decoder units, wherein the query features of the first decoder unit are initialized query features, and the query features of the subsequent decoder units are features output by the previous decoder unit; In each decoder unit, a cross attention operation is first performed, and then a self attention operation is performed; the query features output by the last decoder unit are passed through the MLP module to generate the final embedded vector.
5. The method of video instance segmentation based on cross-frame instance correlation of claim 4, wherein the three decoder units of different scales iterate a preset number of times.

Description

Video instance segmentation method based on cross-frame instance association Technical Field The application belongs to the technical field of video instance segmentation, and particularly relates to a video instance segmentation method based on cross-frame instance association. Background Video instance segmentation aims to detect, segment, and track target instances in video simultaneously, which is helpful for many downstream tasks, including autopilot, video surveillance, video understanding, etc. Video instance segmentation is more challenging because of the fact that object instances in video are accurately segmented and tracked due to factors such as appearance distortion, fast motion, and occlusion, as compared to image instance segmentation. With the introduction of DETR and deformable DETR frameworks, the Transformer-based end-to-end video instance segmentation approach is the recent mainstream. Following the paradigm of video input, video output, visTR applies a transform for the first time to solve the video instance segmentation problem and uses instance queries to obtain an instance sequence from the video, but this approach learns one embedding for each instance of each frame, which makes it difficult to process variable length or long duration video sequences. To reduce VisTR the amount of explosive computation and build cross-frame instance associations, subsequent research has utilized target queries and proposed novel variants, namely, build context time-dependent memory tokens and build cross-frame instance associated query separation mechanisms, respectively. These methods essentially focus on single frame features and detect instances, and then perform cross-frame instance matching, however this deliberately distinguishes images from video, but irreversibly ignores the rich spatio-temporal context information present in video. Furthermore, existing approaches focus mainly on network improvement, but lack focus on the data sets required for training and testing. Through research, the conventional data set at present is easy to generate an overfitting problem during training due to insufficient training data quantity. Disclosure of Invention It is an object of the present application to provide a video instance segmentation method based on cross-frame instance correlation, which overcomes the above-mentioned problems posed in the background art, and which is also referred to as IAST in the present application. In order to achieve the above purpose, the technical scheme of the application is as follows: A video instance segmentation method based on cross-frame instance correlation, comprising: constructing and training a video instance segmentation network comprising a multi-scale feature extractor, a transformer encoder, a pixel decoder, and a transformer decoder; Inputting a video frame sequence to be segmented into a multi-scale feature extractor to extract feature graphs with different scales, wherein the feature graphs are respectively feature graphs C 2、C3、C4 and C 5 according to the scale size; The extracted feature graphs C 3、C4 and C 5 are input into a transformer encoder to extract space-time features Map C 2 and spatio-temporal featuresInput to a pixel decoder and output a spatio-temporal featureSeparated into features corresponding to the dimensions of feature maps C 3、C4 and C 5AndThen gradually up-sampling and cross-fusing to obtain fused time-space characteristicsAnd fused spatio-temporal featuresSpatiotemporal features after fusion Features to be characterizedAndInputting the final embedded vector into a transformer decoder to obtain a final embedded vector; Embedding vectors and spatio-temporal features And performing dot product operation to obtain an instance segmentation result. Further, the training video instance segmentation network includes preprocessing an acquired video frame sequence to generate training sample data, including: Two sections of frame sequences with the same frame number are taken from the collected video frame sequence data set, one section is taken as a target set, and the other section is taken as a source set; Establishing a one-to-one correspondence of image frames between the target set and the source set according to a time sequence; And copying and pasting the instance in the source set image onto the corresponding image frame in the target set to generate a new frame sequence, and placing the new frame sequence into the video frame sequence data set. Further, the extracted feature maps C 3、C4 and C 5 are input into a transformer encoder to extract space-time featuresComprising the following steps: Performing position coding on the feature maps C 3、C4 and C 5, performing tensor flattening operation respectively, and inputting the tensor flattened operation into a deformable attention module to generate basic features F; The feature maps C 3、C4 and C 5 are subjected to position coding and then input into an S2S attention module to generate