CN-122023804-A - Interactive segmentation method for remote sensing video moving target

CN122023804ACN 122023804 ACN122023804 ACN 122023804ACN-122023804-A

Abstract

The application belongs to the technical field of remote sensing image processing, and particularly discloses an interactive segmentation method for a remote sensing video moving target, which comprises the steps of training a segmentation model by using a data set to obtain a trained segmentation model, wherein the segmentation model comprises a space-time enhanced image encoder, a prompt encoder, a mask decoder and a decoupling high-frequency compensation decoder; the method comprises the steps of inputting a remote sensing image to be segmented or a video into a segmentation model with training completed to obtain a segmentation result, enabling a space-time enhanced image encoder to comprise a low-rank adaptive matrix and a time sequence memory link and be used for extracting characteristics of the input video image, enabling a prompt encoder to be used for converting interaction prompts input by a user into high-dimensional prompt vectors, enabling a mask decoder to be used for obtaining a preliminary segmentation result based on the characteristics of the input video image and the high-dimensional prompt vectors, enabling a decoupling high-frequency compensation decoder to be used for extracting shallow high-frequency characteristics from the characteristics extracted by the space-time enhanced image encoder, and optimizing the preliminary segmentation result based on the shallow high-frequency characteristics.

Inventors

SHAN ZHE
ZHOU LEI
LU CONGYUAN
XIE XIA

Assignees

海南大学

Dates

Publication Date: 20260512
Application Date: 20260130

Claims (8)

1. An interactive segmentation method for a remote sensing video moving target is characterized by comprising the following steps: Training a segmentation model by using the pre-acquired data set to obtain a trained segmentation model, wherein the segmentation model comprises a space-time enhanced image encoder, a prompt encoder, a mask decoder and a decoupled high-frequency compensation decoder; inputting the remote sensing image or video to be segmented into a segmentation model with the training completed to obtain a segmentation result; The space-time enhanced image encoder comprises a low-rank adaptive matrix and a time sequence memory link, and is used for extracting the characteristics of an input video image; The prompt encoder is used for converting the interaction prompt input by the user into a high-dimensional prompt vector; the mask decoder is used for obtaining a preliminary segmentation result based on the characteristics of the input video image and the high-dimensional prompt vector; The decoupling high-frequency compensation decoder is used for extracting shallow high-frequency characteristics from the characteristics extracted by the space-time enhancement image encoder and optimizing the preliminary segmentation result based on the shallow high-frequency characteristics.
2. The interactive segmentation method for a moving target of a remote sensing video according to claim 1, wherein the training process of the space-time enhanced image encoder comprises: taking the pre-trained ViT as an image encoder, loading pre-training weights, and freezing trunk parameters; Injecting a low-rank adaptation matrix into the self-attention projection layer at ViT, wherein the low-rank adaptation matrix is used for capturing texture structure and scale change of an image; Constructing a time sequence memory link, which is used for processing coding characteristics of a t-1 frame of the video, generating a time sequence bias item, and injecting the time sequence bias item into self-attention calculation of the t frame of the video, wherein t is a positive integer which is more than 1 and less than or equal to the total frame length of the video; Training the low-rank adaptation matrix and the timing memory link based on the pre-acquired data set; The last layer Transformer Block of ViT is thawed and trimmed, and remote sensing specific high-level semantic knowledge is injected.
3. The interactive segmentation method for a remote sensing video moving object according to claim 1, wherein the training process of the mask decoder and the decoupled high frequency compensation decoder comprises: Extracting an intermediate feature map from a shallow layer of the spatio-temporal enhanced image encoder; performing channel dimension reduction on the intermediate feature map, so that the dimension-reduced intermediate feature map is aligned with the up-sampling feature of the mask decoder in the channel dimension; Fusing the intermediate feature map after the dimension reduction with the semantic features of the mask decoder to form a decoupled high-frequency compensation path; in the training process, the parameters of the mask decoder and the parameters of the decoupling high-frequency compensation decoder are updated alternately.
4. The interactive segmentation method for remote sensing video moving targets according to claim 1, wherein the inference process of the trained segmentation model comprises: determining a target area center point of the video image based on the target area related prompt input by the user and the video image; cutting the video image by taking the center point of the target area as the center to obtain a local image block; Upsampling the local image block; Inputting the up-sampled local image blocks into the trained segmentation model to obtain a high-resolution mask; and mapping the high-resolution mask back to the coordinate system of the original video image to generate a pixel level separation result.
5. An interactive segmentation device for a remote sensing video moving target, comprising: The training module is used for training the segmentation model by using the pre-acquired data set to obtain a trained segmentation model, and the segmentation model comprises a space-time enhanced image encoder, a prompt encoder, a mask decoder and a decoupled high-frequency compensation decoder; The segmentation module is used for inputting the remote sensing image or video to be segmented into a trained segmentation model to obtain a segmentation result; The space-time enhanced image encoder comprises a low-rank adaptive matrix and a time sequence memory link, and is used for extracting the characteristics of an input video image; The prompt encoder is used for converting the interaction prompt input by the user into a high-dimensional prompt vector; the mask decoder is used for obtaining a preliminary segmentation result based on the characteristics of the input video image and the high-dimensional prompt vector; The decoupling high-frequency compensation decoder is used for extracting shallow high-frequency characteristics from the characteristics extracted by the space-time enhancement image encoder and optimizing the preliminary segmentation result based on the shallow high-frequency characteristics.
6. An electronic device, comprising: At least one memory for storing a computer program; At least one processor for executing the program stored in the memory, which when executed is adapted to perform the interactive segmentation method for a remote sensing video moving object according to any one of claims 1-4.
7. A computer readable storage medium storing a computer program, which when run on a processor causes the processor to perform the interactive segmentation method for a remote sensing video moving target according to any one of claims 1-4.
8. A computer program product, which when run on a processor causes the processor to perform the interactive segmentation method for a remote sensing video moving target as claimed in any one of claims 1-4.

Description

Interactive segmentation method for remote sensing video moving target Technical Field The application belongs to the technical field of remote sensing image processing, and particularly relates to an interactive segmentation method for a remote sensing video moving target. Background Along with the wide application of remote sensing video and high-resolution aerial images in the fields of national defense monitoring, disaster assessment, traffic management and the like, accurate, rapid and interactive segmentation of moving targets has become a core technical requirement. Although the general large Model represented by the segmentation arbitrary Model (SEGMENT ANYTHING Model, SAM) breaks through in the natural image field, the following core limitations are still exposed when the SAM is directly migrated to a remote sensing video scene, namely 1, the small target space features are seriously lost, and the SAM adopts fixed image input resolution. In the remote sensing image, targets such as vehicles, ships and the like often occupy only a plurality of pixels, semantic information of the small targets is easy to dissipate after the targets are subjected to multi-stage downsampling by an encoder, so that omission or segmentation holes are caused, and 2, time sequence continuity is lost, namely the existing improvement scheme is mostly based on static single frame processing. The cloud shielding, the shadow transient and the rapid target movement frequently occur in the remote sensing video, the phenomena of intense flicker or target lost can be caused by front and rear frame segmentation masks, 3, the field characteristic mismatch and the boundary blurring are that the remote sensing target has the characteristics of random orientation, high similarity of textures (such as camouflage and background camouflage) and the like, the attention of a general model is hard to focus when the general model processes such high-altitude visual angle images, the generated mask boundary often shows edge saw-tooth or topological structure fracture under complex background interference, and 4, the contradiction between the adaptation cost and the generalization capability is that although the whole parameter fine adjustment can improve the performance, the remote sensing data marking cost is high, the model scale is huge, the whole quantity training consumes massive calculation force, the model loses the original zero sample generalization advantage easily, and the model is hard to cope with the type of an unseen sensor. In summary, the existing remote sensing video segmentation method has low segmentation accuracy. Disclosure of Invention Aiming at the defects of the prior art, the application aims to provide an interactive segmentation method for a remote sensing video moving target, which aims to solve the problems of low segmentation precision caused by serious loss of small target space characteristics, time sequence continuity loss and field characteristic mismatch and boundary blurring in the existing remote sensing video segmentation method. To achieve the above object, in a first aspect, the present application provides an interactive segmentation method for a remote sensing video moving target, including: Training a segmentation model by using the pre-acquired data set to obtain a trained segmentation model, wherein the segmentation model comprises a space-time enhanced image encoder, a prompt encoder, a mask decoder and a decoupled high-frequency compensation decoder; inputting the remote sensing image or video to be segmented into a segmentation model with the training completed to obtain a segmentation result; The space-time enhanced image encoder comprises a low-rank adaptive matrix and a time sequence memory link, and is used for extracting the characteristics of an input video image; The prompt encoder is used for converting the interaction prompt input by the user into a high-dimensional prompt vector; the mask decoder is used for obtaining a preliminary segmentation result based on the characteristics of the input video image and the high-dimensional prompt vector; The decoupling high-frequency compensation decoder is used for extracting shallow high-frequency characteristics from the characteristics extracted by the space-time enhancement image encoder and optimizing the preliminary segmentation result based on the shallow high-frequency characteristics. According to the application, a segmentation model is constructed and trained, a Low-Rank Adaptation matrix (LoRA) and a time sequence memory link are introduced into an image encoder, so that the static encoder has the capability of processing video stream characteristics, cross-domain migration from a general natural image to the high-altitude remote sensing field is completed, the problem of segmentation fracture of a target in a remote sensing video caused by shielding, cloud and fog or shadow is effectively solved, early-stage texture features (high-fr