CN-121982323-A - Medical image segmentation system, method and equipment thereof

CN121982323ACN 121982323 ACN121982323 ACN 121982323ACN-121982323-A

Abstract

The invention belongs to the technical field of medical image processing and artificial intelligence, and discloses a medical image segmentation system, a method and equipment thereof, wherein the system comprises an encoder, a processing unit and a processing unit, wherein the encoder extracts and fuses multi-scale characteristics of an input image step by step; a decoder for decoding the multi-scale features to reconstruct a medical segmented image having high resolution anatomical features and detailed structural information; the method is based on the system, the medical segmentation image with high-resolution anatomical features and detail structure information is reconstructed through a multi-scale feature fusion and global local feature cooperation mechanism, and the device is used for realizing the method.

Inventors

SUN JIANCHENG
WEN RUNLIN
DAI LIYUN

Assignees

江西财经大学

Dates

Publication Date: 20260505
Application Date: 20260408

Claims (10)

1. A medical image segmentation system comprising an encoder and a decoder; the encoder extracts and fuses the multi-scale characteristics of the input image step by step; the decoder decodes the multi-scale features and reconstructs a medical segmentation image with high-resolution anatomical features and detail structure information; Wherein the encoder comprises The number of deep-layer visual blocks, A cooperative attention structure is provided in each of the deep vision blocks, the cooperative attention structure comprising: the improved multi-head self-attention module is used for enhancing information interaction between local neighborhood features through token rolling operation; The light-weight global perception attention module is used for modeling global context information through the spatial attention weight; the improved multi-head self-attention module and the lightweight global perception attention module perform feature modeling on the same input feature and generate enhanced features through feature fusion, so that the feature expression capability of medical image segmentation is improved.
2. The system of claim 1, wherein the encoder comprises a shallow visual block and Each cascaded deep visual block further comprises a downsampling module and an identity module; The shallow visual block extracts and retains shallow features through a shallow feature extraction module ; First, the The depth-layer visual block is one of the deep-layer visual blocks, When it is characterized by shallow layer As an input there is provided, At the time, according to the first The output of each deep visual block is used as input, the spatial resolution is gradually reduced by a downsampling module and an identity module, and meanwhile deep features are extracted 。
3. The system of claim 1, wherein the decoder comprises A cascade of fused sample blocks, the first The fusion up-sampling blocks comprise a multi-scale up-sampling module and a residual convolution module; The multi-scale up-sampling module performs multi-scale feature fusion on the multi-scale features and captures up-sampling features ; The residual convolution module is used for carrying out up-sampling on the characteristics Refining features and preserving structural details, capturing fused up-sampled features 。
4. A medical image segmentation method based on the system according to any one of claims 1-3, characterized by comprising the steps of: Encoding: Step 1, for an input image Standard convolution is carried out to obtain standard convolution characteristics Then the standard convolution characteristic Inputting the shallow vision block to extract shallow features to obtain shallow features ; Step 2, when When the shallow layer features are to be formed Input the first Downsampling is carried out on the deep vision blocks to obtain downsampling characteristics Downsampling features Performing identity mapping to obtain deep features ; When (when) At the time, the first Deep features of individual deep visual block outputs Input to the first Deep visual blocks up to The deep visual blocks are all output to obtain Corresponding deep features ; Decoding: Step 3, when At the time, the first And (d) Deep features of individual deep visual block outputs Input to the first The multiple fused upsampling blocks perform multiscale upsampling to obtain upsampling characteristics Upsampling the features Residual convolution is carried out to obtain a fused up-sampling characteristic ; Step 4, when At the time, the first Deep features of individual deep visual block outputs And (d) Fused upsampling feature for individual fused upsampling block outputs Input to the first Sequentially performing multi-scale up-sampling and residual convolution on the fused up-sampling blocks; When (when) In the process, the shallow layer characteristics obtained in the step 1 are obtained And (d) Fused upsampling feature for individual fused upsampling block outputs Input to the first Sequentially performing multi-scale up-sampling and residual convolution on the fused up-sampling blocks; When (when) When the standard convolution feature obtained in the step 1 is used And (d) Fused upsampling feature for individual fused upsampling block outputs Input to the first The fused upsampling blocks are sequentially subjected to multi-scale upsampling and residual convolution until The output of each fused sampling block is completed, and the output is respectively obtained Each corresponding fusion upsampling feature ; Step 5, 2 nd to 2 nd Fused upsampling feature for individual fused upsampling block outputs Respectively carrying out bilinear interpolation up-sampling and point-by-point convolution to unify the spatial resolution thereof, thereby optimizing the segmentation result by utilizing a multistage supervision training strategy, and finally outputting fused up-sampling characteristics by a1 st fused up-sampling block As a final segmentation result.
5. The method of claim 4, wherein in step 1, the shallow feature extraction is performed as follows: First, standard convolution features Respectively performing multi-scale downsampling and point-by-point convolution, performing GP convolution on the output features of the multi-scale downsampling, performing element-by-element addition on the output features of the GP convolution and the point-by-point convolution, performing two continuous DP convolutions on the output features of the element-by-element addition, and finally performing element-by-element addition on the output features of the DP convolution and the output features of the previous element-by-element addition again to obtain shallow features ; The multi-scale downsampling process is as follows: Convolving the standard with features Respectively carrying out GP convolution and average pooling, carrying out point-by-point convolution on the average pooled output characteristics, keeping key characteristic information while reducing the space dimension, splicing the GP convolved output characteristics and the point-by-point convolved output characteristics, carrying out channel scrambling, and enhancing the characteristic interaction among different channel groups; The GP convolution is group convolution and point-by-point convolution, so that efficient calculation and cross-channel feature fusion are realized; The DP convolution is depth convolution and point-by-point convolution; the processing procedure of the shallow feature extraction is expressed as follows: Wherein, the In order to input an image of the subject, For a standard convolution of the values, As a standard convolution feature, In order to convolve a point by point, For the GP convolution to be performed, For the purpose of the average pooling, In order to splice the two parts together, In order to disrupt the channels of the optical fiber, In order to make a preliminary shallow feature, For the DP convolution, Is a shallow feature.
6. The method according to claim 4, wherein in step 2, the downsampling is performed as follows: First, shallow layer features Respectively carrying out point-by-point convolution and multi-scale downsampling, respectively carrying out improved multi-head self-attention and light-weight global perception attention on the output characteristics of the multi-scale downsampling, carrying out GP convolution on the output characteristics of the light-weight global perception attention to realize interaction of local and global range characteristics, carrying out element-by-element addition on the output characteristics of the GP convolution and the improved multi-head self-attention, carrying out point-by-point convolution again, further completing characteristic fusion to obtain enhanced characteristics, and finally carrying out element-by-element addition on the enhanced characteristics obtained by the point-by-point convolution and the output characteristics of the previous point-by-point convolution to stabilize training to obtain downsampled characteristics ; The downsampling process is represented as follows: Wherein, the As a feature of the shallow layer, In order to convolve a point by point, For the output features of the multi-scale downsampling, In order to be lightweight and globally aware of the attention, For the GP convolution to be performed, In order to improve the self-attention of the multiple heads, Is a downsampling feature; The lightweight global perception attention is firstly convolved point by point, multichannel features are compressed into space attention force diagram, then standard convolutions with different kernel sizes are respectively adopted to extract multi-scale features, point by point convolution is respectively carried out on extraction results to extract features of the space attention force diagram and used for capturing multi-scale information, finally global perception attention weight is generated through a Sigmoid function, hadamard products are carried out on the weight and the multichannel features, and then element by element addition is carried out on the weight and the multichannel features and used for capturing global context information; The processing procedure of the identity mapping is as follows: Downsampling features Performing DP convolution, respectively performing improved multi-head self-attention and light-weight global perception attention on the DP convolution output characteristics, performing DP convolution on the light-weight global perception attention output characteristics, performing element-by-element addition on the DP convolution output characteristics and the improved multi-head self-attention output characteristics, performing point-by-point convolution on the DP convolution output characteristics and the improved multi-head self-attention output characteristics, and finally performing point-by-point convolution on the point-by-point convolution output characteristics and the downsampling characteristics Then adding elements by elements to realize the fusion of local and global features and carry out residual connection so as to improve training stability and obtain deep features ; The processing of the identity map is represented as follows: Wherein, the For the DP convolution, In order to downsample the features of the sample, In order to initially downsample the features of the sample, In order to be lightweight and globally aware of the attention, In order to improve the self-attention of the multiple heads, In order to convolve a point by point, Is a deep feature.
7. The method of claim 6, wherein the improved multi-headed self-attention first employs three fully connected layer-to-input feature maps Performing linear projection transformation to obtain initial state query vector Initial state key vector And an initial state value vector Respectively inquiring the vector of the initial state Initial state key vector And an initial state value vector Firstly, remolding is carried out to obtain an intermediate state query vector Intermediate state key vector And intermediate state value vector Intermediate state query vector Intermediate state key vector And intermediate state value vector Respectively comprise Personal token Wherein , Is the sum of pixels in the feature map, then respectively to the token Performing front-back scrolling operation, and summing to obtain final state query vector Final state key vector And a final state value vector Final state query vector Final state key vector And a final state value vector Respectively comprise Individual summed tokens Realizing each token Interaction with its neighboring tokens effectively integrating local and global context information and then querying the vector for final state And final state key vector Dot product is performed by transpose of (a) and scaled to obtain an attention fraction, scaling factor For the final state key vector Is then normalized by an exponential function calculation of attention, converts the attention score into a probability distribution, and then combines the probability distribution with a final state value vector Dot product is carried out to obtain an attention feature map, and finally the attention feature map is input through a full connection layer Performing linear transformation, adding with attention feature map, and performing point-by-point convolution to obtain global context information The expression is as follows: wherein FC is a fully-connected layer, reshape is a remodelling operation, In order to scroll the operation forward, In order to scroll back through the operation, In order to scroll the query vector forward, In order to scroll the key vector forward, In order to scroll the value vector forward, In order to scroll the query vector backward, In order to scroll the key vector back, To scroll back through the value vector, softmax is the normalized exponential function, Representing the transpose.
8. The method according to claim 4, wherein in step 3 and step 4, the multi-scale up-sampling is performed as follows: Will be the first Deep features Performing transpose convolution and bilinear interpolation up-sampling respectively to realize amplification to extract detail information, and simultaneously performing the first step Deep features Performing point-by-point convolution to form channel alignment and feature extraction, then splicing the output features of the transposed convolution, the output features of bilinear interpolation up-sampling and the output features of the point-by-point convolution, and performing GP convolution after channel scrambling to realize further feature aggregation; the multi-scale up-sampling process is represented as follows: Wherein, the In order to transpose the convolution, For bilinear interpolation up-sampling, In order to convolve a point by point, In order to splice the two parts together, In order to disrupt the channels of the optical fiber, For the GP convolution to be performed, Is an upsampling feature; The residual convolution processing process is as follows: Up-sampling features Respectively carrying out two continuous DP convolutions and one point-by-point convolution to obtain further refined and fused up-sampling characteristics ; The residual convolution processing process is expressed as follows: Wherein, the In order to convolve a point by point, Is a DP convolution.
9. The method of claim 4, wherein in step 5, the multi-stage supervised training strategy employs a composite loss function Wherein, the Dice loss Optimizing organ segmentation integrity by computing region overlap, cross entropy loss Balancing organ size differences by adopting category weights; The Dice loss And cross entropy loss Expressed as: In the formula, For the result of the segmentation of the medical image, In order to refine the fused up-sampling feature, Representing the current fusion up-sampling, The number of the deep vision blocks is represented, In order to convolve a point by point, For bilinear interpolation up-sampling, Input images for adoption of public data sets A corresponding image of the label is displayed, Is the first in the label image The pixel belonging to the first The true probability of the organoids, Segmentation of the results for medical images The pixel belonging to the first The predictive probability of the organoid, For the total number of pixels contained in the input image X, To use the number of organ categories to be segmented in the public dataset, Representing medical image segmentation results The probability of prediction for a pixel is determined, Representing the label image True labels of individual pixels.
10. A medical image segmentation apparatus, comprising: A memory for storing a computer program implementing the medical image segmentation method as set forth in claim 4; A processor for implementing the medical image segmentation method as defined in claim 4 when executing the computer program.

Description

Medical image segmentation system, method and equipment thereof Technical Field The invention belongs to the technical field of medical image processing and artificial intelligence, and particularly relates to a medical image segmentation system, a method and equipment thereof, which are suitable for carrying out rapid and accurate automatic segmentation on medical images such as CT, MRI and the like. Background Medical image segmentation is a key step in accurate medical treatment, and plays an important role in the fields of focus delineation, organ quantitative analysis, operation planning and the like. Imaging techniques such as Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) provide rich anatomical information, but manual segmentation has the problems of time and effort consumption, poor repeatability and strong subjectivity. In recent years, segmentation methods based on deep learning, particularly Convolutional Neural Networks (CNNs) and their derived architectures such as U-Net, have become the mainstream scheme for medical image segmentation. The U-Net can effectively fuse shallow detail information and deep semantic information by virtue of a unique encoder-decoder structure and jump connection, and achieves remarkable results in a plurality of segmentation tasks. However, in the face of complex medical images, such models still suffer from significant limitations of 1, inadequate multi-scale target perceptibility, and extreme variations in size, morphology and location of target structures (e.g., tumors, organs) in medical images. The encoder of standard U-Net can gradually lose the detail information and the space context of a small target in the process of repeated downsampling, so that the segmentation precision of a tiny focus or a complex boundary is reduced, 2, the characteristic represents the bottleneck that the receptive field of a model is limited, and the global context information and the local fine characteristic are difficult to capture simultaneously. The convolution operation of a single scale cannot adaptively process targets with different sizes, and the phenomenon of discontinuous segmentation or false segmentation is easy to occur in a complex scene. Based on this, there is a need to develop a new medical image segmentation method, which can effectively integrate multi-scale feature information, and enhance the collaborative modeling capability of a model on global context and local detail, so as to improve the accuracy and robustness of segmentation on targets with different sizes, so as to overcome the defects in the prior art. Chen et al disclose TransUnet（Chen J , Lu Y , Yu Q ,et al.TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation[J].2021.DOI:10.48550/arXiv.2102.04306.）, that a hybrid architecture is constructed, a CNN encoder is used to capture local features and establish global dependency through a transducer, but because the CNN module and the transducer module are simply stacked in a serial manner, a deep and effective bidirectional feature interaction mechanism is lacking between the two modules, and the hybrid architecture has the defects of poor fusion of local details and global context information, and poor segmentation boundary and loss of a fine structure. Disclosure of Invention In order to overcome the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a medical image segmentation system, a method and a device thereof, which achieve efficient interaction and fusion of features at different levels and different dimensions by constructing one encoder-decoder architecture, thereby achieving accurate segmentation of biomedical anatomical structures of different shapes and sizes. In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: A medical image segmentation system comprising an encoder (Encoder) and a Decoder (Decoder); The encoder (Encoder) extracts and fuses the multi-scale features of the input image step by step; the Decoder (Decoder) decodes the multi-scale features to reconstruct a medical segmented image having high resolution anatomical features and detailed structural information. Wherein the encoder comprisesA plurality of Deep Vision (DV) blocks,A cooperative attention structure is provided in each of the Deep Vision (DV) blocks, the cooperative attention structure comprising: an improved multi-headed self-attention (IMSA) module for enhancing information interaction between local neighborhood features through token scrolling operations; A lightweight global perceived attention (LGA) module for modeling global context information with spatial attention weights; The improved multi-head self-attention (IMSA) module and the lightweight global perception attention (LGA) module perform feature modeling on the same input feature and generate enhanced features through feature fusion, so that the feature expression capability of m