CN-122024230-A - Histopathological image-oriented cell nucleus segmentation method and system

CN122024230ACN 122024230 ACN122024230 ACN 122024230ACN-122024230-A

Abstract

The invention is used in the technical field of medical image processing and artificial intelligence, and particularly discloses a cell nucleus segmentation method and a cell nucleus segmentation system for a histopathological image, which comprise an encoder, wherein the encoder is used for extracting multi-scale characteristics from an input pathological image, decoding the multi-scale feature by at least two parallel decoder branches, including an NP decoder branch and an HV decoder branch, generating at least one Nuclear Pixel (NP) segmentation map and one horizontal/vertical (HV) distance map. According to the cell nucleus segmentation method and system for the histopathological image, the strong hierarchical feature extraction capability of the Swin Transformer is combined with the example perception multitasking framework of Hover-Net, and experiments on CoNSeP and MoNuSeg data sets prove that the Dice coefficient and IoU index of the method are superior to those of baseline models Hover-Net and Swin-Unet, so that more accurate cell nucleus segmentation is realized.

Inventors

Shi Jiashuo
HU JUNJIANG
LI ZHENHUI
TAO HAIBO

Assignees

昆明理工大学

Dates

Publication Date: 20260512
Application Date: 20251231

Claims (10)

1. The system is based on an encoder-decoder architecture, and comprises an encoder, parallel decoder branches and a cavity space pyramid pooling (ASPP) module, wherein the encoder is used for extracting multi-scale features from an input pathological image, the parallel decoder branches are used for decoding the multi-scale features, the parallel decoder branches comprise NP decoder branches and HV decoder branches, a cell Nucleus Pixel (NP) segmentation map, a horizontal/vertical (HV) distance map and an NC cell nucleus classification map are generated, fusion is conducted to obtain a cell nucleus instance segmentation classification prediction map, and the cavity space pyramid pooling (ASPP) module is arranged between the encoder and the parallel decoder branches and used for conducting multi-scale feature fusion on the deepest layer features output by the encoder before decoding.
2. The histopathological image oriented nuclear segmentation system according to claim 1, wherein the system decodes the multi-scale features by at least three parallel decoder branches including NP decoder branches, HV decoder branches, and Nuclear Classification (NC) decoder branches, the encoder being constructed based on a Swin transducer module, the NP decoder branches, HV decoder branches, and NC decoder branches being all constructed based on a Swin transducer module.
3. The histopathological image-oriented nuclear segmentation system of claim 1 wherein the encoder constructs the plurality of stages by alternately using a windowed multi-headed self-attention (W-MSA) and a Swin transducer module that shifts the windowed multi-headed self-attention (SW-MSA), and a block merge layer (PATCH MERGING) to extract the hierarchical multi-scale feature map.
4. A histopathological image oriented nuclear segmentation system according to claim 1, wherein the parallel decoder branches are upsampled and feature refined by a Swin transform module and a block expansion layer (Patch Expanding).
5. A histopathological image oriented nuclear segmentation system according to claim 1 further comprising U-shaped Skip Connections (Skip Connections) for fusing shallow feature maps of the encoder output with feature maps of corresponding scale in parallel decoder branches to preserve high frequency spatial detail.
6. The histopathological image oriented nuclear segmentation system of claim 1 wherein the ASPP module includes a plurality of parallel convolutions branches including at least one convolutions branch, a plurality of holes convolutions branches having different expansion rates, and a global averaging pooling branch for capturing diverse context information.
7. A method for dividing cell nuclei based on the system of any one of claims 1-6, characterized in that the method operates as follows: S1, preparing training data, wherein the training data comprise an original pathological image, segmentation labels and classification labels corresponding to the original pathological image, and the labels need to be aligned with the original pathological image pixel by pixel; S2, carrying out data preprocessing and data enhancement on the pathological image described in the S1, wherein the data preprocessing and the data enhancement comprise geometric transformation, color transformation and non-rigid deformation; S3, dividing the preparation data of the S1 and the preprocessed data of the S2 into a training set, a verification set and a test set, inputting the training set into a cell nucleus segmentation model facing the histopathological image to respectively obtain an NP segmentation map, an HV distance map and an NC classification map predicted by the system, and obtaining a final cell nucleus instance segmentation classification map based on three prediction results; S4, calculating a composite loss function by comparing the three graphs predicted by the system with corresponding real labels; s5, updating model parameters of the system through a back propagation algorithm according to the composite loss function, and finally training to obtain a prediction model.
8. The method of claim 7, wherein the cell nucleus segmentation model comprises four stages, each stage is composed of a series of Swin transducer modules and a block merging layer, an image I with the resolution H×W×3 is input, the image I is segmented into non-overlapped 4×4 to obtain H/4×W/4 blocks through an embedding layer, each block is flattened into a vector with the length of 4×4×3=48, the vector is mapped to a preset dimension C through a linear layer, and the output is a feature map with the dimension H/4×W/4×C as an input of SwinTransformer stages; The SwinTansformer module alternately uses windowed multi-head self-attention (W-MSA) and SW-MSA shift windowed multi-head self-attention (SW-MSA), the W-MSA divides the input feature map into non-overlapping local windows and independently performs multi-head self-attention computation within each window to reduce the computation complexity from global O (N2) to O (M2), where N is the total sequence length, M is the window size, the input feature map is a two-dimensional feature map X e RH X W X C, H and W are the height and width of the feature map, C is the number of channels, the feature map is divided into a plurality of non-overlapping windows of size M X M, together with (H/M) X (W/M) windows, linear projections are performed on the input X to generate a query (Q), key (K) and value (V) vector, q=xw Q ,K=XW K ,V=XW V ,（W Q ,W K ,W V ) erc X C is a learnable weight matrix, and then Q, K, V are divided by the number of heads H to obtain H heads , , Self-Attention is calculated inside each m×m window, and for each window w and each Attention header i, an Attention weight and output are calculated, the Attention weight formula is: ; Wherein, the Is the dimension of each header; The method is an optional relative position coding (Relative Position Bias) matrix, offset is added to pixel pairs at different relative positions in a window to code position information, outputs of all h heads are spliced in a channel dimension, the spliced features are mapped back to an original dimension C through a linear projection layer W O , and finally the output has the same shape RH multiplied by W multiplied by C as the input X; SW-MSA forcing connection at shifted window boundaries by alternating between adjacent Swin convertors blocks, prior to computing W-MSA, ring shifting the input feature map by (M/2 ) pixels to the upper left, shifting the feature map by (M/2 ) pixels to the outside, returning the boundary-shifted pixels from the other side, again standard W-MSA computation on the shifted feature map, since the feature map has been shifted, new window divisions merge together a portion of the pixels of the original adjacent windows, thereby naturally establishing connections across the original windows within these new windows, the output of W-MSA is based on shifted feature maps, in order to restore the original spatial order, reverse ring shifting the output results is required, in computing the attention weight, introducing a large negative mask, adding the attention score to the large negative mask to those returned boundary pairs of pixels, after Softmax, the attention weight at these positions will approach 0, effectively avoiding the non-transition of Swin, the Swin-Swin is stacked with MSA, each stage of MSA; The block merging layer receives the feature images to perform downsampling, halving the spatial resolution, doubling the number of feature channels, and the encoder outputs four hierarchical feature images with different scales after four stages, namely F1 (H/4, W/4), F2 (H/8,W/8), F3 (H/16, W/16) and F4 (H/32, W/32), which are transmitted to the decoder through jump connection; The deepest feature map F4 (H/32, W/32) output by the encoder is sent to a cavity space pyramid pooling (ASPP) module, and is aimed at detecting context information through different receptive fields; The ASPP module receives a deep feature map F4 epsilon R { C x H x W }, which is output by an encoder, and processes the deep feature map F4 epsilon R { C x H x W }, through three core components, namely a multi-branch parallel processing unit, a feature fusion layer and a dimension reduction projection layer, wherein the multi-branch parallel processing unit aims to detect features from different scales, and consists of the following parallel branches, a local detail branch and a standard 1 x1 convolution branch, wherein the local detail branch is used for retaining the local detail features of high frequency under the condition of not changing a receptive field; a multi-scale context branch consisting of 3 parallel 3 x 3 hole convolution branches with different expansion rates (displacement Rate) D e d= { D1, D2,..and dk } capturing various context information by convolving on the same feature map with different expansion rates, a global context branch, which is an image-level feature branch that first compresses the spatial information of the whole feature map into a single feature vector by global averaging pooling (Global Average Pooling) to capture the global context, the vector is feature transformed by a 1 x1 convolution and up-sampled back to the original spatial resolution (H x W) by bilinear interpolation (Bilinear Interpolation), the output feature maps of all parallel branches are spliced (Concatenate) in the channel dimension by a feature fusion layer to form a multi-scale aggregated feature map F4' e R { C ' x W }, which is rich from local detail to global context, finally processed by a convolution layer of 1 x 4',f, the operation can not only reduce the channel number from C' to C consistent with the input, but also realize the effective interaction and integration of the cross-channel information, and the processed characteristic diagram F ASPP epsilon RC XH XW is finally sent to a decoder for subsequent segmentation and classification tasks.
9. The method of claim 1, wherein parallel decoder branches are responsible for gradually restoring deep semantic features output by ASPP modules to original resolution and completing three specific prediction tasks, decoder structures are symmetrical to the encoder and consist of three parallel branches sharing architecture but independent parameters, NP decoder branches, HV decoder branches and NC cell nucleus class decoder branches, each decoder branch is also composed of a Swin transducer module and a block expansion layer (Patch Expanding), the block expansion layer is the inverse operation of a block merging layer, it doubles spatial resolution (2 h,2 w) by reshaping feature maps while halving channel number (C/2) to restore spatial details, during decoding the decoder branches receive F1, F2, F3 feature maps from the encoder by U-hop Connections (Skip Connections), fuse shallow high frequency spatial details with deep semantic features, and the ends of the three decoder branches are connected to three independent task heads: The decoder uses a block expansion layer (PatchExpanding) for upsampling, which is the inverse of the block merging layer, which doubles the spatial resolution by reshaping the feature map while halving the number of channels to accurately recover spatial detail, and between each block expansion layer, a series of Swin Tansformer modules are also used to process and refine the upsampled features, the decoder ends connected to three separate modules, outputting a Nuclear Pixel (NP) map, a horizontal/vertical (HV) map, a nuclear Type (TP) map, respectively; the cell nucleus prediction NP module is responsible for binary semantic segmentation, outputs a double-channel feature map, and obtains the probability that each pixel belongs to a cell nucleus or a background after a softmax activation function; The horizontal/vertical distance map HV module is responsible for regression tasks, outputs a dual-channel characteristic map, corresponds to the horizontal and vertical distances from each cell nucleus pixel to the centroid of the pixel respectively, predicts a two-dimensional vector (dx, dy) for each pixel belonging to the cell nucleus, dx is the horizontal distance (normalized value) from the pixel to the centroid of the cell nucleus to which the pixel belongs, dy is the vertical distance, and the output of the branch does not pass through a nonlinear activation function to directly predict the distance value; the cell nucleus classification NC module is responsible for example segmentation and classification, the branch output is a pixel-level classification probability diagram, the probability diagram is a characteristic diagram of K channels, K is the number of cell nucleus types and is determined by carrying out majority vote prediction results on the categories of all pixels in the cell nucleus classification NC module, and the output is processed by a Sofmax activation function.
10. The method for dividing cell nuclei according to claim 8 wherein: Training the method by a composite loss function, the composite loss function being a weighted sum L total of NP branch loss L NP , HV branch loss L HV , and NC branch loss L NC ; ; 、、 weights of three branch loss functions respectively; L NP is the nuclear pixel loss, taking the combination of the Dice loss and the binary cross entropy loss to focus on both regions and boundaries: ; Where L BCE is the binary cross entropy loss, L Dice is the Dice loss, The pixels p are the true image mask and the predicted segmentation mask, respectively; l HV horizontal vertical loss, mean Square Error (MSE) loss is taken for the regression task: ; Where F is the set of real nuclear pixels, p is one pixel in the set, And The true and predicted horizontal/vertical distance vectors for the pixel, respectively; l NC cell nucleus type loss, cross entropy loss is selected, and the loss is calculated only on real cell nucleus pixels so as to avoid interference of background pixels: ; Wherein L CE is the cross entropy loss, And Respectively the pixel p is a true and predicted class label.

Description

Histopathological image-oriented cell nucleus segmentation method and system Technical Field The invention relates to the technical field of medical image processing and artificial intelligence, in particular to a cell nucleus segmentation method and system for a histopathological image. Background In histopathological diagnosis, the precise analysis of cell nucleus morphology, size, texture and spatial distribution is the core basis of cancer diagnosis and grading, the traditional diagnosis method relies on manual evaluation by pathologists, the method consumes a great deal of time and labor cost, obvious inter-observer differences exist, the unification of diagnosis standards is difficult to achieve, the current medical image segmentation method based on a Convolutional Neural Network (CNN) has been widely applied, in order to solve the difficult problem that dense and adhered cell nuclei are difficult to distinguish by conventional semantic segmentation, an example segmentation technology is generated, hover-Net is taken as a representative multitasking learning framework, separation of adhered cell boundaries is successfully achieved by introducing regression branches of horizontal and vertical distance maps for predicting the centroid position of the cell nucleus, and Vision Transformer and the improved Swin Transformer thereof show excellent performance in the field of medical image segmentation by virtue of strong global context modeling capability. However, the convolution operation of CNN has locality and limited receptive field, which causes natural disadvantages in capturing long-distance dependency and global context information, and is not beneficial to understanding complex organization structure and distinguishing cell nucleus types depending on microenvironment, so that the overall performance is limited by the feature extraction capability of CNN by Hover-Net taking CNN as backbone network, while Swin-Unet has a powerful feature extractor, but the standard architecture can only output pixel-level semantic segmentation result, and adhered and overlapped cell nucleus examples cannot be separated accurately, so that the requirement on accurate segmentation and analysis of cell verification examples in histopathological diagnosis is difficult to be satisfied. Disclosure of Invention The invention aims to provide a cell nucleus segmentation method and system for a histopathological image, which are used for solving the problems that the CNN backbone network-based feature extraction capability is limited, the segmentation and classification precision is insufficient and dense, adhered and overlapped cell nucleus examples cannot be effectively separated based on a transducer backbone network. The invention provides a cell nucleus segmentation system for a tissue pathology image, which is based on an encoder-decoder framework and comprises an encoder, parallel decoder branches and a cavity space pyramid pooling (ASPP) module, wherein the encoder is used for extracting multi-scale features from the input pathology image, the parallel decoder branches are used for decoding the multi-scale features and comprise NP decoder branches and HV decoder branches, a cell Nucleus Pixel (NP) segmentation map, a horizontal/vertical (HV) distance map and an NC cell nucleus classification map are generated, the cell nucleus example segmentation classification prediction map is obtained through fusion, and the cavity space pyramid pooling (ASPP) module is arranged between the encoder and the parallel decoder branches and is used for carrying out multi-scale feature fusion on the deepest-layer features output by the encoder before decoding. Preferably, the system decodes the multi-scale feature by at least three parallel decoder branches including an NP decoder branch, an HV decoder branch, and a Nuclear Classification (NC) decoder branch, the encoder being constructed based on a Swin fransformer module, the NP decoder branch, the HV decoder branch, and the NC decoder branch being constructed based on a Swin fransformer module. Preferably, the encoder constructs multiple stages by alternating the use of a windowed multi-head self-attention (W-MSA) and a Swin transform module that shifts the windowed multi-head self-attention (SW-MSA), and a block merge layer (PATCH MERGING) to extract the hierarchical multi-scale feature map. Preferably, the parallel decoder branches are upsampled and feature refined by a Swin Transformer module and a block extension layer (Patch Expanding). Preferably, the system further comprises a U-hop connection (Skip Connections) for fusing the shallow feature map of the encoder output with the feature map of the corresponding scale in the parallel decoder branches to preserve high frequency spatial detail. Preferably, the ASPP module comprises a plurality of parallel convolution branches, the branches comprising at least one convolution branch, a plurality of hole convolution branches with dif