CN-122023143-A - CBCT image cross-modal synthesis method based on mixed state space model
Abstract
The invention discloses a mixed state space model-based CBCT image cross-mode synthesis method which comprises the steps of data preprocessing and 2.5D input sequence construction, constructing a focus slice correction Branch A, removing artifacts in slices and among slices, constructing a volume scene context modeling Branch B, extracting three-dimensional space consistency characteristics, constructing a mixed global-local Mamba decoder Branch C, generating a high-fidelity sCT image, and model training and optimizing. The invention remarkably improves the physical fidelity of the synthesized CT image and the continuity of the anatomical structure, breaks through the bottleneck of the calculation efficiency of high-resolution medical image processing, realizes the real-time reasoning speed of clinical level while ensuring the modeling capability of 3D level, constructs a disambiguation mechanism based on multi-scale context contrast, and remarkably improves the clinical safety and reliability of the generated image.
Inventors
- HAO XUELI
- YAN ERJUN
- SONG XIANGJUN
- FAN TING
- CAI SHIYUAN
- HE YANGYANG
- ZHANG ZHEYU
Assignees
- 青海职业技术大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260119
Claims (7)
- 1. The CBCT image cross-mode synthesis method based on the mixed state space model is characterized by comprising the following steps of: Step 1, data preprocessing and 2.5D input sequence construction Carrying out standardization processing on the original medical image data to obtain CBCT scanning data to be processed and planning pCT images matched with the CBCT scanning data to be processed; the image pixel value is truncated and normalized, the Hun's unit HU value is truncated to the range of [ -1024, 3071], and is mapped to the interval of [0,1] linearly; On this basis, a 2.5D slice stack is constructed; Step 2, constructing a focus slice correction Branch A, and removing artifacts in a slice and among slices; step 3, constructing a Branch B of the context modeling of the volume scene, and extracting three-dimensional space consistency characteristics; Step 4, constructing a hybrid global-local Mamba decoder Branch C to generate a high-fidelity sCT image, adopting a decoding part of a U-shaped framework, and recovering the image resolution step by step through a hybrid global-local state space model HGL-SSM; step 5, model training and optimization End-to-end supervised training of models using paired CBCT and pCT data to construct composite loss functions The loss function is minimized by the Adam optimizer until the model converges.
- 2. The method for cross-modal synthesis of CBCT images based on the mixed state space model as claimed in claim 1, wherein in step 1, all images are resampled to an isotropic resolution of 1.0 x 1.0 mm3.
- 3. The method of claim 1, wherein in the step 1, for each target center slice S (i) to be reconstructed, the first N and the subsequent N Zhang Xianglin spatially are selected to be combined to form an input tensor including 2n+1 slices Where B represents the batch size and H and W represent the height and width of the image, respectively.
- 4. A CBCT image cross-modality synthesis method based on a mixed state space model according to claim 1, wherein in step 2, the branch is aimed at restoring the high frequency anatomical detail of the central slice, by means of a cascade of orthogonal decoupling and volume compensation mechanisms, comprising the following sub-steps: (2a) Orthogonal decoupling attention feature extraction, namely capturing long-distance dependence by utilizing an orthogonal decoupling attention module ODAM aiming at scattering strip artifacts distributed along a specific angle in CBCT; Input feature maps along the horizontal W and vertical H directions respectively Performing adaptive average pooling to compress two-dimensional features into two orthogonal one-dimensional feature vectors, namely horizontal feature vectors And vertical feature vector ; The two vectors are used for learning the spatial importance through matrix interaction and a shared convolution layer, and are restored into a two-dimensional attention weight graph through a Broadcasting mechanism broadcastingmechanism, wherein the calculation formula of the attention weight graph is shown as the following formula (2-1): (2-1) Wherein, the Activating a function for Sigmoid, X representing matrix multiplication or broadcast concatenation; element-level multiplying the generated weights with the original features ; (2B) And calculating difference characteristics of adjacent slices, namely performing pixel level subtraction on adjacent slice pairs in a stack to obtain 4 groups of difference graphs, wherein a calculation formula is shown in the following formula (2-2): (2-2) Splicing the difference graphs in the channel dimension to form a difference characteristic stack The stack explicitly encodes Z-axis abrupt information caused by respiratory motion or scan artifacts; (2c) Volume difference compensation fusion, namely stacking difference features by utilizing dual-path convolution network Performing multi-scale processing, extracting low-frequency difference by using large receptive field convolution on the first path, extracting high-frequency difference by using small receptive field convolution on the second path, and extracting compensation characteristic After the channel is convolved and adjusted, the channel is weighted and fused into the characteristic of the central slice in a residual form, and the characteristic is shown in the following formula (2-3): (2-3) Wherein, the Weighting coefficients for network automatic learning.
- 5. The CBCT image cross-mode synthesis method based on the mixed state space model of claim 1, wherein in the step 3, the branch utilizes deep semantic information of surrounding slices to assist in judging structural authenticity of a central slice, and specifically comprises the following sub-steps: (3a) Query-Key value based volumetric scene reconstruction, namely aggregating three-dimensional neighborhood information by using an attention mechanism, taking deep features of a central slice as a Query vector (Query, Q), taking features of all slices in an input stack as Key value vectors (Key, K), and calculating a spatial correlation graph between Q and K through convolution operation The following formula (2-4): (2-4) Wherein, the The correlation map reflects which areas in the adjacent slices are continuous with the anatomical structure, and weighting and aggregating the adjacent slice characteristics by utilizing the correlation map so as to filter uncorrelated interlayer noise and generate pure volume context characteristics; (3b) Multi-scale context disambiguation, namely aiming at the problem that artifacts are similar to the real structural morphology and are difficult to distinguish, adopting a multi-scale comparison strategy, respectively constructing 5 complete slices of wide-range feature stream extrusion input, and extracting global consistency features And a narrow-range feature stream, namely, only inputting the middle 3 slices, extracting local continuity features ; Calculating the difference characteristics of the two The following formula (2-5): (2-5) And generating reverse gating weight by utilizing the difference characteristic, inhibiting the artifact characteristic which is inconsistent under different scales, and outputting the disambiguated deep context characteristic.
- 6. The CBCT image cross-mode synthesis method based on the mixed state space model of claim 1, wherein in the step 4, the method specifically comprises the following sub-steps: (4a) Discretizing the state space parameters, namely discretizing the parameters A and B of the continuous state space model by using a zero-order preserving ZOH method so as to carry out recursive calculation on a digital image sequence, wherein the discretizing formula is shown in the following formula (2-6): (2-6) Wherein, the As a parameter of the step of time, Is a unit matrix; Based on the discretization parameters, a linear recursive scan is performed to capture global information as shown in the following equations (2-7): Formulas (2-7); (4b) The double-flow grouping scanning mechanism is used for decomposing the characteristic diagram X into four sequences in four directions and dividing the sequences into two groups for independent processing, and defining a forward group Comprising a forward direction of row And column forward direction Sequence, backward group Comprising a backward direction of the row And column backward direction Respectively inputting each group of sequences into an SSM module for scanning, and keeping causality consistency; (4c) The causal perception local refinement comprises the steps of introducing a local enhancement feedforward network LeFF before global fusion aiming at the problem that SSM is easy to smooth texture, reshaping a scanned output sequence back to a 2D feature map, extracting local high-frequency texture by applying deep convolution Depth-wise Conv, and a feature calculation formula after refinement is shown in the following formula (2-8): (2-8) Adding the refined features of the forward group and the backward group to obtain decoding features with global receptive fields and local texture details; (4d) And (3) merging and outputting deep and shallow features, namely merging the deep decoding features obtained in the step (4) with the corrected shallow features obtained in the step (2) through jump Connection Skip Connection, and outputting a single-channel synthetic CT image after the merged features are mapped by a final convolution layer.
- 7. The CBCT image cross-modal synthesis method based on the mixed state space model of claim 1, wherein in the step 5, a composite loss function is obtained Including pixel level reconstruction loss Loss of perception And gradient consistency loss, wherein pixel level reconstruction loss Ensuring HU value accuracy and perception loss Ensuring the vivid visual texture and gradient consistency loss Ensuring sharp anatomical boundaries.
Description
CBCT image cross-modal synthesis method based on mixed state space model Technical Field The invention relates to the technical field of medical image processing and deep learning, in particular to a cross-modal synthesis method of a CBCT image based on a mixed state space model. Background With the rapid development of accurate medical and adaptive radiotherapy (Adaptive Radiotherapy, ART) technology, acquiring accurate anatomical structure information of a patient during treatment in real time becomes a key to improving treatment gain and reducing side effects. Cone Beam CT (CBCT) has become the standard imaging modality for daily placement verification and online adaptive planning adjustment in image-guided radiation therapy (IGRT) because of its low radiation dose, fast imaging speed, and advantages of being directly integrated on therapeutic devices such as Linac (Linac). However, due to the limitation of physical imaging principles such as scattering, motion artifacts, beam hardening effects and the like, the original CBCT image generally has serious problems of streak artifacts, large noise interference, low soft tissue contrast and the like. More importantly, the pixel values (HU) of CBCT tend to be inaccurate and unstable and cannot be directly used for high-precision radiotherapy dose calculation. To solve the above problems, converting a low quality CBCT image into a high quality synthetic CT (SYNTHETIC CT, SCT) image is a current research hotspot. Accurate and high-fidelity sCT generation can not only remarkably improve the definition of an image and assist doctors to carry out more accurate target region sketching, but also restore accurate electronic density information, so that on-line dose verification and adaptive planning re-optimization based on daily images are possible, and the method has important clinical value and application prospect for promoting personalized accurate radiotherapy. The existing CBCT to sCT image synthesis methods are mainly divided into a traditional image processing method and a method based on deep learning. Traditional methods such as histogram matching, map-based deformation registration and the like rely on complex prior knowledge and manual features, and have poor generalization capability. With the development of deep learning, convolutional Neural Network (CNN) based and method of generating an antagonistic network (GAN) have become mainstream. For example, U-Net and its variants are widely used to establish CBCT-to-CT mappings, cycleGAN are used to solve the cross-modality conversion problem of unpaired data. In recent years Vision Transformer (ViT) and Diffusion Models (Diffusion Models) have also been introduced into this field due to their powerful modeling capabilities. U-Net and its improved model Ronneberger U-Net Convolutional Networks for Biomedical Image Segmentation, the U-Net architecture proposed in the literature by U-Net, has achieved great success in medical image segmentation and synthesis by fusing deep and shallow features through jump connection. However, in dealing with CBCT synthesis tasks, standard U-Net is often limited to local receptive fields, and it is difficult to effectively remove a wide range of scattering artifacts. GAN model-Isola et al in the literature "Image-to-Image Translation with Conditional Adversarial Networks" set forth the Pix2Pix frame, which significantly improves the texture clarity of the composite Image through countermeasure training. However, GAN model training is unstable, is prone to pattern collapse, and may introduce "illusions" (Hallucinations) of non-anatomical compliance during generation. The conversion model Chen et al in document "TransUNet: transformers Make Strong Encoders for MEDICAL IMAGE segment" combines CNN and conversion and utilizes a self-attention mechanism to capture long distance dependencies. Although the global modeling capability is improved, the computational complexity of the transducer increases quadratically with the image resolution (O (N2)), so that the memory occupation is extremely high when the high-resolution three-dimensional medical image is processed, and the reasoning speed is slow. The diffusion model, DDPM proposed by Ho et al in literature Denoising Diffusion Probabilistic Models, generates high-quality images through an iterative denoising process, and is excellent in texture recovery. However, in a medical scenario, the reasoning process requires hundreds of iterations, which takes a very long time (a single slice may take several seconds), and seriously hampers the application of the reasoning process in online adaptive radiotherapy which is imperative in the minute. However, the existing CBCT image synthesis technology still has the following drawbacks in practical clinical applications: (1) The dimension is dilemma (Dimensionality Dilemma) that the 2D model is processed slice by slice, ignoring the anatomical continuity between layers (Z axis), resulting in d