CN-121981902-A - Noise-resistant space-time self-supervision ultrasonic image processing model pre-training method
Abstract
The invention discloses an anti-noise space-time self-supervision ultrasonic image processing model pre-training method. The method comprises the steps of segmenting an ultrasonic video sequence into a plurality of three-dimensional space-time cubes, generating a feature vector sequence with 3D position information according to the three-dimensional space-time cubes, carrying out feature extraction on a non-shielding region of the feature vector sequence with the 3D position information through a feature extraction module with a space-time pipeline masking mechanism to obtain coding features, restoring the shielded region according to the coding features through a restoration module, calculating structural consistency loss and motion consistency loss, and carrying out parameter optimization on an ultrasonic image processing model according to the structural consistency loss and the motion consistency loss. The invention can solve the problem that the self-supervision pre-training is difficult due to high noise of the ultrasonic image.
Inventors
- NI DONG
- TAO XING
Assignees
- 深圳大学
Dates
- Publication Date
- 20260505
- Application Date
- 20251226
Claims (10)
- 1. An anti-noise space-time self-supervising ultrasonic image processing model pre-training method, characterized in that the method comprises the following steps: dividing an ultrasonic video sequence into a plurality of three-dimensional space-time cubes, and generating a feature vector sequence with 3D position information according to the three-dimensional space-time cubes; extracting the characteristics of the non-occlusion region of the characteristic vector sequence with the 3D position information by a characteristic extraction module with a space-time pipeline masking mechanism to obtain coding characteristics; Repairing the shielded area according to the coding characteristics by a repairing module, and calculating structural consistency loss and motion consistency loss; And carrying out parameter optimization on an ultrasonic image processing model according to the structural consistency loss and the motion consistency loss, wherein the structural consistency loss is obtained by calculation based on spatial filtering or frequency domain filtering, and the ultrasonic image processing model comprises the feature extraction module and the restoration module.
- 2. The method of pre-training an anti-noise spatio-temporal self-supervised ultrasound image processing model according to claim 1, further comprising, prior to the step of slicing the ultrasound video sequence into a number of three-dimensional spatio-temporal cubes: Acquiring an original ultrasonic video sequence; and preprocessing the ultrasonic images in the original ultrasonic video sequence to remove the background and text artifacts outside the sector and/or cone scanning area, extracting a clean anatomical region, and obtaining the ultrasonic video sequence composed of the clean ultrasonic images.
- 3. The method of pre-training an anti-noise spatio-temporal self-supervised ultrasound image processing model according to claim 1, wherein each of said three-dimensional spatio-temporal cubes is in a non-overlapping relationship.
- 4. The method of pre-training an anti-noise spatio-temporal self-supervised ultrasound image processing model according to claim 1, wherein generating a sequence of feature vectors with 3D positional information from said three-dimensional spatio-temporal cube comprises: and flattening the three-dimensional space-time cube, mapping the three-dimensional space-time cube into a feature vector sequence through a linear projection layer, and superposing the learnable 3D position codes to obtain the feature vector sequence with the 3D position information.
- 5. The method of claim 1, wherein the space-time pipeline masking mechanism comprises randomly sampling spatial coordinates at a predetermined masking rate in a first frame plane and continuously masking the selected spatial coordinates on a time axis to form a space-time pipeline mask.
- 6. The method of pre-training an anti-noise spatio-temporal self-supervised ultrasound image processing model of claim 5, wherein said spatio-temporal pipeline mask is in the form of a continuous mask whose position dynamically varies over time.
- 7. The method for pre-training an anti-noise space-time self-supervising ultrasonic image processing model according to claim 1, wherein the feature extraction module is an encoder, the repair module is a dual stream decoder comprising a first decoder and a second decoder, and the steps of repairing the occluded region according to the encoding feature by the repair module and calculating the structural consistency loss and the motion consistency loss comprise: Generating a predicted image block according to the coding characteristics through a first decoder, respectively executing transform domain conversion on the predicted image block and the corresponding three-dimensional space-time cube, and calculating the difference of transform domain results of the predicted image block and the corresponding three-dimensional space-time cube after high-frequency component inhibition by combining a preset filter to obtain the structural consistency loss; And generating prediction dynamic change information according to the coding characteristics through a second decoder, and calculating the motion consistency loss according to the prediction dynamic change information and corresponding reference dynamic change information, wherein when the prediction dynamic change information is a prediction optical flow field, the reference dynamic change information is a dense optical flow field, when the prediction dynamic change information is a prediction frame difference picture, the reference dynamic change information is a real frame difference picture, and when the prediction dynamic change information is a prediction motion vector, the reference dynamic change information is a real motion vector.
- 8. The method of claim 1, wherein the step of parameter optimizing the ultrasound image processing model based on the structural consistency loss and the motion consistency loss comprises: The structural consistency loss and the motion consistency loss are weighted and summed to obtain total loss; and carrying out parameter optimization on the ultrasonic image processing model according to the total loss.
- 9. An anti-noise space-time self-supervision ultrasonic image feature extraction method, which is characterized by comprising the following steps: acquiring an ultrasonic video sequence; inputting the ultrasonic video sequence into a trained ultrasonic image processing model to obtain corresponding coding characteristic information, wherein the trained ultrasonic image processing model comprises a characteristic extraction module, and training by the anti-noise space-time self-supervision ultrasonic image processing model pre-training method according to any one of claims 1 to 8.
- 10. A computer readable storage medium having stored thereon a plurality of instructions adapted to be loaded and executed by a processor to implement the steps of the anti-noise spatio-temporal self-supervised ultrasound image processing model pre-training method of any of claims 1 to 8 or the steps of the anti-noise spatio-temporal self-supervised ultrasound image feature extraction method of claim 9.
Description
Noise-resistant space-time self-supervision ultrasonic image processing model pre-training method Technical Field The invention relates to the technical field of image processing, in particular to an anti-noise space-time self-supervision ultrasonic image processing model pre-training method. Background Ultrasonic imaging is one of the most widely applied image modes in clinic, and plays an irreplaceable role in key diagnosis and treatment scenes such as cardiac function assessment, vascular dynamics monitoring, fetal development judgment and the like. The essence of ultrasonic diagnosis is a dynamic analysis process which depends on space-time continuity, and the diagnosis decision of doctors highly depends on observation of continuous characteristics of tissue motion modes, deformation rules and time dimensions in an ultrasonic video stream, rather than information extraction of a single static section, so that the characteristic learning of ultrasonic dynamic images becomes the core direction of the research and development of related technologies. In recent years, the deep learning technology has made remarkable progress in the field of ultrasonic image analysis, but the success of the prior art is mainly focused on a fully supervised learning paradigm, which requires expert doctors to carry out fine focus contour delineation or classification labeling on each frame of ultrasonic image. However, ultrasound video typically has high frame rate characteristics, and the cost of manual annotation grows exponentially with the amount of data. In order to break through the bottleneck of annotation data, self-supervision learning technology has developed, and the core idea is to design an auxiliary task to enable a model to finish pre-training by using unlabeled data, wherein a method based on mask Video modeling (Masked Video Modeling, such as Video MAE) has become a main technical route in the field of natural scene Video processing. However, the video self-supervision technology in the field of general computer vision is directly migrated to ultrasonic image analysis, and three fundamental obstacles and defects are faced, and the defects are solved by a targeted technical scheme: Firstly, the problem of noise overfitting is solved, the existing mask reconstruction algorithm mostly adopts pixel-level mean square error as a loss function, so that the model calculation force is largely consumed on fitting high-frequency noise distribution, the potential pure anatomical structure features are not learned, and finally, the pre-training model is extremely sensitive to noise, and pathological features with diagnostic value are difficult to extract; Secondly, due to the short-cut learning phenomenon, the high frame rate of ultrasonic scanning and the characteristic of smooth movement of a probe, extremely high visual redundancy exists between adjacent frames, and a time axis discrete random mask strategy in the prior art can enable a model to easily complete a reconstruction task through inter-frame copying or pixel interpolation, so that long-time anatomical motion rules such as heart beating and vasoconstriction are not required to be truly understood, and core motion characteristics required by dynamic ultrasonic diagnosis cannot be learned; Finally, the bottleneck of confusion of a motion source is formed, visual changes in an ultrasonic video actually mix two types of motions, namely, artificial motions of a probe and physiological motions of human tissues, a general video model lacks anatomical priori constraint, the two types of motions cannot be effectively distinguished, the background motions of the probe with larger learning amplitude tend to be frequently learned in the pre-training process, and the focus physiological motions which are tiny but have key diagnostic value are ignored, so that the model is insufficient in robustness in a downstream diagnosis task and is easily interfered by operation skills of doctors. Accordingly, there is a need for improvement and development in the art. Disclosure of Invention The invention aims to solve the technical problems that the prior art has the defects, provides an anti-noise space-time self-supervision ultrasonic image processing model pre-training method, and aims to solve the problem that the prior art video model cannot effectively extract anatomical features under the high-noise and high-redundancy ultrasonic scene. The technical scheme adopted by the invention for solving the problems is as follows: in a first aspect, an embodiment of the present invention provides an anti-noise spatio-temporal self-supervised ultrasound image processing model pre-training method, where the method includes: dividing an ultrasonic video sequence into a plurality of three-dimensional space-time cubes, and generating a feature vector sequence with 3D position information according to the three-dimensional space-time cubes; extracting the characteristics of the non-occlusion