CN-122023872-A - Sentence-region alignment-based ultrasonic multi-mode basic model construction method, system, terminal and medium

CN122023872ACN 122023872 ACN122023872 ACN 122023872ACN-122023872-A

Abstract

The invention discloses an ultrasonic multi-mode basic model construction method, system, terminal and medium based on sentence-region alignment, wherein the method comprises the steps of respectively preprocessing an acquired multi-section ultrasonic image and a diagnostic report, and establishing a training sample pair; the method comprises the steps of carrying out image feature coding and text feature coding based on training samples to obtain image global fusion features and image local integration features, obtaining text global features and text local features, establishing a global alignment loss function based on the image global fusion features and the text global features, establishing a local alignment loss function based on the image local integration features and the text local features, carrying out model training and optimization by adopting a double-stage training strategy, and constructing an ultrasonic multi-mode basic model. The invention successfully solves the key technical challenges in the construction of the ultrasonic multi-mode basic model, and realizes high-quality semantic alignment and efficient clinical deployment.

Inventors

NI DONG
WANG JIAN

Assignees

深圳大学

Dates

Publication Date: 20260512
Application Date: 20251224

Claims (10)

1. An ultrasonic multi-modal base model construction method based on sentence-region alignment, which is characterized by comprising the following steps: Acquiring a multi-section ultrasonic image and a corresponding diagnosis report, respectively preprocessing the multi-section ultrasonic image and the diagnosis report, and establishing a training sample pair binding the multi-section ultrasonic image and a single diagnosis report; Image feature coding is carried out based on the training sample pair to obtain image global fusion features and image local integration features corresponding to the ultrasonic image, and text feature coding is carried out based on the training sample pair to obtain text global features and text local features corresponding to the diagnosis report; Performing semantic alignment processing based on the image global fusion feature and the text global feature, establishing a global alignment loss function, performing semantic alignment processing based on the image local integration feature and the text local feature, and establishing a local alignment loss function; And based on the global alignment loss function and the local alignment loss function, performing model training and optimization by adopting a double-stage training strategy, and constructing an ultrasonic multi-mode basic model.
2. The method for constructing an ultrasound multi-modal base model based on sentence-region alignment according to claim 1, wherein preprocessing the multi-tangent plane ultrasound image and the diagnostic report, respectively, includes: For the multi-section ultrasonic image, respectively performing random cutting, scaling to fixed resolution, random horizontal overturning, random elastic deformation, random rotation, random contrast adjustment, random brightness adjustment and intensity normalization; For diagnostic reports, synonym substitution, random sentence rearrangement, punctuation recognition, medical term special processing, filtering invalid sentences and outputting structured sentence sequences are performed respectively.
3. The sentence-region alignment based ultrasound multi-modal base model building method of claim 2, wherein building training sample pairs for binding multiple ultrasound images with a single diagnostic report includes: based on the preprocessed multi-section ultrasonic images, M ultrasonic images with different sections of the same case are arranged into an image set; Binding the image set with the corresponding preprocessed single diagnosis report to form the training sample pair.
4. The method for constructing an ultrasound multi-modal base model based on sentence-region alignment according to claim 1, wherein the image feature encoding is performed based on the training sample pair to obtain an image global fusion feature and an image local integration feature corresponding to an ultrasound image, comprising: For each ultrasonic image in the training sample pair, dividing the ultrasonic image into a plurality of image blocks, splicing the characteristics of the image blocks with a learnable token, adding a position code, inputting the position code into a transducer encoder, and outputting image global characteristics and image local characteristics; splicing a plurality of image global features with learnable token, inputting a plurality of layers of Transformer encoders, and outputting image global fusion features; and splicing the local features of the images to obtain the local integration features of the images.
5. The method for constructing an ultrasound multi-modal base model based on sentence-region alignment according to claim 4, wherein the text feature encoding is performed based on the training sample pair to obtain a text global feature and a text local feature corresponding to the diagnostic report, comprising: Adopting a pre-training BERT model as a text encoder to process a structural sentence sequence corresponding to a diagnosis report in the training sample, taking the output of a learnable token as a text global feature, and obtaining the feature of each word; And carrying out average pooling on the characteristics of all words in a single sentence to obtain sentence-level characteristics, and taking the sentence-level characteristics as text local characteristics.
6. The sentence-region alignment based ultrasound multi-modal base model building method of claim 5, wherein performing semantic alignment processing based on the image global fusion feature and the text global feature, creating a global alignment loss function, and performing semantic alignment processing based on the image local integration feature and the text local feature, creating a local alignment loss function, comprising: Calculating global cosine similarity between image global fusion features of ultrasonic images in all training sample pairs in a batch and text global features of diagnostic reports, taking ultrasonic image-diagnostic report pairs of the same case as positive samples, taking ultrasonic image-diagnostic report pairs of different cases as negative samples, and calculating a global alignment loss function by combining the global cosine similarity; In the same ultrasonic image-diagnostic report pair, calculating local cosine similarity between the text local feature and the image local integration feature, remolding the local cosine similarity into a space dimension to form a similarity thermodynamic diagram, binarizing the similarity thermodynamic diagram, reserving a maximum connected region through connected region analysis, extracting an average value of the features of image blocks in the region as a positive region feature, taking the features of the rest of the image blocks as a negative pair set, and calculating a local alignment loss function, wherein the local alignment loss function comprises a first local alignment loss function and a second local alignment loss function.
7. The sentence-region alignment based ultrasound multi-modal base model building method of claim 6, wherein model training and optimization is performed using a two-stage training strategy based on the global alignment loss function and the local alignment loss function, the ultrasound multi-modal base model building method comprising: establishing a total loss function based on the global alignment loss function, the first local alignment loss function, and the second partial alignment loss function; in the training process of the first stage, a global alignment loss function pre-training model is used for establishing basic semantic association; in the training process of the second stage, the model after the total loss function combined optimization pre-training is used, local alignment constraint is gradually introduced, model training is completed, and an ultrasonic multi-mode basic model is constructed.
8. A sentence-region alignment based ultrasound multimodal base model construction system for implementing the steps of the sentence-region alignment based ultrasound multimodal base model construction method of any of claims 1-7, the system comprising: the data preprocessing and organizing module is used for acquiring the multi-section ultrasonic image and the corresponding diagnosis report, respectively preprocessing the multi-section ultrasonic image and the diagnosis report, and establishing a training sample pair binding the multi-section ultrasonic image and the single diagnosis report; The multi-mode feature extraction module is used for carrying out image feature coding based on the training sample pair to obtain an image global fusion feature and an image local integration feature corresponding to an ultrasonic image, and carrying out text feature coding based on the training sample pair to obtain a text global feature and a text local feature corresponding to a diagnosis report; The multi-scale semantic alignment module is used for carrying out semantic alignment processing on the basis of the image global fusion feature and the text global feature, establishing a global alignment loss function, carrying out semantic alignment processing on the basis of the image local integration feature and the text local feature, and establishing a local alignment loss function; And the model training and optimizing module is used for carrying out model training and optimizing by adopting a double-stage training strategy based on the global alignment loss function and the local alignment loss function, and constructing an ultrasonic multi-mode basic model.
9. A terminal comprising a memory, a processor and a sentence-region alignment based ultrasound multi-modal base model construction program stored in the memory and executable on the processor, the processor implementing the steps of the sentence-region alignment based ultrasound multi-modal base model construction method of any of claims 1-7 when executing the sentence-region alignment based ultrasound multi-modal base model construction program.
10. A computer-readable storage medium, wherein the computer-readable storage medium has stored thereon a sentence-region alignment based ultrasound multi-modal base model construction program, the sentence-region alignment based ultrasound multi-modal base model construction program implementing the steps of the sentence-region alignment based ultrasound multi-modal base model construction method of any of claims 1-7 on the computer-readable storage medium.

Description

Sentence-region alignment-based ultrasonic multi-mode basic model construction method, system, terminal and medium Technical Field The invention relates to the technical field of artificial intelligence, in particular to an ultrasonic multi-mode basic model construction method, system, terminal and medium based on sentence-region alignment. Background In the ultrasonic examination process, a doctor needs to manually operate a probe and read images in real time, and the process is highly dependent on the clinical experience of an operator and has remarkable subjectivity. For example, in critical diagnosis steps such as defining a focus boundary and measuring organ parameters, different doctors are likely to cause deviation of inspection results due to experience accumulation and difference of judgment standards, so that risks of misdiagnosis or missed diagnosis are increased. The problem is particularly prominent in basic medical institutions (lacking advanced ultrasonic doctors) or complex case diagnosis, and severely restricts the standardized promotion of ultrasonic diagnosis and the improvement of the reliability of results. In recent years, the deep learning technology significantly improves the consistency and efficiency of ultrasonic analysis by automated image segmentation, lesion detection and parameter measurement. The split network represented by U-Net can realize automatic calculation of the area of the right atrium, thereby effectively reducing human errors. However, the performance of deep learning models is highly dependent on high quality labeling data (e.g., lesion segmentation masks, organ boundary labeling), while medical labeling is done manually by experts, which is costly and time consuming. Moreover, the poor generalization across hospitals/across devices becomes another bottleneck-a model trained in a particular device or hospital, with performance dips in other institutions or scenarios, limiting the breadth of its clinical application. To alleviate the problem of model dependence on labeled data, researchers began to explore pre-training ultrasound base models by self-supervised learning (SSL) methods, learning generic image features using a large number of unlabeled ultrasound data. The existing mainstream self-supervision learning method mainly comprises two types of (1) comparing learning frames, namely generating positive sample pairs by carrying out data enhancement operations such as random cutting, elastic deformation, gray level transformation and the like on original ultrasonic images, maximizing feature similarity of the positive sample pairs and minimizing feature similarity of negative sample pairs through model training so as to learn distinguishing features of the images, and (2) masking self-encoder (MAE) frames, namely randomly masking partial areas (such as random masking partial image blocks) of the ultrasonic images, reconstructing contents of the masked areas through model training so as to learn global structures and local detail features of the images. The self-supervision learning method improves the downstream task (such as focus detection and image segmentation) performance under a small quantity of annotation data scene to a certain extent, improves the basic generalization capability of the model, but the existing self-supervision ultrasonic model is only trained based on pure ultrasonic image data, and lacks modeling capability for clinical text semantic information, so that the model has obvious limitations (1) the semantic understanding capability is insufficient, the correlation between clinical text descriptions such as 'right atrial expansion', 'tricuspid valve regurgitation', 'hypoechoic nodule', and the like and corresponding ultrasonic image features is not established, the pathological semantic information in the text is difficult to be fused into an image feature learning process, and (2) the pathological mechanism cognition is lack, namely the pathological nature of focuses is difficult to be deeply understood only by relying on image features, and the clinical practical value of the model is limited because the physiological cyst is difficult to accurately distinguish from early malignant tumors, benign nodules and borderline nodules in focus distinction of similar image features. Although the graphic joint modeling in the medical field has been initially explored, the following core problems exist that (1) alignment granularity is insufficient, the prior art focuses on global alignment of images and reports (such as a CLIP type method), but fine granularity alignment capability of local areas and sentences is weak. For example, the inability to precisely correlate the textual description of "right atrial enlargement" with morphological features of the right atrial region in the image results in incomplete semantic information transfer. (2) Few studies have attempted to refine the alignment granularity by local alignment of patch-level (image