US-12626340-B2 - Systems, methods, and apparatuses for implementing self-supervised visual representation learning using order and appearance recovery on a vision transformer

US12626340B2US 12626340 B2US12626340 B2US 12626340B2US-12626340-B2

Abstract

Described herein are means for performing self-supervised visual representation learning using order and appearance recovery on a vision transformer. An exemplary system having a processor and memory is specially configured to execute instructions including: receiving medical image training data; selecting a medical image; generating a first perturbed image by applying local pixel shuffling and other image perturbations and outputting a first patchified perturbed image; generating a second randomized patchified image by patchifying and applying a random permutation to the original image; inputting the first patchified perturbed image and the second randomized patchified image into first and second transformer encoders which each generate and then share first and second generated weights through the recovery of both and patch order appearance from each image; and outputting a pre-trained AI model to perform medical image diagnosis on a new medical image absent from the training data input received by the system.

Inventors

Jiaxuan Pang
DongAo Ma
Jianming Liang

Assignees

ARIZONA BOARD OF REGENTS ON BEHALF OF ARIZONA STATE UNIVERSITY

Dates

Publication Date: 20260512
Application Date: 20230324

Claims (20)

1 . A system comprising: a memory to store instructions; a processor to execute the instructions stored in the memory, including: receiving one or more medical images as training data input at the system; selecting a medical image from among the training data; generating a first perturbed image from the medical image selected by applying local pixel shuffling to the medical image to generate a shuffled image, applying one or more additional image perturbations to the shuffled image, to generate an interim perturbed image, transforming the interim perturbed image into a first group of patches collectively corresponding to the interim perturbed image, and outputting the first group of patches as the first perturbed image; generating a second perturbed image from the medical image selected by transforming the medical image selected into a second group of patches collectively corresponding to the medical image selected, applying a random permutation to the second group of patches to generate a patch randomized interim image, and outputting the patch randomized interim image as the second perturbed image; inputting the first perturbed image into a first transformer encoder to perform self-supervised visual representation learning by recovering order and appearance information from the first perturbed image, resulting in a first set of weights associated with the first perturbed image; inputting the second perturbed image into a second transformer encoder, different than the first transformer encoder, to perform self-supervised visual representation learning by recovering order and appearance information from the second perturbed image, resulting in a second set of weights associated with the second perturbed image; sharing the first and second sets of weights among the first and second transformer encoders; and outputting a pre-trained AI model to perform medical image diagnosis on a new medical image which forms no part of the training data input received by the system.
2 . The system of claim 1 , further comprising: applying fine-tuning to the pre-trained AI model using a publicly available standardized dataset; and outputting the pre-trained and fined-tuned AI model to perform the medical image diagnosis.
3 . The system of claim 2 , wherein: applying the fine-tuning to the pre-trained AI model using the publicly available standardized dataset comprises using one of a publicly available standardized NIH ChestX-ray14 dataset or a publicly available standardized CheXpert dataset.
4 . The system of claim 1 : wherein applying the one or more additional image perturbations to the shuffled image comprises: applying non-linear processing to the shuffled image to generate a processed shuffled image; applying in-painting or out-painting or both in-painting and out-painting to the processed shuffled image to generate an expanded image; and wherein transforming the interim perturbed image into the first group of patches collectively corresponding to the interim perturbed image comprises transforming the expanded image into the first group of patches collectively corresponding to the expanded image generated from the application of the in-painting or out-painting or both.
5 . The system of claim 1 : wherein a Vision Transformer (ViT) base model is utilized as a default backbone for applying the first and second image perturbations of the medical image.
6 . The system of claim 1 : wherein a user-specified Vision Transformer (ViT) model is specified as a second input; supplementing a Vision Transformer (ViT) base model utilized as a default backbone with the user-specified Vision Transformer (ViT) model specified as the second input; and applying the first and second image perturbations of the medical image utilizing the user-specified Vision Transformer (ViT) model as specified via the second input.
7 . The system of claim 1 : wherein applying the local pixel shuffling to the medical image to generate the shuffled image comprises applying the local pixel shuffling at a 50% application threshold to generate the shuffled image.
8 . The system of claim 1 : wherein transforming the interim perturbed image into a first group of patches collectively corresponding to the interim perturbed image, and outputting the first group of patches as the first perturbed image comprises executing instructions via the processor for applying a patchify algorithm to the interim perturbed image to generate the first perturbed image.
9 . The system of claim 1 : wherein transforming the interim perturbed image into a first group of patches collectively corresponding to the interim perturbed image comprises, dividing the interim perturbed image into a 4×4 block of 16 total patches or into an 8×8 block of 64 total patches; and outputting the 4×4 block of 16 total patches or the 8×8 block of 64 total patches as the first perturbed image.
10 . The system of claim 1 : wherein generating the first perturbed image from the medical image and generating the second perturbed image from the medical image comprises executing a Vision Transformer (ViT) at the system against the medical image selected to generate the first and second perturbed images.
11 . A computer-implemented method performed by a system having at least a processor and a memory therein to execute instructions comprising: receiving one or more medical images as training data input at the system; selecting a medical image from among the training data; generating a first perturbed image from the medical image selected by applying local pixel shuffling to the medical image to generate a shuffled image, applying one or more additional image perturbations to the shuffled image, to generate an interim perturbed image, transforming the interim perturbed image into a first group of patches collectively corresponding to the interim perturbed image, and outputting the first group of patches as the first perturbed image; generating a second perturbed image from the medical image selected by transforming the medical image selected into a second group of patches collectively corresponding to the medical image selected, applying a random permutation to the second group of patches to generate a patch randomized interim image, and outputting the patch randomized interim image as the second perturbed image; inputting the first perturbed image into a first transformer encoder to perform self-supervised visual representation learning by recovering order and appearance information from the first perturbed image, resulting in a first set of weights associated with the first perturbed image; inputting the second perturbed image into a second transformer encoder, different than the first transformer encoder, to perform self-supervised visual representation learning by recovering order and appearance information from the second perturbed image, resulting in a second set of weights associated with the second perturbed image; sharing the first and second sets of weights among the first and second transformer encoders; and outputting a pre-trained AI model to perform medical image diagnosis on a new medical image which forms no part of the training data input received by the system.
12 . The computer-implemented method of claim 11 , further comprising: applying fine-tuning to the pre-trained AI model using a publicly available standardized dataset; outputting the pre-trained and fined-tuned AI model to perform the medical image diagnosis; and wherein the publicly available standardized dataset comprises one of a publicly available standardized NIH ChestX-ray14 dataset or a publicly available standardized CheXpert dataset.
13 . The computer-implemented method of claim 11 : wherein applying the one or more additional image perturbations to the shuffled image comprises: applying non-linear processing to the shuffled image to generate a processed shuffled image; applying in-painting or out-painting or both in-painting and out-painting to the processed shuffled image to generate an expanded image; and wherein transforming the interim perturbed image into the first group of patches collectively corresponding to the interim perturbed image comprises transforming the expanded image into the first group of patches collectively corresponding to the expanded image generated from the application of the in-painting or out-painting or both.
14 . The computer-implemented method of claim 11 : wherein a Vision Transformer (ViT) base model is utilized as a default backbone for applying the first and second image perturbations of the medical image.
15 . The computer-implemented method of claim 11 : wherein a user-specified Vision Transformer (ViT) model is specified as a second input; supplementing a Vision Transformer (ViT) base model utilized as a default backbone with the user-specified Vision Transformer (ViT) model specified as the second input; and applying the first and second image perturbations of the medical image utilizing the user-specified Vision Transformer (ViT) model as specified via the second input.
16 . The computer-implemented method of claim 11 : wherein applying the local pixel shuffling to the medical image to generate the shuffled image comprises applying the local pixel shuffling at a 50% application threshold to generate the shuffled image.
17 . The computer-implemented method of claim 11 : wherein transforming the interim perturbed image into a first group of patches collectively corresponding to the interim perturbed image, and outputting the first group of patches as the first perturbed image comprises executing instructions via the processor for applying a patchify algorithm to the interim perturbed image to generate the first perturbed image.
18 . The computer-implemented method of claim 11 : wherein transforming the interim perturbed image into a first group of patches collectively corresponding to the interim perturbed image comprises, dividing the interim perturbed image into a 4×4 block of 16 total patches or into an 8×8 block of 64 total patches; and outputting the 4×4 block of 16 total patches or the 8×8 block of 64 total patches as the first perturbed image.
19 . The computer-implemented method of claim 11 : wherein generating the first perturbed image from the medical image and generating the second perturbed image from the medical image comprises executing a Vision Transformer (ViT) at the system against the medical image selected to generate the first and second perturbed images.
20 . Non-transitory computer readable storage media having instructions stored thereupon that, when executed by a system having at least a processor and a memory therein, the instructions cause the processor to execute instructions, including: receiving one or more medical images as training data input at the system; selecting a medical image from among the training data; generating a first perturbed image from the medical image selected by applying local pixel shuffling to the medical image to generate a shuffled image, applying one or more additional image perturbations to the shuffled image, to generate an interim perturbed image, transforming the interim perturbed image into a first group of patches collectively corresponding to the interim perturbed image, and outputting the first group of patches as the first perturbed image; generating a second perturbed image from the medical image selected by transforming the medical image selected into a second group of patches collectively corresponding to the medical image selected, applying a random permutation to the second group of patches to generate a patch randomized interim image, and outputting the patch randomized interim image as the second perturbed image; inputting the first perturbed image into a first transformer encoder to perform self-supervised visual representation learning by recovering order and appearance information from the first perturbed image, resulting in a first set of weights associated with the first perturbed image; inputting the second perturbed image into a second transformer encoder, different than the first transformer encoder, to perform self-supervised visual representation learning by recovering order and appearance information from the first perturbed image, resulting in a second set of weights associated with the second perturbed image; sharing the first and second sets of weights among the first and second transformer encoders; and outputting a pre-trained AI model to perform medical image diagnosis on a new medical image which forms no part of the training data input received by the system.

Description

CLAIM OF PRIORITY This non-provisional U.S. Utility Patent Application is related to, and claims priority to the U.S. Provisional Patent Application No. 63/323,986, entitled “SYSTEMS, METHODS, AND APPARATUSES FOR IMPLEMENTING SELF-SUPERVISED VISUAL REPRESENTATION LEARNING BY RECOVERING ORDER AND APPEARANCE ON VISION TRANSFORMER,” filed Mar. 25, 2022, the entire contents of which is incorporated herein by reference as though set forth in full. GOVERNMENT RIGHTS AND GOVERNMENT AGENCY SUPPORT NOTICE This invention was made with government support under R01 HL128785 awarded by the National Institutes of Health. The government has certain rights in the invention. COPYRIGHT NOTICE A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. TECHNICAL FIELD Embodiments of the invention relate generally to the field of medical imaging and analysis using self-supervised learning (SSL) capabilities of a Vision Transformer (ViT) for the classification and annotation of medical images, and more particularly, to systems, methods, and apparatuses for implementing self-supervised visual representation learning using order and appearance recovery on a vision transformer, specifically in which trained models derived from such techniques are utilized for processing medical images. BACKGROUND The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also correspond to embodiments of the claimed inventions. Machine learning models have various applications to automatically process inputs and produce outputs considering situational factors and learned information to improve output quality. One area where machine learning models, and neural networks in particular, provide high utility is in the field of processing medical images. Within the context of machine learning and with regard to deep learning specifically, a Convolutional Neural Network (CNN, or ConvNet) is a class of deep neural networks, very often applied to analyzing visual imagery. Convolutional Neural Networks are regularized versions of multilayer perceptrons. Multilayer perceptrons are fully connected networks, such that each neuron in one layer is connected to all neurons in the next layer, a characteristic which often leads to a problem of overfitting of the data and the need for model regularization. Convolutional Neural Networks also seek to apply model regularization, but with a distinct approach. Specifically, CNNs take advantage of the hierarchical pattern in data and assemble more complex patterns using smaller and simpler patterns. Consequently, on the scale of connectedness and complexity, CNNs are on the lower extreme. Also used within the context of machine learning are Vision Transformers (ViTs). A Vision Transformer is a transformer that is targeted at vision processing tasks such as image recognition. Transformers found their initial applications in natural language processing (NLP) tasks, as demonstrated by language models such as BERT and GPT-3. By contrast, the typical image processing system uses a convolutional neural network (CNN). Well-known projects include Xception, ResNet, EfficientNet, DenseNet, and Inception. Unlike CNNs, Transformers measure the relationships between pairs of input tokens (words in the case of text strings), termed attention. The cost is exponential with the number of tokens. For images, the basic unit of analysis is the pixel. However, computing relationships for every pixel pair in a typical image is prohibitive in terms of memory and computation. Instead, ViT computes relationships among pixels in various small sections of the image (e.g., 16×16 pixels), at a drastically reduced cost. The sections (with positional embeddings) are placed in a sequence. The embeddings are learnable vectors. Each section is arranged into a linear sequence and multiplied by the embedding matrix. The result, with the position embedding is fed to the transformer. The architecture for image classification is the most common and uses only the Transformer Encoder in order to transform the various input tokens. However, there are also other applications in which the decoder part of the traditional Transformer Architecture is also used. Heretofore, self-supervised learning has been sparsely applied in the field of medical imaging. Nevertheless