US-12620139-B1 - Neural network-based image segmentation

US12620139B1US 12620139 B1US12620139 B1US 12620139B1US-12620139-B1

Abstract

Apparatuses, systems, and techniques are presented to perform segmentation on images. In at least one embodiment, one or more neural networks are used to segment an image based, at least in part, on one or more visual modifications of the image.

Inventors

Ali Hatamizadeh
Vishwesh Nath
Yucheng Tang
Dong Yang
Wenqi Li
Holger Roth
Daguang Xu

Assignees

NVIDIA CORPORATION

Dates

Publication Date: 20260505
Application Date: 20220309

Claims (16)

1 . A processor, comprising: one or more circuits to use one or more neural networks to segment an image, wherein the one or more neural networks comprise: a transformer neural network that includes one or more encoders, wherein the one or more encoders are pre-trained to extract features from input data using self-attention between a sequence of different portions of the input data based, at least in part, on one or more proxy tasks and unlabeled data; and a convolutional neural network that includes one or more decoders trained together with the one or more encoders as part of additional training of the one or more encoders to extract features from the image to segment and input to the one or more decoders based, at least in part, on labeled data.
2 . The processor of claim 1 , wherein the one or more proxy tasks include removing, from one or more sub-volumes of an input image volume of the input data, one or more mask regions and training the one or more encoders to predict image data that was removed from the one or more mask regions.
3 . The processor of claim 1 , wherein the one or more proxy tasks include predicting one or more sub-volumes of an input image volume of the input data, given one or more rotated versions of the one or more sub-volumes.
4 . The processor of claim 1 , wherein the one or more proxy tasks include a contrastive learning task to train the one or more encoders to differentiate between different regions of interest (ROIs) including different types of features in different views or portions of the input data.
5 . A system comprising: one or more processors to use one or more neural networks to segment an image, wherein the one or more neural networks comprise: a transformer network that includes one or more encoders pre-trained to extract features from input data using self-attention between a sequence of different portions of the input data based, at least in part, on one or more proxy tasks and unlabeled data; and a convolutional neural network that includes one or more decoders trained together with the one or more encoders as part of additional training of the one or more encoders to extract features from the image to segment and input to the one or more decoders based, at least in part, on labeled data.
6 . The system of claim 5 , wherein the one or more proxy tasks include removing, from one or more sub-volumes of an input image volume of the input data, one or more mask regions and training the one or more encoders to predict image data that was removed from the one or more mask regions.
7 . The system of claim 5 , wherein the one or more proxy tasks include predicting one or more sub-volumes of an input image volume of the input data, given one or more rotated versions of the one or more sub-volumes.
8 . The system of claim 5 , wherein the one or more proxy tasks include a contrastive learning task to train the one or more encoders to differentiate between different regions of interest (ROIs) including different types of features in different views or portions of the input data.
9 . A method comprising: using one or more neural networks to segment an image, wherein the one or more neural networks comprise: transformer neural network that includes one or more encoders, wherein the one or more encoders are pre-trained to extract features from input data using self-attention between a sequence of different portions of the input data based, at least in part, on one or more proxy tasks and unlabeled data; and a convolutional neural network that includes one or more decoders of trained together with the one or more encoders as part of additional training of the one or more encoders to extract features from the image to segment and input to the one or more decoders based, at least in part, on labeled data.
10 . The method of claim 9 , wherein the one or more proxy tasks include removing, from one or more sub-volumes of an input image volume of the input data, one or more mask regions and training the one or more encoders to predict image data that was removed from the one or more mask regions.
11 . The method of claim 9 , wherein the one or more proxy tasks include predicting one or more sub-volumes of an input image volume of the input data, given one or more rotated versions of the one or more sub-volumes.
12 . The method of claim 9 , wherein the one or more proxy tasks include a contrastive learning task to train the one or more encoders to differentiate between different regions of interest (ROIs) including different types of features in different views or portions of the input data.
13 . An image segmentation system, comprising: one or more processors to use one or more neural networks to segment an image; memory for storing network parameters for the one or more neural networks; and wherein the one or more neural networks comprise: a transformer neural network that includes one or more encoders, wherein the one or more encoders are pre-trained to extract features from input data using self-attention between a sequence of different portions of the input data based, at least in part, on one or more proxy tasks and unlabeled data; and a convolutional neural network that includes one or more decoders trained together with the one or more encoders as part of additional training of the one or more encoders to extract features from the image to segment and input to the one or more decoders based, at least in part, on labeled data.
14 . The image segmentation system of claim 13 , wherein the one or more proxy tasks include removing, from one or more sub-volumes of an input image volume of the input data, one or more mask regions and training the one or more encoders to predict image data that was removed from the one or more mask regions.
15 . The image segmentation system of claim 13 , wherein the one or more proxy tasks include predicting one or more sub-volumes of an input image volume of the input data, given one or more rotated versions of the one or more sub-volumes.
16 . The image segmentation system of claim 13 , wherein the one or more proxy tasks include a contrastive learning task to train the one or more encoders to differentiate between different regions of interest (ROIs) including different types of features in different views or portions of the input data.

Description

TECHNICAL FIELD At least one embodiment pertains to processing resources used to perform and facilitate artificial intelligence. For example, at least one embodiment pertains to processors or computing systems used to train neural networks, and at least one embodiment pertains to processors or computing systems for performing inferencing using neural networks, according to various novel techniques described herein. BACKGROUND Machine learning is increasingly being utilized to perform various tasks. In order to train machine learning models to perform various types of inferencing tasks, such as to perform image segmentation for medical image analysis, these models typically need to be trained using a large amount of training data. This training data often needs to be annotated or labels, which for tasks such as medical image segmentation can require manual annotations of several medical images to be performed by experts in medical image annotation. Accordingly, producing a sufficient amount of annotated training data can be time consuming and expensive, and not producing enough annotated training data can result in a machine learning model that does not produce sufficiently accurate results. Attempts to train models without annotations has proven to be challenging, and has not produced sufficiently accurate results in most prior attempts. BRIEF DESCRIPTION OF DRAWINGS FIG. 1 illustrates an example training framework, according to at least one embodiment; FIG. 2 illustrates a Swin transformer architecture, according to at least one embodiment; FIGS. 3A and 3B illustrate a Swin transformer block and a shifted windowing mechanism that can be used to train a neural network, according to at least one embodiment; FIG. 4 illustrates a process for training a network using unannotated and annotated training data, according to at least one embodiment; FIG. 5 illustrates a process for performing image segmentation, according to at least one embodiment; FIG. 6 illustrates an example system for training a model and performing inferencing, according to at least one embodiment; FIG. 7A illustrates inference and/or training logic, according to at least one embodiment; FIG. 7B illustrates inference and/or training logic, according to at least one embodiment; FIG. 8 illustrates training and deployment of a neural network, according to at least one embodiment; FIG. 9 illustrates an example data center system, according to at least one embodiment; FIG. 10A illustrates an example of an autonomous vehicle, according to at least one embodiment; FIG. 10B illustrates an example of camera locations and fields of view for the autonomous vehicle of FIG. 10A, according to at least one embodiment; FIG. 10C is a block diagram illustrating an example system architecture for the autonomous vehicle of FIG. 10A, according to at least one embodiment; FIG. 10D is a diagram illustrating a system for communication between cloud-based server(s) and the autonomous vehicle of FIG. 10A, according to at least one embodiment; FIG. 11 is a block diagram illustrating a computer system, according to at least one embodiment; FIG. 12 is a block diagram illustrating a computer system, according to at least one embodiment; FIG. 13 illustrates a computer system, according to at least one embodiment; FIG. 14 illustrates a computer system, according to at least one embodiment; FIG. 15A illustrates a computer system, according to at least one embodiment; FIG. 15B illustrates a computer system, according to at least one embodiment; FIG. 15C illustrates a computer system, according to at least one embodiment; FIG. 15D illustrates a computer system, according to at least one embodiment; FIGS. 15E and 15F illustrate a shared programming model, according to at least one embodiment; FIG. 16 illustrates exemplary integrated circuits and associated graphics processors, according to at least one embodiment; FIGS. 17A-17B illustrate exemplary integrated circuits and associated graphics processors, according to at least one embodiment; FIGS. 18A-18B illustrate additional exemplary graphics processor logic according to at least one embodiment; FIG. 19 illustrates a computer system, according to at least one embodiment; FIG. 20A illustrates a parallel processor, according to at least one embodiment; FIG. 20B illustrates a partition unit, according to at least one embodiment; FIG. 20C illustrates a processing cluster, according to at least one embodiment; FIG. 20D illustrates a graphics multiprocessor, according to at least one embodiment; FIG. 21 illustrates a multi-graphics processing unit (GPU) system, according to at least one embodiment; FIG. 22 illustrates a graphics processor, according to at least one embodiment; FIG. 23 is a block diagram illustrating a processor micro-architecture for a processor, according to at least one embodiment; FIG. 24 illustrates a deep learning application processor, according to at least one embodiment; FIG. 25 is a block diagram illustrati