US-20260127865-A1 - TRAINING VISION MODELS WITH UNIFIED CONTRASTIVE LEARNING

US20260127865A1US 20260127865 A1US20260127865 A1US 20260127865A1US-20260127865-A1

Abstract

Examples are provided for pre-training a computer vision foundation model. A representative method comprises curating a pre-training database of image-text pairs from weakly labeled data. Language is encoded of text descriptions from the image-text pairs. The images of the image-text pairs are encoded using a hierarchical vision transformer with shifted windows and convolutional embedding. Based on the encoded images and the encoded language, the computer vision foundation model is pre-trained via unified image-text contrastive learning.

Inventors

Lu Yuan
Chunyuan LI
Jianwei YANG
Bin Xiao

Assignees

MICROSOFT TECHNOLOGY LICENSING, LLC

Dates

Publication Date: 20260507
Application Date: 20260105

Claims (20)

1 . A computer vision development system, comprising: a data curation engine configured to curate a pre-training database of image-text pairs from weakly labeled data; and a computer vision foundation model, comprising: a pre-training model comprising: a language encoder configured to encode language of text descriptions from the image-text pairs to obtain encoded language; an image encoder configured to encode images of the image-text pairs using a hierarchical vision transformer by generating projection layers using convolutional operations and utilizing shifted windows in determining local attention from the projection layers that are generated to obtain encoded images; and a unified image-text contrastive learning module configured to pre-train the computer vision foundation model based on the encoded images and the encoded language; and two or more extensibility adapters configured to receive a plurality of feature pyramids from different scale levels of the hierarchical vision transformer, and to extend learned feature representations of the feature pyramids in one or more dimensions of a computer vision task problem space.
2 . The computer vision development system of claim 1 , wherein the two or more extensibility adapters include a space-based adapter configured to extend the learned feature representation in a space-based dimension of a computer vision task problem space.
3 . The computer vision development system of claim 2 , wherein the space-based adapter deploys one or more of level-wise, spatial-wise, and channel-wise attention mechanisms.
4 . The computer vision development system of claim 1 , wherein the two or more extensibility adapters include a time-based adapter configured to extend the learned feature representation in a time-based dimension of the computer vision task problem space.
5 . The computer vision development system of claim 4 , wherein the time-based adapter is a fine-grained V+L representation adapter.
6 . The computer vision development system of claim 1 , wherein the two or more extensibility adapters include a modality-based adapter configured to extend the learned feature representation in a modality-based dimension of the computer vision task problem space.
7 . The computer vision development system of claim 6 , wherein the modality-based adapter is a video representation adapter configured to implement a video adaptation of the hierarchical vision transformer to encode images in three-dimensions, and to train the computer vision foundation model on self-attention layers with three-dimensionally shifted local windows.
8 . The computer vision development system of claim 1 , wherein the two or more extensibility adapters include a classification/retrieval adapter.
9 . The computer vision development system of claim 1 , further comprising one or more transferability adapters configured to transfer the pre-trained computer vision foundation model to one or more machine learning scenarios.
10 . A method for developing a computer vision foundation model, comprising: curating a pre-training database of image-text pairs from weakly labeled data; encoding language of text descriptions from the image-text pairs to obtain encoded language; encoding images of the image-text pairs using a hierarchical vision transformer by generating projection layers using convolutional operations and utilizing shifted windows in determining local attention from the projection layers that are generated to obtain encoded images; and pre-training the computer vision foundation model with a unified image-text contrastive learning module based on the encoded images and the encoded language; receiving a plurality of feature pyramids from different scale levels of the hierarchical vision transformer at two or more extensibility adapters; extending learned feature representations of the feature pyramids in one or more dimensions of a computer vision task problem space using the two or more extensibility adapters.
11 . The method of claim 10 , wherein the two or more extensibility adapters include a space-based adapter, and wherein the method further comprises extending the learned feature representation in a space-based dimension of a computer vision task problem space using the space-based adapter.
12 . The method of claim 11 , further comprising: deploying one or more of level-wise, spatial-wise, and channel-wise attention mechanisms using the space-based adapter.
13 . The method of claim 10 , wherein the two or more extensibility adapters include a time-based adapter and wherein the method further comprises extending the learned feature representation in a time-based dimension of the computer vision task problem space using the time-based adapter.
14 . The method of claim 13 , wherein the time-based adapter is a fine-grained V+L representation adapter.
15 . The method of claim 10 , wherein the two or more extensibility adapters include a modality-based adapter, and wherein the method further comprises extending the learned feature representation in a modality-based dimension of the computer vision task problem space using the modality-based adapter.
16 . The method of claim 15 , wherein the modality-based adapter is a video representation adapter, and wherein the method further comprises implementing a video adaptation of the hierarchical vision transformer to encode images in three-dimensions, and training the computer vision foundation model on self-attention layers with three-dimensionally shifted local windows.
17 . The method of claim 10 , wherein the two or more extensibility adapters include a classification/retrieval adapter.
18 . The method of claim 10 , further comprising: transferring the pre-trained computer vision foundation model to one or more machine learning scenarios using one or more transferability adapters.
19 . A method for pre-training a computer vision foundation model, the method comprising: curating a pre-training database of image-text pairs from weakly labeled data; encoding language of text descriptions from the image-text pairs to obtain encoded language; encoding images of the image-text pairs using a hierarchical vision transformer that generates projection layers using convolutional operations and utilizes shifted windows to determine local attention from the projection layers that are generated to obtain encoded images; outputting feature pyramids from different scale levels of the hierarchical vision transformer; pre-training the computer vision foundation model based on the encoded images and the encoded language via a unified image-text contrastive learning module; providing the feature pyramids to two or more extensibility adapters.
20 . The method of claim 19 , further comprising: concatenating the feature pyramid scale levels into a 3-dimensional tensor; sequentially applying a different attention mechanism for each dimension of the 3-dimensional tensor; and decoupling the encoded images using the different attention mechanisms.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. Non-Provisional patent application Ser. No. 17/821,596, filed Aug. 23, 2022, which claims priority to U.S. Provisional Patent Application Ser. No. 63/264,369, filed Nov. 21, 2021, the entirety of each of which are hereby incorporated herein by reference for all purposes. BACKGROUND Automated visual understanding of our diverse and open world requires computer vision models that generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on large-scale diverse data sets and can be adapted to a wide range of downstream tasks, are critical to solve real-world computer vision applications. Computer vision applications are generally trained using exhaustive sets of training data, often including pairs of text and images generated with supervision. Such training may be mediated using neural networks. The trained vision model may then be deployed to recognize images based on their similarity to the training data. A challenge in computer vision lies in generating a scalable pre-training system that is also transferable. For example, many existing platforms use text-image pre-training methods with large scale data training. As such, the models are essentially trained to perform zero-shot learning tasks, and can only be transferred or adapted to related computer vision schemes. Such models do not have broad general transferability. SUMMARY This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure. Examples are provided for pre-training a computer vision foundation model. A representative method comprises curating a pre-training database of image-text pairs from weakly labeled data. Language is encoded of text descriptions from the image-text pairs. The images of the image-text pairs are encoded using a hierarchical vision transformer with shifted windows and convolutional embedding. Based on the encoded images and the encoded language, the computer vision foundation model is pre-trained via unified image-text contrastive learning. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows an illustration of a problem space for computer vision tasks. FIG. 2 schematically shows an example computer vision development system. FIG. 3 schematically shows a computer vision foundation model 300. FIG. 4 is a flow diagram for an example method for pre-training a computer vision foundation model. FIG. 5 schematically shows an example computing system. DETAILED DESCRIPTION Progress in Artificial Intelligence (AI) is often limited when specific models have to be developed to solve specific problems. Such models often rely on supervised training that is further limited by human input capabilities. More rapid progress can be made using cross-modal, holistic models that are capable of solving diverse real-world problems without significant human involvement. Thus, approaches that build cross-modal representations that can be efficiently adapted to various downstream tasks with minimal additional information or interventions are highly desired. One such approach is XYZcode, where monolingual text (X), audio and visual sensory signals (Y), and multilingual data (Z) are organically integrated to create AI models that can speak, hear, see, and understand. Other approaches attempt to build a single model that can be generalized across millions of tasks. One fundamental tool within these sets of approaches is foundation models. The term may be applied to any model that is trained from broad data sets at a scale that is capable of being adapted (e.g., fine-tuned) to a wide range of downstream tasks. Foundation models are important due to their impressive performance and generalization capabilities. Adaptable foundation models may be quickly integrated and deployed into real-world AI systems by many researchers and developers. Although foundation models have already demonstrated huge impact in natural language processing (NLP), and in computer vision, standard practice still involves pre-training models on large, annotated data sets. More recently, large-scale pre-training methods that learn directly from web-scale image-text pairs, show encouraging progress for efficient transfer learning, and zero-shot capability. However, such models have been limited to tasks such as classification, retrieval, and tagging of images. Broader adaptability and transferability have proven more challenging. While existing vision foundation models focus mainly on mapping images and textual represe