CN-116685989-B - Learning unpaired multimodal feature matching for semi-supervised learning

CN116685989BCN 116685989 BCN116685989 BCN 116685989BCN-116685989-B

Abstract

A computer-implemented method for learning multimodal feature matching is provided. The method includes training an image encoder to obtain an encoded image. The method further includes training a common classifier on the encoded image by using the tagged image. The method further includes training the text encoder by using the learned text embedding and a corresponding tag of the learned text embedding while maintaining the common classifier in a fixed configuration. The text encoder is further trained to match the distance of predictive text embedding encoded by the text encoder to a fitted gaussian distribution over the encoded image.

Inventors

S. Chowdhury
Wood village big Yi
KURATA GAKUTO
NAGANO TOHRU

Assignees

国际商业机器公司

Dates

Publication Date: 20260505
Application Date: 20211102
Priority Date: 20201202

Claims (20)

1. A computer-implemented method for learning multimodal feature matching, comprising: Training an image encoder with a triplet penalty of pushing dissimilar images away from a set of similar images to obtain encoded images; learning text embedding with corresponding tags by training a common classifier on encoded images using tagged images, and Training a text encoder while maintaining the common classifier in a fixed configuration by using learned text embedding and corresponding labels for the learned text embedding, wherein the text encoder is further trained to match the distance of predicted text embedding encoded by the text encoder with a fitted gaussian distribution over the encoded image.
2. The computer-implemented method of claim 1, further comprising training the common classifier by using the tagged image with the image encoder and tagged text with the text encoder.
3. The computer-implemented method of claim 1, wherein the text encoder is trained to simultaneously optimize cross entropy with the common classifier and KL divergence between the fitted gaussian distribution and the predicted text embedding.
4. The computer-implemented method of claim 1, wherein the common classifier is trained without pairing data.
5. The computer-implemented method of claim 1, wherein the common classifier is trained using cross entropy loss.
6. The computer-implemented method of claim 1, wherein a total loss is calculated as a sum of a loss corresponding to the common classifier and a result of multiplying a hyper-parameter by a loss corresponding to the image encoder.
7. The computer-implemented method of claim 1, further comprising minimizing Kullback-Liebler divergence between the fitted gaussian distribution and the learned text embedding with the corresponding labels.
8. The computer-implemented method of claim 7, further comprising performing semi-supervised learning on a common embedding space.
9. The computer-implemented method of claim 1, wherein the text encoder maps pre-trained text inserts with the image inserts to a common potential representation to implement a cross-modal task.
10. The computer-implemented method of claim 1, further comprising extracting the text embedding by a pre-trained text embedding model applied to training text.
11. The computer-implemented method of claim 1, wherein the method is performed by a text subtitle system that subtitles an input image with an output text description.
12. The computer-implemented method of claim 11, further comprising controlling an automobile to avoid a collision in response to at least one of the output text descriptions indicating an impending collision.
13. The computer-implemented method of claim 1, wherein the triplet loss pushes similar ones of the encoded images together and separates dissimilar ones of the encoded images.
14. The computer-implemented method of claim 1, wherein training the text encoder further comprises mapping the learned text inserts to sample clusters using the common classifier to classify the learned text inserts into respective ones of a plurality of categories.
15. The computer-implemented method of claim 1, wherein the text encoder is trained such that cross entropy loss and multi-variable gaussian loss with a classifier in the fixed configuration are simultaneously optimized.
16. The computer-implemented method of claim 1, further comprising adding a random adjective to each of the corresponding tags for text distribution changes.
17. A computer program product for learning multimodal feature matching, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method, the method comprising: Training an image encoder with a triplet penalty of pushing dissimilar images away from a set of similar images to obtain encoded images; learning text embedding with corresponding tags by training a common classifier on encoded images using tagged images, and Training a text encoder while maintaining the common classifier in a fixed configuration by using learned text embedding and corresponding labels for the learned text embedding, wherein the text encoder is further trained to match the distance of predicted text embedding encoded by the text encoder with a fitted gaussian distribution over the encoded image.
18. The computer program product of claim 17, wherein the method further comprises training the common classifier by using the tagged image with the image encoder and tagged text with the text encoder.
19. The computer program product of claim 17, wherein the text encoder is trained to simultaneously optimize cross entropy with the common classifier and KL divergence between the fitted gaussian distribution and the predicted text embedding.
20. The computer program product of claim 17, wherein the common classifier is trained without pairing data.

Description

Learning unpaired multimodal feature matching for semi-supervised learning Background The present invention relates generally to machine learning, and more particularly to learning unpaired multimodal feature matching for semi-supervised learning. Generating one data modality from another data modality is an important function in many machine learning applications. Typically, applications involve two or more data modalities, where for each modality there are few tagged samples and many untagged samples. The goal is to learn a common mapping between modalities using tagged samples. In "Text to Image Generative Model using Constrained Embedding Space Mapping", IEEE International Workshop On Machine Learning For Signal Processing, 2017, by Subhajit Chaudhury et al. and "Conditional generation of multi-modal data using constrained embedding space mapping", International Conference on Machine Learning (ICML) workshop on Implicit Generative Models, 2017, by Subhajit Chaudhury et al., deterministic mapping schemes are used where they are first computed to force the potential spaces equal during training. Because the underlying space is deterministic, deterministic mapping schemes cannot model different changes in the multimodal distribution. Furthermore, deterministic mapping schemes are prone to overfitting because deterministic mapping does not provide a measure of maximum decoupling between embedded and common potential space. In addition, they use paired training data. In "Multimodal deep learning," in Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 689–696, by Jiquan Ngiam et al., a deep learning framework is presented that uses a limited Boltzmann machine and deep belief network to learn the effective features of audio and video modalities. However, they require two modalities to infer potential space, which limits conditional generation of data from one modality to the other. In "Generating images from captions with attention," Computing Research Repository (CoRR), Vol. abs/1511.02793, 2015, by Elman Mansimov et al., it is shown that using an attention-based model to generate images from text subtitles results in higher quality samples. However, they do not produce bi-directional multi-modal data distributions. In "Generative adversarial text to image synthesis," in Proceedings of the 33rd International Conference on Machine Learning, Vol. 48. 2016, ICML'16, pp. 1060–1069, JMLR.org, by Scott Reed et al., a deep convolution generation countermeasure network is proposed that combines natural language and image embedding to produce a synthetically generated image. However, they can only generate text from an image, and cannot be generated in the opposite way. In "Joint Multimodal Learning with Deep Generative Models", International Conference on Learning Representations (ICLR) 2017 workshop, April 24–26, 2017, Toulon, France by Masahiro Suzuki et al., joint distribution learning is proposed that uses change inference directly on data modalities by sharing the data modalities to create a common potential space. However, their methods cannot be directly used for condition independent inference. Furthermore, their methods require more network parameters, use more data for training, and have to rely on an countermeasure model for training natural images. Disclosure of Invention According to aspects of the present invention, a computer-implemented method for learning multimodal feature matching is provided. The method includes training an image encoder to obtain an encoded image. The method further includes training a common classifier on the encoded image by using the tagged image. The method further includes training the text encoder by using the learned text embedding and corresponding tags of the learned text embedding while maintaining the common classifier in a fixed configuration. The text encoder is further trained to match the distance of predictive text embedding encoded by the text encoder to a fitted gaussian distribution over the encoded image. Matching of the predicted text-embedded distance to the fitted gaussian distribution over the encoded image forces the untagged image to have a soft cluster score for each category, thus utilizing a small number of tagged images, which results in improved multimodal matching performance using large amounts of data. In an embodiment, the text encoder is trained to simultaneously optimize the cross entropy of the common classifier and the KL divergence between the fitted gaussian distribution and the predicted text embedding in the image domain. In this way, the distribution of potential representations in both image and text embedding may be matched to the same distribution in a category-wise manner that enables cross-modality generation and classification. In an embodiment, the common classifier is trained without pairing data. In this way, a classifier trained on image samples can be used to distinguish text embedded s