KR-20260064400-A - VOICE-GUIDED VISUAL MODELING SYSTEM AND METHOD USING MAKING IMAGE MODELING BASED ON VOICE GUIDANCE

KR20260064400AKR 20260064400 AKR20260064400 AKR 20260064400AKR-20260064400-A

Abstract

A voice-map visual modeling system and method using masking image modeling based on voice guidance are disclosed. A voice-map visual modeling system according to one embodiment may include an image restoration unit that restores a masked image patch using voice data describing a visual scene; and a relationship learning unit that learns relationship information between the voice data and the image data through the restored image patch.

Inventors

정준선
우종빈
류형곤
아르다 세노착

Assignees

한국과학기술원

Dates

Publication Date: 20260507
Application Date: 20241114
Priority Date: 20241031

Claims (15)

In a speech-mapped visual modeling system, An image restoration unit that restores a masked image patch using voice data describing a visual scene; and A relationship learning unit that learns relationship information between the voice data and image data through the restored image patch. A speech-mapped visual modeling system including
In paragraph 1, The above image restoration unit is, Through contrastive learning, feature information forming image-speech pairs containing speech data and image data is made to become closer to each other. A speech-map visual modeling system characterized by the following.
In paragraph 2, The above image restoration unit is, Extracting speech feature information from speech data through a speech encoder A speech-map visual modeling system characterized by the following.
In paragraph 2, The above image restoration unit is, Extracting image feature information from image data through an image encoder A speech-map visual modeling system characterized by the following.
In paragraph 2, The above image restoration unit is, Creates a masked image by masking the image data at a certain ratio A speech-map visual modeling system characterized by the following.
In paragraph 5, The above image restoration unit is, Input the generated masked image into an image encoder, and extract image feature information from the input masked image through the image encoder. A speech-map visual modeling system characterized by the following.
In paragraph 2, The above image restoration unit is, Restoring masked regions in a masked image using speech feature information through a cross-modal decoder A speech-map visual modeling system characterized by the following.
In paragraph 1, The above image restoration unit is, Enhancing cross-modal interaction between speech data and image data through Speech-Guided Masked Image Modeling (MIM) to guide the image restoration process by speech data A speech-map visual modeling system characterized by the following.
In paragraph 8, The above image restoration unit is, A speech feature vector output from speech data through a speech encoder is input into a cross-modal decoder, and information matching a visual scene is found in the speech data according to a cross-attention mechanism through the cross-modal decoder to restore the masked area of the image. A speech-map visual modeling system characterized by the following.
In a speech-mapped visual modeling method performed by a speech-mapped visual modeling system, A step of restoring a masked image patch using voice data describing a visual scene; and A step of learning relationship information between the voice data and image data through the restored image patch. A speech-map visual modeling method including
In Paragraph 10, The above restoration step is, A step that causes feature information forming image-speech pairs, including speech data and image data, to become closer to each other through contrastive learning. A speech-map visual modeling method including
In Paragraph 11, The above restoration step is, A step of extracting speech feature information of speech data through a speech encoder; A step of extracting image feature information from image data through an image encoder; A step of generating a masked image by masking image data at a certain ratio, inputting the generated masked image into an image encoder, and extracting image feature information from the input masked image through the image encoder; and A step of restoring masked regions in a masked image using speech feature information through a cross-modal decoder. A speech-map visual modeling method including
In Paragraph 10, The above restoration step is, A step that guides the image restoration process by voice data by enhancing cross-modal interaction between voice data and image data through Speech-Guided Masked Image Modeling (MIM). A speech-map visual modeling method including
In Paragraph 13, The above restoration step is, A step of inputting a speech feature vector output from speech data through a speech encoder into a cross-modal decoder, and restoring a masked area of an image by finding information matching a visual scene in the speech data according to a cross-attention mechanism through the cross-modal decoder. A speech-map visual modeling method including
In a computer program stored on a computer-readable storage medium for executing a speech-mapped visual modeling method performed by a speech-mapped visual modeling system, The speech-map visual modeling method performed above is, A step of restoring a masked image patch using voice data describing a visual scene; and A step of learning relationship information between the voice data and image data through the restored image patch. A computer program stored on a computer-readable storage medium that executes.

Description

Voice-Guided Visual Modeling System and Method Using Masking Image Modeling Based on Voice Guidance The following description concerns a technology that learns the association between images and voice data. Existing Visually Grounded Speech (VGS) models learn the semantic connectivity between images and speech, focusing on learning semantic correspondences between visual and speech representations even without text information. However, previous studies have primarily focused on aligning images and speech using contrastive learning techniques, and have sought to improve performance by combining this with additional visual or linguistic knowledge. In modern AI-based multimodal learning systems, technologies capable of simultaneously processing visual and auditory information to grasp meaning and utilizing them in various application fields are becoming increasingly important. However, existing VGS models have faced technical limitations in clearly understanding the visual scenes represented by auditory data, and new learning methods are required to overcome these limitations. FIG. 1 is a diagram illustrating a learning method for aligning feature information between voice data and image data in one embodiment. FIG. 2 is a diagram illustrating contrast learning in one embodiment. FIG. 3 is a diagram illustrating a method combining masking image modeling and contrast learning in one embodiment. FIG. 4 is a diagram illustrating the attention weights assigned to each frame by the [CLS] token derived from the last layer of the speech encoder in one embodiment. FIG. 5 is a block diagram illustrating a speech-map visual modeling system in one embodiment. FIG. 6 is a flowchart illustrating a speech-map visual modeling method in one embodiment. Hereinafter, embodiments will be described in detail with reference to the attached drawings. FIG. 1 is a diagram illustrating a learning method for aligning feature information between voice data and image data in one embodiment. A speech-supervised visual modeling system can provide a Visually Grounded Speech (VGS) model that learns the association between image data and speech data. A speech-supervised visual modeling system can restore masked image patches using speech data that describes visual scenes, and further enhance the alignment between speech-visual information through the restored image patches. Extending from existing contrastive learning-based visual-speech models, the speech-supervised visual modeling system aims to improve the interaction between the two modalities and achieve better learning performance by utilizing speech data to restore masked parts of images. The learning process can consist of contrastive learning and Masked Image Modeling (MIM). For contrastive learning, the speech-supervised visual modeling system can extract speech feature information from speech data through a speech encoder and extract image feature information from image data through an image encoder. At the same time, the speech-supervised visual modeling system can generate a masked image by masking a certain area in the image data and extract image feature information from the masked image. Then, the speech-supervised visual modeling system can reconstruct the masked image patch through a cross-modal decoder and guide the restoration process of the masked area using the speech feature information. FIG. 2 is a diagram illustrating contrast learning in one embodiment. Contrastive learning can learn the semantic correspondence between speech data and image data. By learning speech-image pairs that include speech feature information extracted from speech data via a speech encoder and image feature information extracted from image data via an image encoder, contrastive learning can match features between the two modalities and maximize similarity. FIG. 3 is a diagram illustrating a method combining masking image modeling and contrast learning in one embodiment. A speech-supervised visual modeling system can provide a visual speech model that combines Masked Image Modeling (MIM) and contrast learning to improve alignment within the audiovisual feature space. The learning process of the visual speech model will be described in more detail. The speech-map visual modeling system is an image encoder , audio encoder , cross-modal decoder and momentum image encoder It can be composed of four main components, including. Momentum encoder parameters Is It is updated to, and here, represents the momentum coefficient. All these components are based on a transformer structure. A speech encoder receives speech data as input and can extract speech feature information. The speech encoder can be designed based on the HuBERT model and can divide speech data into segments to convert them into feature vectors. In this process, the speech encoder extracts only the important semantic elements from the speech data so that they can be utilized in the image restoration process. An image encoder is a module