EP-4432165-B1 - APPARATUS AND METHOD FOR VIDEO REPRESENTATION LEARNING

EP4432165B1EP 4432165 B1EP4432165 B1EP 4432165B1EP-4432165-B1

Inventors

CHOI, JONG WON
PARK, SOO HYUN
YOUN, JONG SU

Dates

Publication Date: 20260513
Application Date: 20231227

Claims (8)

An apparatus (100) for video representation learning, the apparatus comprising: a feature extractor (110) including a student network that extracts video features from video data and generates a video embedding, a first teacher network that extracts image features from image data extracted from the video data and generates an image embedding, and a second teacher network that extracts audio features from audio data extracted from the video data and generates an audio embedding; a compositional embedding network unit (120) including a first compositional neural network that generates a first compositional embedding based on the video embedding and the image embedding, and a second compositional neural network that generates a second compositional embedding based on the video embedding and the audio embedding; a sample generator (130) configured to generate positive samples and negative samples based on the image embedding and the audio embedding using a Siamese neural network trained to estimate a correlation between the image embedding and the audio embedding; wherein the Siamese neural network generates a positive sample by concatenating an image embedding and an audio embedding with a distance according to the correlation that is equal to or shorter than a certain distance, and generates a negative sample by concatenating an image embedding and an audio embedding with the distance according to the correlation that exceeds the certain distance; and a contrastive learning unit (140) configured to generate one or more loss functions for training the student network using the video embedding, the first compositional embedding, the second compositional embedding, the positive samples, and the negative samples, and train the student network for video search.
The apparatus (100) of claim 1, wherein the student network is constructed as a three-dimensional convolutional neural network, 3D-CNN, by combining a two-dimensional convolutional neural network, 2D-CNN, for extracting spatial information with a one-dimensional convolutional neural network, 1D-CNN, for extracting temporal information.
The apparatus (100) of claim 1, wherein the first teacher network is constructed as a two-dimensional convolutional neural network, 2D-CNN model, and generates an image embedding by extracting spatial visual information from the image data, and the second teacher network is constructed as a one-dimensional convolutional neural network, 1D-CNN, model, and generates an audio embedding by extracting temporal acoustic information from the audio data.
The apparatus (100) of claim 1, wherein the first compositional embedding is calculated by adding the image embedding to an image residual embedding obtained by normalizing each of the image embedding and the video embedding and then concatenating the normalized embeddings, and the second compositional embedding is calculated by adding the audio embedding to an audio residual embedding obtained by normalizing each of the audio embedding and the video embedding and then concatenating the normalized embeddings.
The apparatus (100) of claim 1, wherein the Siamese neural network (MLP) is first trained with positive training samples constructed by concatenating an image embedding and an audio embedding with an embedding distance therebetween that is shorter than or equal to a first distance among the image embeddings and audio embeddings and negative training samples constructed by concatenating an image embedding and an audio embedding with the embedding distance therebetween that is equal to or longer than a second distance, and decreases the first distance and increases the second distance as a training order increases.
The apparatus (100) of claim 1, wherein the contrastive learning unit (140) is configured to generate a loss function based on cosine similarity of the video embedding and the positive sample and cosine similarity of the video embedding and the negative sample.
A method that is performed in a computing device including one or more processors and a memory storing one or more programs executed by the one or more processors, the method comprising: generating (310) a video embedding by extracting a video feature from video data using a student network; generating (320) an image embedding by extracting an image feature from image data extracted from the video data using a first teacher network; generating (330) an audio embedding by extracting an audio feature from audio data extracted from the video data using a second teacher network; generating (340) a first compositional neural network that generates a first compositional embedding based on the video embedding and the image embedding and a second compositional neural network that generates a second compositional embedding based on the video embedding and the audio embedding; generating (350) positive samples and negative samples based on the image embedding and the audio embedding using a Siamese neural network trained to estimate a correlation between the image embedding and the audio embedding; wherein the Siamese neural network generates a positive sample by concatenating an image embedding and an audio embedding with a distance according to the correlation that is equal to or shorter than a certain distance, and generates a negative sample by concatenating an image embedding and an audio embedding with the distance according to the correlation that exceeds the certain distance; and generating (360) one or more loss functions for training the student network using the video embedding, the first compositional embedding, the second compositional embedding, the positive samples, and the negative samples, and training the student network for video search.
A computer program (20) stored in a non-transitory computer readable storage medium (16), the computer program (20) comprising one or more instructions that, when executed by a computing device (12) having one or more processors (14), cause the computing device (12) to perform operations of: generating (310) a video embedding by extracting a video feature from video data using a student network; generating (320) an image embedding by extracting an image feature from image data extracted from the video data using a first teacher network; generating (330) an audio embedding by extracting an audio feature from audio data extracted from the video data using a second teacher network; generating (340) a first compositional neural network that generates a first compositional embedding based on the video embedding and the image embedding and a second compositional neural network that generates a second compositional embedding based on the video embedding and the audio embedding; generating (350) positive samples and negative samples based on the image embedding and the audio embedding using a Siamese neural network trained to estimate a correlation between the image embedding and the audio embedding; wherein the Siamese neural network generates a positive sample by concatenating an image embedding and an audio embedding with a distance according to the correlation that is equal to or shorter than a certain distance, and generates a negative sample by concatenating an image embedding and an audio embedding with the distance according to the correlation that exceeds the certain distance; and generating (360) one or more loss functions for training the student network using the video embedding, the first compositional embedding, the second compositional embedding, the positive samples, and the negative samples, and training the student network for video search.

Description

BACKGROUND 1. Field The following description relates to an apparatus and a method for video representation learning, performing multi-model distillation and contrastive learning based on interdependence information of video and audio. 2. Description of Related Art Existing video search technologies are often dependent on text data or audio data selectively provided, and has problems of requiring a large number of video data for network learning and high costs. Korean Patent Publication No. 10-2015-0091053 (published on August 7, 2015) discloses a feature in which a user enters a text query related to video to be searched and performs text-based image search based on the entered text query. However, in these text-based methods, there are problems that the quality of input annotations is usually poor, and most annotations only provide a brief description of a portion of video. Non-patent Literature "Distilling Audio0Visual Knowledge by Compositional Contrasive Learning" published by Cornell University on April 22, 2021, provides a model for performing a variety of existing knowledge distillation methods in transferring audio0visual knowledge. SUMMARY The disclosed embodiments are intended to provide an apparatus and a method for video representation learning, performing multi-model distillation and contrastive learning based on interdependence information of video and audio. The present invention is set forth in the independent claims and further embodiments are defined in the dependent claims. According to the present invention, there is provided an apparatus for video representation learning, as defined by independent claim 1. The apparatus includes a feature extractor including a student network that extracts video features from video data and generates a video embedding, a first teacher network that extracts image features from image data extracted from the video data and generates an image embedding, and a second teacher network that extracts audio features from audio data extracted from the video data and generates an audio embedding, a compositional embedding network unit including a first compositional neural network that generates a first compositional embedding based on the video embedding and the image embedding, and a second compositional neural network that generates a second compositional embedding based on the video embedding and the audio embedding, a sample generator configured to generate positive samples and negative samples based on the image embedding and the audio embedding using a Siamese neural network trained to estimate a correlation between the image embedding and the audio embedding, and a contrastive learning unit configured to generate one or more loss functions for training the student network using the video embedding, the first compositional embedding, the second compositional embedding, the positive samples, and the negative samples. The student network may be constructed as a three-dimensional convolutional neural network (3D-CNN) by combining a two-dimensional convolutional neural network (2D-CNN) for extracting spatial information with a one-dimensional convolutional neural network (1D-CNN) for extracting temporal information. The first teacher network may be constructed as a two-dimensional convolutional neural network (2D-CNN) model and generate an image embedding by extracting spatial visual information from the image data, and the second teacher network may be constructed as a one-dimensional convolutional neural network (1D-CNN) model and generate an audio embedding by extracting temporal acoustic information from the audio data. The first compositional embedding may be calculated by adding the image embedding to an image residual embedding obtained by normalizing each of the image embedding and the video embedding and then concatenating the normalized embeddings, and the second compositional embedding may be calculated by adding the audio embedding to an audio residual embedding obtained by normalizing each of the audio embedding and the video embedding and then concatenating the normalized embeddings. The Siamese neural network generates a positive sample by concatenating an image embedding and an audio embedding with a distance according to the correlation that is equal to or shorter than a certain distance, and generate a negative sample by concatenating an image embedding and an audio embedding with the distance according to the correlation that exceeds the certain distance. The Siamese neural network may be first trained with positive training samples constructed by concatenating an image embedding and an audio embedding with an embedding distance therebetween that is shorter than or equal to a first distance among the image embeddings and audio embeddings and negative training samples constructed by concatenating an image embedding and an audio embedding with the embedding distance therebetween that is equal to or longer than a second distance, and decrease th