US-12620227-B2 - Common action localization

US12620227B2US 12620227 B2US12620227 B2US 12620227B2US-12620227-B2

Abstract

Aspects of the disclosure are directed to an apparatus configured to perform common-action localization. In certain aspects, the apparatus may receive a query video comprising a plurality of frames, wherein a first query proposal is determined based on a subset of frames of the plurality of frames, the first query proposal indicative of an action depicted on the subset of frames. In certain aspects, the apparatus may determine a first attendance for a first support video of a plurality of support videos. In certain aspects, the apparatus may determine a second attendance for a second support video of the plurality of support videos after computing the first attendance.

Inventors

Juntae LEE
Mihir JAIN
Sungrack Yun

Assignees

QUALCOMM INCORPORATED

Dates

Publication Date: 20260505
Application Date: 20230727

Claims (20)

1 . An apparatus for performing common-action localization, comprising: one or more memories, individually or in combination, having instructions; and one or more processors, individually or in combination, configured to execute the instructions and cause the apparatus to: receive a query video comprising a plurality of frames, wherein a first query proposal is determined based on a subset of frames of the plurality of frames, the first query proposal indicative of an action depicted on the subset of frames; compute relevance of the first query proposal to each support video of a plurality of support videos, individually one at a time, wherein the computing comprises: determining a first attendance indicative of a first probability that the action is found in a first support video of the plurality of support videos by generating and up-sampling, via at least one neural network, at least one feature map associated with the first support video; and determining a second attendance indicative of a second probability that the action is found in a second support video of the plurality of support videos by generating and up-sampling, via the at least one neural network, at least one feature map associated with the second support video, wherein the second attendance is determined after the first attendance is determined; and output a classification of the subset of frames based at least in part on the first attendance and the second attendance.
2 . The apparatus of claim 1 , wherein the second attendance is determined independent of the first attendance.
3 . The apparatus of claim 1 , wherein the first attendance is further indicative of whether the first support video comprises one or more frames matching the first query proposal, and wherein the second attendance is further indicative of whether the second support video comprises one or more frames matching the first query proposal.
4 . The apparatus of claim 3 , wherein the one or more processors are further configured to cause the apparatus to: apply a one-dimensional temporal convolution to the one or more frames from each of the first support video and the second support video matching the first query proposal.
5 . The apparatus of claim 3 , wherein the one or more processors are further configured to cause the apparatus to: generate a third support video consisting of the one or more frames from each of the first support video and the second support video matching the first query proposal.
6 . The apparatus of claim 1 , wherein the first query proposal is one of multiple query proposals determined based on the plurality of frames, and wherein the one or more processors are further configured to cause the apparatus to: determine the first query proposal of the multiple query proposals has a highest probability that the action the first query proposal is indicative of is found among the plurality of support videos, relative to one or more other remaining query proposals of the multiple query proposals that are indicative of one or more other actions.
7 . The apparatus of claim 1 , wherein the one or more processors are further configured to cause the apparatus to: classify each of the first support video and the second support video based on pseudo-action classes.
8 . The apparatus of claim 7 , wherein the pseudo-action classes are mapped to a k-means cluster.
9 . A method for performing common-action localization, comprising: receiving, by one or more processors, a query video comprising a plurality of frames, wherein a first query proposal is determined based on a subset of frames of the plurality of frames, the first query proposal indicative of an action depicted on the subset of frames; computing, by one or more processors, relevance of the first query proposal to each support video of a plurality of support videos, individually one at a time, wherein the computing comprises: determining a first attendance indicative of a first probability that the action is found in a first support video of the plurality of support videos by generating and up-sampling, via at least one neural network, at least one feature map associated with the first support video; and determining a second attendance indicative of a second probability that the action is found in a second support video of the plurality of support videos by generating and up-sampling, via the at least one neural network, at least one feature map associated with the second support video, wherein the second attendance is determined after the first attendance is determined; and outputting, by the one or more processors, a classification of the subset of frames based at least in part on the first attendance and the second attendance.
10 . The method of claim 9 , wherein the second attendance is determined independent of the first attendance.
11 . The method of claim 9 , wherein the first attendance is further indicative of whether the first support video comprises one or more frames matching the first query proposal, and wherein the second attendance is further indicative of whether the second support video comprises one or more frames matching the first query proposal.
12 . The method of claim 11 , wherein the method further comprises: applying a one-dimensional temporal convolution to the one or more frames from each of the first support video and the second support video matching the first query proposal.
13 . The method of claim 11 , wherein the method further comprises: generating a third support video consisting of the one or more frames from each of the first support video and the second support video matching the first query proposal.
14 . The method of claim 9 , wherein the first query proposal is one of multiple query proposals determined based on the plurality of frames, and wherein the method further comprises: determining the first query proposal of the multiple query proposals has a highest probability that the action the first query proposal is indicative of is found among the plurality of support videos, relative to one or more other remaining query proposals of the multiple query proposals that are indicative of one or more other actions.
15 . The method of claim 9 , wherein the method further comprises: classifying each of the first support video and the second support video based on pseudo-action classes.
16 . The method of claim 15 , wherein the pseudo-action classes are mapped to a k-means cluster.
17 . An apparatus for performing common-action localization, comprising: means for receiving a query video comprising a plurality of frames, wherein a first query proposal is determined based on a subset of frames of the plurality of frames, the first query proposal indicative of an action depicted on the subset of frames; means for computing relevance of the first query proposal to each support video of a plurality of support videos, individually one at a time, wherein the computing comprises: determining a first attendance indicative of a first probability that the action is found in a first support video of a plurality of support videos by generating and up-sampling, via at least one neural network, at least one feature map associated with the first support video; and determining a second attendance indicative of a second probability that the action is found in a second support video of the plurality of support videos by generating and up-sampling, via the at least one neural network, at least one feature map associated with the second support video, wherein the second attendance is determined after the first attendance is determined; and means for outputting a classification of the subset of frames based at least in part on the first attendance and the second attendance.
18 . The apparatus of claim 17 , wherein the second attendance is determined independent of the first attendance.
19 . The apparatus of claim 17 , wherein the first attendance is further indicative of whether the first support video comprises one or more frames matching the first query proposal, and wherein the second attendance is further indicative of whether the second support video comprises one or more frames matching the first query proposal.
20 . The apparatus of claim 19 , wherein the apparatus further comprises: means for applying a one-dimensional temporal convolution to the one or more frames from each of the first support video and the second support video matching the first query proposal.

Description

CROSS REFERENCE TO RELATED APPLICATION(S) This application claims the benefit of U.S. Provisional Application Ser. No. 63/450,924, entitled “COMMMON ACTION LOCALIZATION” and filed on Mar. 8, 2023, the disclosure of which is expressly incorporated by reference herein in its entirety. BACKGROUND Technical Field The present disclosure generally relates to machine learning and, more particularly, to improving systems and methods of action recognition and localization. INTRODUCTION An artificial neural network, which may include an interconnected group of artificial neurons (e.g., neuron models), is a computational device or represents a method to be performed by a computational device. Convolutional neural networks are a type of feed-forward artificial neural network. Convolutional neural networks may include collections of neurons that each has a receptive field and that collectively tile an input space. Convolutional neural networks (CNNs) have numerous applications. In particular, CNNs have broadly been used in the area of pattern recognition and classification. Deep learning architectures, such as deep belief networks and deep convolutional networks, are layered neural networks architectures in which the output of a first layer of neurons becomes an input to a second layer of neurons, the output of a second layer of neurons becomes and input to a third layer of neurons, and so on. Deep neural networks may be trained to recognize a hierarchy of features and so they have increasingly been used in object recognition applications. Like convolutional neural networks, computation in these deep learning architectures may be distributed over a population of processing nodes, which may be configured in one or more computational chains. These multi-layered architectures may be trained one layer at a time and may be fine-tuned using back propagation. Other models are also available for object recognition. For example, support vector machines (SVMs) are learning tools that can be applied for classification. Support vector machines include a separating hyperplane (e.g., decision boundary) that categorizes data. The hyperplane is defined by supervised learning. A desired hyperplane increases the margin of the training data. In other words, the hyperplane should have the greatest minimum distance to the training examples. Computational networks such as recurrent neural networks may also be useful for recognizing sequences and other temporal data. However, such computational networks are computationally complex and consume significant compute resources. SUMMARY The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later. Certain aspects are directed to an apparatus for performing common-action localization. The apparatus may include one or more memories, individually or in combination, having instructions, and one or more processors, individually or in combination, configured to execute the instructions. In some examples, the one or more processors may be configured to cause the apparatus to receive a query video comprising a plurality of frames, wherein a first query proposal is determined based on a subset of frames of the plurality of frames, the first query proposal indicative of an action depicted on the subset of frames. In some examples, the one or more processors may be configured to cause the apparatus to determine a first attendance for a first support video of a plurality of support videos. In some examples, the one or more processors may be configured to cause the apparatus to determine a second attendance for a second support video of the plurality of support videos after computing the first attendance. In some examples, the one or more processors may be configured to cause the apparatus to output a classification of the subset of frames based at least in part on the first attendance and the second attendance. Certain aspects are directed to a method for performing common-action localization. In some examples, the method includes receiving a query video comprising a plurality of frames, wherein a first query proposal is determined based on a subset of frames of the plurality of frames, the first query proposal indicative of an action depicted on the subset of frames. In some examples, the method includes determining a first attendance for a first support video of a plurality of support videos. In some examples, the method includes determining a second attendance for a second support video of the plurality of support videos after computing the first attendance. In some examples, the method includes o