US-12626507-B2 - Method and apparatus for video action classification

US12626507B2US 12626507 B2US12626507 B2US 12626507B2US-12626507-B2

Abstract

A method of controlling an apparatus for performing video action classification using a trained machine learning, ML, model, the method includes receiving a plurality of frames of a video, inputting, into the trained ML model, the plurality of frames, identifying an actor in the plurality of frames, wherein the actor performs an action in the plurality of frames, and based on the actor being identified, classifying the action performed by the actor.

Inventors

Enrique SANCHEZ LOZANO
Georgios TZIMIROPOULOS
Yassine OUALI

Assignees

SAMSUNG ELECTRONICS CO., LTD.

Dates

Publication Date: 20260512
Application Date: 20230609
Priority Date: 20230221

Claims (14)

1 . A method of controlling an electronic apparatus for performing video action classification using a trained machine learning (ML) model, the method comprising: receiving a plurality of frames of a video; inputting, into the trained ML model, the plurality of frames; extracting spatial features and temporal features from the plurality of frames by using a backbone network in the trained ML model; identifying an actor in the plurality of frames, wherein the actor performs an action in the plurality of frames, wherein the identifying the actor within the plurality of frames comprises applying an actor transformer module in the trained ML model to the extracted spatial features and temporal features from key frames of the plurality of frames; and based on the actor being identified, classifying the action performed by the actor, wherein the applying the actor transformer module to the extracted spatial features and temporal features comprises inputting, into an encoder in the actor transformer module, the extracted spatial features and temporal features from the key frames.
2 . The method as claimed in claim 1 , wherein the identifying the actor within the plurality of frames comprises: predicting a bounding box around the actor performing the action.
3 . The method as claimed in claim 2 , wherein the applying the actor transformer module to the extracted spatial features and temporal features comprises: outputting, from the encoder in the actor transformer module, position features indicating potential positions of the actor in the key-frames; inputting, into a decoder in the actor transformer module, the position features output from the encoder in the actor transformer module, and a set of actor queries; and outputting, from the decoder in the actor transformer module, final actor queries; and wherein the predicting the bounding box around the actor performing the action comprises: inputting, into an actor classifier in the actor transformer module, the final actor queries; and outputting, from the actor classifier, coordinates of the bounding box for the actor and a classification score indicating a likelihood of the bounding box containing the actor.
4 . The method as claimed in claim 3 , wherein the classifying the action performed by the actor comprises: applying an action transformer module in the trained ML model to the extracted spatial features and temporal features, and predicting a class for the actor performing the action.
5 . The method as claimed in claim 4 , wherein the applying the action transformer module to the extracted spatial features and temporal features comprises: inputting, into an encoder in the action transformer module, the extracted spatial features and temporal features; outputting, from the encoder in the action transformer module, action features indicating potential actions of the actor; inputting, into a decoder in the action transformer module, the action features output from the encoder in the action transformer module, the final actor queries output by the decoder in the actor transformer module, and a set of action queries; and outputting, from the decoder in the action transformer module, final action queries; and wherein the predicting the class for the actor performing the action comprises: inputting, into an action classifier in the action transformer module, the final action queries output from the decoder in the action transformer module; and outputting, from the action classifier, the class for the actor performing the action and a confidence value corresponding to the class.
6 . The method as claimed in claim 5 further comprising: matching the predicted bounding box with the predicted class for the actor; and obtaining a matching score indicating a likelihood of the predicted bounding box being associated with the predicted class.
7 . The method as claimed in claim 6 , wherein the matching comprises: matching the predicted bounding box with the predicted class for the actor having the confidence value greater than a predetermined threshold value.
8 . The method as claimed in claim 6 , wherein the matching comprises: matching the predicted bounding box with two or more predicted classes.
9 . The method as claimed in any of claim 6 further comprising: compressing, using the matching score, the plurality of frames of the video.
10 . The method as claimed in claim 1 , wherein the plurality of frames are a first set of frames in the video, and wherein the method further comprising: classifying an action performed by the actor in a second set of frames in the video.
11 . The method as claimed in claim 10 , wherein the first set of frames is a subsequent set of the second set of frames.
12 . The method as claimed in claim 1 , wherein the video action classification is performed in real-time or near real-time.
13 . The method as claimed in claim 1 , wherein the identified actor is a human object or animal object.
14 . An electronic apparatus for performing video action classification using a trained machine learning (ML) model, the electronic apparatus comprising: a communication interface; at least one processor configured to: receive, through the communication interface, a plurality of frames of a video, input, into the trained ML model, the plurality of frames, extract spatial features and temporal features from the plurality of frames by using a backbone network in the trained ML model; identify an actor in the plurality of frames, wherein the actor performs an action in the plurality of frames, wherein the at least one processor is configured to identify the actor within the plurality of frames comprises by applying an actor transformer module in the trained ML model to the extracted spatial features and temporal features from key frames of the plurality of frames, and based on the actor being identified, classify the action performed by the actor, wherein the at least one processor is configured to apply the actor transformer module to the extracted spatial features and temporal features by inputting, into an encoder in the actor transformer module, the extracted spatial features and temporal features from the key frames.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation application of International Application No. PCT/KR2023/002916 filed on Mar. 3, 2023, which is based on and claims priority to Greek Patent Application No. 20220100210 filed on Mar. 4, 2022 and European Patent Application No. 23157744.6 filed on Feb. 21, 2023, the disclosures of which are incorporated by reference herein in their entireties. TECHNICAL FIELD The present application generally relates to a method and apparatus for action recognition or classification in videos. In particular, the present application relates to a computer-implemented method for performing video action classification using a trained machine learning, ML, model. BACKGROUND ART Object detection models predict a set of bounding boxes around objects of interest in an image, and category or class labels for such objects. For example, the models may identify a dog in an image, predict a bounding box around the dog, and classify the object in the bounding box as a “dog”. Videos may depict actors who are undertaking or performing actions. The term “actor” is used generally herein to mean a human, animal or object that may be performing an action. It is desirable in many contexts to recognise actions within videos. Thus, object detection models may be used to identify actors within videos, as well as the actions being performed by those actors. Spatio-temporal action localisation is the problem of localising actors in space and time and recognising their actions. Compared to action recognition, the task of spatio-temporal action location is more challenging, as it requires spatio-temporal reasoning by taking into account multiple factors including the motion of multiple actors, their interactions with other actors, and their interactions with the surroundings. State-of-the-art methods for solving this problem mainly rely on a complicated two-stage pipeline where a first network for a person detector is used to detect actors (e.g. people) in key frames, and then a second network is used for spatio-temporal action classification. This pipeline has at least two disadvantages: (a) the two stages are disjoint and so are not able to benefit from each other, and (b) it introduces significant computational overheads as the two networks must be employed one after the other. Therefore, the present applicant has recognised the need for an improved technique for performing video action classification. DISCLOSURE Technical Solution According to an embodiment, a method of controlling an electronic apparatus for performing video action classification using a trained machine learning, ML, model, the method includes receiving a plurality of frames of a video, inputting, into the trained ML model, the plurality of frames, identifying an actor in the plurality of frames, wherein the actor performs an action in the plurality of frames, and based on the actor being identified, classifying the action performed by the actor. The method may further include extracting spatial features and temporal features from the plurality of frames by using a backbone network in the trained ML model. The identifying the actor within the plurality of frames may include applying an actor transformer module in the trained ML model to the extracted spatial features and temporal features from key frames of the plurality of frames, and predicting a bounding box around the actor performing the action. The applying the actor transformer module to the extracted spatial features and temporal features may include inputting, into an encoder in the actor transformer module, the extracted spatial features and temporal features from the key frames, outputting, from the encoder in the actor transformer module, position features indicating potential positions of the actor in the key-frames, inputting, into a decoder in the actor transformer module, the position features output from the encoder in the actor transformer module, and a set of actor queries, and outputting, from the decoder in the actor transformer module, final actor queries. The predicting the bounding box around the actor performing the action may include inputting, into an actor classifier in the actor transformer module, the final actor queries, and outputting, from the actor classifier, coordinates of the bounding box for the actor and a classification score indicating a likelihood of the bounding box containing the actor. The classifying the action performed by the actor may include applying an action transformer module in the trained ML model to the extracted spatial features and temporal features, and predicting a class for the actor performing the action. The applying the action transformer module to the extracted spatial features and temporal features may include inputting, into an encoder in the action transformer module, the extracted spatial features and temporal features, outputting, from the encoder in the action transformer module, action f