US-20260127914-A1 - APPARATUS AND METHOD FOR RECOGNIZING STEREOTYPED ACTIONS BASED ON ARTIFICIAL INTELLIGENCE

US20260127914A1US 20260127914 A1US20260127914 A1US 20260127914A1US-20260127914-A1

Abstract

Provided is an apparatus for recognizing a stereotyped action, which includes: a memory storing a learning video dataset including a stereotyped action of a designated disabled child; and a processor functionally connected to the memory, wherein the processor includes: a text encoder configured to extract first features from a composite description phrase related to a facial expression and an action of the child included in the learning video dataset; a video encoder configured to output second features related to a facial expression and an action of the child from the learning video dataset; and a contrastive learning unit configured to learn a similarity between the first and second features that are paired with each other among the first features and the second features to model the video encoder.

Inventors

Cheolhwan Yoo
Jang-Hee YOO
JAEYOON Jang

Assignees

ELECTRONICS AND TELECOMMUNICATIONS RESEARCH INSTITUTE

Dates

Publication Date: 20260507
Application Date: 20251010
Priority Date: 20241107

Claims (20)

1 . An apparatus for recognizing a stereotyped action, comprising: a memory storing a learning video dataset including a stereotyped action of a designated disabled child; and a processor functionally connected to the memory, wherein the processor comprises: a text encoder configured to extract first features from a composite description phrase related to a facial expression and an action of the child included in the learning video dataset; a video encoder configured to output second features related to a facial expression and an action of the child from the learning video dataset; and a contrastive learning unit configured to learn a similarity between the first and second features forming a pair among the first features and the second features to model the video encoder.
2 . The apparatus of claim 1 , wherein the video encoder has a structure based on a 3D convolutional neural network (CNN) and a video transformer to utilize temporal information of time-series data.
3 . The apparatus of claim 1 , wherein the processor includes: an emotion action recognition unit configured to output an emotion word corresponding to the facial expression; an action description generation unit configured to generate a plurality of action description phrases for each stereotyped action; and a linkage unit configured to combine at least one of the plurality of action description phrases with the emotion word to generate the composite description phrase.
4 . The apparatus of claim 3 , wherein the emotion action recognition unit is configured to: extract an expression feature from a face region included in the learning video dataset, classify an emotion of the child based on the extracted expression feature using a designated categorial model; and output the emotion word corresponding to the classified emotion.
5 . The apparatus of claim 4 , wherein the processor further includes a preprocessing unit configured to detect the face region from the learning video dataset and provide the detected face region to the video encoder and the emotion action recognition unit.
6 . The apparatus of claim 3 , wherein the action description generation unit generates the plurality of action description phrases corresponding to action label information of the stereotyped action using a large language model.
7 . The apparatus of claim 3 , wherein the linkage unit randomly selects a single action description phrase from among the plurality of action description phrases and combines the selected action description phrase with the emotion word to generate the composite description phrase.
8 . The apparatus of claim 1 , wherein the video dataset includes a plurality of pieces of video data each including a different stereotyped action, and the contrastive learning unit adjusts a weight of the video encoder such that a similarity between first and second features forming the pair among the second features and the first features extracted from each of the plurality of pieces of video data is maximized and a similarity between first and second features forming different pairs is minimized.
9 . The apparatus of claim 1 , wherein the processor further includes an intermediate concept generation unit, and the intermediate concept generation unit is configured to: obtain a plurality of first features related to a list of composite description phrases regarding a plurality of stereotyped actions of the designated disabled child; obtain a second feature related to a stereotyped action and a facial expression included in one piece of video data from the modeled video encoder, and generate similarity-related information between the list of composite description phrases and the action and the facial expression in the one piece of video data.
10 . The apparatus of claim 9 , wherein the processor further includes a stereotyped action recognition unit, and the stereotyped action recognition unit infers a type of the stereotyped action based on the similarity-based information and outputs an inference result.
11 . An apparatus for recognizing a stereotyped action recognition device, comprising: a memory storing first features of a list of composite description phrases describing a stereotyped action of a designated disabled child in relation to a facial expression; and a processor functionally connected to the memory, wherein the processor comprises: a video encoder configured to extract second features related to an action and a facial expression of a subject to be diagnosed from one piece of video data; and an action recognition unit configured to infer a type of the action included in the one piece of video data based on the similarity between the first features and the second features.
12 . The apparatus of claim 11 , wherein the processor further includes an intermediate concept generation unit configured to calculate a similarity between the first features and the second features and generate similarity-related information between the list of composite description phrases and the action and the facial expression in the one piece of video data.
13 . The apparatus of claim 12 , further comprising an output device, wherein the processor organizes the similarity-related information in at least one visual format of a graph and a chart and outputs the organized similarity-related information in the at least one visual format through the output device.
14 . The apparatus of claim 11 , wherein the processor further includes a text encoder configured to extract the first features from the list of composite description phrases and store the extracted first features in the memory.
15 . The apparatus of claim 11 , wherein the processor further includes a text encoder and a contrastive learning unit, the text encoder encodes a composite descriptive phrase describing each facial expression related to each stereotyped action in learning video data captured for the designated disabled child to extract first features, the video encoder extracts second features related to each stereotyped action and the facial expression from the learning video data, and the contrastive learning unit learns a similarity between the first and second features that are paired with each other among the first features and the second features extracted from the learning video data to model the video encoder.
16 . A method of recognizing a stereotyped action, which is performed by at least one processor, comprising: encoding at least one composite descriptive phrase related to a learning video dataset to extract first features; encoding the learning video dataset using a video encoder to output second features related to a facial expression and an action of a subject to be diagnosed; and learning a similarity between the first and second features that are paired with each other among the first features and the second features to model the video encoder such that the similarity between the first and second features paired with each other increases.
17 . The method of claim 16 , further comprising: before the extracting of the first features, outputting an emotion word corresponding to the facial expression; generating a plurality of action description phrases for each of the stereotyped actions using a large language model; and combining at least one of the plurality of action description phrases with the emotion word and generating the composite description phrase.
18 . The method of claim 17 , wherein the generating of the composite description phrase includes: randomly selecting a single action description phrase from among the plurality of action description phrases; and combining the selected action description phrase with the emotion word to generate the composite description phrase.
19 . The method of claim 17 , further comprising: obtaining a plurality of first features related to a list of composite description phrases regarding a plurality of stereotyped actions of the designated disabled child and obtaining a second feature related to a stereotyped action and a facial expression included in one piece of video data from the modeled video encoder; generating similarity-related information between the list of composite description phrases and the action and the facial expression in the one piece of video data; and outputting the similarity-related information.
20 . The method of claim 19 , wherein the outputting of the similarity-related information includes at least one of: inferring a type of the stereotyped action based on the similarity-based information of the action and the facial expression in the one piece of video data, and outputting an inference result; and visualizing and outputting the similarity-related information.

Description

CROSS-REFERENCE TO RELATED APPLICATION This application claims priority to and the benefit of Korean Patent Application No. 10-2024-0157102, filed on Nov. 7, 2024, the disclosure of which is incorporated herein by reference in its entirety. BACKGROUND 1. Field of the Invention Various embodiments disclosed in this document relate to a technology for recognizing a user action. 2. Description of Related Art According to the U.S. Centers for Disease Control and Prevention (CDC), the prevalence of autism spectrum disorder (ASD) in children has been steadily increasing every year, from 1 in 54 in 2016 to 1 in 36 in 2020. In Korea, as well, the prevalence is high, with 1 in 38 (2.64%) affected and a significant increase (an average annual growth of 6.6%). Early diagnosis of ASD children is very important in terms of enabling treatment within the critical window and preventing secondary neurological damage and accumulation of actional problems to some extent. However, conventional diagnostic systems have mainly relied on labor-intensive and repetitive tests performed by medical professionals. Therefore, the approach took much time and often resulted in missing the critical window for early diagnosis, which is very important for the prognosis of ASD children. To resolve these issues, technologies that support ASD diagnosis by analyzing stereotyped actions, which are the main actional indicators of ASD children, using artificial intelligence (AI)-based automated analysis devices are being widely studied. These studies have attracted significant interest from researchers and clinicians. SUMMARY OF THE INVENTION Conventional AI-based stereotyped action recognition and detection methods may have several limitations as follows. For example, action recognition technologies may recognize the final action class from a specific pattern analyzed from video data using a black-box AI model. However, such black-box models not only have low interpretability, but also have difficulty providing an intermediate reasoning leading to an inference result. Therefore, it is difficult for medical professionals to trust, accept, and clinically utilize AI diagnoses without a basis for recognizing stereotyped actions. As another example, conventional AI-based stereotyped action recognition technologies simply analyze only physical movements and actional patterns of children, making it difficult to understand and interpret the composite actional characteristics of ASD children. The actions of ASD children are closely related to their emotional states, and the same action may have different meanings and interpretations depending on the child's emotional state. Various embodiments disclosed in this document may provide an apparatus and method for recognizing stereotyped actions based on artificial intelligence with which it is possible to assist in the diagnosis of children with autism spectrum disorder by analyzing video data. According to an aspect of the present invention, there is provided an apparatus for recognizing a stereotyped action, which includes: a memory storing a learning video dataset including a stereotyped action of a designated disabled child; and a processor functionally connected to the memory, wherein the processor includes: a text encoder configured to extract first features from a composite description phrase related to a facial expression and an action of the child included in the learning video dataset; a video encoder configured to output second features related to a facial expression and an action of the child from the learning video dataset; and a contrastive learning unit configured to learn a similarity between the first and second features that are paired with each other among the first features and the second features to model the video encoder. According to an aspect of the present invention, there is provided an apparatus for recognizing a stereotyped action recognition device, which includes: a memory storing first features of a list of composite description phrases describing a stereotyped action of a designated disabled child in relation to a facial expression; and a processor functionally connected to the memory, wherein the processor includes: a video encoder configured to extract second features related to an action and a facial expression of a subject to be diagnosed from one piece of video data; and an action recognition unit configured to infer a type of the action included in the one piece of video data based on the similarity between the first features and the second features. According to an aspect of the present invention, there is provided a method of recognizing a stereotyped action, which is performed by at least one processor, which includes: encoding at least one composite descriptive phrase related to a learning video dataset to extract first features; encoding the learning video dataset using a video encoder to output second features related to a facial expression and an action of a subjec