KR-20260064300-A - DETECTION OF USER ACTION IN VIDEO

KR20260064300AKR 20260064300 AKR20260064300 AKR 20260064300AKR-20260064300-A

Abstract

A method for detecting user actions in a video in which a graphical user interface (GUI) is recorded is disclosed, comprising the steps of: identifying a set of keyframes of a video among a set of frames of a video based on a pixel-unit difference between consecutive frames of each pair among a set of frames of a video; parsing a plurality of visual elements of a GUI, including a cursor, from each keyframe within the set of keyframes of the video to generate GUI information representing the plurality of visual elements of a GUI for the corresponding keyframe; classifying actions performed by a user on such other visual elements of a GUI during the recording of the GUI in response to an overlap between a first region and a second region within the corresponding keyframe, each corresponding to the cursor and the other of the plurality of visual elements of the GUI; and generating a linguistic description of the action by performing machine learning (ML)-based natural language processing (NLP) on the result of the action classification in addition to the GUI information.

Inventors

정철
호사인 엠디 비둣

Assignees

세종대학교산학협력단

Dates

Publication Date: 20260507
Application Date: 20241031

Claims (11)

As a method for detecting user action in a recorded video using a graphical user interface (GUI), A step of identifying a set of keyframes of the video among a plurality of frames of the video based on the pixel-wise difference between each pair of consecutive frames among the plurality of frames, and A step of parsing a plurality of visual elements of the GUI, including a cursor, from each keyframe within the set above, and generating GUI information representing the plurality of visual elements for each keyframe, and A step of classifying an action performed by a user on another visual element during recording in response to an overlap between a first area and a second area within each keyframe, each corresponding to the cursor and another of the plurality of visual elements, and Including the step of generating a linguistic description of the action by performing Machine Learning (ML)-based Natural Language Processing (NLP) on the result of the classification in addition to the GUI information. method.
In paragraph 1, Each of the plurality of frames has an intensity value at a plurality of pixel locations, and the method further includes the step of calculating a plurality of pixel intensity differences between consecutive frames of each pair, wherein each of the plurality of pixel intensity differences is an intensity difference between consecutive frames of each pair at a corresponding one of the plurality of pixel locations, and the pixel unit difference is one of the plurality of pixel intensity differences. method.
In paragraph 1, The other visual element is a text-entry element of the GUI, and the step of classifying the action comprises: performing character recognition on the second area within a subsequent keyframe in the set in response to the overlap; and classifying the action as an action in which the user inputs text within the other visual element when a typed character is recognized during the character recognition. method.
In paragraph 1, The other visual element is a non-text input element of the GUI, and the step of classifying the action comprises: calculating the difference in intensity at at least one given pixel location between the first area within each keyframe and the first area within a subsequent keyframe in the set in response to the overlap; determining whether there is a change in scene between each keyframe and the subsequent keyframe when the difference is smaller than a predetermined threshold; and classifying the action as an action in which the user performs a regular click on the other visual element when it is determined that there is a change. method.
In paragraph 1, The other visual element is a non-text input element of the GUI, and the step of classifying the action comprises: calculating the difference in intensity at at least one given pixel location between the first area within each keyframe and the first area within a subsequent keyframe in the set in response to the overlap; determining whether the first area has moved between each keyframe and the subsequent keyframe when the difference is greater than a predetermined threshold; classifying the action as an action in which the user performs a control click on the other visual element when it is determined that the first area has not moved between each keyframe and the subsequent keyframe; and classifying the action as an action in which the user performs a drag on the other visual element when it is determined that the first area has moved between each keyframe and the subsequent keyframe. method.
As a device for detecting user actions in a recorded video using a graphical user interface (GUI), A keyframe selector that identifies a set of keyframes of a video among a plurality of frames of the video based on pixel-unit differences between consecutive frames of each pair among the plurality of frames, and A GUI parser that parses a plurality of visual elements of the GUI, including a cursor, from each keyframe within the set above, and generates GUI information describing the plurality of visual elements for each keyframe, and A GUI action classifier comprising: classifying an action performed by a user on another visual element during recording in response to an overlap between a first region and a second region within each keyframe, each corresponding to the cursor and another of the plurality of visual elements, and generating a linguistic description of the action by performing Machine Learning (ML)-based Natural Language Processing (NLP) on the result of the classification in addition to the GUI information. device.
In paragraph 6, Each of the plurality of frames has its own intensity value at a plurality of pixel locations, and the keyframe selector also calculates a plurality of pixel intensity differences between consecutive frames of each pair, and each of the plurality of pixel intensity differences is an intensity difference between consecutive frames of each pair at a corresponding one of the plurality of pixel locations, and the pixel unit difference is one of the plurality of pixel intensity differences. device.
In paragraph 6, The other visual element is a text input element of the GUI, and the classification of the action includes performing character recognition on the second area within a subsequent keyframe within the set in response to the overlap, and classifying the action as an action in which the user inputs text within the other visual element when a typed character is recognized during the character recognition. device.
In paragraph 6, The other visual element is a non-text input element of the GUI, and the classification of the action comprises, in response to the overlap, calculating the difference in intensity at at least one given pixel location between the first area within each keyframe and the first area within a subsequent keyframe in the set, and when the difference is smaller than a predetermined threshold, determining whether there is a change in the scene between each keyframe and the subsequent keyframe, and when it is determined that there is a change, classifying the action as an action in which the user performs a regular click on the other visual element. device.
In paragraph 6, The other visual element is a non-text input element of the GUI, and the classification of the action comprises, in response to the overlap, calculating the difference in intensity at at least one given pixel location between the first area within each keyframe and the first area within a subsequent keyframe in the set; determining whether the first area has moved between each keyframe and the subsequent keyframe when the difference is greater than a predetermined threshold; classifying the action as an action in which the user performs a control click on the other visual element when it is determined that the first area has not moved between each keyframe and the subsequent keyframe; and classifying the action as an action in which the user performs a drag on the other visual element when it is determined that the first area has moved between each keyframe and the subsequent keyframe. device.
A computer-readable storage medium storing computer-executable instructions that, when executed by a computer processor, cause the computer processor to perform the method described in any one of claims 1 to 5.

Description

Detection of User Action in Video The present disclosure relates to the detection of user actions in video, and more specifically, to generating a linguistic description of such actions by performing machine learning (ML)-based natural language processing (NLP) on the results of classifying actions for a graphical user interface (GUI) in addition to information about a graphical user interface (GUI) recorded in video. Detecting actions performed by a user on a Graphical User Interface (GUI) (hereinafter abbreviated as GUI actions) is critical for automating and analyzing user interaction with a software system. As the use of GUIs becomes more common in various applications, it becomes important to detect and understand actions such as clicks, drags, and menu selections—in which a user interacts with an application—from recorded GUI video (hereinafter abbreviated as GUI video), such as a series of captured screens displaying the GUI. Detecting GUI actions from GUI video is a complex task applicable to various fields, including automated testing, user behavior analysis, and Human-Computer Interaction (HCI) research. Several techniques have been developed for GUI action detection. However, in these approaches, identifying GUI actions from captured video presents unique challenges, in contrast to identifying natural landscapes. Early research, such as template matching (Cheng, Y. P., Li, C. W., & Chen, Y. C. (2019). Apply computer vision in GUI automation for industrial applications. Mathematical Biosciences and Engineering, 16(6), 7526-7545), compares areas of GUI video frames with predefined templates to detect user actions, such as interacting with GUI components (which can represent visual elements of the GUI, such as buttons, text fields, icons (e.g., program icons, directory icons, cursors, or other types of icons available for operating the computer on which the GUI is running), menus, etc.), such as clicking on an icon. Template matching works well when the layout of the GUI is consistent, but it struggles with dynamic or responsive GUIs that have elements that can be moved or whose appearance can change. Additionally, template matching requires manually creating and maintaining a massive library of templates for each desired GUI component, which reduces scalability when applied to a wide variety of interfaces. Optical flow techniques, such as those described in Li, D., Wang, R., Chen, P., Xie, C., Zhou, Q., & Jia, X. (2021). Visual feature learning on video object and human action detection: a systematic review. Micromachines, 13(1), 72, detect motion patterns by tracking changes in pixels between consecutive video frames. These techniques are commonly used to detect mouse movements, drags, and clicks based on cursor displacement and changes in GUI elements, but they may be difficult to provide semantic information about interactions. Gao, D., Ji, L., Bai, Z., Ouyang, M., Li, P., Mao, D., ... & Shou, M. Z. (2024). AssistGUI: Task-Oriented PC Graphical User Interface Automation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13289-13298) and Wu, Q., Gao, D., Lin, K. Q., Wu, Z., Guo, X., Li, P., ... & Shou, M. Z. (2024). GUI Action Narrator: Where and When Did That Action Take Place?. arXiv preprint arXiv:2406.13719 present an Artificial Intelligence (AI) based model trained on an annotated dataset of GUI components. This can be used to recognize and classify interactions based on the spatial arrangement of detected elements, but effectively training the model requires a massive labeled dataset of user interactions and GUI elements. Furthermore, if the training dataset lacks diversity in interface types, generalizing such an approach to unknown GUIs is also a challenge. On the other hand, analyzing every frame of a video to detect GUI actions from GUI video conventionally requires more storage and time. Referring to Figure 1, which illustrates an example of a video hierarchy, a video consists of at least one scene. Each scene consists of at least one shot. Each shot contains multiple frames, each frame consisting of pixels that form an image. Videos often contain repetitive or static scenes. A keyframe represents the main content and information of the corresponding shot. Calic, J., & Thomas, B. (2004, April). Spatial analysis in key-frame extraction using video segmentation. In Workshop on Image Analysis for Multimedia Interactive Services, a simple method for generating keyframes using uniform sampling was proposed. Although sampling-based methods are easy to compute and efficient, they may not generate keyframes for a shot effectively, for example, by generating several keyframes from the same basic content to represent a long static segment (in other words, they may not be effective for representing actual video content). Figure 1 shows an example of a typical video hierarchy. Figure 2 shows an example of a GUI action detection device that detects user actio