KR-20260064522-A - SYSTEM AND METHOD FOR OPTIMIZING OBJECT DETECTION

KR20260064522AKR 20260064522 AKR20260064522 AKR 20260064522AKR-20260064522-A

Abstract

A system and method are disclosed for receiving an input image, dividing it into two or more regions, generating an augmented image by applying a mixup to mix pixel values from the overlapping portion of the two or more regions or performing channel sampling to integrate red, green, and blue (RGB) information, detecting an object within a bounding box based on the mixed pixel values or integrated RGB information, and performing a function based on the detected object.

Inventors

주 얀린
엘-카미 모스타파

Assignees

삼성전자주식회사

Dates

Publication Date: 20260507
Application Date: 20251013
Priority Date: 20251007

Claims (20)

As an object detection method, the above method is, A step of receiving an input image by a processor; A step of dividing the input image into two or more regions by the above processor; A step of generating an augmented image based on the two or more regions by the above processor, wherein the step of generating the augmented image is: A step of applying a mixup to mix pixel values in the overlapping portion of the two or more areas mentioned above, or A step comprising at least one of performing channel sampling to integrate red, green, and blue (RGB) information in the overlapping portion of the two or more regions; A step of detecting an object within a bounding box based on the mixed pixel value or the integrated RGB information by the processor; and A method comprising the step of performing a function based on the detected object by the above processor.
In paragraph 1, A method comprising the step of dividing the input image into two or more regions, wherein the input image is divided into a grid having various cell shapes.
In paragraph 1, A method comprising the step of applying the above mix-up, which blends the pixel values by linearly combining the corresponding RGB values of the overlapping portions of the two or more regions using a weighting factor sampled from a beta distribution.
In paragraph 1, A method comprising the step of generating the augmented image, the step of applying the mixup to mix the pixel values and the step of performing channel sampling to integrate the RGB information of the overlapping portion of the two or more regions.
In paragraph 1, A step of identifying a region of interest by preprocessing the input image using a pose estimation model, and A method further comprising the step of cropping the input image to exclude other regions.
In paragraph 5, A method in which the pose estimation model identifies points on the body and guides the step of cropping the input image.
In paragraph 1, The step of detecting the object within the bounding box further includes the step of training a machine learning model using a weighted loss function, and The above weighted loss function is a method that combines a box localization loss with at least one intersection over union (IOU) based loss.
In Paragraph 7, The above IOU-based loss includes at least one of a complete IOU (CIOU), a distance IOU (DIOU), and a generalized IOU (GIOU). The above weighted loss function is a method for dynamically adjusting the weight of an aspect ratio consistency term during training.
As a device, processor; and It includes memory coupled to the above processor, The above processor is, Receive input image, and The above input image is divided into two or more regions, and Generating an augmented image based on the two or more of the above regions, wherein generating the augmented image is, Applying a mixup to mix pixel values in the overlapping portion of the two or more areas mentioned above, or It includes at least one of performing channel sampling to integrate red, green, and blue (RGB) information in the overlapping portion of the two or more regions, and Detecting objects within a bounding box based on the above mixed pixel values or the above integrated RGB information, A device configured to perform a function based on the detected object.
In Paragraph 9, A device for dividing the input image into the two or more regions, comprising dividing the input image into a grid having various cell shapes.
In Paragraph 9, A device for applying the above mix-up, comprising mixing the pixel values by linearly combining the corresponding RGB values of the overlapping portions of the two or more regions using weighting factors sampled from a beta distribution.
In Paragraph 9, A device for generating the augmented image, comprising applying the mixup to mix the pixel values and performing the channel sampling to integrate the RGB information of the overlapping portions of the two or more regions.
In Paragraph 9, The above processor is, The input image is preprocessed to identify a region of interest using a pose estimation model, and A device further configured to crop the input image to exclude other regions.
In Paragraph 13, The above pose estimation model is a device that identifies points on the body and guides cropping the input image.
In Paragraph 9, The above processor is further configured to train a machine learning model using a weighted loss function, and The above weighted loss function combines a box positioning loss with at least one intersection-based union ratio (IOU) loss, a device.
In paragraph 15, The above IOU-based loss consists of one or more of a complete IOU (CIOU), a distance IOU (DIOU), and a generalized IOU (GIOU), and The above weighted loss function is a device that dynamically adjusts the weights of aspect ratio consistency items during training.
As an object detection method, the above method is, Step of receiving an input image by a processor; A step of processing the input image through a classification branch and a regression branch by the above processor, wherein the classification branch and the regression branch each include an initial stage and a later stage; A step of generating a first initial feature map in the initial stage of the classification branch by the above processor; A step of generating a second initial feature map in the initial stage of the regression branch by the above processor; A step of providing the first initial feature map to the later stage of the regression branch by the above processor; A step of providing the second initial feature map to the later stage of the classification branch by the above processor; A step of processing the first and second initial feature maps by the above processor to generate an output; A step of detecting an object based on the output by the above processor; and A method comprising the step of performing a function based on the detected object by the above processor.
In Paragraph 17, The above classification branch generates an initial feature map of size 1, the above regression branch generates a later feature map of size 2, and the connected feature map is of size 3, and A method in which the first size, the second size, and the third size are different sizes.
In Paragraph 17, A method further comprising the step of performing a convolution operation to reduce the size of connected feature maps by applying a 1x1 convolution.
In Paragraph 17, The above function is a method comprising the step of detecting the object within the bounding box of the edge device.

Description

System and Method for Optimizing Object Detection The present disclosure generally relates to object detection technology in the fields of computer vision and human-computer interaction. More specifically, the subject matter disclosed herein relates to improvements in edge computing-based systems for detecting and locating objects having identifiable features, such as hands, faces, limbs, or similar regions of interest, through data augmentation methods, optimized training strategies, and enhanced neural network architectures. Hand and object detection systems can be widely used in computer vision applications, including gesture recognition interfaces, augmented reality (AR), and virtual reality (VR) environments. These systems utilize machine learning models to generate bounding boxes or masks, enabling them to identify specific features—such as hands, body parts, physical objects, or other objects of interest—and determine their location. When deployed on edge computing devices like smartphones or head-mounted displays (HMDs), these systems can be useful for balancing detection accuracy, computational efficiency, and real-time performance, which can be a complex and computationally intensive task on resource-constrained devices. To address this problem, some solutions use standard methods such as basic ("vanilla") MOSAIC data augmentation techniques, fixed cropping strategies, and neural network architectures designed for generalized object detection. While these methods are effective in some cases, they may struggle to handle dynamic conditions such as varying lighting, complex backgrounds, and partial occlusion (e.g., when an object of interest is partially obscured by another object). Furthermore, edge devices may be constrained by limited processing power and memory, which can reduce the feasibility of using computationally intensive models. One problem with the above approach is that training techniques may fail to generate sufficient variability in the training data, which can limit the model's generalization ability under various conditions. For example, typical MOSAIC data augmentation may rely on robust and non-overlapping layouts, which fail to fully capture the variability required for robust detection. Similarly, cropping methods used to preprocess training images may degrade the model's ability to extract meaningful features by failing to focus on the most relevant regions. Neural network models often fail to properly utilize intermediate features between branches, which can result in suboptimal performance in detection tasks. The following section describes aspects of the subject matter disclosed herein with reference to exemplary embodiments illustrated in the drawings: FIG. 1 is a drawing illustrating an augmented image generated using an enhanced mosaic method according to one embodiment. FIG. 2 is a drawing illustrating an augmented image generated using an enhanced mosaic method according to one embodiment. FIG. 3 is a representation of an input image illustrating a raw visual input prior to preprocessing, according to one embodiment. FIG. 4 is a representation of an input image processed using a pose estimation model according to one embodiment. FIG. 5 is a representation of a pose model index that identifies body points detected by a pose estimation model according to one embodiment. FIG. 6 is a representation of a cropped image in which a hand region is separated based on points identified by a pose estimation model according to one embodiment. FIG. 7 is a block diagram of a neural network architecture illustrating intermediate feature sharing for classification and regression branching according to one embodiment. FIG. 8a is a block diagram illustrating the training of an object detection model according to one embodiment. FIG. 8b is a flowchart illustrating an object detection optimization method according to one embodiment. FIG. 8c is a flowchart illustrating an object detection optimization method according to one embodiment. FIG. 9 is a block diagram of an electronic device in a network according to one embodiment. FIG. 10 is a block diagram illustrating a system including a user device (UE) and a network node according to one embodiment. In the following detailed description, many specific details are described to provide a complete understanding of the present disclosure. However, those skilled in the art will understand that the disclosed aspects may be practiced without these specific details. In other examples, well-known methods, procedures, components, and circuits have not been described in detail to avoid obscuring the subject matter disclosed herein. Throughout this specification, references to “one embodiment” or “one embodiment” mean that specific features, structures, or characteristics described in relation to an embodiment may be included in at least one embodiment disclosed herein. Accordingly, references to “in one embodiment,” “in one embodiment,” or “according