KR-20260067957-A - METHOD FOR LEARNING OBJECT DETECTION MODEL, AND APPARATUS IMPLEMENTING THE SAME METHOD

KR20260067957AKR 20260067957 AKR20260067957 AKR 20260067957AKR-20260067957-A

Abstract

A method for training an object detection model performed by a computing system according to one embodiment of the present disclosure comprises: acquiring training data including an image, the location of an object present in the image, the class of the object, and metadata representing the structure of the image; supervising an AI model including an object detection network and a sequence decoder network using the training data, wherein the sequence decoder network is trained using the metadata; and acquiring the object detection network as an object detection model after the supervised training is completed.

Inventors

김태훈
김남욱
전대헌
김광진
박두원

Assignees

삼성에스디에스 주식회사

Dates

Publication Date: 20260513
Application Date: 20250425
Priority Date: 20241106

Claims (20)

In a method performed by a computing system, A step of acquiring training data including an image, the location of an object present in the image, the class of the object, and metadata representing the structure of the image; A step of supervising an AI model including an object detection network and a sequence decoder network using the above training data, wherein the sequence decoder network is trained using the above metadata; and A step comprising acquiring the object detection network as an object detection model after the above supervised learning is completed, Training method for object detection models.
In Article 1, The step of supervising the above AI model is, A step comprising updating the parameters of the AI model through back propagation using both a first loss value calculated using data output from the object detection network and a second loss value calculated using data output from the sequence decoder network. Training method for object detection models.
In Article 1, The object detection network described above includes a backbone, an encoder, a decoder, and a one-to-one matching module. The above sequence decoder network includes a projection layer and a sequence decoder, Training method for object detection models.
In Paragraph 3, The step of supervising the above AI model is, A step of obtaining a feature vector by inputting the image included in the training data into the backbone and the encoder; A step of inputting the above feature vector into the above decoder to obtain object queries composed of bounding box coordinates and classes; A step of inputting the above object query into a one-to-one matching module to obtain the correct answer and the matching result of the above object query; and A method comprising the step of calculating object detection loss using the above matching result, Training method for object detection models.
In Paragraph 4, The step of supervising the above AI model is, A step of mapping the above feature vector to the feature space of HTML through the projection layer; A step of inputting the mapped result into a sequence decoder to obtain HTML tags in token units; and A method further comprising the step of calculating a sequence prediction loss by comparing the above-mentioned obtained HTML tags with the above-mentioned metadata. Training method for object detection models.
In Article 5, The step of supervising the above AI model is, A method further comprising the step of performing training of the AI model based on the object detection loss and the sequence prediction loss. Training method for object detection models.
In Article 6, The step of training the AI model based on the object detection loss and the sequence prediction loss is: A step of calculating the total loss value using the weighted sum of the object detection loss and the sequence prediction loss; If the above total loss value does not reach a preset threshold, a step of updating the parameters of the AI model to perform learning; and A step comprising terminating the training of the AI model when the total loss value reaches a preset threshold. Training method for object detection models.
In Article 7, The step of calculating the total loss value using the weighted sum of the object detection loss and the sequence prediction loss is: A step of setting a first weight to be applied to the object detection loss and a second weight to be applied to the sequence prediction loss differently according to the result of analyzing the characteristics of the above image; and A step comprising calculating the total loss value using the first weight and the second weight set above, Training method for object detection models.
In Article 7, The step of calculating the total loss value using the weighted sum of the object detection loss and the sequence prediction loss is: A step of measuring model performance indicators for each of the object detection network and the sequence decoder network; A step of setting a first weight to be applied to the object detection loss and a second weight to be applied to the sequence prediction loss differently using the results of measuring each of the above model performance indicators; and A step comprising calculating the total loss value using the first weight and the second weight set above, Training method for object detection models.
In Article 7, If the above total loss value does not reach a preset threshold, the step of updating the parameters of the AI model to perform learning is: The method includes the step of, when the sequence prediction loss reaches a predetermined threshold, stopping the updating of parameters of the sequence decoder network and performing learning by updating only the parameters of the object detection network. Training method for object detection models.
In a method performed by a computing system, A step of inputting an image into an object detection model, wherein the object detection model is a model generated from the result of supervised learning of an AI model including an object detection network and a sequence decoder network, and the sequence decoder network is trained using metadata representing the structure of the image; and A step comprising obtaining a detection result of an object present in the image from the object detection model, AI-based object detection method.
In Article 11, The above AI model is trained using training data comprising image data, position data of an object expressed in pixel coordinates of the image data, class data of the object, and HTML tag data representing the structure of the image data. AI-based object detection method.
In Article 12, The above object detection network is, A backbone and an encoder that receive image data included in the above training data as input and output a feature vector, A decoder that receives the above feature vector as input and generates object queries composed of bounding box coordinates and classes, and A one-to-one matching module that matches the correct answer with the object query generated by the decoder and outputs a matching result, AI-based object detection method.
In Article 13, The above sequence decoder network is, A projection layer that maps the feature vectors output from the backbone and the encoder to the feature space of HTML, and A sequence decoder that outputs HTML tags in token units using the mapping result of the projection layer above, AI-based object detection method.
In Article 13, Before the step of inputting the above image into the object detection model, The method further comprises the step of obtaining the object detection model by supervising the AI model including the object detection network and the sequence decoder network using the above training data. AI-based object detection method.
In Article 15, The step of acquiring the above object detection model is, A step of calculating an object detection loss using the matching result output from the one-to-one matching module during the learning process of the object detection network; A step of calculating a sequence prediction loss using the output result output from the sequence decoder during the learning process of the sequence decoder network; and A method comprising the step of performing training of the AI model based on the object detection loss and the sequence prediction loss. AI-based object detection method.
In Article 16, The step of training the AI model based on the object detection loss and the sequence prediction loss is: A step of calculating a total loss value using the weighted sum of the object detection loss and the sequence prediction loss; If the above total loss value does not reach a preset threshold, a step of updating the parameters of the AI model to perform learning; and A step comprising terminating the training of the AI model when the total loss value reaches a preset threshold. AI-based object detection method.
In Article 17, If the above total loss value does not reach a preset threshold, the step of updating the parameters of the AI model to perform learning is: The method includes the step of, when the sequence prediction loss reaches a predetermined threshold, stopping the updating of parameters of the sequence decoder network and performing learning by updating only the parameters of the object detection network. AI-based object detection method.
In Article 17, The step of calculating the sequence prediction loss above is, The method includes the step of calculating the sequence prediction loss using the HTML tag sequence output from the sequence decoder. The above HTML tag sequence includes a plurality of HTML tokens that are output sequentially, wherein each of the HTML tokens is output based on information regarding an HTML token output at a previous time point. AI-based object detection method.
In Article 19, The step of calculating the sequence prediction loss above is, A step of calculating a loss value for each of the plurality of HTML tokens sequentially output from the sequence decoder; and A step comprising calculating the sequence prediction loss using the sum of loss values for each of the plurality of HTML tokens. AI-based object detection method.

Description

Method for Learning Object Detection Model, and Apparatus for Implementing the Same Method The present disclosure relates to a method for training an object detection model and an apparatus for implementing the same, and more specifically, to a method for training an object detection model used to identify objects in an image or video, and an apparatus for implementing the same. Object detection is a computer vision technology that detects and classifies the location of specific objects in images or videos, and is utilized in various fields such as autonomous vehicles, security systems, and medical image analysis. Existing object detection models learn only the location and type of objects within an image, which leads to a problem of poor understanding of the image's overall structure or content. For example, when processing document images, identifying the content and structure of the document requires additional post-processing or a separate model; this consumes additional time and cost and reduces the model's efficiency. Furthermore, existing object detection models have limitations in that they are difficult to handle with the output of a single model when applied to special cases, such as document images. Therefore, it is necessary to develop an object detection model that can improve the accuracy of object detection by learning not only the location and type of objects present in the image, but also the structure and content of the image. In addition, there is a need for a technology that can improve object detection performance using only a single model, without the need for post-processing or the addition of separate models, to detect objects such as tables, figures, and paragraphs in special images like document images. FIG. 1 illustrates the configuration of an overall system including a learning device for an object detection model according to an embodiment of the present disclosure. FIG. 2 is an example of the input and output of a learning device for an object detection model according to some embodiments of the present disclosure. FIG. 3 is a flowchart for explaining a method of training an object detection model according to one embodiment of the present disclosure. FIG. 4 is an example illustrating the structure of an AI model including an object detection network and a sequence decoder network according to some embodiments of the present disclosure. FIG. 5 is an example illustrating the structure of a sequence decoder layer according to some embodiments of the present disclosure. FIG. 6 is an example of input/output during learning of an object detection task according to some embodiments of the present disclosure. FIG. 7 is an example of input and output during the learning of a sequence prediction task according to some embodiments of the present disclosure. FIG. 8 is an example of calculating a sequence prediction loss during the learning of a sequence prediction task according to some embodiments of the present disclosure. FIG. 9 is a flowchart illustrating the learning pipeline of an object detection task and a sequence prediction task according to some embodiments of the present disclosure. FIG. 10 is an example of a formula representing object detection loss, sequence prediction loss, and total loss values according to some embodiments of the present disclosure. FIG. 11 is an example showing the performance evaluation results of a model according to some embodiments of the present disclosure. FIG. 12 is a flowchart for explaining an object detection method according to another embodiment of the present disclosure. FIG. 13 is a hardware configuration diagram of an exemplary computing system capable of implementing methods according to one embodiment of the present disclosure. Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the attached drawings. The advantages and features of the present disclosure and the methods for achieving them will become clear by referring to the embodiments described below in detail together with the attached drawings. However, the technical concept of the present disclosure is not limited to the following embodiments but can be implemented in various different forms. The following embodiments are provided merely to complete the technical concept of the present disclosure and to fully inform those skilled in the art of the scope of the present disclosure, and the technical concept of the present disclosure is defined only by the scope of the claims. It should be noted that when assigning reference numerals to the components of each drawing, the same components are given the same reference numeral whenever possible, even if they are shown in different drawings. Furthermore, in describing the present disclosure, if it is determined that a detailed description of related known components or functions could obscure the essence of the present disclosure, such detailed description is omitted. Unless otherwi