KR-20260062734-A - APPARATUS AND METHOD FOR CODE COMPLEXITY PREDICTION USING SEMI-SUPERVISED LEARNING

KR20260062734AKR 20260062734 AKR20260062734 AKR 20260062734AKR-20260062734-A

Abstract

The present invention relates to a code complexity prediction device utilizing semi-supervised learning, comprising: a data augmentation unit (120) that receives original label data (U > θ) having reliability greater than or equal to a threshold among a training dataset (110) and generates the augmented label data; a model training unit (130) that generates pseudo-labels by performing co-training with a first model using the original label data and a second model using the augmented label data; a non-pseudo-label dataset unit (140) that receives non-pseudo-label data from the training dataset for which the pseudo-label cannot be assigned; and a pseudo-label processing unit (150) that generates the pseudo-label for the non-pseudo-label data through the first and second models.

Inventors

한요섭
한중혁
안혜선
김정인
임수한

Assignees

연세대학교 산학협력단

Dates

Publication Date: 20260507
Application Date: 20241029

Claims (9)

A data augmentation unit (120) that receives original label data (U > θ) having reliability greater than or equal to a threshold among the training dataset (110) and generates the augmented label data; A model training unit (130) that generates pseudo-labels by performing co-training with a first model using the original label data and a second model using the augmented label data; A non-pseudo-label dataset part (140) that receives non-pseudo-label data from the above training dataset for which the pseudo-label cannot be assigned; and A code complexity prediction device utilizing semi-supervised learning, comprising a pseudo-label process unit (150) that generates the pseudo-label for the non-pseudo-label data through the first and second models.
In paragraph 1, the data augmentation unit A code complexity prediction device utilizing semi-supervised learning, characterized by generating augmented label data through back-translation that performs translation and back-translation of another programming language on the original label data.
In paragraph 1, the data augmentation unit A code complexity prediction device utilizing semi-supervised learning, characterized by generating augmented label data through a loop transformation that performs a transformation of the loop structure for the original label data.
In paragraph 1, the above model learning unit A code complexity prediction device utilizing semi-supervised learning, characterized by performing the task of predicting the time complexity of a code in original label data using the first model and providing the result of the said task to the second model.
In paragraph 4, the above model learning unit A code complexity prediction device utilizing semi-supervised learning, characterized by performing the task of predicting the time complexity of a transformation code in the augmented label data using the second model and providing the result of the task to the first model.
In paragraph 1, the above water labeling process unit A code complexity prediction device utilizing semi-supervised learning, characterized by generating the pseudo-labels for the original label data and the pseudo-label data through the first or second model or through a symbolic module based on a code analysis technique.
In paragraph 6, the above water labeling process unit A code complexity prediction device utilizing semi-supervised learning, characterized by generating the pseudo-label through the first or second model for prediction data having reliability above the threshold.
In Clause 7, the above water labeling process unit A code complexity prediction device utilizing semi-supervised learning, characterized by generating pseudo-labels through a symbolic module implemented as a regular expression or abstract syntax tree for prediction data that does not have reliability above the above threshold.
In a method for predicting code complexity using semi-supervised learning performed in a code complexity prediction device using semi-supervised learning, A data augmentation step that receives original label data having reliability greater than or equal to a threshold among the training dataset (110) and generates the augmented label data; A model training step for generating pseudo-labels by performing co-training with a first model using the original label data and a second model using the augmented label data; A non-pseudo-label dataset step of receiving non-pseudo-label data from the above training dataset for which the pseudo-label cannot be assigned; and A method for predicting code complexity using semi-supervised learning, comprising a pseudo-label process step of generating pseudo-labels for non-pseudo-label data through the first and second models.

Description

Apparatus and Method for Code Complexity Prediction Using Semi-Supervised Learning Apparatus and Method for Code Complexity Prediction Using Semi-Supervised Learning Predicting code time complexity is a critical element in algorithm performance evaluation and efficiency measurement, playing a significant role in various practical and academic applications. However, existing prediction methods require large volumes of labeled data and suffer from significantly degraded performance in environments with minimal data. Achieving accurate time complexity prediction in situations where sufficient labeled data is scarce or nearly non-existent has been extremely difficult, making accurate code performance evaluation challenging, particularly in educational settings, automated code review systems, and programming competitions. Furthermore, since existing methods primarily rely on supervised learning and presuppose large-scale labeled datasets, collecting and managing label data required significant cost and time. In particular, covering diverse code patterns and linguistic characteristics necessitates a vast amount of label data; failure to secure this data leads to a degradation in model generalization performance and a decline in prediction reliability. Furthermore, existing pseudo-label-based semi-supervised learning methods have struggled to guarantee performance in microdata environments due to the degradation of pseudo-label quality, and in such approaches, noisy data is prone to negatively impacting model performance. Because of these limitations, there is a growing need for technologies capable of performing reliable time complexity prediction at low cost using only small amounts of data. Consequently, there is a demand for new approaches that enable accurate and efficient time complexity prediction even in environments with extremely small label data. Korean Registered Patent No. 10-2033136 (October 10, 2019) provides a semi-supervised learning-based machine learning method and apparatus capable of improving the performance of a target model by accurately selecting data samples having correct predicted labels from a dataset where label information is not provided, and by further training the selected data samples. The semi-supervised learning-based machine learning method can progressively improve model performance by accurately selecting data for further training from a dataset without labels through a first model, a second model trained with less data, and a third model that has learned the output probability distribution of the first model, and by learning predicted label information. FIG. 1 is a diagram illustrating the configuration of a code complexity prediction device utilizing semi-supervised learning according to an embodiment of the present invention. Figure 2 is a diagram illustrating the data processing and learning process of a code complexity prediction device utilizing semi-supervised learning of Figure 1. Figure 3 is a flowchart explaining the operation of a code complexity prediction device using semi-supervised learning of Figure 1. The description of the present invention is merely an example for structural or functional explanation, and therefore the scope of the present invention should not be interpreted as being limited by the examples described in the text. That is, since the examples are subject to various modifications and may take various forms, the scope of the present invention should be understood to include equivalents capable of realizing the technical concept. Furthermore, the objectives or effects presented in the present invention do not imply that a specific example must include all of them or only such effects; therefore, the scope of the present invention should not be understood as being limited by them. Meanwhile, the meaning of the terms described in this application should be understood as follows. Terms such as "first," "second," etc., are intended to distinguish one component from another, and the scope of rights shall not be limited by these terms. For example, the first component may be named the second component, and similarly, the second component may be named the first component. When it is stated that one component is "connected" to another component, it should be understood that it may be directly connected to that other component, or that there may be other components in between. Conversely, when it is stated that one component is "directly connected" to another component, it should be understood that there are no other components in between. Meanwhile, other expressions describing the relationships between components, such as "between" and "exactly between," or "adjacent to" and "directly adjacent to," should be interpreted in the same way. A singular expression should be understood to include a plural expression unless the context clearly indicates otherwise, and terms such as "include" or "have" are intended to specify the existence of the implemented f