KR-20260063655-A - APPARATUS AND METHOD FOR GENERATING TRAINING DATA FOR CODE DISCOVERY VIA UNION-FIND CLUSTERING

KR20260063655AKR 20260063655 AKR20260063655 AKR 20260063655AKR-20260063655-A

Abstract

The present invention relates to an apparatus and method for generating training data for code search through Union-Find clustering, wherein the apparatus comprises: an initial cluster generation unit that receives code and query data and generates clusters; an evidence calculation unit that calculates evidence indicating the validity of the clusters based on the semantic association between the query and the code in the clusters; a cluster determination unit that determines a surviving cluster and a remaining cluster based on a Union-Find data structure by performing dynamic transport to move evidence values between the clusters; and a final cluster generation unit that generates the surviving clusters as final clusters used as training data for code search.

Inventors

한요섭
한중혁
최석웅

Assignees

연세대학교 산학협력단

Dates

Publication Date: 20260507
Application Date: 20241030

Claims (9)

Initial cluster creation unit that creates a cluster by receiving code and query data as input; An evidence calculation unit that calculates evidence indicating the validity of the cluster based on the semantic association between the query and the code in the cluster; A cluster determination unit that determines surviving clusters and remaining clusters based on a Union-Find data structure by performing a dynamic transport that moves evidence values between the clusters; and A device for generating training data for code search through Union-Find clustering, comprising a final cluster generation unit that generates the above-mentioned surviving clusters into final clusters used as training data for code search.
In paragraph 1, the initial cluster generation unit A device for generating training data for code search through Union-Find clustering, characterized by receiving input data in which the above query and the above code are each separated into disentangled representations.
In paragraph 2, the initial cluster generation unit A device for generating training data for code search through Union-Find clustering, characterized by managing the clusters using the Union-Find data structure so that they are set as mutually independent sets.
In paragraph 1, the above evidence calculation unit A device for generating training data for code search through Union-Find clustering, characterized by determining the semantic association by calculating the log-probability (logit)-based similarity of the above query and the above code.
In paragraph 4, the above evidence calculation unit A device for generating training data for code search through Union-Find clustering, characterized in that the greater the log probability, the greater the semantic association and the more positively the validity of the cluster is determined.
In paragraph 1, the cluster determining part A device for generating training data for code search through Union-Find clustering, characterized by determining the surviving cluster by recursively performing the above dynamic transmission to strengthen the first specific cluster and delete or merge the second specific cluster.
In paragraph 1, the cluster determining part A device for generating training data for code search through Union-Find clustering, characterized by performing the dynamic transmission up to a maximum number of iterations or until there is no movement of the evidence value, and determining the surviving cluster by reflecting the similarity between the clusters.
In paragraph 1, the final cluster generating unit A device for generating training data for code search through Union-Find clustering, characterized by generating the above-mentioned survival clusters as a non-overlapping set, using them as a generalized pattern for code search, and managing them with the above-mentioned Union-Find data structure.
In a method for generating training data for code search through Union-Find clustering performed in a training data generation device for code search through Union-Find clustering, Initial cluster creation phase that creates a cluster by receiving code and query data as input; An evidence calculation step for calculating evidence indicating the validity of the cluster based on the semantic association between the query and the code in the cluster; A cluster determination step for determining a surviving cluster and a remaining cluster based on a Union-Find data structure by performing a dynamic transport to move evidence values between the clusters; and A method for generating training data for code search through Union-Find clustering, comprising a final cluster generation step of generating the above-mentioned surviving clusters into final clusters used as training data for code search.

Description

Apparatus and Method for Generating Training Data for Code Discovery via Union-Find Clustering The present invention relates to a technology for generating training data for code search, and more specifically, to an apparatus and method for generating training data for code search through Union-Find clustering, which can determine a surviving cluster based on a Union-Find data structure by calculating the validity of a cluster based on the semantic association between a query and a code and performing dynamic transmission. Code search is the task of retrieving code snippets related to queries intended to find the implementation of specific functions. Since this task is crucial for increasing human developer productivity and reducing the hallucinations that LLMs may experience as developers, the extension of programming language models is enhancing performance in moderately dynamic environments. In fact, programmers continuously write new code for debugging or performance improvement, which leads to various types of shifts. Adaptation to these shifts relies on generalization, a core mechanism of human intelligence, which serves as the driving force behind the application of neural networks in various fields. However, programming language models are vulnerable to these shifts, and it has been reported that this leads to performance degradation. To compensate for these vulnerabilities, various types of supervised signals can be utilized; however, there is a problem in that sufficient supervised signals are generally not provided due to the rapid rate of change. Therefore, it is necessary to analyze the adaptation itself to effectively utilize these signals. Adaptation to such changes is formulated as a minimum entropy problem for information retrieval, and it has been proven that the minimum entropy problem is a dual form of the minimum set cover problem for a specific cost function. FIG. 1 is a diagram illustrating a Union-Find clustering algorithm of a learning data generation device for code search according to an embodiment of the present invention. FIG. 2 is a diagram illustrating the functional configuration of a learning data generation device for code search according to an embodiment of the present invention. Figure 3 is a diagram illustrating the system configuration of a training data generation device for code search. FIG. 4 is a flowchart illustrating a method for generating training data for code search according to the present invention. FIG. 5 is a diagram illustrating the experimental details of the Union-Find clustering algorithm of a learning data generation device for code search according to one embodiment of the present invention of FIG. 1. The description of the present invention is merely an example for structural or functional explanation, and therefore the scope of the present invention should not be interpreted as being limited by the examples described in the text. That is, since the examples are subject to various modifications and may take various forms, the scope of the present invention should be understood to include equivalents capable of realizing the technical concept. Furthermore, the objectives or effects presented in the present invention do not imply that a specific example must include all of them or only such effects; therefore, the scope of the present invention should not be understood as being limited by them. Meanwhile, the meaning of the terms described in this application should be understood as follows. Terms such as "first," "second," etc., are intended to distinguish one component from another, and the scope of rights shall not be limited by these terms. For example, the first component may be named the second component, and similarly, the second component may be named the first component. When it is stated that one component is "connected" to another component, it should be understood that it may be directly connected to that other component, or that there may be other components in between. Conversely, when it is stated that one component is "directly connected" to another component, it should be understood that there are no other components in between. Meanwhile, other expressions describing the relationships between components, such as "between" and "exactly between," or "adjacent to" and "directly adjacent to," should be interpreted in the same way. A singular expression should be understood to include a plural expression unless the context clearly indicates otherwise, and terms such as "include" or "have" are intended to specify the existence of the implemented features, numbers, steps, actions, components, parts, or combinations thereof, and should be understood not to preclude the existence or addition of one or more other features, numbers, steps, actions, components, parts, or combinations thereof. In each step, identifiers (e.g., a, b, c, etc.) are used for convenience of explanation and do not describe the order of the steps; the steps may occur di