KR-20260063653-A - APPARATUS AND METHOD FOR GENERATIVE CODE SEARCH USING FUNCTION NAME IDENTIFIERS
Abstract
The present invention relates to a generative code search device using a function name-based identifier, comprising: a database unit that stores a code document containing a function name; a data preprocessing unit that receives the code document and generates a document identifier based on the function name of the code document; a learning unit that performs a first encoder-decoder-based learning operation that takes a variation of the function name as input and outputs the function name and the document identifier; and an inference unit that receives a user query and provides the corresponding document identifier through a second encoder-decoder-based constraint beam search including an inference unit.
Inventors
- 한요섭
- 남궁영수
- 한중혁
Assignees
- 연세대학교 산학협력단
Dates
- Publication Date
- 20260507
- Application Date
- 20241030
Claims (8)
- A database section that stores code documents containing function names; A learning unit that performs data preprocessing to receive the above code document and generate a document identifier based on the function name of the above code document, and a first encoder-decoder based learning that takes a variation of the above function name as input and outputs the above function name and the above document identifier; and A generative code search device using a function name utilization identifier that includes an inference unit that receives a user query and provides a corresponding document identifier through a second encoder-decoder-based constraint beam search including an inference unit.
- In paragraph 1, the above database unit A generative code search device using a function name-based identifier, characterized by additionally storing code content and code summary in the above code document, and using the code content and code summary for association with the above user query.
- In paragraph 1, the above learning unit A generative code search device using a function name-based identifier, characterized by extracting the function name, code content, and code summary from the code document through the above data preprocessing to generate tokens, and combining the tokens to generate a unique document identifier in the database unit.
- In paragraph 3, the above learning part A generative code search device using a function name utilization identifier, characterized by masking a part of the function name to generate a variation of the function name, providing it to the first encoder-decoder, and performing learning to infer a masked region from the function name.
- In paragraph 4, the above learning unit A generative code search device using a function name identifier characterized by simultaneously performing the data preprocessing and the learning through multi-task learning.
- In paragraph 1, the above inference unit A generative code search device using a function name-based identifier, characterized by generating an embedding vector by inputting natural language or code into the second encoder as a constant user query.
- In paragraph 6, the above inference unit A generative code search device using a function name-based identifier, characterized by inputting the above embedding vector into the above second decoder and providing the above document identifier by searching for a prefix tree-based constrained path through the above constraint beam search.
- In a method for searching for generated code using a function name-based identifier performed in a device for searching for generated code using a function name-based identifier, A database step for storing code documents containing function names; A learning step that performs data preprocessing to receive the above code document as input and generate a document identifier based on the function name of the above code document, and a first encoder-decoder based learning that operates with a variation of the above function name as input and the above function name and the above document identifier as output; and A generative code search method using a function name-utilizing identifier, comprising an inference step that receives a user query and provides a corresponding document identifier through a second encoder-decoder-based constraint beam search including an inference unit.
Description
Apparatus and Method for Generating Code Search Using Function Name Identifiers The present invention relates to a generative code search technology using a function name-based identifier, and more specifically, to an apparatus and method for generative code search using a function name-based identifier that processes a function name and its variations using an encoder-decoder-based learning model to provide an accurate document identifier through constraint beam search according to a user query. Embedding-based search technology has advanced significantly alongside the development of encoder-only pre-trained language models. This technology utilizes a method that extracts embeddings from queries and documents and determines search rankings by measuring the distance between them; consequently, dense document search methods utilizing AI models are being widely adopted in the field of information retrieval. Recently, generative search methods have emerged alongside the advancement of encoder-decoder pre-trained language models. Generative search is a method that takes a query as input, directly generates document IDs, and provides final search results through constraint beam search techniques. This approach is highly memory-efficient as it does not require the extraction or storage of separate embedding vectors, and it has the advantage of deriving more sophisticated search results during the model generation process. The existing dense document search method learns by adjusting the embedding vector distance between positive and negative pairs through contrastive learning. In contrast, the generative search method learns based on a next token prediction method that predicts the ID of the correct document for an input query. Generative models learn indexing and search simultaneously, and perform only the search task during inference. Here, indexing refers to the process of generating document IDs based on document content, while search refers to the process of finding a suitable document ID for a given query. However, generative search models carry the risk of generating non-existent document IDs during the probabilistic generation of output tokens. To prevent this, constrained beam search is used to control the model, ensuring that it generates only allowed tokens. Korean Registered Patent No. 10-2703247 (September 2, 2024) provides an inference method and system capable of constructing a prompt for input into a generative language model based on the ranking of phrases determined by reflecting not only the relationship between a query and phrases but also the length of the phrases. The inference method capable of constructing a prompt for input into a generative language model based on the ranking of phrases may include the steps of extracting phrases by length from documents retrieved in response to a user's query, determining the ranking of phrases by length considering the complexity of the query, and constructing a prompt for input into a generative language model from the phrases by length based on the determined ranking. In this case, the ranking of phrases by length may be determined such that the ranking of relatively longer phrases increases as the complexity of the query increases. FIG. 1 is a diagram illustrating a model learning method framework of a generative code search device according to one embodiment of the present invention. Figure 2 is a diagram illustrating the configuration of the generative code search device of Figure 1. Figure 3 is a diagram illustrating the system configuration of the generative code search device of Figure 1. Figure 4 is a flowchart illustrating the operation of the generative code search device of Figure 1. FIG. 5 is a diagram comparing a model learning method of a generative code search device according to one embodiment of the present invention with a conventional generative search method. Figure 6 is a diagram illustrating a dense document search method using a conventional encoder-only model. Figure 7 is a diagram illustrating a generative search method using a conventional encoder-decoder model. FIG. 8 is a diagram comparing a generative code search method of a generative code search device according to one embodiment of the present invention with a conventional generative search method. The description of the present invention is merely an example for structural or functional explanation, and therefore the scope of the present invention should not be interpreted as being limited by the examples described in the text. That is, since the examples are subject to various modifications and may take various forms, the scope of the present invention should be understood to include equivalents capable of realizing the technical concept. Furthermore, the objectives or effects presented in the present invention do not imply that a specific example must include all of them or only such effects; therefore, the scope of the present invention should not be understood as b