KR-20260064248-A - Method and Apparatus for Generating Features Using Generative AI
Abstract
A method and apparatus for generating features using generative AI are disclosed. According to one aspect of the present disclosure, a computer-implemented method comprises: a process of obtaining knowledge information regarding features required for machine learning modeling from a knowledge database; a process of fine-tuning a Large Language Model (LM) to generate feature-related information in response to a context prompt using the knowledge information; a process of obtaining a plurality of responses to a context prompt using the LLM and generating a dataset composed of domain-specific preferred and non-preferred response pairs; and a process of optimizing the LLM by domain through Direct Preference Optimization (DPO) learning using the dataset.
Inventors
- 이규남
- 진기훈
- 고정현
- 박경윤
- 이동훈
- 강예진
- 김태형
- 송지영
- 오세현
- 정유철
Assignees
- 에스케이텔레콤 주식회사
Dates
- Publication Date
- 20260507
- Application Date
- 20241031
Claims (10)
- As a computer implementation method, A process of acquiring knowledge information about features required for machine learning modeling from a knowledge database; A process of fine-tuning the Large Language Model (LM) to generate feature-related information in response to a context prompt using the above knowledge information; A process of obtaining multiple responses to a context prompt using the above LLM and generating a dataset composed of domain-specific preferred and non-preferred response pairs; and The method includes a process of optimizing the LLM for each domain through DPO (Direct Preference Optimization) learning using the above dataset, The above knowledge information includes information regarding the domain, business problem, machine learning type, data attributes, derived variables, reasons for deriving derived variables, and derived variable codes, and The above context prompt includes information regarding the domain, business problem, machine learning type, and data attributes, and A method in which the above feature-related information includes at least one of a derived variable, a reason for deriving the derived variable, and a derived variable code.
- In Article 1, The process of fine-tuning the above LLM is performed using the PEFT (Parameter-Efficient Fine-Tuning) technique.
- In Article 1, The process of fine-tuning the above LLM is performed in a multi-turn conversational manner.
- In Article 1, A method that further includes a process of performing multi-domain agent learning to utilize the above LLM according to the situation.
- In Article 1, The process of optimizing the above LLM is, It includes a process of generating multiple domain-specific LLMs through domain-specific DPO learning, and A method further comprising the process of training a multi-domain agent to use any one of the aforementioned multiple domain-specific LLMs according to an input context prompt.
- In Article 5, A process of receiving input context information from a user regarding the domain, business problem, machine learning type, data attributes, and performance evaluation metrics; A process of generating one or more candidate features through the multi-domain agent based on the above input context information; and A method further comprising the process of training a machine learning model according to the input context information using one or more candidate features and evaluating the performance to determine the feature that contributed to the performance improvement.
- In Article 6, The process of determining the features that contributed to the above performance improvement is, A process of constructing a plurality of candidate feature combinations through random sampling from one or more of the above candidate features; A process of training a machine learning model suitable for a machine learning type included in the input context information using each of the above-mentioned combinations of candidate features, and evaluating the performance of the machine learning model using performance evaluation indicators included in the input context information; and A method including a process for determining features that contributed to performance improvement based on evaluation results.
- In Article 6, A method further comprising the process of storing features that contributed to the above performance improvement in a feature store.
- A computing device comprising at least one processor and memory operably coupled with at least one processor, The above memory stores instructions that cause at least one processor to perform operations in response to the execution of instructions by at least one processor, and The above operations are, A process of acquiring knowledge information about features required for machine learning modeling from a knowledge database; A process of fine-tuning the Large Language Model (LM) to generate feature-related information in response to a context prompt using the above knowledge information; A process of obtaining multiple responses to a context prompt using the above LLM and generating a dataset composed of domain-specific preferred and non-preferred response pairs; and The method includes a process of optimizing the LLM through domain-specific Direct Preference Optimization (DPO) learning using the above dataset, The above knowledge information includes information regarding the domain, business problem, machine learning type, data attributes, derived variables, reasons for using derived variables, and derived variable codes, and The above context prompt includes information regarding the domain, business problem, machine learning type, and data attributes, and A computing device comprising at least one of the above feature-related information, a derived variable, a reason for deriving the derived variable, and a derived variable code.
- A computer-readable, non-transient recording medium storing instructions, wherein when the instructions are executed by the computer, the computer, A process of acquiring knowledge information about features required for machine learning modeling from a knowledge database; A process of fine-tuning the Large Language Model (LM) to generate feature-related information in response to a context prompt using the above knowledge information; A process of obtaining multiple responses to a context prompt using the above LLM and generating a dataset composed of domain-specific preferred and non-preferred response pairs; and A computer-readable non-transient recording medium characterized by executing a process of optimizing the LLM through domain-specific DPO (Direct Preference Optimization) learning using the above dataset.
Description
Method and Apparatus for Generating Features Using Generative AI The present disclosure relates to a method and apparatus for generating features using generative AI. The following description merely provides background information related to the present embodiment and does not constitute prior art. Feature engineering is a critical process in data analysis and machine learning modeling, referring to the task of selecting and processing data features to optimize model performance. Traditionally, data analysts have manually analyzed data and extracted meaningful features by utilizing domain knowledge. However, this method has limitations, as it is time-consuming, labor-intensive, and heavily reliant on the analyst's expertise. AutoML (Automated Machine Learning) tools provide the ability to automatically generate and select features, but since they primarily operate based on predefined rules or algorithms, the quality of the generated features may be inferior to that of an analyst's manual work. With the recent advancement of Generative AI (Artificial Intelligence) technology, it is being utilized in various fields such as the generation of new content like text, images, music, and code, translation, and chatbots. By applying this technology to data analysis and feature engineering, the potential to increase automation and efficiency is being presented. FIG. 1 is a conceptual diagram illustrating the entire process of automatically generating meaningful features to be used in machine learning modeling by utilizing generative artificial intelligence in embodiments of the present disclosure. FIG. 2 is a flowchart of a large-scale language model and a multi-domain agent learning method according to one embodiment of the present disclosure. Figure 3 is a diagram showing an example of a prompt for knowledge extraction and an example of knowledge information obtained as a response, respectively. Figure 4 is a diagram showing an example of obtaining multiple responses to context information (context prompt) and classifying them into preferred responses and non-preferred responses. FIG. 5 is a flowchart of a feature generation and evaluation method according to one embodiment of the present disclosure. FIG. 6 is a block diagram illustrating an example of a computing device according to one embodiment of the present disclosure. Some embodiments of the present disclosure are described in detail below with reference to exemplary drawings. It should be noted that in assigning reference numerals to the components of each drawing, the same components are given the same reference numeral whenever possible, even if they are shown in different drawings. Furthermore, in describing the present disclosure, if it is determined that a detailed description of related known components or functions could obscure the essence of the present disclosure, such detailed description is omitted. In describing the components of the embodiments according to the present disclosure, symbols such as first, second, i), ii), a), b), etc., may be used. These symbols are intended only to distinguish the components from other components, and the essence, order, or sequence of the components is not limited by the symbols. When a part in the specification is described as 'comprising' or 'having' a component, this means that, unless explicitly stated otherwise, it does not exclude other components but may include additional components. Furthermore, terms such as 'part' or 'module' described in the specification refer to a unit that processes at least one function or operation, and this may be implemented in hardware, software, or a combination of hardware and software. The description of the invention disclosed below, together with the accompanying drawings, is intended to describe exemplary embodiments of the invention and is not intended to represent the only embodiment in which the invention may be practiced. The present disclosure presents a method for automatically generating meaningful features that can improve the performance of machine learning models by utilizing generative artificial intelligence. FIG. 1 is a conceptual diagram illustrating the entire process of automatically generating meaningful features to be used in machine learning modeling by utilizing generative artificial intelligence in embodiments of the present disclosure. Referring to Fig. 1, the method largely includes a learning phase and an inference phase. The training phase includes a Large Language Model (LLM) training phase and a Multi-Domain Agent training phase. In the LLM learning phase, information is extracted from various documents (e.g., papers, work history information, code or query history information, etc.) to generate derived variables, and knowledge regarding this is imparted to the LLM. In this specification, derived variable and feature are used interchangeably. The LLM learning phase is characterized by utilizing the LLM to extract knowledge information, such