CN-122019762-A - LLM-driven diversified text automatic generation method, device and system
Abstract
The invention discloses a LLM-driven automatic generation method, device and system for diversified texts, and belongs to the technical field of artificial intelligence natural language processing. The realization method comprises the steps of 1, setting text prompt words, 2, obtaining key values for constructing a text attribute dictionary by utilizing a OpenAI o-mini language model, 3, inputting the constructed sample prompt words into the OpenAI o-mini model to obtain a text sample set with labeling attributes, which is formed by list key values hit randomly from the text attribute dictionary, 4, clustering semantic vectors extracted by an embedded model by utilizing a K-means clustering method to obtain K groups of text samples, 5, training K fine tuning LoRA modules inserted into a frozen base model by utilizing the K groups of text samples to obtain K customized models, 6, inputting the text attribute dictionary into the K customized models in a secondary parallel mode to obtain diversified text data, and compared with the prior art, realizing diversification of LLM-driven text data batch automatic generation.
Inventors
- LI KAN
- YUAN PEIWEN
Assignees
- 北京理工大学
Dates
- Publication Date
- 20260512
- Application Date
- 20251201
Claims (3)
- 1. A LLM-driven diversified text automatic generation method is characterized by comprising the following steps, Setting a text prompt word consisting of text description, an attribute preference dictionary, example texts, a target generated text quantity N and a diversity coefficient K; step 2, obtaining key values for constructing a text attribute dictionary by utilizing OpenAI o-mini language model, wherein the key is a text attribute category, and the value is an attribute candidate value; step 2.1, inputting text prompt words into a OpenAI o-mini language model to obtain text attribute categories and attribute candidate values; Step 2.2, using the text attribute category as a key, using the attribute candidate value as a value, and constructing a text attribute dictionary; Step 3, inputting the constructed sample prompting words into OpenAI o-mini model to obtain a text sample set with labeling attributes formed by list key value pairs hit randomly from the text attribute dictionary; Step 3.1, constructing sample prompt words by using the text description, the text attribute dictionary and the example text; Step 3.2, inputting sample prompt words into OpenAI o-mini model, and randomly hitting key value pairs in list form from text attribute dictionary; step 3.3, generating a sample corresponding to the hit list key value pair by using the OpenAI o4-mini model; step 3.4, forming a text sample set with labeling attributes by the corresponding samples; step 4, clustering semantic vectors extracted by the embedded model by using a K-means clustering method to obtain K group text samples; extracting semantic vectors of a text sample set with labeling attributes by using an embedded model; step 4.2, taking a diversity coefficient K as input, and clustering semantic vectors of a text sample set with labeling attributes by using a K-means clustering method to obtain K groups of text samples; Training K fine tuning LoRA modules inserted into the frozen base model by using K text samples to obtain K customized models; Constructing and freezing a base model, training the base model with the fine tuning LoRA modules by using K groups of text samples to obtain K trained fine tuning LoRA modules; Step 5.2, respectively inserting the trained K fine tuning LoRA modules into the base model to obtain K customized models; step 6, inputting the text attribute dictionary into K customized models in a secondary parallel mode to obtain diversified text data; Step 6.1, inputting the text attribute dictionary into K customized models in parallel, and outputting key value pairs in the form of random hit list from the text attribute dictionary in parallel; and 6.2, inputting the key value pairs in the random hit list form into K customized models in parallel, and obtaining diversified text data in parallel.
- 2. The LLM-driven diversified text automatic generation system for realizing the method of claim 1 is characterized by comprising a prompt word construction module, an attribute dictionary generation module, a sample generation and labeling module, a semantic vector acquisition and clustering module, a fine adjustment and LoRA parameter training module and a multi-model parallel generation module; The prompt word construction module is used for analyzing the text data requirement of a user and constructing and generating required input prompt words; the attribute dictionary generating module is used for calling OpenAI o a 4-mini model to generate a text attribute dictionary according to the prompt word, and is used as input of the multi-model parallel generating module and the sample generating and labeling module; The sample generation and labeling module is used for generating a small amount of representative text samples by using an attribute dictionary call OpenAI o-mini model and labeling specific attribute values for each sample; The semantic vector acquisition and clustering module is used for acquiring semantic vectors for text samples and clustering the samples into groups, and is used as input of the fine tuning and LoRA parameter training module; The fine tuning and LoRA parameter training module is used for executing LoRA fine tuning training on the base language model based on the data of each cluster group to obtain K customized models; the multi-model parallel generation module is used for generating diversified text data by the K customized models in a secondary parallel mode.
- 3. The LLM-driven diversified text automatic generation device is characterized by comprising a user terminal and a server terminal; The user terminal is used for providing a graphical interface or an API interface for a user to input text data demand parameters and display a generated result; The server is provided with the automatic text data generating device, and the system is used for executing the method to generate diversified text data.
Description
LLM-driven diversified text automatic generation method, device and system Technical Field The invention relates to a LLM-driven diversified text automatic generation method, device and system, belongs to the technical field of artificial intelligence natural language processing, and is applied to the aspect of automatic generation of text data driven by a large language model. Background In natural language processing and artificial intelligent model training at present, the problems of high cost and low efficiency exist in acquiring high-quality and diversified text data. The traditional data generation method mainly depends on a pre-designed template or manual creation, is time-consuming and labor-consuming, and due to fixed modes, the generated text lacks of diversity and expandability, and is difficult to meet the increasing training data requirements. Especially in the training and fine tuning process of a large-scale pre-training model, a large amount of corpus with different styles and rich contents is required, and efficiency and diversity control cannot be considered in manual collection or writing of the data. Therefore, how to realize the diversification of the large language model (LLM, large Language Model) driven text data batch automatic generation has become a problem to be solved. Disclosure of Invention The invention aims at solving the technical problem of diversification of LLM-driven text data batch automatic generation, and provides a LLM-driven diversification text automatic generation method, device and system. According to the text data generation method and device, text data meeting diversity requirements are automatically generated in batches according to the necessary text description, the optional attribute preference dictionary, the optional sample text expected generation sample number N and the diversity coefficient K given by a user, so that the cost of acquiring large-scale diversified text data is reduced, and the generation efficiency is improved. The invention aims at realizing the following technical scheme: On one hand, the LLM-driven diversified text automatic generation method disclosed by the invention comprises the following steps of: Setting a text prompt word consisting of text description, an attribute preference dictionary, example texts, a target generated text quantity N and a diversity coefficient K; step 2, obtaining key values for constructing a text attribute dictionary by utilizing OpenAI o-mini language model, wherein the key is a text attribute category, and the value is an attribute candidate value; step 2.1, inputting text prompt words into a OpenAIo-mini language model to obtain text attribute categories and attribute candidate values; Step 2.2, using the text attribute category as a key, using the attribute candidate value as a value, and constructing a text attribute dictionary; Step 3, inputting the constructed sample prompting words into OpenAI o-mini model to obtain a text sample set with labeling attributes formed by list key value pairs hit randomly from the text attribute dictionary; Step 3.1, constructing sample prompt words by using the text description, the text attribute dictionary and the example text; Step 3.2, inputting sample prompt words into OpenAI o-mini model, and randomly hitting key value pairs in list form from text attribute dictionary; step 3.3, generating a sample corresponding to the hit list key value pair by using the OpenAI o4-mini model; step 3.4, forming a text sample set with labeling attributes by the corresponding samples; step 4, clustering semantic vectors extracted by the embedded model by using a K-means clustering method to obtain K group text samples; extracting semantic vectors of a text sample set with labeling attributes by using an embedded model; step 4.2, taking a diversity coefficient K as input, and clustering semantic vectors of a text sample set with labeling attributes by using a K-means clustering method to obtain K groups of text samples; Training K fine tuning LoRA modules inserted into the frozen base model by using K text samples to obtain K customized models; Constructing and freezing a base model, training the base model with the fine tuning LoRA modules by using K groups of text samples to obtain K trained fine tuning LoRA modules; Step 5.2, respectively inserting the trained K fine tuning LoRA modules into the base model to obtain K customized models; step 6, inputting the text attribute dictionary into K customized models in a secondary parallel mode to obtain diversified text data; Step 6.1, inputting the text attribute dictionary into K customized models in parallel, and outputting key value pairs in the form of random hit list from the text attribute dictionary in parallel; Step 6.2, inputting key value pairs in the form of random hit list into K customized models in parallel, and obtaining diversified text data in parallel; On the other hand, in order to achieve the purpose of the in