CN-122020165-A - Data processing method and device, electronic device, storage medium and program product

CN122020165ACN 122020165 ACN122020165 ACN 122020165ACN-122020165-A

Abstract

The present disclosure provides data processing methods and apparatus, electronic devices, storage media, and program products. The data processing method includes determining a plurality of mapping sub-data sets according to an original data set for training and/or fine tuning of a Large Language Model (LLM), determining weights corresponding to each quality index for the original data set based on a mapping network and the plurality of mapping sub-data sets, wherein the mapping network is used for predicting model accuracy corresponding to one mapping sub-data set based on the quality index of data in the mapping sub-data set, and determining a first data set for training and/or fine tuning of the LLM from the original data set based on the determined weights.

Inventors

REN JUNCHEN
MIN YING
LV ZHIJIE
ZHANG YUNHAO
ZHU FENG

Assignees

三星（中国）半导体有限公司
三星电子株式会社

Dates

Publication Date: 20260512
Application Date: 20260121

Claims (13)

1. A data processing method, comprising: Determining a plurality of mapping sub-data sets from the raw data sets for training and/or fine-tuning of the large language model LLM; determining weights for the original dataset corresponding to each quality index based on a mapping network and the plurality of mapping sub-datasets, wherein the mapping network is used for predicting model precision corresponding to one mapping sub-dataset based on the quality index of the data in the mapping sub-dataset; A first data set for training and/or fine tuning of the LLM is determined from the raw data set based on the determined weights.
2. The data processing method of claim 1, wherein the mapping network is determined based on a plurality of training sub-data sets of the original data set, and Wherein the plurality of training sub-data sets includes at least one sub-data set having data randomly selected from the original data set and at least one sub-data set having data selected based on at least one quality indicator randomly determined.
3. The data processing method of claim 2, wherein the mapping network is determined based on the plurality of training sub-data sets by: determining an index score for each quality index for each data of each training sub-data set using the reward model; training and/or fine-tuning the LLM based on each training sub-data set; Testing model accuracy of the training and/or fine-tuned LLM corresponding to each training sub-dataset by using the test set; The mapping network that maps the index scores to model accuracy is determined based on the determined index scores of the plurality of training sub-data sets and the model accuracy of the test.
4. The data processing method of claim 1, wherein determining weights for the original dataset corresponding to each quality indicator based on the mapping network and the plurality of mapping sub-datasets comprises: Randomly setting an index score for each quality index for each of the plurality of mapping sub-data sets; Determining model accuracy of LLM trained and/or trimmed based on each mapping sub-data set based on the set index score of the mapping sub-data set by using the mapping network; selecting a partial mapping sub-dataset from the plurality of mapping sub-datasets based on a plurality of model accuracies respectively corresponding to the plurality of mapping sub-datasets; Weights for each quality indicator of the original dataset are determined based on statistics of various quality indicators of the selected partially mapped sub-dataset.
5. The data processing method of claim 1, wherein determining a first data set for training and/or fine tuning of the LLM from the raw data set based on the determined weights comprises: determining an index score for each quality index of the data in the raw dataset by using a reward model; determining a quality score for data in the original dataset by weighting an index score for each quality index for the data based on a weight corresponding to each quality index; A predetermined number of first data constituting a first data set is selected from the original data set based on a quality score of the data in the original data set.
6. The data processing method of claim 1, further comprising: Second data forming a second data set for training and/or fine tuning of the LLM is selected from the first data set by performing data screening based on the search enhancement generation RAG.
7. The data processing method of claim 6, wherein selecting second data from the first data set forming a second data set for training and/or fine tuning of the LLM by performing data screening based on search enhancement generation RAG comprises: Repeatedly performing the RAG-based data screening until the number of data in the knowledge base or the number of second data in the second data set is equal to a predetermined value and/or the first data in the first data set is performed, wherein the step of performing the RAG-based data screening comprises: selecting a plurality of third data from the first data set for forming a third data set; The steps of retrieving, for each third data in the third data set, a plurality of fourth data related to the third data from the knowledge base using the RAG, determining, using the LLM, whether the knowledge base includes knowledge of the third data based on the plurality of fourth data and the third data, adding the third data to the knowledge base and determining as second data based on the knowledge base not including knowledge of the third data.
8. The data processing method of claim 7, wherein the step of selecting a plurality of third data from the first data set to form a third data set comprises: A predetermined number of third data having a largest Euclidean distance is selected from the first data set by using a K-center greedy algorithm.
9. The data processing method of claim 7, wherein determining whether the knowledge base includes knowledge of the third data based on the plurality of fourth data and the third data by using the LLM comprises: adding the plurality of fourth data related to the third data to a hint for input to the LLM; Obtaining the confusion degree with prompt and the confusion degree without prompt of the third data by using the LLM; determining a ratio of the confusion degree with prompts to the confusion degree without prompts of the third data as a knowledge retrieval score of the third data; Based on the comparison of the knowledge retrieval score of the third data with the threshold, it is determined whether the knowledge base includes knowledge of the third data.
10. A data processing apparatus comprising: A sub-dataset determination unit configured to determine a plurality of mapped sub-datasets from the raw dataset for training and/or fine tuning of the large language model LLM; A weight determining unit configured to determine a weight for each quality index for the original dataset based on a mapping network and the plurality of mapping sub-datasets, wherein the mapping network is configured to predict model accuracy corresponding to one mapping sub-dataset based on the quality index of data in the mapping sub-dataset; A first data set determination unit configured to determine a first data set for training and/or fine tuning of the LLM from the raw data set based on the determined weights.
11. An electronic device, comprising: At least one processor; at least one memory storing computer-executable instructions, Wherein the computer executable instructions, when executed by the at least one processor, cause the at least one processor to perform the data processing method of any of claims 1 to 9.
12. A computer readable storage medium, wherein instructions in the computer readable storage medium, when executed by at least one processor, cause the at least one processor to perform the data processing method of any of claims 1 to 9.
13. A computer program product comprising computer executable instructions which, when executed by at least one processor, implement the data processing method of any one of claims 1 to 9.

Description

Data processing method and device, electronic device, storage medium and program product Technical Field The present disclosure relates to the field of computer vision, and in particular, to data screening in the field of computer vision, and in particular, to a data processing method and apparatus, an electronic device, a storage medium, and a program product. Background With the rapid development of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) technology, large language models (Large Language Model, LLM) have achieved remarkable achievements in various fields of natural language processing (Natural Language Processing, NLP), text image generation, multimodal applications, and the like. The success of LLM depends not only on its complex model structure, but more importantly on the high quality, large scale data set it depends on. Data play a fundamental role, whether in the pre-training or fine-tuning phase. While LLM has been successful in general-purpose fields, LLM has poor application in specific fields (e.g., semiconductor, physical, biomedical, etc.) due to differences in data of specific fields and data of general-purpose fields. In order for LLM to perform domain-specific tasks better, it is often necessary to perform data management (duration), instruction fine-tuning (fine-tuning), reinforcement learning, etc. on LLM. However, in practical applications, the sources of data used for training or fine tuning are complex and the quality of the data varies, which results in noise data or low quality data being present in these data, directly affecting the performance of LLM to perform field-specific tasks. Therefore, efficient data management (particularly formulating a suitable data set for training or fine tuning) is of great importance for enhancing the performance of LLM in a specific area and for training efficiency of LLM. Disclosure of Invention To address at least the problems and/or disadvantages described above, embodiments of the present disclosure provide a data processing method and apparatus, an electronic device, a storage medium, and a program product. According to a first aspect of embodiments of the present disclosure, there is provided a data processing method comprising determining a plurality of mapping sub-data sets from an original data set for training and/or fine tuning of a Large Language Model (LLM), determining weights for each quality indicator for the original data set based on a mapping network and the plurality of mapping sub-data sets, wherein the mapping network is configured to predict model accuracy corresponding to one mapping sub-data set based on the quality indicator of the data in the mapping sub-data set, and determining a first data set for training and/or fine tuning of the LLM from the original data set based on the determined weights. Optionally, the mapping network is determined based on a plurality of training sub-data sets of the original data set, and wherein the plurality of training sub-data sets comprises at least one sub-data set having data randomly selected from the original data set and at least one sub-data set having data selected based on at least one quality index randomly determined. Optionally, the mapping network is determined based on the plurality of training sub-data sets by determining an index score for each quality index for each data of each training sub-data set by using a reward model, training and/or fine tuning the LLM based on each training sub-data set, testing model accuracy of the trained and/or fine tuned LLM corresponding to each training sub-data set by using a test set, and determining the mapping network that maps index scores to model accuracy based on the determined index scores and the tested model accuracy of the plurality of training sub-data sets. Optionally, the step of determining weights for each quality indicator of the original dataset based on the mapping network and the plurality of mapping sub-datasets comprises randomly setting an indicator score for each quality indicator for each of the plurality of mapping sub-datasets, determining model accuracy of the LLM trained and/or trimmed based on each mapping sub-dataset based on the set indicator score for that mapping sub-dataset by using the mapping network, selecting a partial mapping sub-dataset from the plurality of mapping sub-datasets based on a plurality of model accuracy corresponding to each of the plurality of mapping sub-datasets, determining weights for each quality indicator of the original dataset based on statistics of various quality indicators of the selected partial mapping sub-dataset. Optionally, the step of determining a first data set for training and/or fine tuning of the LLM from the raw data set based on the determined weights comprises determining an index score for each quality index of the data in the raw data set by using a reward model, determining a quality score of the data in the raw data set by weighting the index