CN-122020188-A - Cold start scene discipline classification model iterative optimization method and system

CN122020188ACN 122020188 ACN122020188 ACN 122020188ACN-122020188-A

Abstract

The invention discloses a cold start scene subject classification model iterative optimization method and system, wherein the iterative optimization method adopts double iterative optimization, starts model training based on initial sample pool construction, designs a three-level filtering and dynamic priority evaluation mechanism, screens out high-value samples from massive unlabeled data, achieves accurate amplification of training data, utilizes a large language model to conduct root cause analysis on model misjudgment results, automatically identifies misjudgment types and supplements weak data in a directed mode, and finally achieves continuous updating and backtracking of the data through sample pool versioning management. The invention solves the problem of lack of labeling data of the subject classification model in a cold start scene, and through systematic and efficient automatic data construction and optimization closed loop, the dependence of model training and iteration on intensive manual labeling and field data analysis is obviously reduced, and the co-evolution of model classification performance and training data in the automatic closed loop is realized.

Inventors

XIA JING
CHI SHENGQIANG
XIN RAN
LI XUEYAO
SHU GE
ZHANG YING

Assignees

之江实验室

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (10)

1. The iterative optimization method for the subject classification model of the cold start scene is characterized by comprising the following steps of: S1, extracting initial positive samples from an acquired original data set based on a preset rule aiming at a target subject, extracting initial negative samples from other subject data, and combining the positive and negative initial samples to form an initial training sample pool; s2, training by adopting a text classification algorithm based on the initial training sample pool to obtain an initial subject classification model; s3, obtaining a high-value sample from an original data set by using an initial subject classification model through a three-level funnel filtering and dynamic priority evaluation mechanism, performing fine labeling in an auxiliary or manual mode by using a large language model, merging the labeled newly-added sample with a historical sample pool, removing weight, and updating a training sample pool; S4, retraining a subject classification model based on the updated sample pool, analyzing misjudgment samples of the trained subject classification model by using a large language model through a structured prompt word, and supplementing directional enhancement positive and negative sample data to generate a directional enhancement sample pool; S5, retraining the subject classification model based on the directional enhancement sample pool, and when the performance of the subject classification model reaches a preset target, ending iteration and outputting a final subject classification model.
2. The iterative optimization method of cold-start scene discipline classification model according to claim 1, wherein in step S1, the preset rule includes one or more of URL domain name screening, keyword matching, metadata screening, classification system mapping, and large model auxiliary screening.
3. The iterative optimization method of claim 1, wherein in step S2, the text classification algorithm comprises at least one of FastText, textCNN, biLSTM, BERT and its lightweight variants.
4. The iterative optimization method of cold-start scene discipline classification model according to claim 1, wherein in step S3, the three-stage funnel filtering and dynamic priority evaluation mechanism comprises: the first stage is based on the rapid diversion of the confidence coefficient of the initial subject classification model, and the original data set is divided into a high-confidence positive candidate sample, a high-confidence negative candidate sample and a fuzzy sample according to a preset threshold value; The second stage is quality filtering and grading, the unqualified samples are removed based on a preset quality filtering rule to obtain candidate sample pools, then a text quality grade classifier is utilized to grade the high, medium and low quality of the samples in the candidate sample pools, and the high, medium and low quality proportion distribution in the candidate sample pools is calculated; And thirdly, calculating comprehensive priority scores of the samples based on the multi-dimensional intelligent priority assessment, and sorting the samples in descending order according to the comprehensive priority scores based on the diversity factor, the uncertainty factor and the quality distribution adjustment factor.
5. The iterative optimization method of a cold-start scene discipline classification model of claim 4, wherein said third stage is a multidimensional based intelligent priority assessment comprising the sub-steps of: (1) Labeling course labels on each sample in the candidate sample pool, obtaining a coarse-granularity course label set in the candidate sample pool, and calculating the coverage rate and the relative coverage rate of each course label in the candidate sample pool; (2) Weighting and summing the relative coverage rate of each course label corresponding to each sample to obtain a diversity factor of each sample in the candidate sample pool; (3) Calculating an uncertainty factor of the sample based on the predictive value of each sample by the discipline classification model; (4) Calculating a sample mass distribution adjustment factor based on the high, medium, and low mass ratio distribution in the candidate sample pool and the mass class of each sample; (5) Weighting and summing the diversity factor, the uncertainty factor and the quality distribution regulating factor of each sample to obtain a comprehensive priority score; (6) The samples are sorted in descending order based on the composite priority score.
6. The iterative optimization method of a cold-start scene discipline classification model according to claim 1, characterized in that said step S4 comprises the sub-steps of: S4.1, collecting positive samples and negative samples predicted in an evaluation data set by a subject classification model, selecting a false positive sample set and a false negative sample set according to real subject labels of the samples, respectively inputting a large language model, performing root cause analysis by using a structured prompt word, and outputting misjudgment type and confusion subject of each sample, wherein the evaluation data set is constructed from data sources other than an original data set source, and is ensured to cover text samples of a plurality of subjects by a manual or large model labeling mode; S4.2, respectively counting the number of samples of the false positive example sample set and the false negative example sample set, and adopting Laplacian to smoothly calculate the misjudgment proportion of each misjudgment type in the predicted positive sample and negative sample; S4.3, judging whether the misjudgment proportion of each misjudgment type reaches the condition that the directional enhancement is required or not based on a preset proportion threshold; S4.4, screening out samples of the corresponding types needing to be enhanced from the original data set, further screening out directional enhancement positive samples and directional enhancement negative samples, and combining to generate a directional enhancement sample pool.
7. The iterative optimization method of cold-start scene discipline classification model according to claim 6, characterized in that in step S4.1, the structured prompt word template is as follows: the first template is used for analyzing the text which is misjudged as a positive sample by the model, identifying that the misjudgment type is CONF-confusing subjects or NOISE-low-quality text interference, and identifying the names of confusing subjects if the type is CONF; The second template is used for analyzing the text which is misjudged as a negative sample by the model, identifying that the misjudgment reasons are CONF-confusing disciplines or NOISE-low-quality text interference respectively, and identifying the names of the confusing disciplines if the type is CONF.
8. An iterative optimization system for a cold start scene discipline classification model, comprising: the initial sample pool construction module is used for quickly obtaining an initial data set capable of starting training; the sample pool management and updating module is used for performing management updating and capacity control on texts in the sample pool; the model training module is used for training the subject classification model by adopting an efficient text classification algorithm based on the training sample pool of the current version; The training sample iteration multidimensional evaluation and diversification screening module intelligently screens out newly-added training samples capable of improving the model discrimination capability and knowledge coverage to the greatest extent; the misjudgment automatic analysis and orientation data enhancement module is used for automatically diagnosing the misjudgment type and reason of the classification model based on the root cause analysis capability of the large language model, outputting orientation enhanced positive and negative samples aiming at misjudgment labels with the proportion exceeding a preset threshold, updating the training sample pool and retraining the model, and cycling for a plurality of times until the proportion of all misjudgment labels is lower than the preset threshold.
9. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the cold start scene discipline classification model iterative optimization method of any of claims 1-7.
10. An electronic device comprising a memory and a processor, wherein the memory is coupled to the processor, wherein the memory is configured to store program data, and wherein the processor is configured to execute the program data to implement the cold start scene discipline classification model iterative optimization method of any of claims 1-7.

Description

Cold start scene discipline classification model iterative optimization method and system Technical Field The invention belongs to the technical field of artificial intelligence and natural language processing, and particularly relates to a method and a system for iterative optimization of a subject classification model of a cold start scene. Background With the rapid development of academic resource digitization, an automatic and high-precision discipline classification technology becomes a core requirement for screening and organizing expertise from massive texts. The technology aims at providing basic support for applications such as education content recommendation, academic resource navigation, discipline corpus construction and the like. However, in actually building a discipline classification model based on massive text (e.g., massive web page data), the following prominent technical challenges are faced: 1. Cold start problem-the subject classification task often lacks large-scale, high-quality labeling data at the initial stage of start-up. The existing methods typically rely on expert manual labeling of small amounts of seed data, or preliminary data collection using keyword-based rough rules. However, manual labeling-based approaches are costly, inefficient, while rule-based approaches have low recall and high noise. The two modes are difficult to obtain enough and balanced initial training data, so that the performance of an initial model is weak, and the requirement of the follow-up accurate screening of a subject data set cannot be met, so that a vicious circle of 'no high-quality data- & gt poor model- & gt inability to screen high-quality data' is involved. 2. Data bias and undercover problems-even if initial training data is obtained, there is often a significant deviation in the distribution. For example, data may be overly focused on certain hot subjects or hot sub-areas within subjects (such as "machine learning" in computer science), while lacking coverage of cold subject sub-areas. In addition, the data quality distribution is also very unbalanced, and most of the data is text with clear structure and normative expression, and the coverage of the text with common spoken language or mixed content with different quality is lacking. The data prejudice results in that the trained model can perform in a greenhouse environment, but once the model is applied to text data such as real and various mass webpages, the discrimination capability of cold subjects and general quality texts is drastically reduced, the robustness of the model in practical application is seriously affected, and the real accurate screening is difficult to realize. 3. The problem of low iterative optimization efficiency is that in order to improve the performance of a model, an iterative optimization strategy is generally adopted in the prior art, namely, unlabeled data is predicted by using a current model, and then a new sample is screened and added into a training set by manually examining a prediction result. The process is highly dependent on manual analysis of domain experts, is time-consuming and labor-consuming, is high in subjectivity, and is difficult to systematically quantify defect distribution of the model. It is difficult for experts to quickly and accurately generalize model weaknesses from thousands of predicted samples, resulting in ambiguous data enhancement directions, slow optimization processes and limited effectiveness. The inefficient iterative approach severely restricts the feasibility of quickly constructing high-performance discipline classification models in a massive data environment. To address the above challenges, studies have attempted to introduce methods of active learning, semi-supervised learning, etc., but they typically focus on only a single optimization objective (e.g., model uncertainty), failing to incorporate discipline knowledge structures (e.g., curriculum systems) and data quality dimensions into a unified screening framework. Meanwhile, the methods lack a mechanism for carrying out automatic and structural diagnosis on model defects, and still cannot realize the fundamental transition from 'manual experience driving' to 'algorithm data driving'. Therefore, for an application scenario of constructing a subject classification model based on massive text data to perform accurate data screening, an innovative method capable of automatically resolving cold start, systematically eliminating data bias, and intelligently guiding iterative optimization is needed. Disclosure of Invention In order to overcome the defects of the prior art, the invention provides a cold start scene subject classification model iteration optimization method and system, which are used for solving the key technical problems of difficult model training caused by lack of labeling data, weak model generalization capability caused by uneven data distribution and high optimization cost caused by low manual ite