CN-122019687-A - Text processing method oriented to specific context

CN122019687ACN 122019687 ACN122019687 ACN 122019687ACN-122019687-A

Abstract

The invention discloses a text processing method oriented to a specific context, and belongs to the technical field of text processing. The method comprises the steps of constructing a sample set containing texts, labels and confidence coefficient, training a model through dynamic resampling and confidence coefficient weighting loss, calibrating an online strategy based on a high confidence subset scanning threshold value, dividing a consistent area and a conflict area by utilizing an reasoning result, carrying out small-batch manual labeling by combining active learning, generating weak labels by adopting a language model, and writing back and updating the sample set to form a multi-round iteration closed loop. The invention runs through the data management, model training and evaluation flow by a unified confidence coefficient number axis, obviously reduces the dependence on large-scale labeling data, improves the recognition precision and the anti-interference capability of nonstandard expressions such as slang, secret numbers, cross-language mixed writing and the like, and simultaneously realizes the continuous self-evolution with low cost and high stability.

Inventors

HU XIAO

Assignees

成都任平生网络科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260210

Claims (7)

1. A text processing method facing to specific context is characterized by comprising the following steps: step 1, constructing a sample set of an nth round of iteration, wherein the sample set comprises a plurality of samples, and each sample at least comprises a text to be processed, a current round of labels and a current round of sample confidence; step 2, dividing the sample set into a training set and a verification set, and storing a division index; training a text classification model based on a training set and a verification set; Calculating the prediction probability of each sample belonging to the target risk class on a verification set by using a trained text classification model, constructing a calibration weight, carrying out online threshold scanning calibration on a high confidence subset to reduce the influence of weak marking noise on threshold selection, comparing the prediction probability with an online threshold to obtain a prediction label, calculating a weighted F1 value of the target risk class by taking the calibration weight as the weight, and selecting the online threshold with the largest weighted F1 value as a current round of online threshold; the method comprises the steps of (5) performing full reasoning on a training set and a verification set by using an optimal model and a current online threshold value, dividing samples into a consistent region and a conflict region by taking whether a predicted label is consistent with a current label or not as a judging standard, wherein the consistent region comprises samples of which the predicted label is identical with the current label, and the conflict region comprises samples of which the predicted label is different with the current label; step 6, sampling from the verification consistent area and the verification conflict area to construct a candidate pool, wherein the sampling weight of the verification conflict area is greater than that of the verification consistent area; Step 7, inputting the strong tag subset into a language model for induction to generate a current round of prompting words, and carrying out weak labeling on samples of a candidate pool or an extended sample set by using the current round of prompting words to obtain a weak labeling set; and 8, writing the weak labeling result back into a sample set of the nth iteration, updating the label and the confidence coefficient, generating a sample set of the (n+1) th iteration, and returning to the step 1 to start the next iteration to form closed loop self-evolution.
2. The text processing method for a specific context according to claim 1, wherein in the step 1, when n=0, an initial label in an initial sample set in which no iteration starts is set according to a history weak label, a rule weak label, a statistical policy or a default label policy, and an initial sample confidence is set according to a history experience value or a preset initial value.
3. The text processing method for specific context according to claim 1, wherein in the step 3, the training process of the text classification model comprises the following steps: carrying out dynamic resampling on the training set according to the labels, controlling sampling distribution according to the negative and positive proportions of the targets, and carrying out weighted sampling on the negative samples according to the confidence level of the current round of samples, so that the negative samples with high confidence participate in training with higher probability; Model training is performed using a confidence weighted loss function, wherein loss aggregation weights sample-by-sample losses based on current round sample confidence, and validation losses are calculated on a validation set.
4. A context-specific text processing method according to claim 3, characterized in that said dynamic resampling comprises the steps of: splitting the training set into a positive sample set and a negative sample set according to labels; Calculating a target negative sample number according to the target negative-positive proportion; Layering the negative sample set according to the confidence level of the current round sample to obtain a high-confidence negative sample and a low-confidence negative sample; And (3) assigning a weight multiple to the high confidence negative sample, assigning a weight of 1.0 to the low confidence negative sample, and performing weighted random sampling to generate a training subset of each training round.
5. The context-specific text processing method of claim 3, wherein the confidence weighted loss function is calculated by: For each batch of samples, calculating a sample-by-sample cross entropy loss; using the confidence coefficient of the current round of samples as a sample weight, and carrying out weighted aggregation on the cross entropy loss of each sample to obtain a batch loss; And during training, the batch loss is counter-propagated, and during verification, the weighted verification loss is calculated.
6. The context-specific text processing method of claim 1, wherein the samples in the subset of strong labels record artificial labels, artificial confidence levels, and artificial reasons.
7. The context-specific text processing method according to claim 1, wherein the step 7 further comprises the steps of: s71, fine-tuning a language model by using a manual annotation thinking chain in the strong tag subset; s72, generating a current round of prompt words by using the trimmed language model; S73, carrying out weak annotation on samples of the candidate pool or the extended sample set by using the current round of prompt words to obtain a weak annotation set, and comparing differences between the weak annotation of the samples in the weak annotation set and the original annotation to obtain a difference item; S74, supplementing manual labeling for the difference item; s75, fine-tuning the language model again by using a manual annotation thinking chain for supplementing manual annotation; S76, circulating S72-S75 until the current round of prompt words reach the preset accuracy.

Description

Text processing method oriented to specific context Technical Field The invention relates to the technical field of text processing, in particular to a text processing method oriented to a specific context. Background At present, natural language processing and text classification technology is widely applied to business scenes such as content security audit and risk control, and is used for carrying out intention recognition, illegal risk recognition, compliance label classification and automatic sealing and forbidden disposal on a user generated text. With the development of deep learning and pre-training language models, the text classification model obtains higher accuracy on a general corpus and a public data set, so that an automatic auditing and wind control scheme based on the model becomes a mainstream step by step. However, in a real business environment, especially in a high-difficulty short text scene of a context such as an instant social/friend-making dialogue, the following difficulties still exist in a scheme which purely depends on a general pre-training model and a small amount of manual supervision: 1. The context data is difficult to collect and cover, the scene text generally has the characteristics of strong privacy, context fragmentation, quick iteration and the like, is limited by privacy compliance and platform closure, and is difficult to form a large-scale high-quality training corpus which can be obtained in a public way. Meanwhile, the text contains a large amount of slang, surging, postamble, abbreviation, misspelling, homophonic substitution, cross-language mixed writing and mixed arrangement of expression symbols and special characters, so that a public data set or a general corpus cannot cover the real distribution of the expression symbols, and the generalization capability of a model on a key boundary sample is insufficient. 2. Long tails are significant and fast against evolution, high risk expressions tend to be long tails, occur with low frequency but large impact, and will rapidly evolve in a against environment, e.g. circumventing detection by replacing words, splitting expressions, symbolizing expressions, or switching between different languages. Traditional relying rule maintenance or passive collection of real samples is difficult to track novel expression in time, so that the model has long iteration period and response lag. 3. Sentence-level manual labeling has high cost and difficult noise control, namely in a high-difficulty short text scene, the text itself has limited information quantity, the intention and the risk often depend on implicit language or context, so that the manual labeling needs higher professional degree and consistency control, and the unit cost is obviously higher than that of a conventional classification task. While the model trained by only relying on a small amount of manual labeling is easily affected by sample deviation, the data scale can be expanded by introducing the weak labeling/pseudo labels, but if a unified mechanism is lacked to restrict and calibrate the weak labeling noise, the training data is easily polluted. 4. The lack of sustainable self-evolution data and model closed loop is that the existing scheme often lacks automatic mining of difficult cases, quantitative management of weak labeling credibility and a continuous calibration mechanism of threshold values and models in a data acquisition-training-online link, so that the models are difficult to continuously promote under low labeling budget, and the context changes and expression avoidance are difficult to stably cope with. Therefore, a technical scheme for a text with high difficulty and short context is needed, an available supervision signal can be obtained at low cost under the conditions of limited sentence-level labeling and slow data growth, closed-loop self-evolution is formed through unified confidence management, difficult-case mining and iterative write-back mechanisms, and meanwhile, a virtual text constrained by the context can be introduced to generate a supplementary long-tail expression and a supplementary comparison sample, so that the generalization capability, anti-avoidance capability and iteration speed of the classifier are improved. This is particularly important in the APP context of friends/appointments. Disclosure of Invention The invention aims to overcome the defects of the prior art and provides a text processing method oriented to a specific context. The invention aims at realizing the technical scheme that the text processing method facing the specific context comprises the following steps: step 1, constructing a sample set of an nth round of iteration, wherein the sample set comprises a plurality of samples, and each sample at least comprises a text to be processed, a current round of labels and a current round of sample confidence; step 2, dividing the sample set into a training set and a verification set, and storing a division in