CN-121999418-A - Reinforced learning training method and device for content auditing model and electronic equipment

CN121999418ACN 121999418 ACN121999418 ACN 121999418ACN-121999418-A

Abstract

The application relates to a reinforcement learning training method and device of a content auditing model and electronic equipment. The method comprises the steps of obtaining a multi-mode audit large model to be trained and constructing an expert model group, generating audit junctions and corresponding visual attention distribution by the multi-mode audit large model aiming at a sample video, outputting risk probability and significance thermodynamic diagrams by the expert model group, calculating tag consistency rewards and attention alignment rewards, calculating prediction entropy of the expert model group according to the risk probability, constructing dynamic prompt information by searching an external audit rule base under the condition that the prediction entropy is not lower than a preset threshold, calculating logic self-consistency rewards, constructing a composite rewards function based on the tag consistency rewards, the attention alignment rewards and the logic self-consistency rewards, and carrying out reinforcement learning training updating on the multi-mode audit large model. The application solves the technical problems that the content auditing model is poor in generalization and can not process conflict and fine granularity positioning.

Inventors

WANG DAN

Assignees

北京奇艺世纪科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260228

Claims (10)

1. A reinforcement learning training method of a content auditing model is characterized by comprising the following steps: Acquiring a multi-mode auditing large model to be trained and at least two expert small models respectively aiming at different risk categories, and performing significance enhancement reconstruction on the expert small models to construct an expert model group; Generating audit conclusion and corresponding visual attention distribution by the multi-mode audit large model aiming at a sample video, outputting risk probability and significance thermodynamic diagram by the expert model group, calculating tag consistency rewards according to the difference between the risk probability and the audit conclusion, and calculating attention alignment rewards according to the similarity between the visual attention distribution and the significance thermodynamic diagram; calculating the prediction entropy of the expert model group according to the risk probability, taking the label consistency rewards and the attention alignment rewards as main optimization targets under the condition that the prediction entropy is lower than a preset threshold value, constructing dynamic prompt information by retrieving an external auditing rule base under the condition that the prediction entropy is not lower than the preset threshold value, guiding the multi-mode auditing big model to generate an interpretation text containing an reasoning process, and calculating logic self-consistency rewards; And constructing a composite reward function based on the label consistency reward, the attention alignment reward and the logic self-consistency reward, and performing reinforcement learning training update on the multi-mode auditing big model according to the composite reward function.
2. The method of claim 1, wherein performing a saliency enhancement transformation on the expert small model to construct an expert model population comprises: Modifying the expert small model of the convolutional neural network architecture by using a gradient-based class activation mapping technology, and aggregating the attention weights of all layers of the expert small model of the visual transducer architecture by using a layer-by-layer propagation technology based on the attention weights so that the expert model group outputs a significant thermodynamic diagram corresponding to a sample video frame while outputting risk probabilities corresponding to risk classes.
3. The method of claim 1, wherein calculating a tag compliance reward based on a difference between the risk probability and the audit conclusion comprises: Acquiring risk probabilities output by each expert small model in the expert model group aiming at a corresponding risk category, and auditing conclusion confidence degrees generated by the multi-mode auditing large model aiming at the corresponding risk category; calculating the difference value between the risk probability and the audit conclusion confidence under each risk category, and determining the consistency score of the corresponding risk category according to the absolute value of the difference value, wherein the consistency score is inversely proportional to the difference value; and carrying out weighted summation or average treatment on the consistency scores of the risk categories to obtain the tag consistency rewards.
4. The method of claim 1, wherein calculating an attention alignment prize based on a similarity between the visual attention profile and the salient thermodynamic diagram comprises: extracting cross-modal attention weights of the multi-modal auditing big model when generating illegal entity words, and mapping the cross-modal attention weights into visual attention distribution corresponding to a sample video frame space; obtaining a significant thermodynamic diagram output by each expert small model in the expert model group, and normalizing and flattening the visual attention distribution and the significant thermodynamic diagram; And respectively calculating the similarity between the visual attention distribution and each salient thermodynamic diagram, determining confidence coefficient weight according to the corresponding risk probability output by each expert small model in the expert model group, and carrying out weighted fusion on each similarity to obtain the attention alignment rewards.
5. The method of claim 1, wherein calculating the predictive entropy of the expert model group based on the risk probabilities comprises: For each expert small model in the expert model group, acquiring risk probability distribution of the expert small model aiming at the sample video output; Based on the risk probability distribution, a prediction entropy is calculated to quantify the certainty of each expert small model in the expert model group on the current judgment.
6. The method of claim 1, wherein, in the case where the prediction entropy is not lower than the preset threshold, constructing dynamic prompt information by retrieving an external audit rule base, guiding the multi-mode audit big model to generate an interpretation text containing an inference process, and calculating a logical self-consistency reward includes: Under the condition that the prediction entropy is not lower than the preset threshold, according to the visual tag or the illegal entity word of the sample video as a search keyword, searching related rule texts from the external auditing rule base or the legal rule base; Combining the retrieved rule text with the dynamically constructed prompt word information to form dynamic prompt information for guiding the multi-mode auditing big model to generate gradual reasoning explanation; And enabling the multi-mode auditing big model to calculate the logic self-consistency rewards according to the accuracy of the quoting rules and the rationality of the reasoning logic in the process of generating the interpretation text comprising the analysis picture details, the quoting rule clauses and the decision making conclusion.
7. The method of any of claims 1-6, wherein constructing a composite rewards function based on the tag uniformity rewards, the attention alignment rewards, and the logical self-consistency rewards comprises: Under the condition that the prediction entropy is lower than a preset threshold value, carrying out weighted summation on the label consistency rewards and the attention alignment rewards to obtain the composite rewards function; And under the condition that the prediction entropy is not lower than the preset threshold value, carrying out weighted calculation on the logic self-consistency rewards to obtain the composite rewards function.
8. A reinforcement learning training device for a content auditing model, comprising: The building module is used for acquiring a multi-mode auditing large model to be trained and at least two expert small models respectively aiming at different risk categories, and performing significance enhancement reconstruction on the expert small models to build an expert model group; The first calculation module is used for generating audit results and corresponding visual attention distribution by the multi-mode audit big model aiming at a sample video, outputting risk probability and significance thermodynamic diagrams by the expert model group, calculating tag consistency rewards according to the difference between the risk probability and the audit results, and calculating attention alignment rewards according to the similarity between the visual attention distribution and the significance thermodynamic diagrams; The second calculation module is used for calculating the prediction entropy of the expert model group according to the risk probability, taking the label consistency rewards and the attention alignment rewards as main optimization targets under the condition that the prediction entropy is lower than a preset threshold value, constructing dynamic prompt information by searching an external auditing rule base under the condition that the prediction entropy is not lower than the preset threshold value, guiding the multi-mode auditing big model to generate an explanation text comprising an reasoning process, and calculating logic self-consistency rewards; And the training module is used for constructing a composite rewarding function based on the label consistency rewards, the attention alignment rewards and the logic self-consistency rewards, and carrying out reinforcement learning training update on the multi-mode auditing big model according to the composite rewarding function.
9. A computer-readable storage medium, having stored thereon a computer program, characterized in that the computer program, when executed by a processor, performs the method of any of claims 1 to 7.
10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 7 by means of the computer program.

Description

Reinforced learning training method and device for content auditing model and electronic equipment Technical Field The application relates to the technical field of multimedia intelligent processing, in particular to a reinforcement learning training method and device of a content auditing model and electronic equipment. Background With the explosive growth of internet multimedia content, content auditing has become a core link for ensuring user safety and compliance operation for each large platform. To detect different types of offending content, platforms typically deploy multiple expert small models for specific risk categories, such as dedicated models for identifying pornography, violence, and the like. These proprietary small models have higher accuracy and stability in specific fields, but with the increasing complexity of content types and auditing standards, traditional auditing methods that rely solely on single or multiple small models expose significant limitations. Firstly, the existing small model can only capture local features, lacks understanding capability of complex semantic relationships and cannot effectively judge a hidden offence or long-tail risk scene, secondly, when small model output in different fields has conflict, the existing method usually adopts simple weighting or average processing, so that a large model is easy to learn logic judgment with ambiguities or errors, and furthermore, the existing method lacks fine-granularity visual positioning capability, only provides tag-level offence judgment, and is difficult to intuitively guide manual review or subsequent processing. To solve these problems, a multimodal big model has begun to introduce the field of content auditing in recent years. Compared with the traditional small model, the multi-mode large model can process images, videos, texts and audios in a unified semantic space simultaneously to realize deep understanding of complex violations, has strong common sense reasoning capability, can logically analyze by combining background knowledge and context information to judge the properties of similar art nude, athletic scenes or hidden and obscure violations, can generate an interpretable natural language audit report, clearly describe the violations and provide visual references for manual review. However, how to effectively combine the precise domain knowledge of the existing expert model with the wide-blog reasoning capability of the large model is still a key problem to be solved in the current technology. The traditional knowledge distillation method only aligns the labels or probability distribution output by the small model, can not ensure the consistency of the model in visual attention areas and reasoning logics, is easy to generate shortcut learning problems, and leads to insufficient generalization capability of the large model in unknown scenes. Disclosure of Invention The application provides a reinforcement learning training method and device of a content auditing model and electronic equipment, and aims to solve the technical problems that the content auditing model is poor in generalization and cannot deal with conflict and fine granularity positioning. According to a first aspect, the application provides a reinforcement learning training method of a content audit model, which comprises the steps of obtaining a multi-mode audit big model to be trained and at least two expert small models respectively aiming at different risk categories, performing significance enhancement reconstruction on the expert small models to construct an expert model group, generating audit junction and corresponding visual attention distribution by the multi-mode audit big model aiming at a sample video, outputting risk probability and significance thermodynamic diagram by the expert model group, calculating attention alignment rewards according to the difference between the risk probability and the audit junction, calculating attention alignment rewards according to the similarity between the visual attention distribution and the significance thermodynamic diagram, calculating prediction entropy of the expert model group according to the risk probability, taking the label alignment rewards and the attention alignment rewards as main optimization targets under the condition that the prediction entropy is lower than a preset threshold, constructing dynamic prompt information by searching an external rule base under the condition that the prediction entropy is not lower than the preset threshold, guiding the multi-mode audit junction comprises the high-state audit junction and generating a self-consistency rewards, and constructing a self-consistency reward training function according to the self-consistency reward rule. The application provides a reinforcement learning training device of a content audit model, which comprises a construction module, a first calculation module, a second calculation module and a self-consistency rewa