CN-122027851-A - Method, device and equipment for auditing video content

CN122027851ACN 122027851 ACN122027851 ACN 122027851ACN-122027851-A

Abstract

The embodiment of the specification provides a method, a device and equipment for auditing video contents. The method comprises the steps of obtaining first multi-mode data of a video to be checked, wherein the first multi-mode data comprise key frame images and voice recognition texts of the video to be checked, extracting multi-mode features of the first multi-mode data, generating based on retrieval enhancement, retrieving a plurality of target checking labels used for judging whether the video to be checked is illegal or not from a rule knowledge base according to the multi-mode features, calculating matching scores of the multi-mode features and the target checking labels by using a first multi-mode large language model, and obtaining a first checking result by using a video checking model based on the first multi-mode data, the target checking labels and the matching scores corresponding to the target checking labels.

Inventors

ZHAO WENLONG
SUN XU
LI XIAOBO

Assignees

支付宝(杭州)数字服务技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260211

Claims (16)

1. A method of auditing video content, comprising: Acquiring first multi-mode data of a video to be checked, wherein the first multi-mode data comprises a voice recognition text and key frame images screened from the video to be checked by adopting a preset model with the model parameter number smaller than the preset parameter number; extracting multi-modal characteristics of the first multi-modal data; Based on the search enhancement generation, searching a plurality of target audit labels for judging whether the video to be audited is illegal or not from a rule knowledge base according to the multi-mode characteristics; Calculating matching scores of the multi-modal features and each target audit label by using a first multi-modal large language model; And obtaining a first audit result by using a video audit model based on the first multi-mode data, the target audit tag and the matching score.
2. The auditing method according to claim 1, wherein the extracting the multi-modal features of the first multi-modal data specifically includes: visual content analysis is carried out on the key frame image to obtain image description information of the key frame image; Identifying text entities included in the voice identification text to obtain text entity information; carrying out emotion analysis on the voice recognition text to obtain text emotion information; And carrying out feature fusion on the image description information, the text entity information and the text emotion information to obtain the multi-modal feature.
3. The auditing method according to claim 1, wherein the searching enhancement generation is based on the multi-modal feature to search a plurality of target auditing labels for judging whether the video to be audited is illegal from a rule knowledge base, specifically comprising: calculating the similarity between the feature vector of the multi-mode feature and a prestored rule vector in the rule knowledge base; based on the similarity, determining the audit label corresponding to the prestored rule vector meeting the preset requirement as a candidate audit label; generating an enhanced prompt word based on the multi-modal feature and the candidate audit tag; and inputting the enhanced prompt word into a second multi-mode large language model to obtain a plurality of target audit labels output by the second multi-mode large language model.
4. The auditing method of claim 1, the first multimodal data further comprising: and acquiring at least one of a video title, author information and a topic label of the video to be checked.
5. The auditing method according to claim 2, wherein the calculating the matching score of the multimodal features and each target auditing label by using the first multimodal large language model specifically includes: Generating a matching score calculation prompt word based on the multi-modal feature and the target audit tag, wherein the prompt word comprises the multi-modal feature, the target audit tag and task requirement information, and the task requirement information is used for indicating the first multi-modal large language model to calculate the matching score of the multi-modal feature and each target audit tag based on the image description information, the text entity information and the text emotion information; And inputting the matching score calculation prompt word into the first multi-mode large language model to obtain the matching score corresponding to each target audit label output by the first multi-mode large language model.
6. The auditing method according to claim 1, wherein the obtaining a first auditing result by using a video auditing model based on the first multi-mode data, the target auditing labels and matching scores corresponding to the target auditing labels specifically includes: Generating audit result generation prompt information for generating audit results based on the first multi-mode data, the target audit tag and the matching score; And inputting the review result generation prompt information to the video review model to obtain a first review result output by the video review model and first thinking link information for describing an analysis process of the first review result obtained by the multi-mode data, wherein the first thinking link information specifically comprises labeling information of a target key frame image related to the first review result and labeling information of a text segment related to the first review result.
7. The method of auditing video content according to claim 1, wherein the first audit result specifically comprises a first violation determination tag and a confidence score, the first violation determination tag is used for indicating whether the first violation determination tag passes the audit, fails the audit or needs to be manually reviewed, and the confidence score is used for indicating the confidence level of the video audit model on the first audit result.
8. The method for auditing video content according to claim 7, further comprising, after obtaining the first audit result using the video audit model: if the confidence score is lower than the preset score, triggering a manual review process; and if the confidence score is higher than a preset score and the first violation judging label is not approved, executing interception operation on the video to be inspected.
9. The auditing method of claim 1, wherein the training process of the video auditing model comprises a knowledge distillation training phase, and the knowledge distillation training phase specifically comprises: Acquiring first training data, wherein the first training data comprises second multi-mode data of a first historical to-be-checked video and a training checking label corresponding to the first historical to-be-checked video; sample auditing results of the first historical to-be-audited video are obtained, and sample thinking chain information for describing an analysis process of obtaining the sample auditing results from the second multi-mode data is obtained; Inputting the first training data into the video auditing model to obtain first output of the video auditing model, wherein the first output comprises a second auditing result and second thinking chain information; calculating a first loss of the first output by taking the sample auditing result and sample thinking chain information as a first supervision signal; Inputting the sample auditing result and the sample thinking chain information into the video auditing model to obtain a second output of the video auditing model, wherein the second output comprises a prediction auditing label; calculating a second loss of the second output by taking the training audit tag as a second supervisory signal; model parameters of the video auditing model are adjusted based on the first loss and the second loss.
10. The auditing method of claim 9, wherein the acquiring the first training data specifically includes: acquiring original training data; performing data cleaning on the original training data to obtain cleaned original data; Inputting the cleaned original data into a third multi-mode large language model to obtain a third checking result output by the third multi-mode large language model; Determining data, consistent with the manual auditing result, of the third auditing result in the cleaned original data as target training data; The first training data is generated based on the target training data.
11. The auditing method of claim 9, wherein the training process of the video auditing model specifically comprises a reinforcement learning training phase, and the reinforcement learning training phase specifically comprises: Acquiring second training data, wherein the second training data comprises third multi-mode data of a second historical to-be-checked video, and the generation time of the second training data is later than that of the first training data; Inputting the second training data into the video auditing model to obtain a third output of the video auditing model, wherein the third output comprises a fourth auditing result and third thinking chain information; calculating a reward value based on the third output and a reference output, wherein the reference output comprises a fifth audit result; And updating model parameters of the video auditing model by using a strategy gradient algorithm based on the reward value.
12. An auditing method according to claim 11, the calculating a prize value based on the third output and a reference output specifically comprising: Calculating a first reward value based on semantic similarity of the fourth audit result and the fifth audit result; calculating a second rewards value based on the third mental chain information, wherein the second rewards value is used for quantifying the quality of the third mental chain information; Calculating a third prize value based on a normalization of a format of the third output; And carrying out weighted summation on the first reward value, the second reward value and the third reward value to obtain the reward value.
13. An auditing method according to claim 11, the method further comprising: The method comprises the steps of obtaining a new video auditing rule, wherein the new video auditing rule comprises an auditing rule determined according to compliance requirements; coding the newly added video auditing rule to obtain a newly added rule vector; And adding the newly added rule vector to the rule knowledge base.
14. An auditing method according to claim 13, the adding the new rule vector to the rule knowledge base further comprising: Reconstructing a reward function of the reinforcement learning training phase based on the newly added video auditing rule; Based on a strategy optimization algorithm, training is carried out on the video auditing model by utilizing the reconstructed rewarding function and training data related to the newly added video auditing rule.
15. An auditing apparatus for video content, comprising: the system comprises a first acquisition module, a second acquisition module and a first verification module, wherein the first acquisition module is used for acquiring first multi-mode data of a video to be verified, the first multi-mode data comprises a voice recognition text and key frame images screened from the video to be verified by adopting a preset model with a model parameter smaller than a preset parameter; The feature extraction module is used for extracting multi-modal features of the first multi-modal data; The retrieval module is used for retrieving a plurality of target audit labels for judging whether the video to be audited is illegal or not from a rule knowledge base according to the multi-mode characteristics; the matching score calculation module is used for calculating the matching score of the multi-modal feature and each target audit label by using a first multi-modal large language model; And the auditing module is used for obtaining a first auditing result by utilizing a video auditing model based on the first multi-mode data, the target auditing label and the matching score.
16. A computing device, comprising: A memory and a processor; The memory is adapted to store a computer program/instruction, the processor being adapted to execute the computer program/instruction, which when executed by the processor, implements the steps of the method of any of claims 1 to 14.

Description

Method, device and equipment for auditing video content Technical Field The present disclosure relates to the field of video auditing technologies, and in particular, to a method, an apparatus, and a device for auditing video content. Background With the explosive growth of multimedia contents such as short videos, live broadcasting and the like, the volume of videos in a network space rises exponentially, and video content auditing becomes a key link for guaranteeing network environment safety, maintaining good sequence and conforming to laws and regulations. The current video content auditing technology has a plurality of limitations, and is difficult to meet the real-time and accurate auditing requirements of mass contents. Based on this, how to provide a video auditing method is a technical problem to be solved. Disclosure of Invention In view of this, one or more embodiments of the present disclosure provide a method, an apparatus, and a device for auditing video content, so as to improve efficiency, accuracy, and real-time performance of video auditing, and enhance adaptation capability of an auditing model to diversified illegal contents, and reduce risks of missed auditing and mistrial. According to a first aspect of one or more embodiments of the present specification, there is provided a method of auditing video content, comprising: Acquiring first multi-mode data of a video to be checked, wherein the first multi-mode data comprises a voice recognition text and key frame images screened from the video to be checked by adopting a preset model with the model parameter number smaller than the preset parameter number; extracting multi-modal characteristics of the first multi-modal data; Based on the search enhancement generation, searching a plurality of target audit labels for judging whether the video to be audited is illegal or not from a rule knowledge base according to the multi-mode characteristics; Calculating matching scores of the multi-modal features and each target audit label by using a first multi-modal large language model; And obtaining a first audit result by using a video audit model based on the first multi-mode data, the target audit tag and the matching score. According to a second aspect of one or more embodiments of the present specification, there is provided an auditing apparatus for video content, comprising: the system comprises a first acquisition module, a second acquisition module and a first verification module, wherein the first acquisition module is used for acquiring first multi-mode data of a video to be verified, the first multi-mode data comprises a voice recognition text and key frame images screened from the video to be verified by adopting a preset model with a model parameter smaller than a preset parameter; the feature extraction module is used for extracting multi-mode features of the multi-mode data of the video to be checked; The retrieval module is used for retrieving a plurality of target audit labels for judging whether the video to be audited is illegal or not from a rule knowledge base according to the multi-mode characteristics, wherein the rule knowledge base is constructed based on a retrieval enhancement generation method; the matching score calculation module is used for calculating the matching score of the multi-modal feature and each target audit label by using a first multi-modal large language model; and the auditing module is used for obtaining a first auditing result by utilizing a video auditing model based on the multi-mode data, the target auditing label and the matching score corresponding to the target auditing label. According to a third aspect of one or more embodiments of the present specification, there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the method of auditing video content when the computer instructions are executed. The method and the device have the advantages that through obtaining the first multi-mode data of the key frame image and the voice recognition text of the video to be checked, the multi-mode characteristics of the first multi-mode data are extracted, so that the generation is based on retrieval enhancement, a plurality of target checking labels used for judging whether the video to be checked is illegal or not are retrieved from a rule knowledge base according to the multi-mode characteristics, matching scores of the multi-mode characteristics and the target checking labels are calculated by using a first multi-mode large language model, and accordingly a first checking result is obtained by using a video checking model based on the first multi-mode data, the target checking labels and the matching scores corresponding to the target checking labels. In the embodiment of the specification, the visual and voice bimodal information is fused to audit the video to be audited,