CN-115527140-B - Sensitive content detection method, device, electronic equipment and storage medium

CN115527140BCN 115527140 BCN115527140 BCN 115527140BCN-115527140-B

Abstract

The invention provides a sensitive content detection method, a device, electronic equipment and a storage medium, and relates to the technical field of content security, wherein the method comprises the steps of obtaining a video to be detected; the method comprises the steps of inputting a video to be detected into a visual element analysis model to obtain a visual element result and a visual characteristic output by the visual element analysis model, inputting the video to be detected into an auditory element analysis model to obtain an auditory element result and an auditory characteristic output by the auditory element analysis model, inputting the visual characteristic and the auditory characteristic into an event detection model to output an event detection result used for representing whether sensitive content is contained or not, matching the visual element result and the auditory element result with a sensitive content rule base to output a sensitive event type, and determining the sensitive content detection result by combining the event detection result and the sensitive event type. The invention can realize comprehensive detection of video sensitive content and improve the detection flexibility and accuracy.

Inventors

LI YANGXI
PENG CHENGWEI
LIU KEDONG
Miao yanan
WANG PEI
HU WEIMING
LI BING
LIU YUFAN
WANG JIAN

Assignees

国家计算机网络与信息安全管理中心
中国科学院自动化研究所

Dates

Publication Date: 20260505
Application Date: 20220704

Claims (10)

1. A method of sensitive content detection, comprising: acquiring a video to be detected; The method comprises the steps of inputting a video to be detected into a visual element analysis model to obtain a visual element result and a visual characteristic which are output by the visual element analysis model, wherein the visual element result comprises a first visual element result, a second visual element result and a third visual element result, the first visual element result is used for representing a person visual element and an object visual element in the video to be detected, the second visual element result is used for representing the position relationship of a person and an object in the video to be detected, and the third visual element result is used for representing the interaction relationship of the person and the object in the video to be detected and the interaction relationship of the person and the person; Inputting the video to be detected into an auditory element analysis model to obtain an auditory element result and an auditory characteristic which are output by the auditory element analysis model, wherein the auditory element result comprises a first auditory element result and a second auditory element result, the first auditory element result is used for representing the human auditory element, the object auditory element and the environment element in the video to be detected, and the second auditory element result is used for representing the sound source positions of the human and the object in the video to be detected; inputting the visual features and the auditory features into an event detection model, and outputting event detection results for representing whether sensitive content is contained or not; and matching the visual element result and the auditory element result with a sensitive content rule base, outputting a sensitive event type, and determining a sensitive content detection result by combining the event detection result and the sensitive event type.
2. The method for detecting sensitive content according to claim 1, wherein inputting the video to be detected into a visual element analysis model to obtain a visual element result and a visual feature output by the visual element analysis model comprises: inputting the video to be detected into a visual classification model of the visual element analysis model, and outputting a first visual element result and a corresponding first visual feature; inputting the video to be detected into a detection model of the visual element analysis model, and outputting a second visual element result and a corresponding second visual characteristic; inputting the video to be detected into an interaction model of the visual element analysis model, and outputting a third visual element result and a corresponding third visual characteristic; determining the visual element result based on the first visual element result, the second visual element result, and the third visual element result; the visual features are determined based on the first visual feature, the second visual feature, and the third visual feature.
3. The method for detecting sensitive content according to claim 1, wherein inputting the video to be detected into an auditory element analysis model to obtain an auditory element result and an auditory feature output by the auditory element analysis model comprises: inputting the video to be detected into an auditory classification model of the auditory element analysis model, and outputting a first auditory element result and a corresponding first auditory feature; inputting the video to be detected into a positioning model of the auditory element analysis model, and outputting a second auditory element result and a corresponding second auditory feature; determining the auditory element result based on the first auditory element result and the second auditory element result; the auditory feature is determined based on the first auditory feature and the second auditory feature.
4. The method of detecting sensitive content according to claim 1, wherein said inputting the visual feature and the auditory feature into an event detection model, outputting an event detection result for characterizing whether sensitive content is contained, comprises: inputting the visual features into a visual feature processing model of the event detection model, and outputting visual fusion features; Inputting the auditory feature to an auditory feature processing model of the event detection model, and outputting an auditory fusion feature; And inputting the visual fusion characteristic and the auditory fusion characteristic into a multi-mode fusion model of the event detection model, and outputting an event detection result used for representing whether sensitive content is contained.
5. The method for detecting sensitive content according to claim 1, wherein said matching the visual element result and the auditory element result with a sensitive content rule base, outputting a sensitive event type, and determining a sensitive content detection result in combination with the event detection result and the sensitive event type, comprises: The visual element results are combined at will and matched with the visual element rules in the sensitive content rule base, and a first sensitive event type corresponding to the hit visual element rules is output; the auditory element results are combined at will and matched with auditory element rules in the sensitive content rule base, and a second sensitive event type corresponding to the hit auditory element rules is output; The visual element result and the auditory element result are combined at will and matched with the cross element rule in the sensitive content rule base, and a third sensitive event type corresponding to the hit cross element rule is output; based on the first sensitive event type, the second sensitive event type and the third sensitive event type, statistics is carried out to obtain a sensitive event type corresponding to the video to be detected; And combining the event detection result and the sensitive event type to determine a sensitive content detection result corresponding to the video to be detected.
6. The sensitive content detection method according to any one of claims 1 to 5, further comprising: Acquiring a new sensitive event, and respectively inputting videos corresponding to the new sensitive event into the visual element analysis model and the auditory element analysis model to obtain a new visual element result and new visual characteristics output by the visual element analysis model and new auditory element result and new auditory characteristics output by the auditory element analysis model; Determining a new visual element rule, a new auditory element rule and a new cross element rule corresponding to the new sensitive event based on the new visual element result and the new auditory element result; And determining the sensitive event type as the sensitive content detection result in the case that the video to be detected hits at least one of the new visual element rule, the new auditory element rule and the new crossing element rule.
7. A sensitive content detection apparatus, comprising: The acquisition module is used for acquiring the video to be detected; The visual element analysis module is used for inputting the video to be detected into a visual element analysis model to obtain a visual element result and a visual characteristic which are output by the visual element analysis model, wherein the visual element result comprises a first visual element result, a second visual element result and a third visual element result, the first visual element result is used for representing the visual elements of people and objects in the video to be detected, the second visual element result is used for representing the position relation of the people and objects in the video to be detected, and the third visual element result is used for representing the interaction relation of the people and objects in the video to be detected and the interaction relation of the people and the people; The system comprises an auditory element analysis module, a first auditory element analysis module and a second auditory element analysis module, wherein the auditory element analysis module is used for inputting the video to be detected into an auditory element analysis model to obtain an auditory element result and an auditory characteristic output by the auditory element analysis model, and the auditory element result comprises a first auditory element result and a second auditory element result; The event detection module is used for inputting the visual characteristics and the auditory characteristics into an event detection model and outputting event detection results used for representing whether sensitive contents are contained or not; And the multi-line reasoning module is used for matching the visual element result and the auditory element result with a sensitive content rule base, outputting a sensitive event type, and determining a sensitive content detection result by combining the event detection result and the sensitive event type.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the sensitive content detection method of any one of claims 1 to 6 when the program is executed by the processor.
9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the sensitive content detection method according to any of claims 1 to 6.
10. A computer program product comprising a computer program which, when executed by a processor, implements the sensitive content detection method according to any one of claims 1 to 6.

Description

Sensitive content detection method, device, electronic equipment and storage medium Technical Field The present invention relates to the field of content security technologies, and in particular, to a method and apparatus for detecting sensitive content, an electronic device, and a storage medium. Background In the current society, the video watching through a network or a television signal has become an indispensable activity in the daily life of people, and people can watch all videos on the network or the television, wherein part of the videos related to sensitive contents are not suitable for users to watch, especially for children or teenagers to watch, so how to detect whether the videos contain the sensitive contents is very important. In the prior art, a plurality of detection models are usually obtained by defining some security events in advance and training a neural network by using sample videos containing the security events, so as to detect video contents based on the trained detection models. However, in the above manner, when the detection model is used to detect the video content, only one type of security event can be detected, so that the detection result of the video content is incomplete and the accuracy is not high. Disclosure of Invention The invention provides a sensitive content detection method, a sensitive content detection device, electronic equipment and a storage medium, which are used for solving the defect that a detection result is not comprehensive in the prior art, realizing comprehensive detection of video sensitive content and improving detection flexibility and accuracy. The invention provides a sensitive content detection method, which comprises the following steps: acquiring a video to be detected; Inputting the video to be detected into a visual element analysis model to obtain a visual element result and visual characteristics output by the visual element analysis model; inputting the video to be detected into an auditory element analysis model to obtain an auditory element result and an auditory characteristic output by the auditory element analysis model; inputting the visual features and the auditory features into an event detection model, and outputting event detection results for representing whether sensitive content is contained or not; and matching the visual element result and the auditory element result with a sensitive content rule base, outputting a sensitive event type, and determining a sensitive content detection result by combining the event detection result and the sensitive event type. According to the sensitive content detection method provided by the invention, the video to be detected is input into a visual element analysis model to obtain a visual element result and a visual characteristic output by the visual element analysis model, and the method comprises the following steps: inputting the video to be detected into a visual classification model of the visual element analysis model, and outputting a first visual element result and a corresponding first visual feature, wherein the first visual element result is used for representing a character visual element and an object visual element in the video to be detected; Inputting the video to be detected into a detection model of the visual element analysis model, and outputting a second visual element result and a corresponding second visual feature, wherein the second visual element result is used for representing the position relationship between people and objects in the video to be detected; Inputting the video to be detected into an interaction model of the visual element analysis model, and outputting a third visual element result and a corresponding third visual feature, wherein the third visual element result is used for representing the interaction relationship between people and objects in the video to be detected and the interaction relationship between people and people; determining the visual element result based on the first visual element result, the second visual element result, and the third visual element result; the visual features are determined based on the first visual feature, the second visual feature, and the third visual feature. According to the method for detecting sensitive content provided by the invention, the method for inputting the video to be detected into the auditory element analysis model to obtain the auditory element result and the auditory characteristic output by the auditory element analysis model comprises the following steps: Inputting the video to be detected into an auditory classification model of the auditory element analysis model, and outputting a first auditory element result and a corresponding first auditory feature, wherein the first auditory element result is used for representing a person auditory element, an object auditory element and an environment element in the video to be detected; Inputting the video to be detected into a positioning