CN-122020296-A - Detection method for implicit toxicity multi-mode data and terminal equipment

CN122020296ACN 122020296 ACN122020296 ACN 122020296ACN-122020296-A

Abstract

The invention discloses a detection method and terminal equipment for implicit toxicity multi-mode data, belongs to the technical field of toxicity data detection, and can solve the problems that an existing model cannot effectively detect implicit toxicity content and cannot provide clear explanation and quantify the concealment degree of the implicit toxicity content. The method comprises the steps of S1, obtaining a plurality of visual nodes from an input image of an input image-text pair, constructing a visual association tree of the input image, S2, obtaining a plurality of text nodes from an input text of the input image-text pair, constructing a text association tree of the input text, S3, constructing a cross-modal bipartite graph of the input image-text pair according to the visual association tree and the text association tree, S4, matching each node pair in the cross-modal bipartite graph with all concept pairs in a toxic concept pair set respectively, and determining a toxicity label of the input image-text pair according to a matching result. The method is used for detecting the implicit toxicity multi-modal data.

Inventors

WU BAOYUAN
Wu Guanzong
ZHU ZIHAO

Assignees

香港中文大学（深圳）

Dates

Publication Date: 20260512
Application Date: 20260127

Claims (10)

1. A method for detecting implicit toxicity multi-modal data, the method comprising: s1, acquiring a plurality of visual nodes from an input image of an input image-text pair, and constructing a visual association tree of the input image according to the plurality of visual nodes; S2, acquiring a plurality of text nodes from the input text of the input image-text pair, and constructing a text association tree of the input text according to the plurality of text nodes; S3, constructing a cross-mode bipartite graph of the input image-text pair according to the visual association tree and the text association tree, wherein the cross-mode bipartite graph comprises all node pairs consisting of visual nodes and text nodes; And S4, respectively matching each node pair in the cross-modal bipartite graph with all concept pairs in a toxic concept pair set, and determining a toxic label of the input image-text pair according to a matching result.
2. The method according to claim 1, wherein S1 specifically comprises: S11, extracting an entity concept from an input image of an input image-text pair, taking the entity concept as a visual root node, and taking the visual root node as a visual father node; s12, expanding a plurality of visual child nodes from the visual parent node, and creating edges pointing to each visual child node from the visual parent node; s13, calculating the edge transition probability from the visual father node to each visual child node, and repeatedly executing S12 and S13 by taking the visual child nodes as the visual father nodes to obtain a visual association tree of the input image, wherein the visual nodes comprise visual root nodes, all visual father nodes and all visual child nodes.
3. The method according to claim 2, wherein S2 specifically comprises: S21, extracting an entity concept from an input text of the input image-text pair, taking the entity concept as a text root node, and taking the text root node as a text father node; S22, expanding a plurality of text child nodes from the text parent node, and creating edges pointing to each text child node from the text parent node; s23, calculating the edge transition probability from the text parent node to each text child node, and repeatedly executing S22 and S23 by taking the text child node as the text parent node to obtain a text association tree of the input text, wherein the text nodes comprise text root nodes, all text parent nodes and all text child nodes.
4. The method according to claim 1, wherein determining the toxicity tag of the input image-text pair according to the matching result in S4 specifically includes: Determining that the toxicity label of the input image-text pair is toxic when the matching result is that a toxic node pair exists, wherein the toxic node pair is a node pair matched with any concept pair in a toxic concept pair set in the cross-mode bipartite graph; And when the matching result is that no toxic node pair exists, determining that the toxicity label of the input image-text pair is nontoxic.
5. The method of claim 4, wherein after S4, the method further comprises: S5, when the toxicity label of the input image-text pair is toxic, calculating the toxicity concealment score of the input image-text pair according to the visual association tree and the text association tree.
6. The method according to claim 5, wherein the calculating in S5 a toxicity concealment score for the input graphic pairs from the visual association tree and the text association tree, specifically comprises: And calculating the joint transition probability of the toxicity node pair according to the visual association tree and the text association tree, and taking the difference value of 1 and the joint transition probability as the toxicity concealment score of the input image-text pair.
7. The method according to claim 6, wherein calculating joint transition probabilities for the toxic node pairs from the visual association tree and the text association tree comprises: Calculating the visual path cumulative probability of toxic visual nodes from the visual root node to the toxic node pair according to the visual association tree, and calculating the text path cumulative probability of toxic text nodes from the text root node to the toxic node pair according to the text association tree; a product of the visual path cumulative probability and the text path cumulative probability is calculated and taken as a joint transition probability for the toxic node pair.
8. The method according to claim 7, wherein the visual path cumulative probability of the visual root node to the toxic visual node of the toxic node pair is calculated from the visual association tree, in particular: calculating a first continuous product of all edge transition probabilities on a toxic visual node path from the visual root node to the toxic node pair, and taking the first continuous product as a visual path cumulative probability; Calculating text path cumulative probability from the text root node to the toxic text node of the toxic node pair according to the text association tree, wherein the text path cumulative probability is specifically as follows: And calculating a second continuous product of all the edge transition probabilities on the toxic text node path from the text root node to the toxic node pair, and taking the second continuous product as a text path cumulative probability.
9. The method of claim 5, wherein after S5, the method further comprises: and generating detection process interpretation information by using a large language model according to the toxicity node pairs and the reasoning paths of the toxicity node pairs in the visual association tree and the text association tree.
10. Terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method for detecting implicit toxicity multimodal data according to any of claims 1 to 9 when the computer program is executed.

Description

Detection method for implicit toxicity multi-mode data and terminal equipment Technical Field The invention relates to a detection method and terminal equipment for implicit toxicity multi-mode data, and belongs to the technical field of toxicity data detection. Background With the explosive growth of internet multimedia content, the use of multimodal data (especially a combination of images and text) to communicate harmful information has become an increasingly serious challenge. Multimodal content can express more complex, more ambiguous semantics than single modality. Attackers often take advantage of this to create "implicit toxic" content. In this context, images and text are considered to be completely benign or neutral alone, and the malicious intent of the hiding behind them can only be understood when the recipient combines the two and makes an association with a particular cultural background, social knowledge or logical reasoning. For example, a common industry "flange" picture, with "function undefined items, private chat" text, may suggest illegal weapon parts trading in a specific context. Existing content security detection models are deficient in their ability to identify such highly covert toxic content because they rely primarily on explicit features extracted from a single modality, and lack the ability to structurally infer cross-modality deep semantic associations. This results in a large amount of hidden harmful content being evaded from review, creating a potential social hazard. In addition, the 'black box' characteristic of the existing model makes the decision process opaque, so that user trust is difficult to obtain, and effective decision assistance cannot be provided for content auditors. Disclosure of Invention The invention provides a detection method and terminal equipment for implicit toxicity multi-mode data, which can solve the problems that the existing model cannot effectively detect the implicit toxicity content, and clear explanation and quantification of the concealment degree are difficult to provide. In one aspect, the invention provides a method for detecting implicit toxicity multi-mode data, the method comprising: s1, acquiring a plurality of visual nodes from an input image of an input image-text pair, and constructing a visual association tree of the input image according to the plurality of visual nodes; S2, acquiring a plurality of text nodes from the input text of the input image-text pair, and constructing a text association tree of the input text according to the plurality of text nodes; S3, constructing a cross-mode bipartite graph of the input image-text pair according to the visual association tree and the text association tree, wherein the cross-mode bipartite graph comprises all node pairs consisting of visual nodes and text nodes; And S4, respectively matching each node pair in the cross-modal bipartite graph with all concept pairs in a toxic concept pair set, and determining a toxic label of the input image-text pair according to a matching result. Optionally, the S1 specifically includes: S11, extracting an entity concept from an input image of an input image-text pair, taking the entity concept as a visual root node, and taking the visual root node as a visual father node; s12, expanding a plurality of visual child nodes from the visual parent node, and creating edges pointing to each visual child node from the visual parent node; s13, calculating the edge transition probability from the visual father node to each visual child node, and repeatedly executing S12 and S13 by taking the visual child nodes as the visual father nodes to obtain a visual association tree of the input image, wherein the visual nodes comprise visual root nodes, all visual father nodes and all visual child nodes. Optionally, the S2 specifically includes: S21, extracting an entity concept from an input text of the input image-text pair, taking the entity concept as a text root node, and taking the text root node as a text father node; S22, expanding a plurality of text child nodes from the text parent node, and creating edges pointing to each text child node from the text parent node; s23, calculating the edge transition probability from the text parent node to each text child node, and repeatedly executing S22 and S23 by taking the text child node as the text parent node to obtain a text association tree of the input text, wherein the text nodes comprise text root nodes, all text parent nodes and all text child nodes. Optionally, determining the toxicity tag of the input image-text pair according to the matching result in S4 specifically includes: Determining that the toxicity label of the input image-text pair is toxic when the matching result is that a toxic node pair exists, wherein the toxic node pair is a node pair matched with any concept pair in a toxic concept pair set in the cross-mode bipartite graph; And when the matching result is that no toxic node pair exists,