CN-122022836-A - Implicit soft-broad detection method based on true phase default theory and active learning

CN122022836ACN 122022836 ACN122022836 ACN 122022836ACN-122022836-A

Abstract

The invention provides a hidden soft-broad detection method based on true phase default theory and active learning. The method is based on an active learning framework structure, is constructed by taking a true phase default theory as a psychological guidance characteristic, combines the characteristic extraction capacity of a large language model with the decision making capacity of a double-branch lightweight classifier to detect hidden softness and breadth, and improves the model performance through man-machine cooperation iteration under the active learning framework. The invention systematically solves the technical problems of insufficient interpretability, unstable small sample performance, limited multidimensional feature fusion effect, high labeling cost and the like in the existing undisclosed implicit soft wide detection technology from the synergistic effect of multiple layers of theoretical modeling, feature representation, model structure, data acquisition strategy and the like.

Inventors

CHEN XUELONG
PAN JINCHAO
WANG ZIRUI
SU XIAOYAN
ZHANG LEI
LIU YANG
YE XIN

Assignees

大连理工大学

Dates

Publication Date: 20260512
Application Date: 20260122

Claims (10)

1. The implicit soft-wide detection method based on true phase default theory and active learning is characterized by comprising the following steps: Step S1, preprocessing input user generated content, obtaining a plurality of UGC post data to be detected, constructing a UGC content data set, and dividing the UGC content data set into a labeled training data set and a non-labeled alternative data set, wherein the UGC post data comprises three types of original observable indexes, namely a text index, a numerical index and a behavior interaction index; Step S2, mapping the original observable index of each UGC post data into five groups of trigger clues based on five groups of trigger clues of a real-phase default theory; step S3, constructing a global strength index for each type of trigger clues to serve as a knowledge anchor point; S4, utilizing the pre-trained large model as a frozen semantic embedding extractor, encoding text content in the mapped trigger clues, and generating a high-dimensional text embedding matrix; Step S5, constructing a double-branch multi-layer perceptron classifier, respectively processing a high-dimensional text embedding matrix and a structural numerical characteristic, wherein the structural numerical characteristic comprises a global strength index, a numerical index and a numerical part in an interactive behavior index; Step S6, calculating entropy values of unlabeled samples in the unlabeled alternative data set based on an active learning framework of information entropy, and selecting samples with the maximum entropy values for manual labeling; S7, adding the new manually marked data into the original labeled training data set, and repeating the steps S1-S6 to retrain the double-branch multi-layer perceptron classifier until the model converges to obtain a trained model; and S8, inputting the content to be detected into a trained model, judging that the content is hidden and soft if the prediction probability output by the model is greater than a preset threshold, otherwise, judging that the content is real.
2. The implicit soft-segment detection method based on true phase default theory and active learning of claim 1, wherein in step S1, the original observable index of each UGC post data in the UGC content data set is obtained based on the following manner: Collecting title information, text content, topic content, comment information and tag information of posts as text indexes; Collecting the number of praise, collection, comment, topic, vermicelli of a posting author and attention account number, and additionally processing emotion difference indexes and similarity indexes of different parts of the post to serve as numerical indexes, wherein in addition, whether the post contains links pointing to an external platform, whether recommendations of other platforms are obtained or not, and whether specific brands or product names are explicitly mentioned in comments are constructed into numerical indexes of two classifications {0,1 }; the interaction information of the post comments and the authors, the homepage information of the posting authors, the subjects and topics of the posts released in the past and the total number of collections and praise obtained by the authors are collected to serve as behavior interaction type indexes.
3. The implicit soft-wide detection method based on true phase default theory and active learning of claim 2, wherein in step S2, the original observable index is mapped into five sets of trigger cues, comprising: Step S21, dividing the TDT into five trigger clues, namely communication context and motivation, sender holding, third party information, communication continuity and external corresponding information; step S22, mapping the original observable indexes into five groups of trigger clues based on the mapping relation, wherein unstructured text indexes and structured numerical indexes are covered.
4. The implicit soft-wide detection method based on true phase default theory and active learning of claim 3, wherein the mapping relationship between the original observable index and five groups of trigger clues is specifically as follows: mapping title information, text content, topic content, comment information of posts and related topic quantity dimension indexes into communication context and motivation; Mapping the homepage information of the posting authors, the number of praise, the number of comments and the number of collection obtained by the posts, the number of vermicelli of the posting authors, the number of attention accounts, the number of collection obtained by the author history and the dimension index of the number of praise to be held by a sender; mapping whether the posts contain links pointing to external platforms, whether recommendations of other platforms are obtained, and whether specific brands or product name indexes are explicitly mentioned in comments into third party information; Extracting similarity indexes of contents of different parts of the posts based on text indexes of the posts and mapping the similarity indexes into communication consistency, wherein the similarity indexes of the different parts of the posts comprise Jaccard similarity of the different contents of the posts and content similarity among potential semantic topics, contents, titles and labels; Extracting emotion difference indexes of different parts of the posts based on text indexes of the posts and mapping the emotion difference indexes into external corresponding information, wherein the emotion difference index extraction of the different parts of the posts comprises the steps of extracting emotion characteristics from text contents of the posts by using SnowNLP and generating standardized emotion polarity scores, and extracting emotion differences between the text contents of the posts and topics to which the text contents of the posts belong.
5. The implicit soft-segment detection method based on true phase default theory and active learning of claim 4, wherein in step S3, constructing a global strength index for each type of trigger clues specifically includes: Step S31, carrying out weighted linear aggregation and normalization on all indexes in each trigger clue group to obtain a global strength index under each trigger dimension; And S32, combining the global intensity indexes under the five trigger dimensions into a global intensity vector, and complementing the five groups of trigger clues obtained in the step S2 to jointly represent UGC post data.
6. The implicit soft-wide detection method based on true phase default theory and active learning according to claim 5, wherein the global intensity index under each trigger dimension is calculated as follows: Wherein, the Represent the first Global intensity indicators in the individual trigger dimensions, ; Is the first Total number of indicators in the dimension of the individual triggers; Represent the first The first trigger dimension A number of indicators; is the corresponding weight; The global intensity indexes under the five trigger dimensions obtained by calculation form a global intensity vector 。
7. The implicit soft-segment detection method based on true phase default theory and active learning of claim 6, wherein step S4 specifically comprises: s41, utilizing a pre-training large model as a frozen semantic embedding extractor, and carrying out high-dimensional semantic representation on text contents in the five groups of trigger clues to obtain semantic embedding representation; And S42, stacking semantic embedded representations extracted from the five types of trigger clues to form a text embedded matrix.
8. The implicit soft-wide detection method based on true phase default theory and active learning of claim 7, wherein in step S5, the dual-branch multi-layer perceptron classifier includes a text MLP classifier branch and a numerical MLP classifier branch, and the process flow includes: S51, inputting a text embedding matrix into a text MLP classifier branch, and performing dimension reduction processing on high-dimension text embedding to obtain a text hidden representation; Step S52, inputting the numerical parts in the global strength index, the numerical index and the behavior interaction class index into a numerical MLP classifier branch, and performing dimension expansion on the numerical indexes including the numerical parts in the global strength index, the numerical index and the behavior interaction class index to obtain a numerical hidden representation; step S53, carrying out dimension matching on heterogeneous features with unbalanced dimensions, and mapping the text hidden representation and the numerical hidden representation to the same potential space for fusion; And S54, predicting the fused features through the full-connection layer to obtain the prediction probability of whether the features are implicit soft and wide.
9. The implicit soft-segment detection method based on true phase default theory and active learning of claim 8, wherein in step S5, the specific flow of the dual-branch feature processing and fusion is as follows: Wherein, the And Respectively representing a text hidden representation and a numerical hidden representation; Is a predictive probability; representing a vectorization operation; is an activation function; And Respectively weighting and biasing of the text MLP classifier branches; And The weights and offsets of the numerical MLP classifier branches, respectively; And The weight and the bias of the full connection layer are respectively; Representing vector stitching; is the first Text embedding matrix of the bar UGC post data; is the first The global intensity vector of the bar UGC post data, Is the first A numerical indicator of the bar UGC post data, Is the first The numeric portion in the behavioral interaction class indicator of the bar UGC post data.
10. The implicit soft-segment detection method based on true phase default theory and active learning of claim 9, wherein step S6 specifically comprises: Step S61, calculating each unlabeled sample based on the active learning framework of the information entropy As a measure of prediction uncertainty, where Representing a label-free alternative dataset, the formula is: step S62, setting the number of sample batches required for each round, and selecting Selecting a sample lot with the maximum entropy value To submit manual labels; And step S63, labeling the selected sample with the highest entropy value by a manual expert, and voting to determine a final sample label.

Description

Implicit soft-broad detection method based on true phase default theory and active learning Technical Field The invention relates to the field of false marketing detection and network content management, in particular to a hidden soft broad detection method guided by a psychological theory true-phase default theory (Truth-DefaultTheory, TDT) by combining a large language model and a lightweight model. Background With the rapid growth of the internet and mobile users, user Generated Content (UGC) platforms (e.g., reddish books, facebook, tikTok, etc.) have grown vigorously. Consumers voluntarily share opinion and comments on these UGC platforms, UGC is regarded as a source of real consumer insight due to its spontaneous and self-expression properties, and has a significant impact on purchase decisions of other consumers. However, in order to avoid supervision and conflict psychology of consumers on advertisement content, more and more brands and network red people cooperate to release implicit soft-minded (UndisclosedSponsoredPosts, USPs), hide commercial intention, disguise as real user experience, embed commercial popularization into living narrative, have very deceptive nature, seriously damage consumer interests and platform public trust. In contrast to false reviews, such as false, counterfeit, or misleading reviews (which are most present on e-commerce platforms) surrounding a particular product, implicit soft broad content tends to embed hidden commercial promotional intentions in what appears to be real. They often produce the illusion of "fairness" through the comparison between multiple brands. Unlike spurious reviews, which are limited to review text, fraudulence of implicit soft broad content may occur in posts, reviews, and even author information, such as notes that drain users to other platforms. In addition, false comments for a certain product can be presented and verified more intensively through the overall evaluation of other users, and the information on the UGC platform is more dispersed, so that if the user needs to verify that the UGC platform is hidden and soft, additional searching is often needed, the cognitive load is increased, and the false decisions are more easily caused. Thus, the implicit soft-broad look like tiny offensive content can gradually escalate into broad suspicion and erode the user's loyalty to platforms and brands over time. With the development of technology, a Large Language Model (LLMs) shows remarkable advantages in the aspects of understanding content and capturing semantic clues, and provides a strong potential for implicit soft and wide recognition. For example, a knowledge driven hint strategy JSDRV enhances the semantic reasoning and detection capabilities of large language models by injecting domain knowledge, thereby improving performance in a resource-constrained environment. The MiLk-FD model realizes the high-dimensional feature aggregation of cross-entity, theme and text by integrating the knowledge graph, LLM embedding and a transducer network, and verifies the advantages of multi-knowledge fusion in false information detection. These studies not only highlight the great potential of LLM in identifying deceptive content, but also expose long-term challenges such as illusion and insufficient interpretability, which limit its reliability in implicit soft broad detection. Specifically, the implicit soft-broad detection method in the existing UGC platform mainly faces the following key bottlenecks: the first bottleneck is data scarcity and product dynamic evolution problems. Existing static data-driven supervised learning models typically require large-scale, high-quality labeling data, which is expensive and time-consuming to collect and label implicit soft and wide data. In addition, due to the wide range of product types covered by USPs and the frequent updating, with the continued emergence of new products, models trained on static data sets are difficult to generalize to new products or new areas that are emerging or not seen, resulting in models that have degraded performance in the face of dynamically changing UGC environments. The second bottleneck is the real default bias problem for the user. Studies have shown that users default to believing that content is authentic when they contact information until deceptive cues activate the user's suspicion, i.e., true default deviation, can result in users being more likely to belief implicit softness. Meanwhile, the information of the UGC platform is more dispersed, and extra effort is often required to be paid by a user when the implicit soft is verified, so that the cognitive load is increased, and the error decision is more easily caused. The third bottleneck is the problem of spoofing clue ambiguity. Implicit soft and wide fraud is often hidden in multi-dimensional cues (e.g., text content, numerical statistics, publisher behavior, etc.), which may be ignored and difficult to detect effectively