CN-121997338-A - Multi-dimensional feature analysis-based generating engine optimization cheating judgment system and method
Abstract
The invention discloses a system and a method for generating engine optimization cheating judgment based on multidimensional feature analysis, which relate to the technical field of computer network security and artificial intelligence and comprise a data preprocessing module, a detection judgment module and a traceability alarm module, wherein the data preprocessing module is used for receiving user inquiry and an original answer text generated by a large language model, carrying out named entity recognition and linguistic feature extraction on the original answer text to obtain a standardized analysis object, the detection judgment module is used for carrying out heuristic language feature analysis on the standardized analysis object to identify language abnormality, comparing AI recommendation ordering with real world market data based on a statistical imbalance principle to output a quantifiable risk judgment result, and the traceability alarm module is used for integrally fusing heuristic language feature analysis, statistical imbalance difference calculation, antagonism traceability and automatic RAG evaluation and further introducing an active antagonism probe reproduction mechanism to form an auditable evidence chain.
Inventors
- YIN LEI
- SUN JUNFENG
Assignees
- 元聚变(上海)科技股份有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20260123
Claims (8)
- 1. A generating type engine optimization cheating judgment system based on multidimensional feature analysis is characterized by comprising a data preprocessing module, a detection judgment module and a traceability alarm module, wherein the data preprocessing module is used for receiving user inquiry and original answer texts generated by a large language model, carrying out named entity recognition and linguistic feature extraction on the original answer texts to obtain standardized analysis objects, the detection judgment module is used for carrying out heuristic linguistic feature analysis on the standardized analysis objects to identify language abnormal, comparing AI recommendation sequences and real world market data based on a statistical imbalance principle to output quantifiable risk judgment results, and the traceability alarm module is used for carrying out automatic RAG evaluation on generated contents to obtain loyalty and quotation density indexes and realizing content source tracking based on antagonistic watermarks.
- 2. The system for generating engine optimized cheating decision based on multidimensional feature analysis as recited in claim 1, wherein the data preprocessing module comprises an input interface unit, an NLP preprocessing unit, an entity standardization unit, a classification unit and a comparison unit, wherein the input interface unit is used for receiving user queries and original answer texts generated by the large language model; The detection and judgment module comprises a semantic emphasis deviation detection unit, a judgment unit and a judgment unit, wherein the semantic emphasis deviation detection unit is used for calculating the density ratio of an evaluation adjective to a descriptive noun in target entity description and marking abnormality according to the density ratio; the parasitic reference detection unit is used for analyzing the domain name and the path of the reference link, identifying the reference fraud behavior pointing to the user generated content sub-path under the authority domain name, and executing the weight reduction processing on the authority weight of the reference; the fact island verification unit is used for extracting key data assertion and performing cross verification in a third party authority database to mark data illusion and kneading data; the system comprises a structured induction identification unit, an AI sound volume calculation unit, a benchmark data acquisition unit, an imbalance analysis engine, an active countermeasure probe unit, a fuzzy analysis unit and a fuzzy analysis unit, wherein the structured induction identification unit is used for identifying sandwich ranking in a list structure, detecting an abnormal conversion structure with discount codes attached to target brands and purchasing links, and detecting closed loop fitting of text logic flows to PAS models; The traceability warning module comprises an automatic RAG evaluation unit, a antagonism traceability unit and a visual output terminal, wherein the automatic RAG evaluation unit is used for constructing a golden data set and calculating a faithfulness index and a quotation density index, the antagonism traceability unit is used for embedding a high-mobility implicit watermark in a content generation stage or an original content release stage and decoding the watermark after cross-platform copying and rewriting of the content to trace an original content source, and the visual output terminal is used for displaying a contrast bar chart of AI sound volume and real heat, respectively indicating data coincidence and data imbalance states by a green shield and a red warning lamp and providing hover warning information of ranking drop.
- 3. The method for generating engine optimized cheating decision based on multidimensional feature analysis as recited in claim 2, comprising the steps of: S1, data access and preprocessing, namely receiving an original answer text generated by a user query and a large language model, executing named entity recognition on the original answer text to form an entity set, executing normalized mapping on the entity set, and extracting linguistic features required by subsequent analysis; S2, linguistic anomaly detection and probe reproduction, namely performing heuristic linguistic feature analysis on an original answer text to identify semantic lateral deviation, parasitic citation and structural induction, constructing a plurality of groups of equivalent question methods and constraint question methods based on user inquiry, and repeatedly performing heuristic analysis, and converging multiple analysis results into an anomaly evidence chain to improve the reproducibility and the interpretability of judgment; S3, calculating AI sound quantity for the entity set and forming a context ranking, acquiring real world big disc data through an external data interface and generating a market ranking, comparing the context ranking with the market ranking to output a disadjustment suspicion index and outputting a risk level according to a threshold value; S4, tracing and faithfulness verification, namely constructing a golden data set as a reference true phase, performing faithfulness and quotation density evaluation on the generated content, performing countermeasure watermark decoding on the generated content to trace the content source and identify an anonymous poisoning link, and storing an evaluation result and a dereferencing calculation result in a correlated manner for auditing; And S5, visually outputting and alarming, namely generating a visual chart for comparing the AI sound quantity with the real heat, outputting a green safety mark or a red alarm mark according to the risk level, and outputting ranking fall and risk interpretation information to prompt a user to take a corresponding treatment strategy.
- 4. The method for generating engine optimized cheating decision based on multidimensional feature analysis according to claim 3, wherein S1 specifically comprises: S1-1, receiving an original answer text generated by a user query text and a large language model by an input interface unit, and binding the user query text and the original answer text into the same session record, wherein the session record at least comprises the query text, the answer text, a receiving timestamp and an answer source identifier; S1-2, an entity extraction step, namely performing named entity recognition on the original answer text to obtain an entity set Wherein The named entity identification takes the answer text as input, outputs entity candidates and the position interval thereof in the text, and merges different writing methods of the same entity into the same normalized name through entity standardization rules; s1-3, for Extracting the occurrence fragments of the entity in the original answer text and generating a characteristic record for subsequent semantic emphasis deviation detection, parasitic reference detection and structural induction recognition, wherein the characteristic record at least comprises entity occurrence times, entity first occurrence positions, entity peripheral modifier fragments, reference link sets corresponding to the entity, and paragraph and list structure information of the entity.
- 5. The method for generating engine optimized cheating decision based on multidimensional feature analysis as recited in claim 3, wherein S2 comprises: S2-1, counting the number of appraisal adjectives on descriptive fragments of each entity Number of descriptive nouns Calculating the density ratio The evaluation adjective refers to a word category which carries out subjective identification, strong recommendation and exaggeration modification on an entity, the descriptive noun refers to a noun category which is used for objectively describing a product category, a function, a technical key point and a service form, Larger indicates that the description is more biased towards marketing modification when the target entity Relative to the bidding entity Exceeding a preset deviation threshold When the target entity is marked as abnormal semantic emphasis deviation; s2-2, analyzing each quotation link in the original answer text, extracting a domain name field and a path field, marking the quotation as a parasitic quotation and reducing the authority weight of the quotation for subsequent fact island verification when the domain name field belongs to a preset authority domain name set and the path field belongs to a user-generated content sub-path set, wherein the authority domain name set is used for identifying a site domain name commonly perceived as a high-credibility source, and the user-generated content sub-path set is used for identifying the path type of content released by a personal user in the site; S2-3, extracting key data assertion in the original answer text, and executing cross verification in a third party authority database comprising Statista and Gartner, wherein when the key data assertion only exists in a single reference source and cannot be supported in the cross verification, the key data assertion is marked as data illusion or kneading data; S2-4, carrying out sequence analysis on the list structure, marking the target brand as structured induction abnormality when detecting that the target brand is at the preset sandwich arrangement sequence position and only the discount code is attached to the target brand and linked with purchase, and further detecting closed-loop fitting of the text logic flow to the PAS model to form abnormality interpretation; the sandwich ranking sequence position refers to a specific position structure of a target brand which is fixedly clamped between two types of comparison items in a plurality of adjacent items of a recommendation list, namely, the front item and the rear item of the target brand respectively belong to a preset authority comparison item and a score comparison item, so that a seemingly objective comparison package is formed; s2-5, generating a probe query set based on the user query Wherein The method comprises a semanteme equivalent question subset and a constraint question subset, wherein the semanteme equivalent question subset refers to a query set formed by only rewriting the form on the premise of keeping the same problem intention, the same entity category and the same task target as those of the original user query unchanged, so that any rewritten query is equivalent to the original query in terms of information requirement, the constraint question subset refers to a query set formed by applying definite constraint conditions to the generated model output on the premise of keeping the problem intention of the original user query unchanged, and the constraint conditions at least comprise neutral expression constraint, verifiable source constraint, conversion guide content constraint exclusion and limited output range constraint, and the method comprises the steps of Repeatedly executing S2-1 to S2-4 and recording the abnormal marking result of each round, and calculating the marked abnormal proportion of the target entity in the probe round For characterizing the reproducible degree of the anomaly.
- 6. The method for generating engine optimized cheating decision based on multidimensional feature analysis according to claim 3, wherein the step S3 is specifically as follows: s3-1, for entity set Calculating AI sound quantity for each entity in (a) Wherein For the frequency of mention of an entity in the original answer text, To determine whether a boolean weight is the first recommendation, a first preset value is taken when the entity is the first recommended entity in the answer, a second preset value is taken when the first recommendation is not met, Is the emotion polarity fraction, is used for quantifying the forward modification intensity of the entity surrounding modification fragments, 、 、 The weight coefficient is preset; S3-2, calculating the text segments related to the entity by using the emotion analysis model to obtain emotion polarity scores For each occurrence of the entity in the answer text, intercepting a context segment containing the entity and inputting a pre-trained emotion analysis model, outputting the difference between the positive trend and the negative trend of the segment as the emotion polarity score of the occurrence, averaging the emotion polarity scores of the occurrences when the entity occurs for a plurality of times in the answer text, and normalizing and limiting the emotion polarity score to be [ 1,1] Interval; S3-3, AI sound quantity according to each entity Ranking to obtain context ranking Smaller values represent higher ranks.
- 7. The method for generating engine optimized cheating decision based on multidimensional feature analysis according to claim 6, wherein S3 further comprises: S3-4, respectively acquiring multi-source market signals related to a target entity through an external data interface, wherein the multi-source market signals at least comprise a search heat signal, a website traffic signal and a social contact reference signal, wherein the search heat signal is provided by a GoogleTrends interface and a SEMrush interface, the website traffic signal is provided by a SimilarWeb interface, and the social contact reference signal is provided by a Reddit reference rate scanning interface and a Twitter reference rate scanning interface, so that a multi-source observation set is formed Wherein For the number of data sources, Is the first Original observed values or observed sequences output by the data sources aiming at the target entity are compared with each other to ensure the comparability among different data sources Performing unified time window alignment to obtain windowed observations Wherein For presetting a market time window, the method is used for eliminating sorting drift caused by inconsistent short-term explosion points and long-term heat caliber Mapping to dimensionless contribution fractions The scale difference of different sources does not affect the subsequent fusion ordering, and is used for unifying the output of each source to the same comparable measurement domain and is based on the dimensionless contribution of each data source Performing source sorting on the candidate entity set to obtain the first Sub-source ranking for individual data sources For expressing market status under the caliber of the data source and forming a source-dividing ranking set Score ranking set Performing consistency fusion to obtain market ranking The fusion rule adopts weighted steady aggregation, which is defined as Wherein In order to weight the robust aggregation operator, Is the first Weight coefficients of individual data sources, all The sum of (2) is 1; s3-5, constructing a prompt word disturbance set based on original query of a user Each of which is provided with Maintaining semantic consistency with original query, and controllable differences exist in expression structure, constraint terms and expression style, wherein disturbance prompt is used for triggering multiple outputs of a generative model under the conditions of unchanged semantic and changed expression so as to evaluate the stability of recommended sequencing, wherein the disturbance prompt is used for triggering the multiple outputs of the generative model under the conditions of unchanged semantic and changed expression To prompt the quantity for disturbance Each disturbance prompt in the text is input into a generation type model to obtain a corresponding answer text, and the round of context ranking is generated according to the existing AI sound volume calculation and ordering rule Thereby forming a context ranking sequence Wherein Is the first The lower the number of context ranking orders of the target entity under the round disturbance prompt, the higher the front in the round reply, the more accurate the value is Performing robust aggregation to obtain stable context ranking Wherein For the median operator, the median has robustness to extreme turns, and accidental bias of few prompt word triggers can be weakened to enable Typical recommended positions under the condition of consistent semantics are embodied, so that stable estimation of disturbance resistance of prompt words is realized, and the method is based on And (3) with Calculating GEO suspicion index For characterizing the degree of fracture of the real world site relative to the AI recommended location, The larger the indication AI pushes strongly but the market does not support the deregulated morphology.
- 8. The method for generating engine optimized cheating decision based on multidimensional feature analysis as recited in claim 3, wherein S4 and S5 comprise: s4-1, constructing a verified golden data set as a reference true phase, wherein the golden data set consists of verifiable authoritative fact fragments and corresponding sources thereof; S4-2, calculating loyalty indexes based on the golden dataset Splitting the generated answer into a plurality of factual assertions, checking the factual assertions one by one in the retrieved supporting document and the golden data set, if the assertions can find the faithful count of the consistent support in the supporting document or the golden data set, otherwise, the assertions count as not faithful count, For the ratio of the number of faithful assertions to the total number of factual assertions, for measuring whether the generated answers contain external knowledge that is not present in the retrieved document, identifying the illusion risk caused by the additional external knowledge of the evidence, and calculating the quotation density index Dividing the generated answer into a plurality of output units, checking whether each output unit is accompanied by accessible reference links or locatable source identifiers, the accompanied and locatable points are provided with references, otherwise, the points are not provided with references, For the ratio of the number of referenced output units to the total number of output units; s4-3 embedding implicit watermarks in content generation phase or original content distribution phase Controlling several optional expressions according to preset key when generating content, such as similar rewrite, word order fine tuning, punctuation format variation and low-order character mode selection, forming a group of statistically recognizable feature combinations on the premise of unchanged semantics of the text, wherein the feature combinations are watermarks, performing watermark decoding on the content subjected to cross-platform copy and rewrite to trace the original content source and prevent anonymous poisoning, extracting the features from the text to be tested and matching the features with a feature template corresponding to the key, calculating matching statistics, judging that the watermark exists when the statistics exceed a preset threshold, identifying or tracking the content propagation path according to the restored source, Implicit flag information that can be decoded; S5-1, outputting a difference bar chart for comparing and displaying AI sound quantity and real heat, outputting a green shield mark for indicating a data coincidence state, and outputting a red warning lamp mark for indicating a data disorder state, wherein the difference bar chart is used for comparing and displaying AI sound quantity with real heat Reflected AI side exposure intensity The reflected market side heat intensity is visually compared; s5-2, outputting hover prompt information of ranking drop data in a visual interface to present the context ranking Ranking with market Wherein the hover hints include at least a contextual ranking value, a market ranking value, and a maladjustment index Corresponding risk interpretation statements.
Description
Multi-dimensional feature analysis-based generating engine optimization cheating judgment system and method Technical Field The invention relates to the technical field of computer network security and artificial intelligence, in particular to a system and a method for generating engine optimization cheating judgment based on multidimensional feature analysis. Background With the popularization of the generated artificial intelligence, a novel threat of Generated Engine Optimization (GEO) appears, wherein marketing corpus and objective facts are seamlessly fused by high-dimensional soft broad-spectrum implantation and data poisoning and large language model black box characteristics, so that a user is difficult to find abnormality only by reading. Based on the background, a person skilled in the art at least faces the following technical problems of automatically identifying hidden language anomalies and structured induction behaviors in generated texts and forming interpretable abnormal evidences under the condition of not depending on internal parameters of a model and only based on generated texts and available external data interfaces, establishing a set of quantifiable statistical imbalance indexes for comparing the cutting degree between AI recommendation sequences and real world market data and outputting risk judgment results of a configurable threshold according to the cutting degree, verifying the fidelity and the introduction density of generated contents and combining mobilizable competitive watermarks to realize content tracing and further inhibit anonymous poisoning, and visually outputting the judgment results in an intuitive mode to enable a user to obtain operable early warning feedback and restore the integrity of digital information ecology. Disclosure of Invention The invention aims to provide a system and a method for generating engine optimization cheating judgment based on multidimensional feature analysis, which are used for solving the problems in the background technology. In order to solve the technical problems, the invention provides a generating engine optimized cheating judgment system and method based on multidimensional feature analysis, comprising a data preprocessing module, a detection judgment module and a traceability alarm module, wherein the data preprocessing module is used for receiving user inquiry and original answer texts generated by a large language model, carrying out named entity recognition and linguistic feature extraction on the original answer texts to obtain standardized analysis objects, the detection judgment module is used for carrying out heuristic language feature analysis on the standardized analysis objects to identify language anomalies, comparing AI recommendation sequences and real world market data based on a statistics imbalance principle to output quantifiable risk judgment results, and the traceability alarm module is used for carrying out automatic RAG evaluation on generated contents to obtain loyalty and quotation density indexes, realizing content source tracking based on antagonistic watermarks and further outputting alarm identifications and difference charts in a visual mode. According to the technical scheme, the data preprocessing module comprises an input interface unit, an NLP preprocessing unit, an entity standardization unit, a data preprocessing unit and a data preprocessing unit, wherein the input interface unit is used for receiving user inquiry and an original answer text generated by the large language model; The detection and judgment module comprises a semantic emphasis deviation detection unit, a judgment unit and a judgment unit, wherein the semantic emphasis deviation detection unit is used for calculating the density ratio of an evaluation adjective to a descriptive noun in target entity description and marking abnormality according to the density ratio; the parasitic reference detection unit is used for analyzing the domain name and the path of the reference link, identifying the reference fraud behavior pointing to the user generated content sub-path under the authority domain name, and executing the weight reduction processing on the authority weight of the reference; the fact island verification unit is used for extracting key data assertion and performing cross verification in a third party authority database to mark data illusion and kneading data; the system comprises a structured induction identification unit, an AI sound volume calculation unit, a benchmark data acquisition unit, an imbalance analysis engine, an active countermeasure probe unit, a fuzzy analysis unit and a fuzzy analysis unit, wherein the structured induction identification unit is used for identifying sandwich ranking in a list structure, detecting an abnormal conversion structure with discount codes attached to target brands and purchasing links, and detecting closed loop fitting of text logic flows to PAS models; The traceability warning