CN-121997914-A - AI text detection method based on synonym substitution
Abstract
The application provides an AI text detection method based on synonym substitution, relates to the technical field of AI text detection, and solves the problem of low detection accuracy of the existing AI text detection technology. The method comprises the steps of firstly obtaining an original text, replacing a target word in the original text by synonyms to obtain a replaced text, then calculating the context fitness, the context window semantics and the probability distribution offset of the synonyms in the replaced text, and finally inputting the context fitness, the context window semantics and the probability distribution offset into a preset neural network model to detect, so that a detection result is obtained. The whole scheme effectively merges the semantic disturbance generation and multidimensional feature detection mechanism, can carry out general detection on the original text, and has the advantage of high-precision detection.
Inventors
- CHEN YONGJUN
- GONG WEI
Assignees
- 中国电子科技集团公司第七研究所
Dates
- Publication Date
- 20260508
- Application Date
- 20251121
Claims (10)
- 1. The AI text detection method based on synonym substitution is characterized by comprising the following steps: S1, acquiring an original text, and carrying out synonym replacement on a target word in the original text to obtain a replacement text; S2, calculating the context fitness, the context window semantics and the probability distribution offset of synonyms in the replacement text; S3, generating a detection result of the original text based on the context fitness, the context window semantics and the probability distribution offset, wherein the detection result is an AI text or an artificial text.
- 2. The method for detecting AI text based on synonym substitution as set forth in claim 1, wherein said performing synonym substitution on the target vocabulary in the original text to obtain a substituted text includes: S11, identifying keywords in the original text through word segmentation and part-of-speech tagging; S12, obtaining candidate synonyms of the keywords; s13, calculating the confidence coefficient of the filling probability of the candidate synonyms in the original text, and selecting the candidate synonym with the highest confidence coefficient of the filling probability as a replacement synonym to replace the key word, so that a replacement text is obtained.
- 3. The synonym substitution-based AI text detection method of claim 2, wherein the computing of contextual fitness comprises: s201, constructing a context window with a fixed size for the replacement synonym, wherein the context window comprises a plurality of words; s202, calculating the correlation degree between the replacement synonym and any word in the context window; s203, calculating the context fitness of the context window by using the correlation degree.
- 4. The synonym substitution-based AI text detection method as set forth in claim 3 wherein the calculation formula for calculating the contextual fitness F of the contextual window using the degree of correlation is as follows: where s is a replacement synonym, i is any of the remaining words in the context window, To replace the relevance of synonym s to any of the remaining words i in the context window, Is the word frequency of any word i remaining in the context window.
- 5. The synonym substitution-based AI text detection method of claim 4, wherein the computing of contextual window semantics comprises: s211, constructing vector representation of the context window; s212, calculating the context window semantics by using knowledge graph information and the vector representation.
- 6. The synonym substitution-based AI text detection method of claim 2, wherein the calculation of the probability distribution offset comprises: S221, counting occurrence frequencies of the replacement synonyms in a preset corpus, and constructing an ordered vector set of the replacement synonyms; s222, counting the occurrence frequency of attribute pairs corresponding to vectors in the ordered vector set in an original text; S223, performing divergence calculation on the occurrence frequency to obtain the probability distribution offset of the replacement synonym.
- 7. The synonym substitution-based AI text detection method as set forth in claim 6 wherein said subjecting said frequency of occurrence to a divergence calculation yields a probability distribution offset for said substitution synonyms The calculation formula of (2) is as follows: Wherein, the Is attribute pair The frequency of occurrence in the original text, Is attribute pair Reference frequencies in standard human corpus.
- 8. The synonym substitution-based AI text detection method of claim 1, wherein the generating AI text detection results based on the contextual fitness, contextual window semantics, and probability distribution bias comprises: S31, inputting the context fitness, the context window semantics and the probability distribution offset into a preset back propagation neural network model, and outputting a preliminary detection result by the back propagation neural network model; S32, judging whether the preliminary detection result is an AI text, if so, inputting the replacement text into a preset parser to generate a semantic graph, and executing S33; S33, generating a reconstructed text based on the semantic graph, and calculating semantic similarity between the original text and the reconstructed text; S34, judging whether the semantic similarity is larger than a preset threshold value, if so, judging that the AI detection result of the original text is an AI text, and if not, judging that the AI detection result of the original text is an artificial text.
- 9. The synonym substitution-based AI text detection method of claim 8, wherein the semantic graph comprises a plurality of nodes and connection relationships thereof, and wherein generating reconstructed text based on the semantic graph comprises: s331, coding the nodes and the connection relations thereof, and carrying out weighted aggregation on semantic information of neighbor nodes of each node based on the connection relations to obtain a context representation representing the whole semantic of the semantic graph: V is a node of the semantic graph, and V is a node set in the semantic graph; a hidden state for node v; S332, decoding the hidden state vector set based on the context representation by gradually generating output words in time sequence, generating candidate words and corresponding probabilities according to the decoding state of the previous time step, the generated output words and the context vector extracted from the context representation at each time step : Wherein ARM is a semantic graph, and softmax is # ) As a function of the softmax of the sample, Is a linear transformation matrix, b is a bias term, t is a time step, y is an output sequence, Is a candidate word; S333, selecting the candidate word with the highest probability as the current output word until a reconstructed text corresponding to the semantic graph is generated.
- 10. The synonym substitution-based AI text detection method of claim 8, wherein the computing semantic similarity between the original text and the reconstructed text comprises: s333, respectively utilizing a pre-trained sentence encoder to vectorize and encode the original text and the reconstructed text to obtain an original text sentence vector and a reconstructed text sentence vector; S334, performing cosine similarity calculation on the original text sentence vector and the reconstructed text sentence vector to obtain the expression of the semantic similarity as follows: Wherein, the The original text is represented by a representation of the original text, The representation of the reconstructed text is that, A sentence vector representing the original text, Representing sentence vectors of the reconstructed text.
Description
AI text detection method based on synonym substitution Technical Field The application relates to the technical field of AI text detection, in particular to an AI text detection method based on synonym substitution. Background With the rapid development of artificial intelligence and natural language processing technologies, large-scale language models (Large Language Model, LLM) are widely used in the fields of text generation, automatic writing, academic aided creation, and the like. The AI generated text has obvious advantages in content production due to the characteristics of smooth language, consistent semantics, consistent logic and the like. However, the widespread use of AI text also carries potential risks, especially in the context of academic publishing, educational evaluation, and content security, where AI-generated content may be used for inappropriate purposes, affecting originality and authenticity of the text. Therefore, how to accurately distinguish the AI-generated text from the artificial-written text becomes a research focus in the current artificial intelligence content detection field. Currently, AI text detection relies mainly on two types of technical paths. One type is a learning method based on a discriminant model, which extracts language features and classifies the language features by using a deep learning model through collecting a large number of manual and AI text samples. The method has higher accuracy under the condition of sufficient training set, but the model has stronger dependence on data distribution, and when a new generated model or resistance rewritten text appears, the detection performance is obviously reduced and frequent retraining is required. The other is a zero sample method based on statistical characteristics, and the naturalness of the text is evaluated by calculating indexes such as confusion or burstiness of the text. The method is suitable for rapid detection without additional training, but has limited capability of describing deep semantics and context consistency as the method only reflects language surface features, and remarkably reduces the accuracy when facing high camouflage AI text. Disclosure of Invention The invention provides an AI text detection method based on synonym substitution, and aims to solve the problem of low detection accuracy of the existing AI text detection technology. In order to achieve the technical effects, the technical scheme of the invention is as follows: an AI text detection method based on synonym substitution comprises the following steps: S1, acquiring an original text, and carrying out synonym replacement on a target word in the original text to obtain a replacement text; S2, calculating the context fitness, the context window semantics and the probability distribution offset of synonyms in the replacement text; S3, generating a detection result of the original text based on the context fitness, the context window semantics and the probability distribution offset, wherein the detection result is an AI text or an artificial text. Preferably, the performing synonym replacement on the target vocabulary in the original text to obtain a replaced text includes: S11, identifying keywords in the original text through word segmentation and part-of-speech tagging; S12, obtaining candidate synonyms of the keywords; s13, calculating the confidence coefficient of the filling probability of the candidate synonyms in the original text, and selecting the candidate synonym with the highest confidence coefficient of the filling probability as a replacement synonym to replace the key word, so that a replacement text is obtained. Preferably, the calculating of the context fitness includes: s201, constructing a context window with a fixed size for the replacement synonym, wherein the context window comprises a plurality of words; s202, calculating the correlation degree between the replacement synonym and any word in the context window; s203, calculating the context fitness of the context window by using the correlation degree. Preferably, the calculation formula for calculating the context fitness F of the context window by using the correlation is as follows: where s is a replacement synonym, i is any of the remaining words in the context window, To replace the relevance of synonym s to any of the remaining words i in the context window,Is the word frequency of any word i remaining in the context window. Preferably, the calculating of the context window semantics includes: s211, constructing vector representation of the context window; s212, calculating the context window semantics by using knowledge graph information and the vector representation. Preferably, the calculating of the probability distribution offset includes: S221, counting occurrence frequencies of the replacement synonyms in a preset corpus, and constructing an ordered vector set of the replacement synonyms; s222, counting the occurrence frequency of attribute pa