KR-20260063656-A - APPARATUS AND METHODS FOR DETECTING TEXT GENERATED BY ARTIFICIAL INTELLIGENCE LANGUAGE MODELS
Abstract
The present invention relates to a device and method for detecting text generated through an artificial intelligence language model, wherein the device comprises: a text analysis unit that receives a source text and analyzes the sentence structure of the source text through a morphological analyzer and a syntactic analyzer; a variant text generation unit that generates a variant text by inserting a comma into the source text according to a defined rule; a log-likelihood calculation unit that calculates log-likelihood values of the source text and the variant text; and a determination unit that analyzes the log-likelihood value of a given text to determine whether the given text is a text generated by an artificial intelligence language model.
Inventors
- 한요섭
- 박신우
- 김도경
Assignees
- 연세대학교 산학협력단
Dates
- Publication Date
- 20260507
- Application Date
- 20241030
Claims (9)
- A text analysis unit that receives source text and analyzes the sentence structure of the source text through a morphological analyzer and a syntactic analyzer; A variant text generation unit that generates a variant text by inserting commas into the original text according to defined rules; A log-likelihood calculation unit that calculates log-likelihood values of the above original text and the above variant text; and A text detection device generated through an artificial intelligence language model, comprising a judgment unit that analyzes the log-likelihood value of a given text to determine whether the given text is text generated by the artificial intelligence language model.
- In paragraph 1, the variant text generation unit A text detection device generated through an artificial intelligence language model characterized by identifying possible comma insertion locations in the source text according to the comma usage rules of the basic language.
- In paragraph 2, the variant text generation unit A text detection device generated through an artificial intelligence language model, characterized by generating the variant text by inserting a comma at a different location in the original text while maintaining the meaning of the original text.
- In paragraph 3, the variant text generation unit A text detection device generated through an artificial intelligence language model, characterized by determining the insertion position of the comma by assuming the above variant text as an output simulation of the above artificial intelligence language model.
- In paragraph 1, the log-likelihood calculation unit A text detection device generated through an artificial intelligence language model characterized by determining the log-likelihood value by calculating a value that measures how naturally a specific word in the original text or the variant text flows compared to previous words.
- In paragraph 5, the above log-likelihood calculation unit A text detection device generated through an artificial intelligence language model characterized by calculating the probability that the next word will appear for each word of the original text or the variant text given the previous words, and then summing the log values of all probabilities to calculate the log-likelihood value.
- In paragraph 5, the above log-likelihood calculation unit A text detection device generated through an artificial intelligence language model, characterized by calculating the log-likelihood difference between the original text and the variant text and determining a judgment criterion for the possibility of generation through the artificial intelligence language model based on the magnitude of the difference.
- In paragraph 1, the above judgment unit A text detection device generated through an artificial intelligence language model, characterized by determining that the given text was written through the artificial intelligence language model when the log-likelihood value of the given text is greater than or equal to the judgment criterion for the possibility of generation through the artificial intelligence language model.
- In a text detection method generated through an artificial intelligence language model, performed in a text detection device generated through an artificial intelligence language model, A text analysis step that receives the source text and analyzes the sentence structure of the source text through a morphological analyzer and a syntactic analyzer; A variant text generation step that generates a variant text by inserting commas into the original text according to defined rules; A log-likelihood calculation step for calculating log-likelihood values of the above original text and the above variant text; and A method for detecting text generated through an artificial intelligence language model, comprising a judgment step of analyzing the log-likelihood value of a given text to determine whether the given text is text generated by the artificial intelligence language model.
Description
Apparatus and methods for detecting text generated by artificial intelligence language models The present invention relates to a technology for effectively distinguishing between text written by a human and text generated by a large language model, with respect to the Korean language. More specifically, it relates to a device and method for detecting text generated by an artificial intelligence language model, which can detect text generated by the artificial intelligence language model through the generation of variant text by inserting commas based on the difference in comma usage patterns between humans and large language models, and through log-likelihood comparison. Over the past few years, the rapid advancement of massive language models (MLVs), such as ChatGPT, has dramatically improved the quality and naturalness of AI-generated text. While this development has enabled innovative applications across various fields, it has simultaneously raised several social and ethical issues. For instance, in the field of education, there are concerns that students are using MLPs to write assignments on behalf of others, while in the media sector, there are worries about the spread of fake news generated by these models. In the legal field as well, the possibility has been raised that false evidence or documents generated by MLPs could lead to legal disputes. Against this backdrop, there is a growing need for technologies capable of accurately identifying text generated by large language models. However, existing detection technologies have primarily been developed for English or Chinese, and effective detection technologies that consider the specific characteristics of the Korean language have not yet been sufficiently researched. FIG. 1 is a diagram illustrating the functional configuration of a text detection device according to one embodiment of the present invention. Figure 2 is a diagram illustrating the system configuration of a text detection device. FIG. 3 is a flowchart illustrating a text detection method according to the present invention. The description of the present invention is merely an example for structural or functional explanation, and therefore the scope of the present invention should not be interpreted as being limited by the examples described in the text. That is, since the examples are subject to various modifications and may take various forms, the scope of the present invention should be understood to include equivalents capable of realizing the technical concept. Furthermore, the objectives or effects presented in the present invention do not imply that a specific example must include all of them or only such effects; therefore, the scope of the present invention should not be understood as being limited by them. Meanwhile, the meaning of the terms described in this application should be understood as follows. Terms such as "first," "second," etc., are intended to distinguish one component from another, and the scope of rights shall not be limited by these terms. For example, the first component may be named the second component, and similarly, the second component may be named the first component. When it is stated that one component is "connected" to another component, it should be understood that it may be directly connected to that other component, or that there may be other components in between. Conversely, when it is stated that one component is "directly connected" to another component, it should be understood that there are no other components in between. Meanwhile, other expressions describing the relationships between components, such as "between" and "exactly between," or "adjacent to" and "directly adjacent to," should be interpreted in the same way. A singular expression should be understood to include a plural expression unless the context clearly indicates otherwise, and terms such as "include" or "have" are intended to specify the existence of the implemented features, numbers, steps, actions, components, parts, or combinations thereof, and should be understood not to preclude the existence or addition of one or more other features, numbers, steps, actions, components, parts, or combinations thereof. In each step, identifiers (e.g., a, b, c, etc.) are used for convenience of explanation and do not describe the order of the steps; the steps may occur differently from the specified order unless a specific order is clearly indicated in the context. That is, the steps may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the reverse order. The present invention may be implemented as computer-readable code on a computer-readable recording medium, and the computer-readable recording medium includes all types of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device