CN-122021617-A - ACG-oriented daily source word special recognition and corpus construction method and system
Abstract
The invention discloses a method and a system for identifying daily source words and constructing a corpus for ACG (advanced text messaging), wherein the method comprises the steps of collecting ACG full-field Jing Wenben, carrying out denoising, word segmentation and the like, storing according to time layers, constructing a high-quality traceable data base, and supporting subsequent analysis. And through N-gram statistics, solidification degree and context entropy calculation, high-frequency, compact and boundary free combinations are screened, potential daily source words are purified, and verification workload is reduced. The daily source attribute is confirmed through step verification, and the structured corpus is built by classifying and warehousing according to the language irreplaceability after multi-dimensional labeling. And excavating propagation rules and semantic features through time sequence and network analysis, and dynamically updating corpus and model. The steps are progressive layer by layer, the former step is the later step for input, and a closed loop of data preparation, screening, verification, analysis and optimization is formed. Technical support is provided for ACG daily source word recognition, corpus construction and rule mining.
Inventors
- ZHOU XUAN
Assignees
- 武汉大学
Dates
- Publication Date
- 20260512
- Application Date
- 20251230
Claims (10)
- 1. An ACG-oriented daily source word specialized recognition and corpus construction method is characterized by comprising the following steps: S1, acquiring an ACG field related text corpus of at least one network platform in a set time interval, performing standardized preprocessing of denoising, word segmentation and part-of-speech tagging on the original corpus to form a structured initial corpus, and performing time sequence layered storage on the structured initial corpus according to a preset time unit; S2, extracting high-frequency character combinations from a structured initial corpus through N-gram frequency statistics to serve as candidate words, calculating a solidification degree index of the candidate words to measure the tightness of internal word formation, calculating context information entropy on the left side and the right side to evaluate the boundary freedom degree, screening the candidate words meeting preset minimum frequency, minimum solidification degree and minimum left and right entropy thresholds at the same time, and forming a daily source word candidate set in the ACG field; S3, performing stepwise verification by searching Japanese related corpus, literature and dictionary and combining semantic usage relevance comparison, translation type and cultural attribute verification and network propagation feature analysis to confirm the daily source attribute of the candidate word; for the daily source words passing the verification, at least one of semantic meaning, grammar, language use and social dimension is marked, and then the daily source words are classified and incorporated into a core, diffusion or edge corpus according to the language use function irreplaceable characteristics of the daily source words; S4, analyzing the marked daily source words through word frequency change prediction, time sequence analysis for propagation inflection point identification, vocabulary association network construction and network analysis for propagation path tracking, and simultaneously, establishing a dynamic updating mechanism, periodically collecting newly-increased corpus, circularly executing corpus collection, candidate word screening and daily source word verification labeling processes, and completing continuous expansion of a corpus and iterative optimization of a recognition model.
- 2. The method for identifying daily source word specialized terms and constructing a corpus for ACG according to claim 1, wherein the text corpus related to the ACG domain in S1 comprises: the ACG works core associated text comprises cartoon, game comments, scenario discussions, role analysis, setting interpretation, theme songs and score related comments; ACG practical application text, which comprises game attack guidelines, play sharing, role interaction content, peripheral evaluation, exhibition reports and feedback; the ACG community interaction text comprises ACG community topic discussion, live bullet screen, question-answer consultation and member exchange sharing; ACG derivative authoring class text that relates to two-dimensional derivative authoring text, homographs, ACG element-based two-dimensional authoring content.
- 3. The method for identifying solar source word specialized terms and constructing a corpus for ACG according to claim 1, wherein the step S1 of time sequence hierarchical storage is characterized in that: preset time unit , Representing month, quarter or year, wherein the value of the code is determined according to the characteristic of the propagation period of the daily source word in the ACG field; setting the structured initial corpus set after standardized pretreatment as Corpus set The collection time stamp of each corpus is ; For the corpus sequence number, ; Defining a time interval mapping function ; Is a non-negative integer and represents a time layering sequence number, and the collection time stamp of each corpus is obtained through the function Mapping to a corresponding time interval ; Constructing a time-layered corpus subset based on the mapping result Wherein ; For each corpus subset Allocating independent storage identifiers, the identifier information comprising time intervals Number of corpora within a subset ; Is a subset The number of the corpus included in the table tennis table, , Representing the cardinality of a collection, chronologically layered sequence numbers Sequentially storing corpus subsets in increasing order A temporal hierarchical corpus is formed.
- 4. The method for identifying daily source words and constructing a corpus for ACG according to claim 1, wherein the step of calculating the coagulability index of the candidate words in S2 to measure the tightness of the internal word formation is as follows: Let the candidate words be of length Wherein , Is the continuous character of the candidate word, and counts the occurrence frequency of the candidate word in the structured initial corpus Binary segmentation is carried out on the candidate words: any cutting position , The range of the values is as follows Dividing the candidate words into a front sub-segment and a rear sub-segment, wherein the front sub-segment is marked as From the front of the candidate word The successive characters being spliced, i.e The postamble fragment is marked as From word candidate And from the first to the second The successive characters being spliced, i.e Wherein Representing concatenation of character sequences, respectively statistics Frequency of occurrence of (a) 、 Frequency of occurrence of (a) ; The degree of solidification at this cut position was calculated as: , Taking the minimum value of the solidification degree under all the segmentation positions as the final solidification degree index of the candidate word: , the higher the index value, the stronger the compactness of the word forming inside the representing candidate word.
- 5. The method for identifying daily source word specialized terms and constructing a corpus for an ACG according to claim 1, wherein the process of calculating the left and right context information entropy in S2 to evaluate the degree of freedom of the boundary is: Set candidate words as Statistical occurrence in structured initial corpus All characters on the left side constitute a left context set ; For the different left character in the left context set, Counting each of the total number of different characters in the left context set At the position of Frequency of left occurrence , Representing candidate words Is arranged at the left side position of the (c), Representative character Occurs at Frequency of locations; Computing probability distribution for left context , Representative character Occurs at The probability of the position, left context information entropy is: , Representing candidate words Left context information entropy of (a), and similarly statistics appear in All characters on the right side constitute a right context set , For the different right character in the right context set, Counting each of the total number of different characters in the right context set At the position of Frequency of right side occurrence , Representing candidate words Is positioned at the right side of the (c), Representative character Occurs at Frequency of locations; computing probability distribution for right context , Representative character Occurs at The probability of the position, the right context information entropy is: , Representing candidate words Right context information entropy of (2); The higher the value of the context information entropy, the more dispersed the character distribution representing the corresponding boundary of the candidate word, and the stronger the boundary freedom degree.
- 6. The method for identifying daily source word specialized terms and constructing a corpus for ACG according to claim 1, wherein the step S2 further comprises the steps of performing multi-dimensional formal analysis on the candidate set, specifically: in the font layer diagnosis, verifying whether the candidate word contains a kanji or a reverse-word; In word layer diagnosis, identifying whether candidate words adopt a sum type word structure or have common word affix producing capability in the ACG field; In the phonological layer diagnosis, judging whether the candidate words have the characteristics of a Japanese reading characteristic and a transliteration characteristic or have the characteristics of a Japanese mimicry; in the syntax layer diagnosis, analyzing whether the candidate words have Japanese conversion use or present Japanese collocation mode; Screening candidate words with positive characteristics on at least 2 layers, entering the next step of verification, and temporarily storing the candidate words with positive characteristics or high degree of Chinese on only 1 layer into an observation corpus.
- 7. The ACG-oriented daily source word specialized recognition and corpus construction method according to claim 1, wherein the step verification is implemented in S3, and the specific process of confirming the daily source attribute of the candidate word is as follows: Verifying the initial Japanese character attribute, checking whether the candidate word has the corroborative evidence of Japanese character, wherein the evidence covers Japanese authoritative document recording records, professional dictionary recording information and Japanese corpus initial time records, and if the initial creation and use source of the candidate word are confirmed to be Japanese, directly judging the candidate word as a Japanese source word; Tracing a semantic usage source flow, comparing the modern semantic structure of the candidate word with the ancient Chinese semantic structure aiming at the existing homologous Chinese words in the ancient Chinese, examining the source of the modern semantic item of the candidate word, and judging the current semantic item as a new daily source word if the modern semantic meaning or collocation of the candidate word is derived from Japanese input and forms a new semantic item; Judging the type of language borrowing, checking whether the candidate word has a Japanese phonetic system borrowing feature, judging as a transliterated Japanese source word if the candidate word contains Japanese phonetic system borrowing features, judging as a borrowed Japanese source word if the candidate word contains Japanese Chinese phonetic borrowing features, judging as a mixed Japanese source word if the candidate word contains both meaning translation components and transliterated components, and entering the next judging step if the candidate word does not have the phonetic system borrowing features; checking Japanese cultural attributes, analyzing whether candidate words bear Japanese social cultural kernels or are bound with Japanese specific cultural scenes, value ideas or life conventions, judging whether the candidate words are cultural or sub-cultural borrowed daily source words if the candidate words have clear Japanese cultural identifications, and entering a next judging step if the candidate words are not related in daily text; and evaluating the attribute of the propagation scene, checking the network popularity of the candidate words and the propagation characteristics of the specific communities, checking whether the candidate words form high-frequency propagation in the ACG sub-cultural communities, judging the candidate words as network or sub-cultural daily source words if the candidate words have propagation heat in the network space or the ACG sub-cultural communities, and judging the candidate words as non-daily source words or incorporating into a list of words to be examined if the candidate words do not have the propagation characteristics.
- 8. The method for identifying daily source words and constructing a corpus for ACG according to claim 1, wherein the process of analyzing the labeled daily source words in S4 is as follows: Set the marked daily source word set as Wherein Selecting time window sequence for the total number of marked daily source words Wherein Counting daily source words for dividing time window In the time window Frequency of occurrence in Constructing word frequency time sequence matrix ; Calculating word frequency change rate based on the matrix Setting a change rate threshold When (when) When determining the corresponding time window Completing time sequence analysis for the propagation inflection point of the daily source word; constructing a vocabulary co-occurrence network Wherein the set of nodes Corresponding to the marked daily source words and the high-frequency words in the Chinese ACG field, and collecting edges The co-occurrence relation between the corresponding vocabularies is calculated Wherein Is a word And vocabulary and words Is used for the number of co-occurrence times of (a), And Respectively words of Sum vocabulary Is a total frequency of occurrence of (1); Shortest path algorithm based on edge weight Tracking the propagation path of the daily source word to complete network analysis, wherein Is a word And vocabulary and words Shortest path length in co-occurrence network; and integrating the word frequency time sequence matrix, the propagation inflection point data and the vocabulary co-occurrence network parameters, refining the propagation rule and semantic evolution characteristics of the daily source words, and forming a standardized analysis result.
- 9. An ACG-oriented daily source word specialized recognition and corpus construction system, comprising: The corpus preprocessing and time sequence storage unit is used for acquiring ACG field related text corpus of at least one network platform in a set time interval, performing standardized preprocessing of denoising, word segmentation and part-of-speech tagging on the original corpus to form a structured initial corpus, and performing time sequence layering storage on the structured initial corpus according to a preset time unit; The method comprises a daily source word candidate set screening unit, a daily source word candidate set screening unit and a daily source word processing unit, wherein the daily source word candidate set screening unit is used for extracting high-frequency character combinations from a structured initial corpus through N-gram frequency statistics to serve as candidate words; The daily source word verification and classification unit is used for verifying the daily source attribute of the candidate word by searching the daily source word related corpus, literature and dictionary and combining semantic usage relevance comparison, translation type and cultural attribute verification and network propagation feature analysis; The analysis and model iteration optimization unit is used for carrying out analysis on the marked daily source words through word frequency change prediction, time sequence analysis of propagation inflection point identification, network analysis of vocabulary association network construction and propagation path tracking, and meanwhile, a dynamic update mechanism is established, new corpus is periodically acquired, corpus acquisition, candidate word screening and daily source word verification marking flow are circularly executed, and continuous expansion of a corpus and iteration optimization of an identification model are completed.
- 10. A computer readable storage medium having stored thereon a computer program, the computer program being executable by a processor to perform an ACG-oriented daily source word specific term recognition and corpus construction method according to any of claims 1-8.
Description
ACG-oriented daily source word special recognition and corpus construction method and system Technical Field The invention belongs to the technical field of corpus construction, and particularly relates to an ACG-oriented method and system for identifying Japanese-source word special terms and constructing a corpus. Background ACG is used as a core field for carrying sub-culture propagation and communication with young people, and Japanese source vocabulary continuously floods and rapidly spreads along with the deepening of globalization network propagation, and becomes an important component of language system in the field. However, there is a significant shortboard for specialized technical support of ACG day source words in the current language processing field. The existing solar word recognition method is multi-oriented to a general language scene, lacks adaptation to text characteristics of the ACG field, and is difficult to cope with special expression forms such as harmonic transformation, thumbnail deformation, secondary creation derivative and the like which occur at high frequency in the corpus of the field, so that the recognition accuracy is low. Meanwhile, ACG daily source word propagation has strong timeliness and community aggregation, a traditional corpus construction mode adopts a static collection mode, propagation situation and semantic evolution of words in different time phases cannot be captured, and dynamic analysis requirements are difficult to meet. In the corpus screening link, the prior art often depends on a single frequency index, ignores word compactness and context relevance in the vocabulary, and easily misjudges character combinations without practical significance as candidate words, or omits low-frequency but functional key daily source words, so that the accuracy of the candidate sets is insufficient. In the aspect of daily source attribute verification, the system determination logic is lacking, so that the semantic source flow of the homologous Chinese-character words is difficult to distinguish only through simple dictionary matching or literal comparison, and the daily source words of complex borrowing types such as transliteration, mixed translation and the like cannot be effectively identified, so that the definition of the daily source words and the non-daily source words is fuzzy. In addition, the existing related corpus is generally lack of hierarchical classification and multidimensional attribute labeling, cannot embody the difference of daily source words in language functions, and is difficult to support deep applications such as semantic analysis, cultural propagation research and the like in the ACG field. These problems not only restrict the landing effect of language processing technology in ACG field, but also influence the research depth of the problems such as sub-cultural propagation rule, cross-cultural language fusion, etc. Therefore, a set of daily source word recognition and corpus construction schemes which adapt to the ACG field characteristics and give consideration to the accuracy and the dynamics are constructed, the defects of the prior art in recognition accuracy, timeliness and practicability are overcome, and the method has important significance in improving the field language processing system, promoting the sub-culture research and cross-culture communication. Disclosure of Invention The invention aims to solve the problems of low accuracy of identifying daily source words in the ACG field, lack of dynamicity and layering of a corpus and the like, and constructs a technical scheme for adapting to field characteristics. Through standardized corpus processing, multi-dimensional candidate word screening, stepwise daily source attribute verification and dynamic iteration optimization, the daily source word is accurately identified, structured labeling and classified storage are realized, and reliable support is provided for field language processing and subculture research. Aiming at the defects or improvement demands of the prior art, the invention provides an ACG-oriented daily source word specialized identification and corpus construction method, which comprises the following steps: S1, acquiring an ACG field related text corpus of at least one network platform in a set time interval, performing standardized preprocessing of denoising, word segmentation and part-of-speech tagging on the original corpus to form a structured initial corpus, and performing time sequence layered storage on the structured initial corpus according to a preset time unit; S2, extracting high-frequency character combinations from a structured initial corpus through N-gram frequency statistics to serve as candidate words, calculating a solidification degree index of the candidate words to measure the tightness of internal word formation, calculating context information entropy on the left side and the right side to evaluate the boundary freedom degree, screening