CN-121543587-B - English text standing labeling method and system

CN121543587BCN 121543587 BCN121543587 BCN 121543587BCN-121543587-B

Abstract

The invention provides an English text position labeling method and system, which belong to the crossing field of information technology and natural language processing, and comprise the steps of part-of-speech preprocessing, rule construction, setting up position vocabulary labeling rules on the basis of the part-of-speech labels, writing codes to obtain a rule-based position vocabulary labeling device, labeling the English text containing the part-of-speech labels according to the rule-based position vocabulary labeling device to obtain a position labeling data set, machine learning training, training by using the position labeling data set and using RoBERTa-Larget as a base model to obtain a training-based position vocabulary labeling device, and improving labeling accuracy. The invention constructs the English text position marker to realize accurate marking of English text positions.

Inventors

LIU TINGTING
Diao Heng

Assignees

天津财经大学

Dates

Publication Date: 20260508
Application Date: 20260116

Claims (9)

1. The English text position labeling method is characterized by comprising the following steps of: s1, part-of-speech preprocessing, namely acquiring an English text through data acquisition, and generating part-of-speech tags for each word of the English text; s2, rule construction, namely, on the basis of the part-of-speech labels, formulating a standing vocabulary labeling rule according to English syntactic structures, context semantics and vocabulary characteristics, and writing codes to obtain a rule-based standing vocabulary labeling device, wherein the formulation of the standing vocabulary labeling rule comprises the following steps: S201, labeling simple standing vocabulary, namely taking the situational verbs, the semi-situational verbs and the standing adverbs as simple standing vocabulary, and carrying out vocabulary definition expansion; S202, labeling complex standing vocabulary, namely taking standing verbs, standing adjectives and standing nouns of guide supplement clauses as complex standing vocabulary, summarizing recognition types of the complex standing vocabulary, setting general rules and special rules corresponding to each recognition type, and judging and labeling the complex standing vocabulary according to the recognition types and the corresponding rules by combining the shape of the target word and the part-of-speech tag; S3, marking the data set, namely marking the standpoint of the English text containing the part-of-speech tag by using the rule-based standpoint vocabulary marking device to obtain a standpoint marking data set; S4, machine learning training, namely training by using the standing annotation data set and using a RoBERTa-Large model as a base model to obtain a standing vocabulary annotator based on training, and improving the annotation accuracy; S5, standing labeling, namely finishing the standing labeling of the English text to be analyzed through the training-based standing vocabulary labeling device.
2. The english text position labeling method of claim 1, wherein step S1 comprises: S101, collecting English text corpus of domestic and foreign media news from a global news database; s102, performing repeated item screening on the collected text corpus, wherein the repeated item screening comprises the steps of extracting title prefixes through regular expressions, grouping and performing duplicate removal, and performing double-rule parallel similarity matching to delete similar texts; s103, cleaning data of the text subjected to repeated item screening; S104, segmenting the text content after data cleaning through a regular expression, and separating a title, a date, a source, watermark information and a text; S105, constructing sentence dividing logic by adopting a regular expression, dividing text content into independent sentences based on punctuation and quotation rules of English, and marking the divided sentences by adopting a Stanford part-of-speech marker Stanford POS Tagger to obtain part-of-speech tags of each word; s106, constructing a PostgreSQL database, which is divided into an ID, a title, a source, a date, a text, a clause result and a label, writing the divided contents into corresponding fields except the ID, and forming an English text database by unique IDs of each piece of data.
3. The method for labeling english text according to claim 2, wherein the repeated item screening in step S102 specifically comprises: S102-1, extracting and grouping title prefixes of texts, namely extracting prefixes from head line titles of the texts through regular expressions, grouping the texts according to the extracted prefixes, counting the number of texts and the number of unique titles of each group, only reserving the groups with the number of the files being more than 1 as candidate duplicate removal objects, further screening out the groups with all file titles in the groups being identical and the number of the files being less than or equal to a set number threshold, and carrying out subsequent content comparison on the screened groups; s102-2, matching the similarity of the two rules in parallel, namely calculating the complete similarity of the contents of each text in the screened group, judging the text as a similar text if the complete similarity is larger than or equal to a set similarity threshold, comparing the first N characters if the similarity is smaller than the set similarity threshold, calculating the similarity of the subset, judging the text as the similar text if the similarity of the subset is larger than or equal to the set similarity threshold, and deleting the matched similar text, wherein only one part of the text is reserved.
4. The english text position labeling method of claim 1, wherein step S202 comprises: S202-1, determining the recognition type of the complex place word according to place verbs, place adjectives, place nouns and guided complement clauses in English grammar, wherein the recognition type comprises the following steps: the standing verb+to clause, the standing verb+that clause, the standing adjective+to clause, the standing adjective+that clause, the standing noun+to clause, and the standing noun+that clause; S202-2, setting general rules of all identification types, wherein the general rules comprise: (1) Vocabulary definition expansion, namely expanding deformation forms of standing verbs/standing adjectives/standing nouns defined in the existing standing vocabulary into corresponding levels for definition expansion; (2) Defining a valid range of the to/that, namely defining a range of part-of-speech tags for judging complex standpoint vocabulary according to a plurality of different part-of-speech tags corresponding to the to/that respectively; (3) Defining a definition rule, wherein the definition rule comprises triggering different judging flows and results according to grammar structures comprising different vocabularies, part-of-speech labels and punctuations existing between standing verbs/standing adjectives/standing nouns and to/that; S202-3, setting special rules of each recognition type, wherein the special rules comprise setting grammar structure judgment rules corresponding to special cases according to the special cases which are not in accordance with the general rules and still belong to complex standing vocabulary and are contained in standing verbs/standing adjectives/standing nouns+to/that clauses.
5. The method for labeling english text according to claim 4, wherein the general rule and the special rule further include a part-of-speech tag correction rule, and the method further comprises determining whether to mislabel according to the context tag of the target word, and if so, replacing the part-of-speech tag with the correct part-of-speech tag.
6. The english text position labeling method of claim 1, wherein step S3 comprises: s301, setting a position code according to definition of position vocabulary and definition of sub-level division for expressing position tendency; s302, programming English texts containing part-of-speech labels according to the position word labeling rules and position codes to obtain a position labeling data set.
7. The english text position labeling method of claim 1, wherein step S4 comprises: dividing the standing annotation data set into a training set, a verification set and a test set, taking the training set as a data source for RoBERTa-Large model training, and enabling the model to learn the annotation result; The method comprises the steps of using a word segmentation device RobertaTokenizerFast to split sentences into WordPiece sub-words, using the first sub-word of each word as a token to map to labels corresponding to the words, then distributing labeling tasks to a pre-training model class RobertaForTokenClassification, enabling models to learn how to predict rule labels on a Large scale corpus through cross entropy or multi-label loss, adopting gradient accumulation, mixing precision and learning rate norm-up in the training process to adapt to the display memory requirement of RoBERTa-Large, adjusting model parameters through a verification set, testing model performance through a test set, and finally obtaining the trained models as a standing vocabulary labeling device.
8. An english text position annotation system comprising: the part-of-speech preprocessing module is used for acquiring English text through data acquisition and generating part-of-speech tags for each word of the English text; The rule construction module is used for formulating a standing vocabulary labeling rule based on English syntactic structures, context semantics and vocabulary characteristics and obtaining a rule-based standing vocabulary labeling device by compiling codes, wherein the standing vocabulary labeling rule comprises the steps of using a situation verb, a half situation verb and a standing adverb as simple standing vocabulary and carrying out vocabulary definition expansion, combining a word shape of a target word and the part-of-speech label and carrying out judgment and labeling of the simple standing vocabulary, and the complex standing vocabulary labeling step comprises the steps of taking the standing verb, the standing adjective and the standing noun of a guiding supplementary sentence as complex standing vocabulary, summarizing recognition types of the complex standing vocabulary, setting general rules and special rules corresponding to each recognition type, and combining the word shape of the target word and the part-of-speech label and carrying out judgment and labeling of the complex standing vocabulary according to the recognition types and the corresponding rules; The data set labeling module is used for labeling the standpoint of the English text containing the part-of-speech tag by using the rule-based standing vocabulary labeling device to obtain a standing labeling data set; the machine learning training module is used for training by using the standing annotation data set and RoBERTa-Large as a base model to obtain a standing vocabulary annotator based on training, and the annotation accuracy is improved; And the standing marking module is used for finishing the standing marking of the English text to be analyzed through the training-based standing vocabulary marking device.
9. A visual retrieval marking system, which develops a webpage with database visual retrieval and marking functions through a programming language, and is characterized in that the English text position marking method as claimed in any one of claims 1-7 is applied.

Description

English text standing labeling method and system Technical Field The invention belongs to the crossing field of information technology and natural language processing, and particularly relates to an English text standing labeling method and system. Background The analysis and optimization of English text can be used for correcting grammar errors, and is very important for accurately expressing text content, avoiding improper expression and speaking loopholes, improving the penetrating power of speaking and enhancing communication effect. One of the important analyses and optimizations is in terms of standpoint expression. The English opposites are used for properly expressing, so that deep meaning can be more accurately conveyed, and the views can be accurately understood and spread in the global scope. In English text position analysis and optimization research, the method is a very important process for labeling position words. Because the standing vocabulary is often required to be judged by combining a specific grammar structure, the traditional research mostly adopts a part-of-speech labeler to allocate part of speech for each word in the text, and then manually identify grammar relations and standing words. Although the accuracy of the labeling can be ensured by the method, the cost is high due to the large amount of manual labor, and the feasible standpoint research on a large-scale corpus can not be realized. Therefore, there is a strong need for an automated position labeling method and analysis system that can accurately recognize english text neutral position vocabulary in large quantities. Disclosure of Invention The invention aims to provide an English text position labeling method and system, which realize accurate labeling of English text positions by constructing an English text position labeling device. In order to achieve the above object, the technical scheme of the present invention is as follows: an English text position labeling method comprises the following steps: s1, part-of-speech preprocessing, namely acquiring an English text through data acquisition, and generating part-of-speech tags for each word of the English text; s2, rule construction, namely, on the basis of the part-of-speech labels, establishing a standing vocabulary marking rule according to English syntactic structures, context semantics and vocabulary characteristics, and writing codes to obtain a rule-based standing vocabulary marking device; S3, marking the data set, namely marking the standpoint of the English text containing the part-of-speech tag by using the rule-based standpoint vocabulary marking device to obtain a standpoint marking data set; s4, machine learning training, namely training by using the standing annotation data set and taking RoBERTa-Large as a base model to obtain a standing vocabulary annotator based on training, and improving the annotation accuracy; S5, standing labeling, namely finishing the standing labeling of the English text to be analyzed through the training-based standing vocabulary labeling device. Further, step S1 includes: s101, acquiring English text corpus of foreign mainstream media news and English text corpus of domestic main foreign media news from a global news database; s102, performing repeated item screening on the collected text corpus, wherein the repeated item screening comprises the steps of extracting title prefixes through regular expressions, grouping and performing duplicate removal, and performing double-rule parallel similarity matching to delete similar texts; s103, cleaning data of the text subjected to repeated item screening; S104, segmenting the text content after data cleaning through a regular expression, and separating a title, a date, a source, watermark information and a text; S105, constructing sentence dividing logic by adopting a regular expression, dividing text content into independent sentences based on punctuation and quotation rules of English, and marking the divided sentences by adopting a Stanford part-of-speech marker Stanford POS Tagger to obtain part-of-speech tags of each word; s106, constructing a PostgreSQL database, which is divided into an ID, a title, a source, a date, a text, a clause result and a label, writing the divided contents into corresponding fields except the ID, and forming an English text database by unique IDs of each piece of data. Further, the repeated item filtering in step S102 specifically includes: S102-1, extracting and grouping title prefixes of texts, namely extracting prefixes from head line titles of the texts through regular expressions, grouping the texts according to the extracted prefixes, counting the number of texts and the number of unique titles of each group, only reserving the groups with the number of the files being more than 1 as candidate duplicate removal objects, further screening out the groups with all file titles in the groups being identical and the number of the files being less than