CN-121543600-B - Intelligent analysis and text optimization method and system for external financial and financial utterances

CN121543600BCN 121543600 BCN121543600 BCN 121543600BCN-121543600-B

Abstract

The invention provides an intelligent analysis and text optimization method and system for foreign financial utterances, which belong to the crossing field of information technology and natural language processing and comprise the steps of collecting English financial news corpus, preprocessing and warehousing to form an English financial news database, training a language feature labeling model based on machine learning to label language features of English texts, building a financial utterances analysis frame, taking dimensionality scores and language feature thousand-word frequency as analysis bases of the financial utterances analysis frame, training a foreign financial utterances vertical domain large model, analyzing the English financial news text based on the financial utterances analysis frame and providing optimization suggestions. The invention realizes accurate marking, language characteristic analysis and high-quality text color rendering of English financial texts, and effectively improves the readability, accuracy and influence of the financial texts in English native language readers.

Inventors

LIU TINGTING
Diao Heng

Assignees

天津财经大学

Dates

Publication Date: 20260508
Application Date: 20260120

Claims (9)

1. An intelligent analysis and text optimization method for external financial utterances is characterized by comprising the following steps: s1, core data acquisition and cleaning, namely acquiring foreign media English financial news corpus and domestic and foreign media English financial news corpus, preprocessing and warehousing to form an English financial news database; s2, training a language feature labeling model, namely performing three rounds of model training based on a machine learning method, wherein the first round of model training identifies language features comprising part of speech, simple grammar and vocabulary, the second round of model training identifies language features comprising complex grammar, and the third round of model training performs iterative improvement to finally obtain an English text language feature labeling model; s3, constructing a financial speech analysis framework, namely calculating the thousand word frequency of language features, obtaining a plurality of dimensions through factor analysis, taking the score of each dimension and the standardized frequency of the language features as the analysis basis of the financial speech analysis framework, and calculating the Euclidean distance between the financial speech analysis framework and the text to be analyzed, wherein the method comprises the following steps: S301, re-labeling English texts of an English financial news database by using the language feature labeling model to obtain labeling results containing all language features; S302, calculating the thousand word frequency of all language features in each piece of English text data in a mode that the thousand word frequency of the language features is = (the actual frequency of the language features/the total number of the text token) 1000; s303, importing the thousand-word frequency of all the language features into social science statistical software SPSS for factor analysis, deleting the language features with the load value lower than 0.35, forming a plurality of feature dimensions, wherein each dimension comprises a plurality of language features with obvious co-occurrence trends; S304, counting the co-occurrence distribution condition of language features in the key sentences in the same dimension, and finally determining the specific classification naming of each dimension through observing the distribution condition; S305, sampling to form a dimension naming verification data set, dividing language features into corresponding dimensions, and carrying out quantitative verification and qualitative assessment, wherein the quantitative verification comprises the steps of calculating variance interpretation rate and clone Bach coefficient of load features of each dimension on the dimension naming verification data set, and checking correlation coefficients among the dimensions, then clustering the dimension naming verification data set according to the projected dimension scores, and assessing the effectiveness of the dimension scores on distinguishing text differences; S306, taking the thousand-word frequency of the language features as a standardized frequency, and calculating each dimension score, wherein a calculation formula is that dimension score= (sum of all positive characteristic thousand-word frequencies in the dimension-sum of all negative characteristic thousand-word frequencies in the dimension); s4, training the large model of the vertical domain of the external financial utterance, namely, using the large model of the open source as a base to carry out fine adjustment, so that the large model of the vertical domain of the external financial utterance is mastered by the large model of the open source, the knowledge in the steps S1-S3 is obtained, and an optimization suggestion is provided for improving the quality of English financial texts.
2. The method for intelligent analysis and text optimization for external financial utterances according to claim 1, wherein the method for preprocessing in step S1 comprises: s101, removing duplication, namely screening English financial news corpus with duplication possibility through a regular expression, and performing duplication removal based on similarity calculation of text content; s102, data screening, namely finding out a text with continuous capitalization through a regular expression, and judging whether the problems of format unnormalization or low content quality exist or not and whether rejection is needed or not.
3. The intelligent analysis and text optimization method for the external financial utterances according to claim 1 is characterized in that in the step S1, data standardization and cleaning are needed after preprocessing and before warehousing, and the method comprises the steps of title format adjustment, proper nouns extraction to form a proper noun dictionary, standardization processing of text content case, punctuation marks, common abbreviations and sentence capitalization, text cleaning, removal of irrelevant content including illegal characters or website links, and morphological reduction of cleaned titles and texts.
4. The intelligent analysis and text optimization method for the external financial news is characterized in that the entering step S1 comprises the steps of dividing text content, separating titles, dates, sources, watermark information and texts, constructing a database, dividing the text content into 8 fields of IDs, titles, cleaned titles, watermarks, sources, dates, texts and cleaned texts, writing the divided content into the corresponding fields, enabling each data to have a unique ID to form an English financial news database, and clustering the cleaned texts and the cleaned titles in the English financial news database for multiple times through a fine-tuning LDA topic clustering model to form a plurality of primary, secondary and tertiary topics.
5. The intelligent analysis and text optimization method for external financial utterances according to claim 1, wherein the first round of training of the language feature labeling model in step S2 comprises: s201, marking the English text content by using a part-of-speech marking tool at one time, and marking a basic part-of-speech label of each word; S202, programming to carry out secondary labeling on related words based on the basic part-of-speech tag by combining English grammar rules, and labeling grammar feature tags; S203, programming and judging whether the vocabulary expresses the position according to the position vocabulary in the English and the grammar rule of the corresponding type clause, if so, marking the position vocabulary for three times, and marking a position label; S204, dividing the data set formed after the three labeling into a verification set and a test set B of a training set A, A; S205, labeling training is carried out on the training set A by using RoBERTa-Large as a base model, the learning effect of each learning round is checked by using the verification set of A, and model performance is tested by the test set B, so that a DDU Tagger 1.0.0 language characteristic labeling model is obtained.
6. The intelligent analysis and text optimization method for external financial utterances according to claim 5, wherein step S205 specifically comprises using RobertaTokenizerFast to split sentences into WordPiece subwords, mapping the first subword of each word as a token to a label corresponding to the word, then assigning a labeling training task to RobertaForTokenClassification, allowing the model to learn how to predict rule labels on a Large scale corpus through cross entropy loss, adopting gradient accumulation, mixing precision and learning rate norm-up in the training process to adapt to the display memory requirement of RoBERTa-Large, and finally training to obtain DDU Tagger 1.0.0 language feature labeling model.
7. The intelligent analysis and text optimization method for external financial utterances according to claim 5, wherein the second round of training of the language feature labeling model in step S2 comprises: S211, autonomously encoding language features containing complex grammar, which cannot be accurately marked currently in the English text, for marking complex grammar feature labels; S212, converting electronic versions of different English grammar dictionaries into retrievable texts, importing the retrievable texts into a database, forming original example sentences of different batches through indexing and classifying, pre-marking the original example sentences of the different batches by using DDU Tagger 1.0.0, reserving items to be marked for marking complex language features, taking all batches of marked items of the same language features as a sub-data set, repeating the process to generate sub-data sets C1, C2, C3, cn, n as the category number of the language features, forming a complete data set C together by all the sub-data sets, and reserving 10% as a test set D; S213, training the model by using the DDU Tagger 1.0.0 language characteristic labeling model as initial weight and using the data set C to obtain the DDU Tagger 2.0 language characteristic labeling model.
8. The intelligent analysis and text optimization method for external financial utterances according to claim 5, wherein the third training of the language feature labeling model in step S2 comprises: Marking the test set B used in the first training by using DDU Tagger 2.0.0 language characteristic marking model again, counting the correct and wrong conditions of the label, checking the marking accuracy of the model, carrying out iterative improvement on the label with high error rate, combining all error examples and newly added marks into a final training set E, and retraining by taking DDU Tagger 2.0.0 language characteristic marking model as initial weight on the basis until macro average F1 is more than or equal to 0.92, recall rate of each new label is more than or equal to 0.80 and overall accuracy is more than or equal to 98%, thus obtaining a final language characteristic marking model DDU Tagger 3.0.0.
9. An intelligent analysis and text optimization system for an external financial utterance, comprising: the system comprises a core data acquisition and cleaning layer, a language characteristic labeling model training layer, a financial speaking analysis frame module and an external financial speaking vertical domain large model which are connected in sequence; the core data acquisition and cleaning layer acquires foreign media English financial news corpus and foreign media English financial news corpus, performs preprocessing and warehousing to form an English financial news database; the language feature labeling model training layer is used for carrying out three rounds of model training based on a machine learning method, wherein the first round of model training is used for identifying language features comprising part of speech, simple grammar and vocabulary standpoints, the second round of model training is used for identifying language features comprising complex grammar, and the third round of model training is used for carrying out iterative improvement, so that an English text language feature labeling model is finally obtained; Calculating the thousand word frequency of language features, obtaining a plurality of dimensions through factor analysis, and taking the score of each dimension and the standardized frequency of the language features as the analysis basis of the financial speech analysis framework for calculating the Euclidean distance between the financial speech analysis framework and the text to be analyzed; the method comprises the steps of re-labeling English texts of an English financial news database by using the language feature labeling model to obtain labeling results containing all language features; the method comprises the steps of calculating the thousand word frequency of all language features in each piece of English text data, wherein the calculation mode is that the thousand word frequency of the language features is 1000 (the actual frequency of the language features/the total number of the texts), the thousand word frequency of all the language features is imported into social science statistics software SPSS for factor analysis, language features with the load value lower than 0.35 are deleted to form a plurality of feature dimensions, each dimension comprises a plurality of language features with obvious co-occurrence trend, the co-occurrence distribution situation of the language features in key sentences in the same dimension is counted, the specific classification naming of each dimension is finally determined through observation of the distribution situation, a dimension naming verification data set is formed by sampling, the language features are divided into corresponding dimensions for quantitative verification and qualitative assessment, the quantitative verification comprises the steps of calculating the variance interpretation rate and the clone Bach coefficient of the load features in the dimension naming verification data set, checking the correlation coefficient between the dimensions, the dimension naming verification data set is clustered according to the dimension score after projection, the validity of the dimension score in text difference is calculated, the dimension name verification data set comprises the characteristic list is selected in a high-quality manner, judging whether the sentences are consistent with semantic category implied by dimension naming, confirming reasonable naming if the sentences are consistent with the naming, readjusting feature attribution or naming if mismatching occurs, taking the thousand-word frequency of the language features as standardized frequency, and calculating each dimension score according to a calculation formula, wherein the calculation formula is dimension score= (sum of all positive-direction feature thousand-word frequencies in the dimension and sum of all negative-direction feature thousand-word frequencies in the dimension); And (3) fine tuning the large model of the external financial speaking domain by using the open source large model as a base, so that the large model of the external financial speaking domain can master the knowledge of a core data acquisition and cleaning layer, a language characteristic labeling model training layer and a financial speaking analysis frame module, obtain the large model of the intelligent interactive external financial speaking domain, and provide optimization suggestions for improving the quality of English financial texts.

Description

Intelligent analysis and text optimization method and system for external financial and financial utterances Technical Field The invention belongs to the crossing field of information technology and natural language processing, and particularly relates to an intelligent analysis and text optimization method and system for external financial and financial utterances. Background The existing english text analysis optimizing method or system focuses on only simple english part of speech or grammar analysis, and is applied to foreign financial and financial utterances, and most existing technology short boards, such as: (1) The English financial utterance expression cannot be accurately evaluated through language feature co-occurrence analysis in the financial language domain. (2) The accurate positioning and system analysis of the standing expression in English financial words cannot be performed. (3) Professional English financial and financial utterance optimization suggestions and related example sentence display in a real language domain cannot be obtained through intelligent interaction with the self-training large model. In summary, the existing method or system for analyzing and optimizing english text is difficult to achieve accurate labeling, language feature analysis and high-quality text color rendering of english financial texts, and is difficult to improve the readability, accuracy and influence of the financial texts in english native language readers. Disclosure of Invention The invention aims to provide an intelligent analysis and text optimization method and system for foreign financial utterances, which are combined with a machine learning technology and a statistical analysis method to construct an English financial utterance evaluation system so as to realize analysis, evaluation and improvement of financial texts. In order to achieve the above object, the technical scheme of the present invention is as follows: An intelligent analysis and text optimization method for external financial utterances comprises the following steps: s1, core data acquisition and cleaning, namely acquiring foreign media English financial news corpus and domestic and foreign media English financial news corpus, preprocessing and warehousing to form an English financial news database; s2, training a language feature labeling model, namely performing three rounds of model training based on a machine learning method, wherein the first round of model training identifies language features comprising part of speech, simple grammar and vocabulary, the second round of model training identifies language features comprising complex grammar, and the third round of model training performs iterative improvement to finally obtain an English text language feature labeling model; s3, constructing a financial speech analysis framework, namely calculating the thousand word frequency of language features, obtaining a plurality of dimensions through factor analysis, and taking the standard score of each dimension and the standardized frequency of the language features as the analysis basis of the financial speech analysis framework for calculating the Euclidean distance between the text to be analyzed; s4, training the large model of the vertical domain of the external financial utterance, namely, using the large model of the open source as a base to carry out fine adjustment, so that the large model of the vertical domain of the external financial utterance is mastered by the large model of the open source, the knowledge in the steps S1-S3 is obtained, and an optimization suggestion is provided for improving the quality of English financial texts. Further, the method of preprocessing in step S1 includes: s101, removing duplication, namely screening English financial news corpus with duplication possibility through a regular expression, and performing duplication removal based on similarity calculation of text content; s102, data screening, namely finding out a text with continuous capitalization through a regular expression, and judging whether the problems of format unnormalization or low content quality exist or not and whether rejection is needed or not. Further, in step S1, data standardization and cleaning are required after preprocessing and before warehousing, including title format adjustment, proper nouns are extracted to form a proper noun dictionary, standardization processing is performed on text content cases, punctuation marks, common abbreviations and sentence capitalization, text is cleaned, irrelevant contents including illegal characters or website links are removed, and word shape reduction is performed on the cleaned title and text. Further, the step S1 of warehousing comprises the steps of dividing text content, dividing title, date, source, watermark information and text, constructing a database, dividing the text content into 8 fields of ID, title, cleaned title, watermark, source, date, text and cleaned text,