CN-121981112-A - Portugal language large model pre-training method and system based on multi-source data

CN121981112ACN 121981112 ACN121981112 ACN 121981112ACN-121981112-A

Abstract

The invention discloses a Portugal language large model pre-training method and system based on multi-source data, which are characterized in that multi-source Portugal language texts such as news, social contact, government affairs and academic are collected, vocabulary coverage and language style characteristics are extracted, corpus quality layering and variant recognition are realized, a inflected change structure of Portugal language is analyzed by adopting a morphological reduction and prefix separation technology to generate a training sample unit carrying morphological labels, a high-value mask position is recognized based on morphological complexity, three-layer mask tasks such as word level, phrase level and sentence level are configured, a morphological consistency constraint enhancement model is combined for learning grammar relations such as main-name consistency and name-shape consistency, training tasks are progressively executed according to difficulty by adopting a course learning strategy, finally, an optimal parameter output Portugal language pre-training model is evaluated and screened through multi-dimensional evaluation, and a high-quality language representation basis can be provided for downstream tasks such as Portugal language text classification, named entity recognition and machine translation.

Inventors

LIN YUCHU
WANG YIMING
CHEN XINYUAN
ZHANG YANCHUAN
CHEN YIWEI

Assignees

深绎未来科技(广东横琴)有限公司

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (10)

1. A Portugal language large model pretraining method based on multi-source data, comprising: Collecting Portugal language multi-source heterogeneous text data, and performing cross-source consistency check on the Portugal language multi-source heterogeneous text data to extract Portugal language vocabulary coverage features and language style features; Performing quality layering screening and recognizing a core corpus by utilizing the Portugal language vocabulary coverage features and the language style features, performing Babuou glucan variant recognition on the core corpus to generate a variant tag set, performing corpus quality evaluation according to the variant tag set, calculating corpus quality scores, and defining a grading cleaning threshold based on the corpus quality scores; Acquiring cleaned effective corpus from Portugal language multi-source heterogeneous text data, performing root-affix separation and semantic unit segmentation on the effective corpus to generate a training sample unit, and performing hierarchical labeling on the training sample unit according to the hierarchical cleaning threshold to establish a corpus index table; Executing morphological complexity labeling based on the corpus quality score and the training sample unit to generate a morphological sensitive mask position, configuring word-phrase-sentence three-level mask tasks based on the morphological sensitive mask position, and combining the word-phrase-sentence three-level mask tasks with the corpus index table to formulate a pre-training task sequence; and carrying out model training iteration by using the pre-training task sequence to generate an iteration parameter set, and screening the optimal parameter output Portugal language pre-training model from the iteration parameter set.
2. The method of claim 1, wherein said performing quality hierarchical screening to identify a core corpus using said Portugal language vocabulary coverage features and said language style features comprises: performing coverage grading on the Portugal language vocabulary coverage features to form a coverage grading table; carrying out style consistency analysis by using the language style characteristics to generate style consistency grading; Cross screening is carried out according to the coverage grading table and the style consistency grading to generate quality grades; and recognizing the core corpus according to the quality grade.
3. The method of claim 1, wherein the performing the sephaeuler variant recognition on the core corpus generates a set of variant tags, comprising: mapping analysis is carried out on the variant feature words through the core corpus to establish a feature word positioning library; Performing Basephao-European-grape spelling difference matching by using the feature word positioning library to form a preliminary variant mark; Performing variant confidence assessment on the preliminary variant mark to generate a high confidence variant mark; And summarizing according to the high-confidence variant marks to form a variant label set.
4. The method of claim 1, wherein performing root-affix separation and semantic unit segmentation on the valid corpus generates training sample units, comprising: performing condensed word splitting and morphological reduction on the effective corpus to establish a root positioning table; identifying prefix and suffix by using the root positioning table to form a prefix separation sequence; carrying out semantic boundary detection according to the prefix separation sequence to generate a semantic unit boundary; and cutting the effective corpus according to the semantic unit boundary to determine a training sample unit.
5. The method of claim 1, wherein the performing morphological complexity labeling to generate morphological sensitivity mask positions based on the corpus quality scores and the training sample units comprises: Identifying verb displacement from the training sample unit to establish a displacement complexity file; cross evaluation is carried out through the deflection complexity file and the corpus quality score to generate a morphology complexity score; screening the high complexity position according to the form complexity score to generate candidate mask positions; and determining a shape sensitive mask position according to the candidate mask position.
6. The method of claim 1, wherein the configuring a word-phrase-sentence tertiary masking task based on the morphological sensitive mask position comprises: Performing a morphology dependent analysis on the morphology sensitive mask locations to identify morphology associated location pairs; Dynamically allocating the morphological sensitive mask positions according to the grammar granularity and combining the text field characteristics to generate word-phrase-sentence three-level mask proportional configuration; performing joint mask sampling on the morphological sensitive mask position and the morphological association position pair according to the word-phrase-sentence three-level mask proportion configuration to generate a mask sample set; and constructing a word-phrase-sentence three-level mask task according to the mask sample set.
7. The method of claim 1, wherein performing model training iterations using the pre-training task sequence generates an iteration parameter set, comprising: sequencing the pre-training task sequence according to task difficulty to form a task scheduling queue; Sequentially executing forward propagation calculation based on the task scheduling queue to generate a loss value sequence; carrying out morphological loss weighting processing according to the loss value sequence to generate a weighted loss value; And performing back propagation update parameter accumulation according to the weighted loss value to form an iterative parameter set.
8. The method of claim 4, wherein identifying prefix suffixes using the root location table to form an affix separation sequence comprises: scanning the prefix component forward along the root location table to form a prefix tag; Scanning backward along the root positioning table to identify suffix components and form suffix marks; performing shift prefix association labeling on the suffix mark to generate a shift association mark; And summarizing and determining a prefix separation sequence according to the prefix mark and the displacement association mark.
9. The method of claim 5, wherein identifying verb arguments from the training sample unit creates an argument complexity stage comprising: executing part-of-speech tagging from the training sample unit and screening verb components to form a verb candidate set; Performing word tail feature analysis by using the verb candidate set to generate a displacement form mark; classifying the dislocation form marks according to the man-hour state to count the dislocation form quantity; And establishing a deflection complexity file according to the deflection form quantity.
10. A multi-source data-based Portugal language large model pre-training system, comprising: the data acquisition module is used for acquiring Portugal language multi-source heterogeneous text data, and performing cross-source consistency check on the Portugal language multi-source heterogeneous text data to extract Portugal language vocabulary coverage features and language style features; The corpus cleaning module is used for carrying out quality layering screening and recognizing a core corpus by utilizing the Portugal language vocabulary coverage features and the language style features, carrying out the Basephakur variant recognition on the core corpus to generate a variant tag set, carrying out corpus quality evaluation according to the variant tag set, calculating corpus quality scores, and defining a grading cleaning threshold value based on the corpus quality scores; the corpus processing module is used for acquiring cleaned effective corpus from the Portugal language multi-source heterogeneous text data, performing root-prefix separation and semantic unit segmentation on the effective corpus to generate a training sample unit, and performing hierarchical labeling on the training sample unit according to the hierarchical cleaning threshold value to establish a corpus index table; The task configuration module is used for executing morphological complexity labeling based on the corpus quality score and the training sample unit to generate a morphological sensitive mask position, configuring word-phrase-sentence three-level mask tasks based on the morphological sensitive mask position, and combining the word-phrase-sentence three-level mask tasks with the corpus index table to formulate a pre-training task sequence; and the model training module is used for carrying out model training iteration by using the pre-training task sequence to generate an iteration parameter set, and screening the optimal parameter output Portugal language pre-training model from the iteration parameter set.

Description

Portugal language large model pre-training method and system based on multi-source data Technical Field The invention relates to the technical field of natural language processing, in particular to a Portugal language large model pre-training method and system based on multi-source data. Background Portuguese, as the sixth largest global language, covers multiple countries and regions such as brazil, portuguese, angora, morubig, and has more than two hundred million native language users. With the deep application of artificial intelligence technology in the field of language understanding, the construction of a high-quality Portugal language large language model has become a key requirement for supporting Portugal language area intelligent services. However, portuguese belongs to a typical inflected language, a morphological change system of the portuguese is far more complex than that of an analytic language such as English, verbs need to be changed according to grammatical categories such as person names, tenses, mood and the like, nouns and adjectives need to be changed according to the nature and the number, and the rich morphological change brings challenges to morphological rule learning of a language model. The existing large language model pre-training method is mainly designed aiming at the relatively simple language with English and other morphological changes, and the unique refraction change rule of the large language model is difficult to effectively capture when the large language model pre-training method is directly migrated to portuguese. Meanwhile, the Portuguese has obvious regional difference, brazil Portugal language is different from European Portugal language in spelling habit, vocabulary selection and syntax structure, and the general model is difficult to consider the language characteristics of the two variants. In addition, portugal language corpus sources on the internet are scattered and the quality is uneven, and the lack of an effective screening and processing mechanism can cause low-quality text to be mixed into training data to influence the model performance. Disclosure of Invention The invention discloses a Portugal language large model pre-training method and system based on multi-source data, which aim to construct a special pre-training process aiming at the refractive language characteristics of portuguese, and realize deep learning of Portugal language morphological change rules through links such as multi-source corpus acquisition, quality layering, morphological structure analysis, sample unit generation, multi-granularity mask task configuration, course learning training and the like, and finally output a high-quality pre-training model with characteristics of Brazil Portugal language and European Portugal language. The first aspect of the invention provides a Portugal language large model pre-training method based on multi-source data, which comprises the following steps: Collecting Portugal language multi-source heterogeneous text data, and performing cross-source consistency check on the Portugal language multi-source heterogeneous text data to extract Portugal language vocabulary coverage features and language style features; Performing quality layering screening and recognizing a core corpus by utilizing the Portugal language vocabulary coverage features and the language style features, performing Babuou glucan variant recognition on the core corpus to generate a variant tag set, performing corpus quality evaluation according to the variant tag set, calculating corpus quality scores, and defining a grading cleaning threshold based on the corpus quality scores; Acquiring cleaned effective corpus from Portugal language multi-source heterogeneous text data, performing root-affix separation and semantic unit segmentation on the effective corpus to generate a training sample unit, and performing hierarchical labeling on the training sample unit according to the hierarchical cleaning threshold to establish a corpus index table; Executing morphological complexity labeling based on the corpus quality score and the training sample unit to generate a morphological sensitive mask position, configuring word-phrase-sentence three-level mask tasks based on the morphological sensitive mask position, and combining the word-phrase-sentence three-level mask tasks with the corpus index table to formulate a pre-training task sequence; and carrying out model training iteration by using the pre-training task sequence to generate an iteration parameter set, and screening the optimal parameter output Portugal language pre-training model from the iteration parameter set. The second aspect of the present invention proposes a Portugal language big model pre-training system based on multi-source data, comprising: the data acquisition module is used for acquiring Portugal language multi-source heterogeneous text data, and performing cross-source consistency check on the Portugal languag