CN-122021666-A - Efficient automatic construction method for Tibetan bilingual parallel corpus
Abstract
The invention discloses a high-efficiency automatic construction method of Tibetan double-language parallel corpus, which relates to the technical field of Tibetan double-language parallel corpus construction and comprises a base model unit, a localization processing unit and a Tibetan stability strengthening module, wherein the base model unit is a Qwen2.5 series pre-training model serving as a technical base, the training model comprises a 70B parameter main generation model and a 7B parameter light weight intention judgment model, and the 70B model is responsible for receiving input texts and generating Tibetan double-language parallel sentence pairs. The invention performs intercommunication through shared memory communication among the base model unit, the localization processing unit and the Tibetan stability strengthening module, and the Tibetan stability strengthening module also comprises a dual-model collaboration mechanism, wherein the workflow of the dual-model collaboration mechanism is that an input Chinese sentence is preprocessed by the word segmentation module, and 7B model performs intended triplet analysis to generate semantic control codes, thereby realizing the improvement of the stability of parallel corpus output of Tibetan and Chinese through the dual-model collaboration architecture.
Inventors
- Danzen Rob
- Tong Hongjiang
- NIMA DUNZHU
- WANG PENG
- Gesangquni
Assignees
- 西藏觉罗数字产业管理有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260121
Claims (10)
- 1. The efficient automatic construction method for the parallel corpus of Tibetan double language is characterized by comprising a base model unit, a localization processing unit and a Tibetan stability strengthening module, wherein the base model unit is a Qwen2.5 series pre-training model serving as a technical base, the training model comprises a 70B parameter main generation model and a 7B parameter lightweight intention judgment model, the 70B model is responsible for receiving an input text and generating a parallel sentence pair of Tibetan double language, the 7B model monitors the generation process in real time, semantic deviation is identified through an intention classification module, the Tibetan stability strengthening module is activated, the Tibetan stability strengthening module integrates a Tibetan grammar rule base and a dynamic noise suppression algorithm, when the 7B model detects that the Tibetan generation confidence level is lower than a threshold value, rule constraint vector correction output distribution is injected, the Tibetan stability and the term consistency in the generation stage are guaranteed, a new Tibetan and a high-frequency compound word are added on the basis of an original word list, the reasoning stage and the term base are linked through continuous pre-training, a translation name is injected in the context, and further constraint output is further carried out, and the base model unit, the localization processing unit and the Tibetan stability strengthening module are communicated through a shared memory and communicated with the memory strengthening module.
- 2. The efficient automated construction method of Tibetan bilingual parallel corpus of claim 1, wherein the Tibetan stability enhancement module further comprises a double-model collaboration mechanism, and the workflow of the double-model collaboration mechanism is as follows: S1, preprocessing an input Chinese sentence by a word segmentation module; S2, 7B model executing intention triplet analysis, generating semantic control codes, wherein the intention triplet analysis is an action subject, an action and an object; s3, the 70B model outputs the Tibetan double-language result in parallel under the condition of semantic control coding; S4, the real-time feedback loop 7B model analyzes the syntax tree of the Tibetan language output, and if the check word-assisting deficiency or the verb displacement error is detected, the regeneration instruction is triggered.
- 3. The efficient automated construction method of parallel corpus of Tibetan Han as recited in claim 1, wherein the Tibetan stability enhancement module further comprises a dynamic prefix compensator, an acoustic feature constraint component and a religious culture term check library, the dynamic prefix compensator dynamically inserts missing lattice marks in a decoding stage based on a Tibetan praise grammar rule base, the acoustic feature constraint component filters illegal syllable combinations through a Tibetan syllable structure finite state machine, and the religious culture term check library enables authoritative dictionary forced alignment for Buddha's and medical professional terms.
- 4. The method for efficient and automatic construction of parallel corpus of Tibetan and Chinese characters according to claim 3, wherein the forced alignment is operated by a four-dimensional rewarding model, and the specific operation flow of the four-dimensional rewarding model is as follows: S1, weighting and calculating through the Han-Zanguilt two-way BLEU-4 and BERTScore; s2, calling a Tibetan dependency analyzer to detect the consistency of a main predicate and a guest; s3, identifying religion instrument rails based on the domain ontology atlas and the accuracy of traditional custom terms; S4, calculating the rationality of the Tibetan initial and final combination probability distribution by adopting syllable entropy values.
- 5. The efficient automated construction method of parallel corpus of Tibetan characters of claim 1, wherein the Tibetan stability enhancement module further comprises a training unit, wherein the training unit trains data to cover 6500 tens of thousands of aligned sentence pairs of Tibetan characters, tibetan language corpus and 12 tens of thousands of Tibetan character term pairs subjected to expert review, the aligned sentence pairs are used for stabilizing cross-language correspondence and syntax modes of question and answer, the single language corpus is used for enhancing pure Tibetan fluency and cultural diversity, and the Tibetan character term pairs are used for unifying proper noun translation and specialized expression.
- 6. The efficient and automatic construction method of Tibetan-Han bilingual parallel corpus of claim 5, wherein the training unit training is performed in three steps: S1, training ContinualPretraining and DomainAdaptivePretraining, performing causal language modeling on the Tibetan single language and the high-quality aligned corpus, so that a base forms a robust language prior on Tibetan; S2, performing supervision fine adjustment on a Tibetan-based question and answer and task instruction sample by using instruction alignment, and emphasizing the facts, structuring and neutral styles; And S3, refusing and compliance bias Safety-tunedSFT and ORPO, and enabling a model society to refuse and give legal substitution suggestions by a unified template when triggering a strategy on strategy labeling data to form stable auditable output behaviors.
- 7. The efficient and automatic construction method of Tibetan bilingual parallel corpus of claim 1, wherein the method is characterized by comprising the following steps of: the method is characterized in that a 70B in the base model unit combines a gradient check point and tensor parallelism by adopting a QLoRA and LoRA parameter efficient method, so that the occupation of a video memory is obviously reduced, the throughput is improved, 7B performs full-scale fine tuning for light tasks of intention judgment, query rewrite and bridge answer, structured rewrite and short instruction are preferentially routed to 7B, open question answer and long context synthesis is routed to 70B, the whole user experience and resource utilization rate are ensured, the base model unit further comprises a typical super-parameter setting, the typical super-parameter setting follows the principles of long context, stable convergence and reproducibility, the 70BQLoRA setting is set as bf16 calculation, 8-bit weight loading, loRArank, learning rate 1e-4, context length 8192, effective batchsize256,7B full-scale fine tuning learning rate 2e-5, context length 8192, effective batchsize512, training parallelism and memory optimization adopt a combination of data parallelism, tensor flow, and split single-layer weight into multiple cards according to columns and cumulating and multiple layers, and multi-layer parallel to the gradient stack is matched with the gradient and the stable throughput.
- 8. The efficient automated construction method of Tibetan-Han bilingual parallel corpora of claim 1, wherein the data governance of the localization processing unit follows the processes of cleaning, deduplication, language identification, alignment, segmentation and acceptance, the processes are completed in a local isolation environment, original data and products are guaranteed not to go out of domain, a heuristic and aligner combined strategy is adopted in an alignment stage, high-confidence alignment is preferentially reserved, sentence-element and paragraph-formation are carried out on long texts in a segmentation stage, subsequent batch processing and training scheduling are facilitated, the localization processing unit further comprises an inference service, the inference service is provided in a local privately-form, an interface style is compatible with mainstream ecology, existing SDK and tool chains are facilitated to be docked, services support two return modes of non-streaming and streaming, incremental pushing is carried out by adopting a server event to reduce first package time delay, and the service returns debugging and auditing metadata for each request, and retrieval candidates, delay, retrieval, processing time consumption, inquiry and bridging text parameters are included, and positioning quality problems and parameters are facilitated.
- 9. The efficient automated construction method of Tibetan-Han bilingual parallel corpora of claim 1, wherein the base model unit further includes a containerized deployment and offline delivery, the containerized deployment is based on Docker and Compose, supported to run on a local server of multiple GPUs, balanced between acceptable latency and throughput by combining distributed tensor parallelism and weight quantization, offline delivery packages images, indexes and data through offlinebundle and manualbundle catalogs, startup and verification can be accomplished independent of the public network, and upon first startup, the service warms up search resources to avoid first request blocking, the search resources are glossary, FAISS index and automaton.
- 10. The efficient automatic construction method of the Tibetan-Han bilingual parallel corpus of claim 1, wherein the base model unit further comprises a compliance management module, the compliance management module completes sensitive word detection, link and file filtering, personal sensitive information identification and desensitization in an input link, performs white list and purification treatment on an external knowledge access channel in a retrieval link, introduces a decoding period safety classifier and a refusing template in a generation link, performs secondary interception and audit archiving in an output link, adopts a rule base, an automaton and a strategy engine, the rule base covers two word lists of Chinese and Tibetan, merges synonyms, variants and common erroneous writing, the automaton generates a multimode matching structure in an Aho-Corasick, ensures extremely low delay, and provides a DSL (digital subscriber line) capable of being updated thermally, and opens and closes strategies and thresholds in different business scenes.
Description
Efficient automatic construction method for Tibetan bilingual parallel corpus Technical Field The invention relates to the technical field of parallel corpus construction of Tibetan double language, in particular to a high-efficiency automatic construction method of parallel corpus of Tibetan double language. Background The Tibetan language model is based on the basic principle of local and privately-arranged, a complete training, reasoning and compliance system is built around three targets of controllable, auditable and evolutionable, the model system takes 70B and 7B double-model collaboration and mixed retrieval as cores, a deep neural network is built by combining term standardization and multi-level security strategies, and a security mechanism for auditing and filtering is added, so that after the algorithm model is deployed, a stable, reliable and compliance technology base is provided for public Tibetan language information services according to human instructions or prompts, tibetan language meaning analysis, calculation reasoning, question-answer dialogue and chapter generation tasks are realized, AI services equivalent to Chinese and English are provided for Tibetan language users, and meanwhile, tibetan language knowledge and culture are conveniently acquired for Chinese and English users. The existing Tibetan bilingual parallel corpus automatic construction method has the defects that: 1. Patent document CN108763223B discloses a method for constructing a parallel corpus of chinese-english, mongolian, tibetan and multi-language, which is characterized in that "the information of Chinese commodity to be translated is obtained through each chinese electronic commerce platform, the information of partial commodity is translated by using bilingual dictionary, the similarity of the web page tag sequence and the similarity of the maximum matching calculation number sequence are used as characteristic information, candidate parallel web pages are extracted by using support vector machine, then sentence segmentation, alignment and arrangement are performed on the web pages, the parallel corpus of chinese-english, han-Mongolian, han-tibetan and han-wei of commodity information is obtained, the construction of the parallel corpus of multi-language is completed, the effect is that the construction of the parallel corpus of multi-language is completed, but the conventional automatic construction method for constructing parallel corpus of Tibetan-han is unstable in content, sensitive words are easy to be generated, and the security is not high through external network access. Disclosure of Invention The invention aims to provide a high-efficiency automatic construction method for parallel corpus of Tibetan double language, which aims to solve the technical problems that the conventional automatic construction method for parallel corpus of Tibetan double language, which is proposed in the background art, is unstable in Tibetan content, sensitive words are easy to generate, and the safety is low because the sensitive words need to be accessed through an external network. The efficient automatic construction method for the parallel corpus of Tibetan characters comprises a base model unit, a localization processing unit and a Tibetan stability enhancement module, wherein the base model unit is a Qwen2.5 series pre-training model which is a technical base, the training model comprises a 7B parameter lightweight intention judgment model of a 70B parameter main generation model, the 70B model is responsible for receiving input texts and generating parallel sentence pairs of Tibetan characters, the 7B model monitors the generation process in real time, the intention classification module is used for recognizing semantic deviation, the Tibetan stability enhancement module is activated, the Tibetan stability enhancement module integrates a Tibetan grammar rule base and a dynamic noise suppression algorithm, when the 7B model detects that the Tibetan generation confidence level is lower than a threshold value, the rule constraint vector is injected for revising output distribution, the Tibetan stability and the term consistency in the generation stage are guaranteed, the Tibetan characters and the high-frequency compound subword are newly increased on the basis of an original word list, the expansion table is fully adapted through continuous pre-training, the reasoning stage and the term base are injected with the name in the context, the Tibetan stability enhancement module is further constrained to be output, and the base unit and the Tibetan stability enhancement module is communicated with the memory by the communication enhancement module. Preferably, the Tibetan stability strengthening module further comprises a double-model collaboration mechanism, and the workflow of the double-model collaboration mechanism is as follows: S1, preprocessing an input Chinese sentence by a word segmentation module; S2, 7B model e