Search

CN-122021800-A - Model training method based on high-dimensional efficiency self-adaptive sparse training paradigm

CN122021800ACN 122021800 ACN122021800 ACN 122021800ACN-122021800-A

Abstract

The invention discloses a model training method based on a high-dimensional efficiency self-adaptive sparse training paradigm, which belongs to the technical field of computer science and comprises the following steps of preparing data and infrastructure, mixing sparse pre-training, aligning and post-training. The invention provides a complete and end-to-end efficient LLM training and reasoning control system. The system thoroughly solves three bottlenecks of computational efficiency, data efficiency and alignment accuracy faced by the traditional LLM training through the bottom innovation of architecture, training targets and post-training alignment mechanisms.

Inventors

  • LIU HONGYI
  • LI YUE
  • Ning Biguo
  • LI YONG
  • HUANG XU
  • LI YUHANG

Assignees

  • 通威股份有限公司

Dates

Publication Date
20260512
Application Date
20260104

Claims (11)

  1. 1. A model training method based on a high-dimensional efficiency self-adaptive sparse training paradigm is characterized by comprising the following steps: Step one, preparing data and infrastructure; Step two, mixing sparse pre-training; and thirdly, aligning and post-training.
  2. 2. The method for model training based on high-dimensional performance adaptive sparse training paradigm of claim 1, further comprising data cleansing, word segmentation and vocabulary construction, and training framework packaging prior to the first step.
  3. 3. The method for model training based on the high-dimensional performance adaptive sparse training paradigm of claim 2, wherein the data cleansing includes deduplication, language identification, sensitive content filtering, document structure restoration.
  4. 4. The method for model training based on high-dimensional performance adaptive sparse training paradigm of claim 3, wherein said word segmentation and vocabulary construction includes employing byte-level BPE or equivalent subword algorithm, preserving code, number, symbol integrity.
  5. 5. The method for model training based on high-dimensional performance adaptive sparse training paradigm of claim 4, wherein the training framework package comprises a distributed parameter server, gradient checkpoints, asynchronous data loading, and breakpoint training.
  6. 6. The method for model training based on the high-dimensional performance adaptive sparse training paradigm of claim 5, wherein the data and infrastructure preparation comprises the steps of: Step S11, corpus collection, namely web pages, books, academic papers, code warehouses, dialogue records and multi-modal alt-text; step S12, quality-density double scoring, namely training a small reference model to calculate perplexity and filtering low-quality paragraphs, and then carrying out heuristic secondary screening on the low-quality paragraphs by using information density; S13, a long text slice is formed by adopting a structure perception sliding window, aligning at the boundaries of a chapter and a paragraph, and keeping context semantics continuous; and S14, storing and loading, namely preprocessing the corpus by word segmentation, generating a corresponding word token according to a word list, and reducing GPU waiting by matching with RAM-Disk caching and vectorization reading.
  7. 7. The method for model training based on the high-dimensional efficacy adaptive sparse training paradigm of claim 6, wherein the mixed sparse pre-training comprises the steps of: S21, integrally designing a mixed sparse architecture; Step S22, mixing GATED DELTANET/Gated Attention in a ratio of 3:1; And S23, predicting a training target by adopting multiple tokens in a pre-training stage.
  8. 8. The method for model training based on the high-dimensional performance adaptive sparse training paradigm of claim 7, wherein in step S21, the Mixture-of-expertise architecture is depth optimized through a hybrid Attention mechanism to obtain a hybrid sparse architecture, the hybrid sparse architecture comprises a high-speed indexer and a fine-grained Token selection mechanism, and the hybrid sparse architecture is instantiated in a Multi-HEAD LATENT Attention architecture.
  9. 9. The method for model training based on the high-dimensional performance adaptive sparse training paradigm of claim 8, wherein in step S22, a 3:1 mixing ratio of GATED DELTANET to Gated Attention is used for the mixed layout within the transducer block.
  10. 10. The method for model training based on the high-dimensional efficacy adaptive sparse training paradigm of claim 9, wherein the aligning and post-training comprises the steps of: Step S31, pre-training stability guarantee; S32, a self-adaptive mode strengthening and optimizing framework; and step S33, knowledge distillation and reasoning capability enhancement.
  11. 11. The model training method based on the high-dimensional efficacy adaptive sparse training paradigm of claim 10, further comprising the steps of reasoning deployment and subsequent iteration: step 41, dynamic thinking budget, namely adjusting the route proportion of the thinking expert in real time according to the longest thinking token number or LATENCY SLA set by a user; step 42, presuming that decoding falls to the ground, namely taking an MTP head as a small draft model, carrying out parallel verification on a main model, and carrying out overall speed acceleration on the average receiving length of approximately 2-3 token; and 43, online feedback loop, namely collecting user praise/click stepping and editing correction as preference data of a new round AMRO, and continuously performing rolling optimization.

Description

Model training method based on high-dimensional efficiency self-adaptive sparse training paradigm Technical Field The invention belongs to the technical field of computer science, and particularly relates to a model training method based on a high-dimensional efficiency self-adaptive sparse training paradigm. In particular, the architecture of a large-scale language model (LLM), training target design, and post-training alignment method. Specifically, HED-ASTP is a comprehensive training paradigm designed for trillion-level parametric models, and aims to overcome the computational bottleneck of the traditional transform architecture in processing very long contexts, the inefficiency of data caused by the Next-Token Prediction (NTP) target, and the credit allocation difficulties of traditional reinforcement learning from human feedback (RLHF). According to the invention, by fusing the mixed sparse architecture, the dense prediction target and the Token level preference optimization, the spanned improvement of the model training efficiency, the reasoning throughput and the alignment accuracy is realized. The HED-ASTP has extremely wide application prospect, and is particularly suitable for the next generation AI infrastructure with strict requirements on high throughput, low delay and ultra-long context processing capability. This includes, but is not limited to, enterprise-level knowledge base question-answering systems, high-precision code generation platforms, long-document summarization and analysis tools in the financial and legal fields, and cloud service providers that require sophisticated control of computing resources (i.e., thought budgets). Background Conventional LLM training procedures are generally divided into three general stages of pre-training, instruction fine Tuning (SFT), and post-training alignment (e.g., RLHF). While this paradigm has spawned current AI waves, as model scales continue to expand to billions or even trillion parameter levels, its inherent shortcomings become a major bottleneck restricting performance enhancement. 1. Secondary complexity bottleneck for long contexts (Pre-tracking/Architecture) Conventional Transformer architecture relies heavily on standard self-attention mechanisms whose computational complexity and memory consumption are quadratic in the sequence length L in O (L2). In the pre-training stage, in order to efficiently capture long-distance dependencies from massive data, the contextual windows of the model need to be continuously expanded. However, the computational overhead of O (L2) makes the model dramatically increase in inference cost and drastically decrease in inference throughput (Tokens per Second) when dealing with very long contexts exceeding 32K tokens. The computational constraint limits the utilization efficiency of the traditional pre-training model on massive long text data, and directly influences the processing capacity of the model on complex long tasks in practical application. 2. Data efficiency and learning Signal sparsity (Pre-training/Objective) The traditional pre-training core goal is Next-Token Prediction (NTP), i.e. the model predicts only the Next Token t+1 at each time step t. While this autoregressive objective is simple and effective, the supervisory signals provided by NTP are extremely sparse in the time dimension in the face of massive pre-training corpora on the order of trillions or even billions tokens. In order to converge sufficiently, the model must consume astronomical digital computing resources and time. This single point supervision mode results in lower sample efficiency, extends the model training period, and makes training costs high. The traditional method is difficult to effectively improve the learning signal density of each step on the premise of keeping the single forward propagation calculation complexity unchanged. 3. Stability and particle size deficiency of Alignment process (Alignment/Post-Alignment) Conventional post-training alignment methods, particularly the RLHF flow based on reinforcement learning, present inherent complexity and instability. RLHF first requires training a Reward Model (RM) and then policy optimization is typically performed using Proximal Policy Optimization (PPO) or other algorithms. However, the PPO algorithm itself is complex to implement and sensitive to hyper-parameters, easily resulting in unstable training. More critical, whether PPO or DIRECT PREFERENCE Optimization (DPO) of simplified version, typically treat the entire response sequence y as a single action and receive rewards at the sequence level (rather than Token level). This Contextual Bandit modeling approach leads to serious credit allocation problems in that when a certain inference step or key phrase in a long sequence leads to poor end results, sequence level rewards are difficult to precisely penalize or excite the model to make decisions at a specific Token, severely limiting the alignment accuracy of t