CN-122019446-A - Large model acceleration system, method, medium, terminal and program product based on random calculation
Abstract
The application provides a large model accelerating system, a method, a medium, a terminal and a program product based on random calculation, wherein the system comprises a random bit stream encoding module, a random calculation accelerating module and a random bit stream decoding module, wherein the random bit stream encoding module is used for encoding weight parameters and input characteristics of a large model into a random bit stream with preset digits from floating point numbers, the random calculation accelerating module is used for carrying out multiplication and addition operation and activation function operation according to the weight parameters and the input characteristics converted into the random bit stream so as to obtain an operation result, and the random bit stream decoding module is used for decoding the operation result of the random bit stream into the floating point number format. The application can reduce hardware cost, simplify computing architecture, promote energy efficiency ratio, support main stream big model and consider versatility and deployment flexibility on the premise of ensuring big model reasoning and training precision.
Inventors
- Request for anonymity
Assignees
- 上海光羽芯辰科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260121
Claims (10)
- 1. A stochastic-computation-based large model acceleration system, comprising: The random bit stream coding module is used for coding the weight parameters and the input characteristics of the large model from floating point numbers into a random bit stream with preset digits; The random calculation acceleration module is used for carrying out multiplication and addition operation and activation function operation according to the weight parameters and the input characteristics converted into the random bit stream so as to obtain an operation result; And the random bit stream decoding module is used for decoding the operation result of the random bit stream into a floating point number format.
- 2. The stochastic computing-based large model acceleration system of claim 1, wherein the stochastic bit stream encoding module comprises: The normalization unit is used for carrying out normalization processing on the weight parameters and the input characteristics of the large model so as to normalize the weight parameters and the input characteristics to a normalization interval and storing normalization coefficients; a pseudo random number generation unit for generating a uniform random number based on the configured feedback function; and the coding unit is used for comparing the random bit stream with the uniform random number generated by the pseudo-random number generating unit to output the random bit stream with the preset bit number if the normalized interval is 0-1, and coding the random bit stream based on the bipolar coding mechanism to obtain the final random bit stream according to the uniform random number generated by the pseudo-random number generating unit if the normalized interval is-1.
- 3. The big model acceleration system based on random calculation according to claim 2, wherein the bipolar coding mechanism encodes according to the uniform random number generated by the pseudo random number generating unit to obtain a final random bit stream, and the specific process includes: encoding according to the inputted positive value and the uniform random number to output a random bit stream with a preset bit number; Calculating positive part probability and negative part probability according to the input negative number, and comparing the positive part probability, the negative part probability and the uniform random number to generate a positive random bit stream and a negative random bit stream with preset digits respectively; The negative random bit stream is subtracted from the positive random bit stream generated to obtain a final random bit stream.
- 4. The stochastic-based large model acceleration system of claim 2, wherein the stochastic-based acceleration module comprises: the multiplication unit is used for inputting the weight parameters and the input characteristics coded into the random bit stream into the two-input AND gate to carry out multiplication operation so as to obtain a multiplication result of the random bit stream; The addition unit is used for inputting the multiplication result of the random bit stream output by the multiplication unit into the OR gate array for operation and outputting an accumulation result; The ReLU function calculation unit is used for comparing the input original random bit stream with the all-zero random bit stream based on the comparator, and outputting the all-zero random bit stream if the input value is smaller than or equal to zero; and GELU function calculation unit for inputting the original random bit stream to AND gate and OR gate group to calculate GELU function.
- 5. The stochastic computing-based large model acceleration system of claim 4, wherein the stochastic bit stream decoding module comprises: A statistics unit, configured to calculate the number of one of the output random bit streams with preset bits based on the counter; The probability calculation unit is used for calculating to obtain a probability value according to the counted number and the preset number of bits; and the inverse normalization unit is used for multiplying the probability value output by the probability calculation unit by a normalization coefficient to obtain a final value of the floating point number format.
- 6. The large model acceleration system of claim 5, wherein the random bit stream decoding module further comprises a calibration unit for calibrating the accumulation result output by the or gate array according to the calculated probability value to find a calibration table to obtain a true accumulation result.
- 7. A large model acceleration method based on stochastic computing, comprising: encoding the weight parameters and the input characteristics of the large model into a random bit stream with preset digits by floating point numbers; performing multiply-add operation and activation function operation according to the weight parameters and the input characteristics converted into the random bit stream to obtain an operation result; The result of the operation of the random bit stream is decoded into a floating point number format.
- 8. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method as claimed in claim 7.
- 9. A computer program product comprising computer program code means for causing a computer to carry out the method as claimed in claim 7 when said computer program code means are run on the computer.
- 10. An electronic terminal comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the method as claimed in claim 7.
Description
Large model acceleration system, method, medium, terminal and program product based on random calculation Technical Field The application relates to the technical field of artificial intelligence, in particular to a large model acceleration system, a method, a medium, a terminal and a program product based on random calculation. Background Currently, pre-training large models (such as GPT series, LLaMA series, BERT, etc.) with a transducer architecture as a core have been widely applied to the fields of natural language processing, generation AI, computer vision, etc., but the high computational power requirement of the large models results in high accelerating deployment cost. For this challenge, existing acceleration schemes are mainly deployed around both hardware and software directions. In the hardware level, mainly rely on high-performance special computing chips, such as Injeida H100 GPU, google TPU v5e, kabrian Ji Yuan 370 ASIC chip and the like, which promote the computational intensity through customized computing units (such as Tensor Core in GPU), meanwhile, part of scenes adopt FPGA (such as Xilinx Zynq UltraScale + series) to realize flexible acceleration, but the traditional FPGA design is based on a binary computing architecture, the utilization rate of hardware resources (such as LUT lookup tables and registers) is only 30% -50%, and the cost is difficult to be further reduced. At the software level, the calculation cost is reduced by algorithm optimization, specifically comprising quantization technology, compressing model parameters from FP32/FP16 to INT8/INT4, typical algorithms such as GPTQ, AWQ and the like are used for reducing storage and calculation amount, but additional calibration flow is needed, pruning and distillation technology, removing redundant weights (such as 50% -70% of low contribution weight of pruning) or training small models to simulate large model behaviors (such as MobileBERT being distilled versions of BERT), parallel calculation technology, splitting tasks by data parallel, model parallel and pipeline parallel (such as Megatron-LM framework), utilizing multi-card collaborative lifting speed, but relying on high-speed interconnection (such as NVLink), so that hardware investment is further increased. Although the existing large model acceleration scheme improves the calculation force to a certain extent, the core defect that the cost and the performance are hard to balance still exists, and the method is mainly characterized in that: (1) The hardware cost is too high, the unit price of special chips such as GPU/TPU is expensive, the investment of 100-card-scale large model reasoning cluster hardware is 400-600 ten thousands of dollars, the ASIC chip research and development period is as long as 1-2 years, the research and development cost is over ten thousands of dollars, the customized design is only suitable for a single model, the universality is poor, and the multi-model deployment requirement cannot be met; (2) The energy efficiency ratio is low, so that the long-term cost is high, the power consumption of the traditional binary computing chip is extremely high (such as 150W of full load power consumption of an A10 GPU (analog to digital) in the English reach), the annual electricity charge of a 100-card cluster exceeds 13 ten thousand dollars, and the long-term use cost is obvious; (3) The precision and the acceleration effect conflict that although the low-bit quantization (such as INT 2) can reduce the calculated amount, the precision loss can reach 5% -10%, high-precision scenes such as medical treatment, finance and the like cannot be met, the model convergence is reduced due to high pruning rate (such as pruning 70%), and the reasoning result stability is poor; (4) The deployment threshold is high, parallel computing needs to deeply optimize hardware interconnection and a software framework, small and medium enterprises lack professional technical team, technology and cost investment for accelerating a large model are difficult to bear, and the universal application of the large model is limited. Accordingly, there is a need for a large model acceleration system, method, medium, terminal and program product based on stochastic computing that solve the above-mentioned problems of the prior art. Disclosure of Invention In view of the above drawbacks of the prior art, an object of the present application is to provide a large model acceleration system, method, medium, terminal and program product based on random computation, which are used for solving the technical problems of high hardware cost, low energy efficiency ratio, large precision loss and poor versatility of the prior art. To achieve the above and other related objects, a first aspect of the present application provides a large model acceleration system based on random computation, which includes a random bit stream encoding module configured to encode a weight parameter and an input feature o