CN-122001564-A - SM3 hash algorithm parallel optimization and verification method and system

CN122001564ACN 122001564 ACN122001564 ACN 122001564ACN-122001564-A

Abstract

The invention discloses a parallel optimization and verification method and system for an SM3 hash algorithm, which aim to solve the problem of performance bottleneck caused by serial dependence of message expansion and compression iteration when the SM3 algorithm is implemented on an NPU parallel processor. The invention firstly carries out 3-path parallel message expansion through the NPU pulse array, then writes the message expansion words obtained by message expansion into the production buffer zone, utilizes the double buffer zone of the NPU to realize asynchronous parallel of message expansion data supply and compression iterative computation, further deconstructs the compression iterative logic into two parallel computing paths which are respectively responsible for updating different state variable groups, and enables the two paths to be asynchronously executed in most of time through a minimum synchronization mechanism, thereby breaking the internal cyclic dependence of compression functions. The invention obviously improves the operation throughput rate of the SM3 algorithm on parallel hardware, and is suitable for high-performance encryption scenes such as intelligent driving, edge calculation and the like.

Inventors

CHENG WENDONG
Sun Tengye

Assignees

上海伊世智能科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260226

Claims (7)

1. The SM3 hash algorithm parallel optimization method is applied to an intelligent driving assistance system, an industrial visual detection platform, an edge computing terminal and other encryption scenes taking NPU as a core computing unit, and is characterized by comprising the following steps: S1, splitting an input 512-bit message block into 16 32-bit message expansion words Storing into an input register set; s2, generating message expansion words in parallel through pulsation arrays of NPUs ; S3, generating message expansion words synchronously through exclusive OR unit array of NPU ; S4, inputting into the register group Generated by a systolic array Generated by an array of exclusive-or cells Storing the information into a double buffer area of the NPU, wherein the double buffer area comprises a production buffer area for writing and a consumption buffer area for reading, and realizing asynchronous parallel of message expansion and compression iteration by switching the reading and writing roles after the production buffer area is full; s5, reading from the consumption buffer area in the double buffer area And When the compression iteration is carried out, the state registers A-H and the update logic thereof are deconstructed into two parallel computing paths, and asynchronous execution among the paths is realized through a minimum synchronization mechanism.
2. The SM3 hash algorithm parallel optimization method of claim 1, wherein the systolic array comprises 52 processing units in step S2 Each processing unit Correspondingly generating a message extension word The systolic array performs message expansion according to the round, 3 continuous processing units are activated in each round, 3 message expansion words are generated in parallel, and for Each processing unit Generating Immediately afterwards, will Writing corresponding addresses of production buffer and output register set, processing unit The calculation logic of (1) is: ; Wherein when In the time-course of which the first and second contact surfaces, 、、、、 From a set of input registers, when In the time-course of which the first and second contact surfaces, 、、、 From the set of output registers, The output from the previous round of processing units.
3. The SM3 hash algorithm parallel optimization method as claimed in claim 2, wherein the XOR unit array in step S3 comprises 64 combinational logic XOR units Each exclusive OR unit Correspondingly generating a message extension word The generation logic is as follows: ; Wherein when In the time-course of which the first and second contact surfaces, 、 From a set of input registers, when In the time-course of which the first and second contact surfaces, From the set of input registers, From the output register set, when Time of day 、 From the output register set; Each exclusive OR unit At the position of When the read-out register exists in the input register set or after the read-out register set is generated and written into the output register set, the read-out register set is read immediately And Performing an exclusive OR operation to generate Generated by The corresponding address of the production buffer is written immediately.
4. The SM3 hash algorithm parallel optimization method of claim 3, wherein in step S4 the double Buffer comprises two independent buffers Buffer A and Buffer B, each Buffer has a storage capacity of 6 groups of message expansion words, each group corresponds to 1 round of computation of compression iteration, and comprises 1 And 1 I.e. each buffer stores 6 32-bit message extension words And 6 32-bit message extension words Wherein 、、 5, For initial round, input register set Generated by a systolic array Exclusive or cell array generation The packets are stored in the double buffer according to the round, the method comprises the following steps: For compression iteration Wheels, corresponding to And Make up 1 data set , When the exclusive OR unit Generating Immediately afterwards, a register from the input register set or the output register set And from an array of exclusive or cells Writing the corresponding address of the current production buffer zone, and writing the pointer of the production buffer zone according to the following mode Sequentially increasing; asynchronous parallel of message expansion and compression iteration is realized by the double Buffer areas through a producer-consumer role dynamic switching mechanism, and in an initial state, buffer A is used as a production Buffer area and receives message expansion writing Buffer B is a consumption Buffer area for compression iterative reading Triggering the synchronous signal generated by the NPU interrupt controller when the write pointer of the production buffer reaches the end of the production buffer and the read pointer of the consumption buffer reaches the end of the consumption buffer, exchanging the roles of the production buffer and the consumption buffer, resetting the write pointer of the production buffer and the read pointer of the consumption buffer, and compressing and iterating to read from the new consumption buffer Message extension writes to new production buffers ; The double-buffer access authority control mechanism is that the production buffer only allows information to expand and write data, the consumption buffer only allows compression iteration to read data, and in the switching process, the address mapping and the access authority of the double-buffer are controlled by the memory management unit of the NPU to carry out atomic switching, so that data collision is avoided.
5. The SM3 hash algorithm parallel optimization method according to claim 1, wherein the step S5 specifically comprises: s51, state variable grouping and calculating path allocation, namely dividing a state register A, B, C, D into a first state variable group, wherein updating logic of the first state variable group is executed by a first calculating path, and dividing a state register E, F, G, H into a second state variable group, wherein updating logic of the second state variable group is executed by a second calculating path; S52, synchronously calculating a common intermediate value, namely, when the jth round of iteration starts, synchronizing the first calculation path and the second calculation path, and calculating the common intermediate value SS1 according to the value of the register A and the value of the register E updated in the previous round of iteration: ; Wherein when In the time-course of which the first and second contact surfaces, When (when) In the time-course of which the first and second contact surfaces, ; And S53, asynchronous parallel computing the updated value, namely immediately distributing the common intermediate value SS1 to two computing paths after computing the common intermediate value SS1, wherein the first computing path and the second computing path enter an asynchronous parallel execution stage: (a) The first computation path reads the message extension word from the consumption buffer And independently performs the following calculations to generate a first updated value TT1: ; ; (b) At the same time, the second computation path reads the message extension from the consumption buffer And independently performs the following calculations to generate a second updated value TT2: ; Wherein, the And (3) with As a Boolean function when In the time-course of which the first and second contact surfaces, , When (when) In the time-course of which the first and second contact surfaces, , ; And S54, asynchronously and parallelly updating state variables, namely independently updating corresponding state variable groups after the first computing path and the second computing path generate update values respectively until both paths finish updating: (a) The first computing path updates the first state variable group by using the first updated value TT1 to enable Preparing a new A, B, C, D value for the next round of iterations; (b) The second calculation path uses the second updated value TT2 and passes through the permutation function After operation, updating the second state variable group to make A new E, F, G, H value is prepared for the next round of iteration, wherein, 。
6. A parallel optimization verification method for SM3 hash algorithm, which is used for verifying the hardware circuit realized by the method according to any one of claims 1-5, and comprises the following steps: S1, in a verification environment, constructing a software reference model of a standard serial SM3 hash algorithm, wherein the reference model receives a message expansion word as input and generates a standard intermediate state variable reference value round by round In the verified SM3 hash algorithm hardware implementation adopting the parallel optimization method, hardware monitoring probes are respectively arranged for the first computing path and the second computing path, and the hardware monitoring probes are used for capturing and outputting actual intermediate state variable values after each round of compression iteration is completed and the round of actual intermediate state variable values are respectively updated by the two paths ; S2, for Is used in the present invention, the following operations are performed: (a) Synchronous excitation, namely expanding the same message by words before the j-th round of iteration starts And Simultaneously inputting into the reference model and the hardware implementation; (b) Triggering a software reference model and a hardware implementation to execute a jth round of computation, respectively capturing and extracting actual values of a first state variable group updated by a first computation path through the hardware monitoring probe after the minimum synchronization point of the hardware implementation is completed And the actual values of the second set of state variables updated by the second computation path ; (C) Cross-comparing the actual value captured by the hardware implementation with the reference value generated by the reference model, and performing a first calculation path And (3) with Performing a bit-by-bit comparison to obtain a second calculation path And (3) with Performing bit-by-bit comparison; S3, error positioning and judging, namely judging that errors exist in parallel optimization logic realized by hardware if the actual value and the reference value of any state variable of any round are inconsistent in the comparison of the step S2, and accurately positioning error sources according to a comparison object path and a round j which are inconsistent for the first time; S4, constructing a performance test environment on a target NPU hardware platform, wherein the performance test environment comprises an input module used for generating batch messages, a parallel optimization algorithm module used for deploying a parallel optimized SM3 algorithm, and a performance statistics module used for counting throughput, delay and NPU resource utilization rate, and comparing performance indexes of the parallel optimized SM3 algorithm with performance indexes of an original SM3 algorithm.
7. An SM3 hash algorithm parallel optimization system, which is used for executing the SM3 hash algorithm parallel optimization method as set forth in any one of claims 1-5, comprising: An input preprocessing module for splitting an input 512-bit message block into 16 32-bit message expansion words And will Storing into an input register set; Systolic array module comprising 52 processing units Each processing unit Correspondingly generating a message extension word Configured to perform message expansion in round passes, each round activating 3 consecutive processing units, generating 3 message expansion words in parallel, each processing unit Generating Immediately afterwards, will Writing corresponding addresses of the production buffer and the output register set; An exclusive-OR unit array module including 64 combinational logic exclusive-OR units Each exclusive OR unit Correspondingly generating a message extension word Each exclusive OR unit At the position of When the read-out register exists in the input register set or after the read-out register set is generated and written into the output register set, the read-out register set is read immediately And Performing an exclusive OR operation to generate Writing the corresponding address of the production buffer area; the double buffer module comprises a production buffer zone and a consumption buffer zone, wherein the storage capacity of each buffer zone is 6 groups of data, and each group of data Comprising 1 number of And 1 Production buffer for writing And The consumption buffer area is used for being read by the compression iteration module And Performing iterative compression, triggering a synchronous signal when a write pointer of a production buffer reaches the end and a read pointer of a consumption buffer reaches the end, and exchanging roles of the production buffer and the consumption buffer; the compression iteration module comprises a first calculation path and a second calculation path, and realizes asynchronous parallel calculation and update of compression iteration through double-path asynchronous iteration.

Description

SM3 hash algorithm parallel optimization and verification method and system Technical Field The invention relates to the technical field of information security and cryptography, in particular to a SM3 hash algorithm parallel optimization and verification method and system. Background The SM3 hash algorithm is a commercial password hash algorithm standard issued by the China national password administration, is used as a basic stone of an information security system, and is widely applied to a plurality of information security fields such as digital signature, message authentication codes, data integrity verification and the like. With the rapid development of emerging application scenes such as intelligent driving, industrial visual detection, edge calculation and the like, extremely high requirements are put on the instantaneity and throughput rate of data encryption and verification, in the scenes, an NPU is often used as a core calculation unit, however, when the traditional SM3 algorithm realizes two stages of message expansion and compression iteration, the serial structure of the traditional SM3 algorithm leads to insufficient utilization of hardware resources on parallel processors such as the NPU and the like. In the implementation of the conventional SM3 algorithm, each message expansion stageThe calculation of (a) depends on the previous 16 message expansion words, and the message expansion is usually carried out in series and sequentially, and the reading is only carried out after the complete message expansionAndAnd the compression iteration process comprises 64 rounds of loops, each round of updating needs to sequentially calculate a plurality of intermediate variable values, and after all intermediate variable values are calculated, 8 32-bit state registers from A to H are sequentially updated by using the intermediate variable values, and according to SM3 algorithm standard, the updating of the state register of the jth round of iteration is strictly dependent on the state register value after the j-1 th round of iteration calculation is completed. These tight data dependencies constitute a lengthy computation path, greatly limiting the potential for improving algorithm efficiency through parallel computation. Therefore, a method and a system for optimizing and verifying an SM3 hash algorithm capable of implementing deep parallel computation are needed to meet the encryption scene requirement. Disclosure of Invention In order to solve the problems in the background art, the invention provides a parallel optimization and verification method and a system for an SM3 hash algorithm, which are used for solving the problems of low hardware resource utilization rate, high calculation delay and limited overall throughput rate caused by a serial generation mechanism of message expansion, a serial execution mode of message expansion and compression iteration and compact data dependency inside the compression iteration when the traditional SM3 algorithm is implemented on an NPU processor, and realizing asynchronous parallel processing of parallel message expansion and compression iteration and parallel calculation of the compression iteration process. In order to achieve the above purpose, the present invention adopts the following technical scheme: the SM3 hash algorithm parallel optimization method is applied to encryption scenes of intelligent driving assistance systems, industrial visual detection platforms, edge computing terminals and the like which take NPU as a core computing unit, and comprises the following steps: S1, splitting an input 512-bit message block into 16 32-bit message expansion words Storing into an input register set; s2, generating message expansion words in parallel through pulsation arrays of NPUs ; S3, generating message expansion words synchronously through exclusive OR unit array of NPU; S4, inputting into the register groupGenerated by a systolic arrayGenerated by an array of exclusive-or cellsStoring the information into a double buffer area of the NPU, wherein the double buffer area comprises a production buffer area for writing and a consumption buffer area for reading, and realizing asynchronous parallel of message expansion and compression iteration by switching the reading and writing roles after the production buffer area is full; s5, reading from the consumption buffer area in the double buffer area AndWhen the compression iteration is carried out, the state registers A-H and the update logic thereof are deconstructed into two parallel computing paths, and asynchronous execution among the paths is realized through a minimum synchronization mechanism. Specifically, the systolic array in step S2 includes 52 processing unitsEach processing unitCorrespondingly generating a message extension wordThe systolic array performs message expansion according to the round, 3 continuous processing units are activated in each round, 3 message expansion words are generated in para