CN-121841638-B - High-order mask comparison implementation method and system based on GPU acceleration

CN121841638BCN 121841638 BCN121841638 BCN 121841638BCN-121841638-B

Abstract

The invention provides a high-order mask comparison implementation method and system based on GPU acceleration, and relates to the technical field of post quantum cryptography. The method comprises the steps of generating random numbers for each thread independently at a GPU end, converting input data from an arithmetic mask form to a Boolean mask form by utilizing the random numbers, selecting a direct path or a bit slice path according to a data scale to carry out zero value detection, if the direct path is adopted, maintaining the original data layout, multiplexing the same random numbers for a plurality of mask shares of the same data unit, judging whether the unit is zero or not through an accumulation result, if the bit slice path is adopted, reorganizing the data into bit slices, aggregating the same bit of the plurality of data units by each slice, carrying out differentiation processing and accumulation on each unit in the slice according to the positions of the bit slices, and realizing multi-unit parallel detection, thereby improving the performance and the safety of higher-order mask comparison under the GPU environment.

Inventors

LI YANBIN
LI CHENGSHAN
GE CHUNPENG

Assignees

山东大学

Dates

Publication Date: 20260508
Application Date: 20260316

Claims (10)

1. The method for realizing the high-order mask comparison based on GPU acceleration is characterized by comprising the following steps of: independently generating random numbers for each thread at the GPU end; Converting the input data from an arithmetic mask to a Boolean mask by utilizing the random number, wherein the conversion is completed through a lookup table, the lookup table simultaneously codes a conversion result and a carry state, and each thread sequentially looks up the table to complete the conversion of the responsible shares; And the bit slice path reorganizes the data into bit slices, each slice aggregates the same bit bits of a plurality of data units, performs differentiation processing and accumulation on each unit in the slice according to the positions of the same bit bits, and realizes multi-unit parallel detection.
2. The method for implementing high-order mask comparison based on GPU acceleration according to claim 1, wherein the random number generation process specifically comprises: maintaining an independent pseudo-random number generator state for each thread at a GPU (graphics processing unit) end, generating a high-entropy seed for each thread by using a hardware random number at a host end, and initializing a key and a random number of each thread; each thread independently executes a generating function at the equipment end, and the generated random number directly resides in the equipment memory for conversion and detection and calling; all random number generation operations adopt a non-branch design, and state data is managed through memory pooling.
3. The method for implementing high-order mask comparison based on GPU acceleration according to claim 1, wherein the conversion is accomplished by a lookup table, the lookup table encodes the conversion result and the carry state simultaneously, and each thread sequentially looks up the conversion of the responsible shares is accomplished, specifically comprising: Dividing an input arithmetic mask value into a plurality of sequential blocks according to a fixed bit width; For each block, reading a corresponding item from a lookup table preset in a GPU constant memory, wherein the corresponding item comprises a Boolean conversion result of the current block and a carry state of the next block; The thread sequentially looks up the table according to the current carry state, and combines the block results into a complete Boolean mask for output; the different threads process respective mask shares in parallel, the table look-up process is free of branches, the access mode is fixed, and the whole conversion process is completed in constant time.
4. The method for implementing high-order mask comparison based on GPU acceleration according to claim 1, wherein the direct path maintains an original data layout, multiplexes a plurality of mask shares of a same data unit with a same random number, and determines whether the unit is zero by accumulating the result, specifically comprising: each thread independently processes a data unit and reads all mask shares in the unit in turn; Carrying out carry-free multiplication operation by using the same random number and each share, and carrying out bit exclusive or accumulation on operation results of each time; after traversing all the shares of the unit, reflecting the projection of the true value of the unit through the accumulated value; And executing the same operation on all the data units to be compared, judging that the comparison is equal if the accumulated value of all the units is zero, and otherwise, judging that the comparison is unequal.
5. The method for implementing high-order mask comparison based on GPU acceleration according to claim 1, wherein the bit slice path reorganizes data into bit slices, each slice aggregates identical bits of a plurality of data units, performs differential processing and accumulation on units in the slice according to positions thereof, and implements multi-unit parallel detection, specifically comprising: reorganizing the original data into a bit slice format by batches, bits and groups, wherein each bit slice comprises mask shares of the same bit of a plurality of data units; when in processing, carrying out carry-free multiplication on each bit slice by using the same random number to obtain an intermediate value; according to the position of each data unit in the bit slice in the original group, respectively shifting the intermediate value left by a corresponding bit number and accumulating the intermediate value into a corresponding high-low accumulator; After all bit slice processing is completed, cross-share reconstruction and bitwise OR operation are carried out on all accumulators, and whether all data units are zero at the same time is judged according to whether the result is zero.
6. A high-order mask comparison implementation system based on GPU acceleration, comprising: The random number generation module is used for independently generating random numbers for each thread at the GPU end; the Boolean conversion module is used for converting the input data from the arithmetic mask to the Boolean mask by utilizing the random number, wherein the conversion is completed through a lookup table, the lookup table simultaneously codes a conversion result and a carry state, and each thread sequentially looks up the table to complete the conversion of the responsible shares; the path selection module is used for selecting a direct path or a bit slice path according to the data scale to carry out zero value detection, wherein the direct path keeps the original data layout, multiplexes the same random number on a plurality of mask shares of the same data unit, judges whether the unit is zero or not according to the accumulation result, reorganizes the data into bit slices, each slice aggregates the same bit of a plurality of data units, carries out differentiation processing and accumulation on each unit in the slice according to the positions of the units, and realizes multi-unit parallel detection.
7. The GPU-accelerated high order mask comparison implementation system of claim 6, wherein said random number generation process is specifically: maintaining an independent pseudo-random number generator state for each thread at a GPU (graphics processing unit) end, generating a high-entropy seed for each thread by using a hardware random number at a host end, and initializing a key and a random number of each thread; each thread independently executes a generating function at the equipment end, and the generated random number directly resides in the equipment memory for conversion and detection and calling; all random number generation operations adopt a non-branch design, and state data is managed through memory pooling.
8. The GPU-accelerated higher order mask comparison implementation system of claim 6, wherein the conversion is accomplished by a lookup table that encodes the conversion result and the carry state simultaneously, each thread sequentially looking up the conversion of the responsible shares, specifically comprising: Dividing an input arithmetic mask value into a plurality of sequential blocks according to a fixed bit width; For each block, reading a corresponding item from a lookup table preset in a GPU constant memory, wherein the corresponding item comprises a Boolean conversion result of the current block and a carry state of the next block; The thread sequentially looks up the table according to the current carry state, and combines the block results into a complete Boolean mask for output; the different threads process respective mask shares in parallel, the table look-up process is free of branches, the access mode is fixed, and the whole conversion process is completed in constant time.
9. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the steps of a GPU acceleration based high order mask comparison implementation method as claimed in any of claims 1-5.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, performs the steps in a GPU acceleration based high order mask comparison implementation method as claimed in any of claims 1-5.

Description

High-order mask comparison implementation method and system based on GPU acceleration Technical Field The invention relates to the technical field of post quantum cryptography, in particular to a high-order mask comparison implementation method and system based on GPU acceleration. Background The high-order mask comparison is a core technology in the field of post quantum cryptography and side channel protection, and can effectively resist side channel attack by splitting sensitive data into multiple mask shares and completing equivalent test of non-secret value reconstruction, so that the method is a key link of safety data interaction in the scenes of industrial Internet of things gateway, cloud computing and the like, and GPU becomes a preferable hardware platform for large-scale engineering deployment of the technology by virtue of high parallelism. With the floor application of post quantum algorithms such as lattice passwords, the requirements of high-order mask comparison technology facing GPU environments are increasingly urgent, and the requirements of security against side channel attacks and performance requirements of large-data-volume batch processing are met at the same time, so that the requirements are important in the research of the current password engineering field. In the prior art, masking is realized based on a CPU or an embedded chip, the SIMD parallel architecture of the GPU is difficult to adapt, the parallel advantages of the SIMD parallel architecture cannot be exerted after direct transplantation, side channel leakage is easy to occur due to carry propagation and branch operation of data dependence, random number support of cryptology safety is lacking, and high-order protection requirements are difficult to meet. Meanwhile, the existing scheme does not design a special domain conversion and comparison path aiming at the GPU, the problem of high delay and high overhead exists in the conversion from the arithmetic mask to the Boolean mask, the redundancy is high in medium-small-scale data processing, the throughput is insufficient in large-scale batch processing, and the processing requirements of different workloads cannot be met. Disclosure of Invention In order to solve the problems, the invention provides a high-order mask comparison implementation method and a high-order mask comparison implementation system based on GPU acceleration, which realize the safety comparison function of high-performance and side-channel resistance through innovative bit slice data packing technology and parallelization design. In order to achieve the above purpose, the present invention adopts the following technical scheme: in a first aspect, the present invention provides a method for implementing high-order mask comparison based on GPU acceleration, including: independently generating random numbers for each thread at the GPU end; Converting the input data from an arithmetic mask to a Boolean mask by utilizing the random number, wherein the conversion is completed through a lookup table, the lookup table simultaneously codes a conversion result and a carry state, and each thread sequentially looks up the table to complete the conversion of the responsible shares; And the bit slice path reorganizes the data into bit slices, each slice aggregates the same bit bits of a plurality of data units, performs differentiation processing and accumulation on each unit in the slice according to the positions of the same bit bits, and realizes multi-unit parallel detection. In a second aspect, the present invention provides a high-order mask comparison implementation system based on GPU acceleration, including: The random number generation module is used for independently generating random numbers for each thread at the GPU end; the Boolean conversion module is used for converting the input data from the arithmetic mask to the Boolean mask by utilizing the random number, wherein the conversion is completed through a lookup table, the lookup table simultaneously codes a conversion result and a carry state, and each thread sequentially looks up the table to complete the conversion of the responsible shares; the path selection module is used for selecting a direct path or a bit slice path according to the data scale to carry out zero value detection, wherein the direct path keeps the original data layout, multiplexes the same random number on a plurality of mask shares of the same data unit, judges whether the unit is zero or not according to the accumulation result, reorganizes the data into bit slices, each slice aggregates the same bit of a plurality of data units, carries out differentiation processing and accumulation on each unit in the slice according to the positions of the units, and realizes multi-unit parallel detection. In a third aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the ste