CN-121981187-A - Internet of things large model pruning compression method and system based on Attention matrix optimization

CN121981187ACN 121981187 ACN121981187 ACN 121981187ACN-121981187-A

Abstract

The invention provides an Internet of things large model pruning compression method and system based on Attention matrix optimization, and relates to the technical field of data processing, wherein the method comprises the steps of obtaining a pre-trained Internet of things large model, extracting parameters in the Attention matrix, and obtaining a parameter set to be optimized; based on the parameter set, carrying out importance evaluation on parameters in the Attention matrix to generate parameter importance distribution data, and establishing high-importance calibration, medium-importance calibration and low-importance calibration by taking the obtained parameter importance distribution data as a processing basis to construct a ternary calibration matrix. The invention effectively balances the compression ratio and the reasoning precision.

Inventors

WU ZHIHUI
ZHANG YINFENG
SUN KE

Assignees

厦门天堉物联网科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260407

Claims (10)

1. The Internet of things large model pruning compression method based on attribute matrix optimization is characterized by comprising the following steps of: Acquiring a pre-trained Internet of things large model, extracting parameters in an Attention matrix, and obtaining a parameter set to be optimized; Based on the parameter set, carrying out importance assessment on parameters in the Attention matrix to generate parameter importance distribution data; establishing a three-dimensional calibration matrix by taking the obtained parameter importance distribution data as a processing basis, establishing high importance calibration, medium importance calibration and low importance calibration, carrying out regularized discretization on a parameter space defined by the three-dimensional calibration matrix, carrying out high-precision track fitting on the parameter importance aggregation degree and gradient change trend in each discrete area, accurately describing the parameter importance aggregation characteristics of discrete units through fitting results, and constructing pruning granularity mapping corresponding to a discrete structure based on continuous change tracks generated by fitting; Mapping the grid division result back to the space position of the original Attention matrix through pruning granularity mapping, carrying out regional structured pruning on the Attention matrix according to the pruning granularity corresponding to each grid unit, and removing parameters with importance ranking in each region to obtain the pruned Attention matrix; And reconstructing a network structure of the large model of the Internet of things according to the pruned Attention matrix, performing targeted fine tuning, outputting a compressed lightweight model, and realizing model parameter reduction and reasoning efficiency improvement.
2. The internet of things large model pruning compression method based on Attention matrix optimization according to claim 1, wherein the method is characterized by obtaining a pre-trained internet of things large model, extracting parameters in the Attention matrix to obtain a parameter set to be optimized, and comprises the following steps: Reading a pre-trained Internet of things large model file from a preset model storage library, loading the Internet of things large model file into an operation memory, analyzing a network structure of a loaded model, and identifying and positioning all attention layers in the model; Accessing an Attention matrix in the Attention layer aiming at each identified Attention layer, and extracting all weight parameters in the Attention matrix; and collecting the weight parameters extracted from all the Attention layers to generate a structured parameter set to be optimized, wherein each record in the parameter set is associated with a corresponding Attention layer identifier and specific position information of the parameter in an Attention matrix.
3. The internet of things large model pruning compression method based on the Attention matrix optimization of claim 2, wherein the importance evaluation is performed on parameters in the Attention matrix based on a parameter set to generate parameter importance distribution data, and the method comprises the following steps: Invoking attention layer identifiers and matrix position information associated with each record in the parameter set, constructing a parameter index mapping table, and parallelizing grouping the parameter set according to the index mapping table to form intra-layer parameter subsets corresponding to each attention layer one by one; For each intra-layer parameter subset, synchronously executing a two-channel importance measurement, wherein the absolute amplitude of a first channel computing parameter is used as a static importance index, the second channel is based on a loss function gradient obtained by forward propagation of a preset check data set, and the gradient amplitude of the computing parameter is used as a dynamic sensitivity index; weighting and fusing the static importance index and the dynamic sensitivity index of each parameter to generate a fused importance score; Recombining the fusion importance scores of all parameters in the parameter subsets in each layer according to the space positions of the Attention matrix to form an importance thermodynamic data matrix consistent with the dimension of the original Attention matrix; Global normalization processing is carried out on the importance thermodynamic diagram matrix, and structured parameter importance distribution data are generated.
4. The internet of things large model pruning compression method based on Attention matrix optimization according to claim 3, wherein the establishing of the ternary calibration matrix by taking the obtained parameter importance distribution data as a processing basis and establishing the high importance calibration, the medium importance calibration and the low importance calibration comprises the following steps: Receiving parameter importance distribution data, carrying out global histogram statistics on the distribution data, generating a frequency distribution histogram of importance scores, and identifying a first peak interval and a second peak interval based on bimodal distribution characteristics of the histogram; Taking the right boundary of the first peak value interval as the lower threshold value of the high importance region, taking the left boundary of the second peak value interval as the upper threshold value of the low importance region, and defining the continuous interval between the two threshold values as the value range of the medium importance region; according to the two thresholds, performing triple interval division on the parameter importance distribution data, calibrating parameters with importance scores not lower than the lower threshold of the high importance area into high importance categories, calibrating parameters with importance scores not higher than the upper threshold of the low importance area into low importance categories, and calibrating parameters with importance scores between the two thresholds into medium importance categories to obtain three types of calibration results; Binding three types of calibration results with original space coordinates of parameters in the Attention matrix, and constructing a ternary calibration matrix containing category labels and position indexes.
5. The internet of things large model pruning compression method based on Attention matrix optimization according to claim 4, which is characterized in that regularized discretization is implemented on a parameter space defined by a ternary calibration matrix, high-precision track fitting is performed on parameter importance aggregation degree and gradient change trend in each discrete area, parameter importance aggregation characteristics of discrete units are accurately depicted through fitting results, and pruning granularity mapping corresponding to a discrete structure is constructed based on continuous change tracks generated by fitting, and the method comprises the following steps: Based on the distribution density of the spatial categories of the ternary calibration matrix, the grid division granularity in the row and column directions is determined in a self-adaptive mode, and the parameter space is divided into a regularized rectangular unit set; For each rectangular unit, counting the duty ratio of high importance parameters in the unit, and generating an aggregation degree quantization index sequence bound with the unit position; calculating a first-order difference value between adjacent units based on the aggregation degree quantization index sequence, constructing a two-dimensional gradient vector field representing the spatial change of importance, and extracting the gradient amplitude and the direction change rate of each unit as gradient change trend quantization indexes; Taking the central coordinates of the units as input, fusing aggregation degree indexes and gradient change trend indexes, and performing smooth fitting to obtain a continuous importance change track curved surface; according to the local curvature and gradient characteristics of the track curved surface, the pruning proportion is dynamically distributed, the region with gentle curvature distributes pruning with large proportion, the region with steep curvature distributes pruning with small proportion, and then the pruning granularity mapping table strictly corresponding to the rectangular unit position is constructed.
6. The internet of things large model pruning compression method based on the Attention matrix optimization of claim 5, wherein the method is characterized in that grid division results are mapped back to the spatial positions of the original Attention matrix through pruning granularity mapping, regional structured pruning is implemented on the Attention matrix according to the pruning granularity corresponding to each grid unit, parameters with importance ordered in each region are removed, and the pruned Attention matrix is obtained, and comprises the following steps: Accurately mapping the pruning proportion value to a row and column coordinate area corresponding to the original Attention matrix through a pruning granularity mapping table according to the spatial position indexes of each rectangular unit in the mapping table; for each mapped rectangular unit area, calling parameter importance distribution data, and sorting all parameters in the area in descending order according to importance scores; according to the pruning proportion value, cutting off the ordered parameter sequence, removing the parameters which are ordered and have the accumulated proportion reaching the pruning proportion of the unit, and reserving a high-importance parameter subset; structural constraint is implemented on the reserved parameters, and in each rectangular unit, the reserved parameters are recombined into continuous memory blocks according to the original matrix row-column sequence, so that parameter blocks processed by the rectangular units are obtained; and splicing the parameter blocks processed by all the rectangular units according to the space positions, and reconstructing the parameter blocks into a complete-dimension sparse Attention matrix to obtain a pruned Attention matrix.
7. The internet of things large model pruning compression method based on the Attention matrix optimization of claim 6, wherein the method is characterized by reconstructing a network structure of the internet of things large model and performing targeted fine tuning according to the Attention matrix after pruning, outputting a compressed lightweight model, realizing model parameter quantity reduction and reasoning efficiency improvement, and comprises the following steps: According to the pruned Attention matrix, accurately backfilling the pruned matrix to a corresponding Attention layer according to the Attention layer identification and the position index, and replacing original weight parameters to obtain a structural sparsified intermediate model; Performing sparse mode adaptation on the intermediate model, and reorganizing unstructured sparse parameters into regular sparse blocks according to a block sparse rule to obtain an adapted model; Performing two-stage fine tuning on the adapted model, wherein the first stage adopts a small learning rate to only fine tune the retention parameters to restore the characterization capacity, the second stage introduces a knowledge distillation loss function, takes the original pre-trained model as a teacher model to guide and optimize, compensates the performance loss introduced by pruning, and obtains a fine-tuned model; Performing reasoning delay and precision joint verification on the trimmed model, triggering a local parameter recovery mechanism if the precision loss exceeds a preset threshold, otherwise, converting the model into a deployment format supported by target Internet of things equipment, and obtaining a verified lightweight model; And according to the verified lightweight model, generating a compression efficiency report containing the parameter compression ratio, the theoretical reasoning acceleration ratio and the memory occupation reduction ratio, and completing an end-to-end model compression flow facing the scene of the Internet of things.
8. An internet of things large model pruning compression system based on Attention matrix optimization, which implements the method according to any one of claims 1 to 7, comprising: The acquisition module is used for acquiring a pre-trained Internet of things large model, extracting parameters in the Attention matrix and obtaining a parameter set to be optimized; The evaluation module is used for carrying out importance evaluation on parameters in the Attention matrix based on the parameter set to generate parameter importance distribution data; The fitting module is used for establishing high-importance calibration, medium-importance calibration and low-importance calibration by taking the obtained parameter importance distribution data as a processing basis and constructing a ternary calibration matrix, carrying out regularized discretization on a parameter space defined by the ternary calibration matrix, carrying out high-precision track fitting on the parameter importance aggregation degree and gradient change trend in each discrete area, accurately describing the parameter importance aggregation characteristics of the discrete units through fitting results, and constructing pruning granularity mapping corresponding to a discrete structure based on a continuous change track generated by fitting; The mapping module is used for mapping the grid division result back to the space position of the original Attention matrix through pruning granularity mapping, carrying out regional structured pruning on the Attention matrix according to the pruning granularity corresponding to each grid unit, and removing the parameters with the importance ordered back in each region to obtain the pruned Attention matrix; And the compression module is used for reconstructing a network structure of the large model of the Internet of things according to the pruned Attention matrix, performing targeted fine adjustment, outputting a compressed lightweight model and realizing reduction of model parameter and improvement of reasoning efficiency.
9. A computing device, comprising: One or more processors; Storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the method of any of claims 1to 7.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a program which, when executed by a processor, implements the method according to any of claims 1 to 7.

Description

Internet of things large model pruning compression method and system based on Attention matrix optimization Technical Field The invention relates to the technical field of data processing, in particular to an Internet of things large model pruning compression method and system based on attribute matrix optimization. Background With the wide deployment of the terminals of the Internet of things, the large model of the Internet of things for real-time data processing and equipment state monitoring generally has the problems of excessively high storage occupation and longer reasoning delay because of containing massive Attention parameters, and the efficient pruning compression technology has become the industry's just need; In a certain intelligent agriculture Internet of things system, a large model for crop disease and pest image recognition adopts a traditional global uniform pruning method, and after the parameters of an Attention matrix are cut according to a set proportion, a certain degree of parameter quantity reduction is realized, but due to insufficient difference of parameter importance, key Attention weights for representing specific characteristics of diseases and pests are wrongly deleted, so that model recognition accuracy is reduced, actual requirements of field monitoring cannot be met, and effective balance is difficult to form between model compression ratio and reasoning accuracy. Disclosure of Invention The technical problem to be solved by the invention is to provide the Internet of things large model pruning compression method and system based on the Attention matrix optimization, so that the compression ratio and the reasoning accuracy are effectively balanced. In order to solve the technical problems, the technical scheme of the invention is as follows: in a first aspect, an internet of things large model pruning compression method based on Attention matrix optimization, the method comprises the following steps: Acquiring a pre-trained Internet of things large model, extracting parameters in an Attention matrix, and obtaining a parameter set to be optimized; Based on the parameter set, carrying out importance assessment on parameters in the Attention matrix to generate parameter importance distribution data; establishing a three-dimensional calibration matrix by taking the obtained parameter importance distribution data as a processing basis, establishing high importance calibration, medium importance calibration and low importance calibration, carrying out regularized discretization on a parameter space defined by the three-dimensional calibration matrix, carrying out high-precision track fitting on the parameter importance aggregation degree and gradient change trend in each discrete area, accurately describing the parameter importance aggregation characteristics of discrete units through fitting results, and constructing pruning granularity mapping corresponding to a discrete structure based on continuous change tracks generated by fitting; Mapping the grid division result back to the space position of the original Attention matrix through pruning granularity mapping, carrying out regional structured pruning on the Attention matrix according to the pruning granularity corresponding to each grid unit, and removing parameters with importance ranking in each region to obtain the pruned Attention matrix; And reconstructing a network structure of the large model of the Internet of things according to the pruned Attention matrix, performing targeted fine tuning, outputting a compressed lightweight model, and realizing model parameter reduction and reasoning efficiency improvement. Further, obtaining a pre-trained internet of things large model, extracting parameters in an Attention matrix, and obtaining a parameter set to be optimized, wherein the method comprises the following steps: Reading a pre-trained Internet of things large model file from a preset model storage library, loading the Internet of things large model file into an operation memory, analyzing a network structure of a loaded model, and identifying and positioning all attention layers in the model; Accessing an Attention matrix in the Attention layer aiming at each identified Attention layer, and extracting all weight parameters in the Attention matrix; and collecting the weight parameters extracted from all the Attention layers to generate a structured parameter set to be optimized, wherein each record in the parameter set is associated with a corresponding Attention layer identifier and specific position information of the parameter in an Attention matrix. Further, based on the parameter set, performing importance assessment on parameters in the Attention matrix to generate parameter importance distribution data, including: Invoking attention layer identifiers and matrix position information associated with each record in the parameter set, constructing a parameter index mapping table, and parallelizing grouping the pa