CN-121981270-A - End cloud collaborative reasoning method, equipment and medium based on image blocking and pruning

CN121981270ACN 121981270 ACN121981270 ACN 121981270ACN-121981270-A

Abstract

The invention discloses an end cloud collaborative reasoning method, equipment and medium based on image blocking and pruning. The method comprises the steps of dividing an input image into a plurality of sub-image blocks, calculating global pruning rate by taking the aim that the absolute value of the difference between the coding calculation time of each sub-image block and the uploading time of the previous sub-image block is smaller than a preset threshold value, linearly combining the color richness and texture entropy of each sub-image block to obtain importance weight of each sub-image block, reversely distributing the global pruning rate to each sub-image block according to the importance weight of each sub-image block to obtain local pruning rates respectively corresponding to each sub-image block, carrying out pruning coding on the corresponding sub-image blocks according to the local pruning rates, uploading the pruning coded data to a server side, and carrying out pruning coding and uploading according to a parallel pipeline mode. The invention improves the efficiency and usability of the end cloud collaborative reasoning.

Inventors

XU HONGLI
QIU HAO
XU YANG
Liao Yunming
HUANG LIUSHENG

Assignees

中国科学技术大学苏州高等研究院
中国科学技术大学

Dates

Publication Date: 20260505
Application Date: 20260126

Claims (10)

1. An end cloud collaborative reasoning method based on image blocking and pruning is used for a device end and is characterized by comprising the following steps: Dividing an input image into a plurality of sub-image blocks by taking the shortest total time required for completing image coding and uploading of each sub-image block in a parallel pipeline mode as a target; calculating the global pruning rate by taking the aim that the absolute value of the difference between the coding calculation time of each sub-image block and the uploading time of the previous sub-image block is smaller than a preset threshold value; For each sub-image block, linearly combining the color richness and texture entropy of the sub-image block to obtain importance weight of the sub-image block; According to the importance weight of each sub-image block, reversely distributing the global pruning rate to each sub-image block according to the importance degree to obtain local pruning rates corresponding to each sub-image block respectively; And performing pruning coding on the corresponding sub-image blocks according to the local pruning rate, and uploading the data after pruning coding to a server side, wherein the pruning coding and the uploading are performed in a parallel pipeline mode.
2. The end cloud collaborative reasoning method based on image blocking and pruning according to claim 1, wherein the coding calculation time of the sub-image block is: ; ; Wherein, the Calculating a time for encoding of said sub-image block, For the computational overhead of all layers of the ViT model for pruning, The force is calculated for the peak value at the equipment end, In order to achieve a utilization rate of the water, As a coefficient of efficiency of the hardware, For a depth of ViT of the model, For the number of actual tokens processed at the i-th layer in the ViT model, The layer dimensions are hidden for the ViT model, For the number of tokens pruned at the i-th layer in the ViT model, Is the feed-forward network intermediate layer dimension.
3. The end cloud collaborative reasoning method based on image blocking and pruning according to claim 1, wherein the uploading time of the sub-image blocks is as follows: ; ; Wherein, the For the upload time of the sub-image block, The amount of visual token data to be uploaded for the sub-image block, For the uplink bandwidth to be available, For the bandwidth fluctuation penalty factor, For the bandwidth fluctuation variance to be the same, For the number of actual tokens processed through a total of L layers in the ViT model, The layer dimensions are hidden for the ViT model, Data type bit width.
4. The image blocking and pruning-based end cloud collaborative reasoning method according to claim 1, wherein the importance weights are as follows: ; Wherein, the For the importance weight of the kth sub-picture block, As the weight coefficient of the light-emitting diode, , Is a normalized value of the color richness of the kth sub-image block, A normalized value of the texture entropy measure for the kth sub-image block.
5. The end cloud collaborative reasoning method based on image segmentation and pruning according to claim 1 or 4, wherein the local pruning rate corresponding to the sub-image block is: ; Wherein, the For the local pruning rate of the kth sub-picture block, For the said global pruning rate, For the importance weight of the kth sub-picture block, Is the total number of sub-image blocks.
6. The end cloud collaborative reasoning method based on image blocking and pruning according to claim 1, wherein the pruning coding and the uploading are performed in a parallel pipeline manner, and specifically comprises: The sub-image blocks with high importance weight j and the sub-image blocks with low importance weight j are allocated to the same batch, , For the total number of sub-image blocks, Is not greater than Is the largest integer of (2); And carrying out pruning coding and uploading processing on each batch in a parallel pipeline mode.
7. The end cloud collaborative reasoning method based on image segmentation and pruning according to claim 1, wherein pruning is performed on corresponding sub-image blocks according to the local pruning rate, specifically comprising: Pruning is carried out by utilizing the ViT model, and for each layer of the ViT model, cosine similarity of each token pair Key vector of the sub-image block is calculated, and token pairs with highest similarity are iteratively combined until the number of the combined token pairs is equal to the local pruning rate.
8. The image blocking and pruning-based end cloud collaborative reasoning method according to claim 1, wherein a token space index table of the input image is generated in the equipment end, and the position of each visual token in the input image is stored in the token index table; The device side uploads the visual token after pruning coding and the token space index table to a server side, so that the server side rebuilds the visual token after pruning coding into a visual feature map according to the token space index table, fuses the visual feature map and a text token sequence representing a user instruction to generate a multi-mode representation, and infers the multi-mode representation to generate a response conforming to the user instruction.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements an image blocking and pruning based end cloud collaborative reasoning method as claimed in any one of claims 1-8 when executing the program.
10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the image blocking and pruning-based end cloud collaborative reasoning method of any of claims 1-8.

Description

End cloud collaborative reasoning method, equipment and medium based on image blocking and pruning Technical Field The invention belongs to the technical field of large-scale visual language models, and particularly relates to an end cloud collaborative reasoning method, equipment and medium based on image segmentation and pruning. Background The Large Visual Language Model (LVLMs) realizes the unification of visual understanding and Language reasoning, and shows prominence in multi-mode tasks such as visual question-answering, image description and the like. As LVLMs is increasingly applied in real scenes, its visual input often comes from sensitive data at the user equipment end, such as medical images, vehicle camera pictures or home environment images. The traditional cloud reasoning mode aggravates the user privacy disclosure risk, and the actual requirements of high privacy protection scenes are difficult to meet. In order to consider privacy protection and instantaneity, researchers provide an end cloud collaborative reasoning architecture. The terminal cloud collaborative reasoning architecture realizes local processing of privacy sensitive data and cloud execution of complex reasoning tasks by distributing model calculation between the terminal equipment with limited resources and the high-performance cloud server, so that balance between performance and safety is achieved. Although this architecture has been widely used in traditional deep neural networks, its systematic research on LVLMs is still relatively inadequate and still faces serious challenges in terms of inference delay. The method is mainly characterized in that firstly, the data volume transmitted by the end side to the cloud end is increased rapidly when a high-resolution image is segmented, for example, images with the size of 1920 multiplied by 1080 and about 1 MB are converted into 10549 visual tokens (token) when the images are segmented by 14 multiplied by 14, and the images are stored by bf16 to be about 20 to MB, so that huge communication burden is brought, and secondly, in a real network environment, the uplink bandwidth between the terminal equipment and the cloud end is usually limited and fluctuates obviously, so that the communication delay is not negligible in the uploading process of all the visual tokens, and the total reasoning delay is difficult to meet the actual application requirement. In order to reduce the time delay of end cloud collaborative reasoning, the existing research is mainly developed from the directions of model division, pipeline collaboration, visual feature compression and the like. However, these methods are based on the structural characteristics of the conventional deep neural network, and are difficult to directly apply to LVLMs encoded by ViT, so that there are still obvious limitations in LVLMs. First, the model segmentation method based on network structure generally relies on feature size reduction caused by layer-by-layer downsampling in a convolutional network to find low-overhead segmentation points. However, the dimension of the intermediate feature of Vit remains constant in the whole encoding process, and an effective dimension reduction effect cannot be obtained, so that the segmentation strategy is difficult to play a substantial role in LVLMs. Secondly, the pipelined end cloud cooperation method realizes overlapping execution of end side calculation and cloud communication by segmenting calculation or input data. However, the method assumes that the processing cost of a single sample is small so as to form efficient batch processing or stable pipeline rhythm, and the number of visual tokens generated by LVLMs is huge and fixed, so that the transmission of the single sample becomes a bottleneck, and the delay hiding effect of the pipeline is fundamentally destroyed. Finally, the feature compression methods such as visual token pruning can reduce the transmission data amount to a certain extent, but under LVLMs architecture, all visual tokens can execute pruning after being completely encoded at the end side, so that the end side calculation is still heavy, and under the condition of real bandwidth, even if the pruning is carried out, the residual feature scale is still huge, and the communication bottleneck cannot be effectively eliminated. In summary, how to solve the problems of communication delay and excessive end-side computing load existing in the existing end-cloud collaborative reasoning under the high-resolution image and the fixed-size visual token has important research significance. Disclosure of Invention The invention mainly aims to provide an end cloud collaborative reasoning method, equipment and medium based on image blocking and pruning, so as to overcome the defects of the prior art. In order to achieve the above object, the present invention adopts the following technical scheme: The invention provides an end cloud collaborative reasoning method based on image