CN-122021766-A - Visual transducer layering pruning method, device and equipment based on space-time similarity

CN122021766ACN 122021766 ACN122021766 ACN 122021766ACN-122021766-A

Abstract

The application relates to a visual transducer layering pruning method, device and equipment based on space-time similarity, which are used for dividing each feature unit in a feature map of a current frame image into a foreground region set and a background region set based on a target detection frame set predicted for the previous frame image when each current frame image in a video frame sequence is processed by using an object detection model based on a transducer architecture, calculating the prediction importance score of each feature unit in the feature map, determining each first feature unit after pruning according to the prediction importance score of each feature unit, wherein each first feature unit comprises feature units with the prediction importance score in the foreground region set being greater than or equal to a first pruning threshold and feature units with the prediction importance score in the background region set being greater than or equal to a second pruning threshold, wherein the second pruning threshold is higher than the first pruning threshold, and predicting the target detection frame set in the current frame image based on input features corresponding to each first feature unit. The method can reduce the calculation cost.

Inventors

Chen cen
HUANG JINGKAI
WANG QINYU
CAI JIACHENG
Zeng ziqian

Assignees

华南理工大学

Dates

Publication Date: 20260512
Application Date: 20260106

Claims (10)

1. A visual transducer layering pruning method based on space-time similarity, the method comprising: When each current frame image in a video frame sequence is processed by using an object detection model based on a transform architecture, dividing each feature unit in a feature map of the current frame image into a foreground region set and a background region set based on a target detection frame set predicted by the object detection model for the previous frame image; calculating the predictive importance score of each feature unit in the feature map; Determining each first characteristic unit after pruning according to the predictive importance score of each characteristic unit, wherein each first characteristic unit comprises characteristic units with the predictive importance score being greater than or equal to a first pruning threshold value in the foreground region set and characteristic units with the predictive importance score being greater than or equal to a second pruning threshold value in the background region set, and the second pruning threshold value is higher than the first pruning threshold value; And predicting a target detection frame set in the current frame image based on the input features corresponding to the first feature units.
2. The method of claim 1, wherein the pruned feature cells are denoted as second feature cells, and wherein predicting the set of object detection boxes in the current frame image based on the input features corresponding to the respective first feature cells comprises: For each first feature unit, projecting the input feature corresponding to the first feature unit into a current query vector, a current key vector and a current value vector; Reading a history key vector and a history value vector corresponding to each second feature unit from a preset reference tensor, splicing a current key vector and the history key vector to obtain a mixed key vector, and splicing the current value vector and the history value vector to obtain a mixed value vector; calculating target features corresponding to the first feature unit according to the current query vector, the mixed key vector and the mixed value vector; And predicting a target detection frame set in the current frame image based on target features respectively corresponding to the first feature units, and updating the current key vector and the current value vector corresponding to each first feature unit into the reference tensor.
3. The method of claim 2, wherein determining the pruned individual first feature units based on the predictive importance scores for each feature unit comprises: Determining a first index set and a second index set according to the predictive importance score of each feature unit, wherein the first index set comprises indexes of all first feature units remained after pruning; the reading the history key vector and the history value vector corresponding to each second feature unit from the preset reference tensor includes: Reading a history key vector and a history value vector corresponding to each second feature unit from a preset reference tensor according to the second index set; the updating the current key vector and the current value vector corresponding to each first feature unit into the reference tensor includes: And updating the current key vector and the current value vector corresponding to each first feature unit to the corresponding position in the reference tensor according to the first index set.
4. A method according to claim 3, wherein the reference tensor comprises a key store and a value store; The updating, according to the first index set, the current key vector and the current value vector corresponding to each of the first feature units to corresponding positions in the reference tensor includes: And for each first feature unit, writing the current key vector and the current value vector corresponding to the first feature unit into target positions indicated by indexes of the first feature unit in the key storage area and the value storage area respectively so as to cover original historical data in the target positions.
5. The method of claim 2, wherein said calculating a predictive importance score for each feature cell in said feature map comprises: And calculating the predictive importance score of each feature unit based on the global attention map corresponding to the associated feature unit, wherein the associated feature unit comprises a preamble adjacent feature unit corresponding to the current frame image and/or a feature unit corresponding to the previous frame image and positioned at the same level position as the feature unit.
6. The method of claim 5, wherein said calculating, for each of said feature cells, a predictive importance score for said feature cell based on a global attention map corresponding to the associated feature cell, comprises: For each feature unit, acquiring a first global attention map corresponding to a first associated feature unit, wherein the first associated feature unit is a preamble adjacent feature unit corresponding to the current frame image of the feature unit; acquiring a second global attention map corresponding to a second associated feature unit, wherein the second associated feature unit is a feature unit of the previous frame image, which is positioned at the same level position as the feature unit; And fusing the first global attention map and the second global attention map, obtaining a target reference attention map, and calculating the predictive importance score of the feature unit based on the target reference attention map.
7. The method of claim 5, wherein the computing the target feature corresponding to the first feature unit from the current query vector, the hybrid key vector, and the hybrid value vector comprises: Performing attention weight calculation based on the current query vector and the mixed key vector to obtain a local attention map corresponding to the first feature unit; Weighting and summing all value vectors in the mixed value vector according to all attention weights in the local attention map to obtain target features corresponding to the first feature unit; the method further comprises the steps of: and updating the global attention force diagram corresponding to the preamble adjacent feature unit based on the local attention force diagram corresponding to the first feature unit, and obtaining and recording the global attention force diagram corresponding to the first feature unit.
8. The method according to any one of claims 1 to 7, wherein the dividing each feature unit in the feature map of the current frame image into a foreground region set and a background region set based on the target detection frame set predicted by the object detection model for the previous frame image includes: Mapping the target detection frame set of the previous frame image to a feature map of the current frame image; traversing the coordinates of each feature unit in the feature map of the current frame image; dividing feature units with coordinates falling into any detection frame in the target detection frame set of the previous frame image into the foreground region set; and dividing the feature units with coordinates falling outside the arbitrary detection frame into the background area set.
9. A visual transducer layering pruning device based on spatiotemporal similarity, the device comprising: The hierarchical division module is used for dividing each feature unit in the feature map of the current frame image into a foreground region set and a background region set based on a target detection frame set predicted by the object detection model for the previous frame image when each current frame image in the video frame sequence is processed by using the object detection model based on the transform architecture; The importance calculating module is used for calculating the predictive importance score of each feature unit in the feature map; the pruning module is used for determining each first characteristic unit after pruning according to the prediction importance score of each characteristic unit, wherein each first characteristic unit comprises characteristic units with the prediction importance score larger than or equal to a first pruning threshold value in the foreground region set and characteristic units with the prediction importance score larger than or equal to a second pruning threshold value in the background region set, and the second pruning threshold value is higher than the first pruning threshold value; and the object detection module is used for predicting a target detection frame set in the current frame image based on the input features corresponding to the first feature units.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor is adapted to implement the steps of the method of any one of claims 1 to 8 when the computer program is executed.

Description

Visual transducer layering pruning method, device and equipment based on space-time similarity Technical Field The application relates to the technical field of computer vision, in particular to a visual transformation layering pruning method, device and equipment based on space-time similarity. Background With the rapid development of artificial intelligence technology, particularly computer vision technology, the vision transducer model has achieved great success in the field of image processing. However, in the process of reasoning by using the visual transducer model, the computational complexity is high, and the deployment on the resource-limited equipment is difficult. To reduce the computational complexity, pruning techniques have been proposed in the industry. In the traditional pruning method, all feature units (Token) in the current image participate in complex matrix multiplication operation in an attention mechanism to obtain an attention score, and the attention score is normalized by using Softmax to obtain an attention weight which can be recorded as the attention weight after Softmax. And then, determining redundant Token in the image based on the attention weight after Softmax, wherein the redundant Token is the object to be pruned. This means that redundant Token has participated in a large number of matrix multiplication and computation-intensive operations such as Softmax before determining the object to be pruned. Thus, this delayed conventional pruning method still has a large number of ineffective computations, thereby causing a large additional computational overhead. Disclosure of Invention In view of the foregoing, it is desirable to provide a method, apparatus, computer device, computer readable storage medium, and computer program product for visual transformation hierarchical pruning based on spatio-temporal similarity. The application provides a visual transducer layering pruning method based on space-time similarity, which comprises the steps of dividing each feature unit in a feature map of a current frame image into a foreground region set and a background region set based on a target detection frame set predicted by an object detection model aiming at a previous frame image when each current frame image in a video frame sequence is processed by the object detection model based on a transducer architecture, calculating the prediction importance score of each feature unit in the feature map, determining each first feature unit after pruning according to the prediction importance score of each feature unit, wherein each first feature unit comprises feature units with the prediction importance score of the foreground region set being greater than or equal to a first pruning threshold value, and feature units with the prediction importance score of the background region set being greater than or equal to a second pruning threshold value, wherein the second pruning threshold value is higher than the first pruning threshold value, and predicting the target detection frame set in the current frame image based on input features corresponding to each first feature unit. In one embodiment, pruned feature units are recorded as second feature units, a target detection frame set in a current frame image is predicted based on input features corresponding to each first feature unit, the pruned feature units comprise projecting input features corresponding to the first feature units into a current query vector, a current key vector and a current value vector for each first feature unit, reading a historical key vector and a historical value vector corresponding to each second feature unit from a preset reference tensor, splicing the current key vector and the historical key vector to obtain a mixed key vector, splicing the current value vector and the historical value vector to obtain a mixed value vector, calculating target features corresponding to the first feature units according to the current query vector, the mixed key vector and the mixed value vector, predicting the target detection frame set in the current frame image based on the target features corresponding to each first feature unit, and updating the current key vector and the current value vector corresponding to each first feature unit into the reference tensor. In one embodiment, determining each first feature cell after pruning according to the predicted importance score of each feature cell comprises determining a first index set and a second index set according to the predicted importance score of each feature cell, wherein the first index set comprises indexes of each first feature cell remained after pruning, and the second index set comprises indexes of the second feature cell after pruning. In the embodiment, the method for reading the historical key vector and the historical value vector corresponding to each second feature unit from the preset reference tensor comprises the step of reading the historical key vector