CN-121999235-A - Image processing method based on dynamic graph attention network and cross-modal distillation associated data
Abstract
The invention discloses an image processing method based on associated data of a dynamic graph attention network and cross-modal distillation, which is suitable for image processing scenes such as automatic driving, remote sensing monitoring, intelligent security and the like, in which multi-modal data are required to be fused. The method comprises the steps of obtaining association data and preprocessing, constructing a feature conversion method based on an improved gram angle field, generating polar coordinates, constructing an image feature matrix through inner product operation, determining a space-time association relation between image data and cross-modal data by using a SIFT recognition method, updating an adjacent matrix of a graph structure, constructing a cross-modal distillation module, processing target image data based on a student model after knowledge migration, extracting space-time association features of the target image through a dynamic graph attention network by using the student model, and outputting an image processing result by using an output layer.
Inventors
- HOU JIN
- ZHU JIANG
- ZHU XI
- ZENG TIJIAN
- XIE ZHIQI
- TANG XIAOBO
- LUO YU
- SU QIAN
- DU ZEXIN
Assignees
- 贵州乌江水电开发有限责任公司
Dates
- Publication Date
- 20260508
- Application Date
- 20251215
Claims (10)
- 1. An image processing method based on dynamic graph attention network and cross-modal distillation associated data, comprising: Acquiring associated data and preprocessing, wherein the associated data comprises image data and corresponding cross-modal auxiliary data, and the cross-modal auxiliary data is selected from at least one of text description data, laser radar point cloud data, sensor time sequence data or meteorological observation data; Based on an improved gram angle field construction feature conversion method, mapping a time sequence value of the preprocessed image data into a radius of a polar coordinate system, taking the sum of a normalized anticcosine value of an image key parameter sequence and a parameter change direction value as an angle, generating polar coordinates, and constructing an image feature matrix through inner product operation, wherein the mapping is that the radius value range of the polar coordinate system is [0,1], the mapping formula is r_t= \frac { t } { N }, wherein t is a sequence point sequence number, N is a window total length, Converting the image feature matrix into a scale space by using a SIFT recognition method, and determining a space-time association relationship between image data and cross-modal data; Constructing a graph structure based on a dynamic graph attention network and a space-time association relation, and updating an adjacency matrix of the graph structure, wherein the dynamic graph attention network sets a multi-head attention head number K=4\sim8, and dynamically calculates related interaction coefficients among nodes through LeakyReLU activation functions; Constructing a cross-modal distillation module, taking a pre-trained multi-modal model as a teacher model, respectively extracting image features and cross-modal features of the teacher model, taking features corresponding to a graph structure as input features of a student model, and realizing knowledge migration from the teacher model to the student model through a distillation loss function; And processing the target image data based on the student model after knowledge migration, wherein the student model adopts a lightweight network architecture, extracts space-time correlation characteristics of the target image through a dynamic image attention network, and outputs an image processing result by using an output layer.
- 2. The method of claim 1, wherein the specific step of generating the image feature matrix based on the improved glatiramer angle field construction feature transformation method comprises: Calculating an anticcosine value\phi_s= arccos (x_ { norm) of the normalized sequence by Min-Max normalization processing (the formula is x_ { norm = \frac { x-x_ { Min } { x_ { Max } -x_ { Min }, wherein x_ { Min } is a sequence minimum value and x_ { Max } is a sequence maximum value); Converting a parameter change direction value in the cross-mode auxiliary data into radian (the formula is \phi_d=x_d\times\frac { \pi } {180 \circle }, wherein x_d is an angle system parameter change direction value); Adding the inverse cosine value and the converted radian to obtain an angle \phi = \phi_s + \phi_d, and linearly mapping the time sequence value into a radius r_t = \frac { t } { N } in a polar coordinate system; And (3) carrying out dot-by-dot inner product operation on polar coordinates in a scheduling period (the scheduling period is set to be 0.5-5 seconds) (an inner product formula is r_ i r _j\cos (\phi_i\phi_j), wherein i and j are time sequence point indexes), and generating an image feature matrix capable of representing the space-time features of the image.
- 3. The method of claim 1, wherein determining a spatiotemporal association between image data and cross-modality data comprises: Performing Gaussian blur processing on the image feature matrix (Gaussian kernel function is G (x, y, sigma) = \frac {1} {2\pi\sigma 2} e { - \frac { (x-u)/(2+ (y-v)/(2) } {2\sigma 2}, wherein (u, v) is a kernel center coordinate), and detecting key points through Gaussian difference calculation between different scales (formula is D (x, y, \sigma) = G (x, y, k\sigma)/(x, y) -G (x, y, \sigma)/(I (x, y)) by convolution operation; Generating 128-dimensional feature description vectors according to gradient directions (a calculation formula of \theta (x, y) = \tan { -1} \left (\frac { L (x, y+1) -L (x, y-1) } { (x+1, y) -L (x-1, y) } \right)) and amplitude distribution (a calculation formula of m (x, y) = \sqrt { [ L (x+1, y) -L (x-1, y) ] ++2 + [ L (x, y+1) -L (x, y-1) ] ++2 })) of a 16×16 neighborhood around the key point; Calculating Euclidean distance between each feature description vector and the cross-modal data feature vector (the formula is d= \sqrt { \sum { i=1 } {128} (f_ {1i } -f_ {2 i) } 2}, wherein f_ {1i }, f_ {2i } are the i-th elements of the two vectors); Based on the distance ratio test method of David Lowe, finding the nearest neighbor and the next nearest neighbor matching points for each feature description vector, calculating the distance ratio \frac { d_ { near } { d_ { next }, and judging that the corresponding image data and the cross-modal data have a space-time association relationship if the distance ratio is smaller than a set threshold (0.75-0.8).
- 4. The method of claim 1, wherein the dynamic graph attention network building graph structure comprises: Constructing an initial graph structure G= (V, E) according to a space-time association relation, wherein a node set V=V_ { img } \cup V_ { cross }, V_ { img } is an image feature node (the number is consistent with the dimension of an image feature matrix), V_ { cross } is a cross-modal feature node (the number is 1/4-1/2 of the cross-modal data feature dimension), and an edge set E represents the association relation among the nodes (the association exists, e_ { ij } =1, otherwise, e_ { ij } =0); Constructing an initial adjacency matrix adj\in\ mathbb { R } { N\times N } (N is the total number of nodes) based on an initial graph structure, and calculating attention coefficients among the nodes by utilizing a multi-head attention mechanism of a dynamic graph attention network (the formula is e_ { ij } = a # [ Wx_i\Wx_j ], wherein a is an attention vector, W is a linear transformation matrix, and |is characteristic splicing operation); The attention coefficient is normalized through a Softmax function (the formula is \alpha_ { ij } = \frac { \exp (LeakyReLU (e_ { ij)) } { \sum _ { k\in n_i } \exp (LeakyReLU (e_ { ik)) }, wherein n_i is a neighbor node set of the node i), and the adjacency matrix adj_ { ij } = \alpha_ { ij is updated according to the normalized attention weight, so that a dynamically updated graph structure is obtained.
- 5. The method of claim 1, wherein the training process of the cross-modal distillation module comprises: initializing student model parameters (adopting an Xavier initialization method) and fixing teacher model parameters; Setting the Batch size to be 16-64 by adopting a Batch random gradient descent (Batch-SGD) optimizer, and setting the learning rate to be 1 e-4-5 e-4 (adopting a cosine annealing learning rate scheduling strategy); In each training round, inputting the preprocessed data into a teacher model and a student model, respectively extracting features and calculating distillation loss, and updating the parameters of a convolution layer, a attention layer and a full connection layer of the student model through back propagation; And stopping training when the training iteration number reaches a preset threshold (100-200 rounds) or distillation loss converges (the fluctuation of the loss value is less than 1e-5 in 10 rounds continuously), and obtaining a student model after knowledge migration.
- 6. The method of claim 1, wherein processing the target image data based on the knowledge-migrated student model comprises: After preprocessing target image data in the step 1 of claim 1, inputting the target image data into a feature extraction layer (comprising 3-5 convolution blocks, wherein each convolution block comprises a convolution layer, a batch normalization layer and a ReLU activation layer) of a student model; Carrying out space association modeling on the extracted features through a dynamic graph attention network, and outputting space-time association features (feature dimension is 256-512) fused with cross-modal information; If the space-time correlation feature is used as a target detection task, the space-time correlation feature is input into a non-anchor detection head (comprising a 3X 3 shared convolution layer and 6 decoupling heads), the center coordinates (obtained by adding the strongest points of thermodynamic diagram response and offset amounts) of the target, the height, the length, the width, the height, the rotation angle and the speed are output, and if the space-time correlation feature is used as an image classification task, the space-time correlation feature is input into a full connection layer and a Softmax layer, and a class label and a corresponding probability value (the class with the probability value larger than 0.5 is the final classification result) are output.
- 7. The method of claim 1, further comprising the step of model performance optimization, wherein knowledge distillation regularization terms (the formula is L_ { reg = \lambda\sum { l=1 } { L } \w_l { teacher } -w_l { student } |are introduced into the cross-modal distillation module, wherein \lambda = 0.1\to 0.3 is a regularization coefficient, L is the number of network layers, W_l { teacher }, W_l { student } are weights of a teacher and a first layer of a student model respectively), overfitting of the student model is restrained, and meanwhile, a knowledge distillation pruning strategy (pruning is adopted for convolution kernels with parameters of which absolute values are smaller than 1e-4 in the student model) is adopted, so that the number of model parameters is reduced (the number of parameters after pruning is reduced by 30% -50%), and the reasoning speed is improved.
- 8. An image processing system based on dynamic graph attention network and cross-modal distillation correlation data, comprising: The data preprocessing unit is used for acquiring the associated data and executing normalization, resolution adjustment and format conversion operation, and outputting preprocessed image data and preprocessed cross-modal data, wherein the data preprocessing unit comprises an image preprocessing subunit (comprising a Gaussian filtering module and a resolution scaling module) and a cross-modal preprocessing subunit (comprising a text encoding module and a point cloud downsampling module); the feature matrix generation unit is used for constructing a feature conversion method based on the improved gram angle field and outputting an image feature matrix through polar coordinate generation and inner product operation; the association relation determining unit is used for calling a SIFT recognition algorithm to detect the characteristic mutation points, calculating the distance of the characteristic vectors, and outputting the space-time association relation between the image and the cross-modal data through a distance ratio test; the graph structure construction unit is used for constructing an initial graph structure based on the dynamic graph attention network, calculating the node attention weight, updating the adjacency matrix and outputting the dynamic graph structure; The system comprises a cross-modal distillation unit, a model analysis unit and a model analysis unit, wherein the cross-modal distillation unit is used for loading a pre-trained teacher model, constructing a distillation loss function, optimizing student model parameters through back propagation, and outputting a student model after knowledge migration; the image processing unit is used for inputting target image data into the student model, and outputting target detection, feature extraction or image classification results through feature extraction, associated modeling and output layer processing; And the performance optimization unit is used for introducing distillation regularization term to inhibit overfitting, optimizing a model structure by adopting a pruning strategy and outputting a light-weight high-performance student model.
- 9. An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform the method of image processing of associated data based on a dynamic graph attention network and cross-modal distillation of any one of claims 1-7.
- 10. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the image processing method of any one of claims 1-7 based on associated data of a dynamic graph attention network and cross-modal distillation.
Description
Image processing method based on dynamic graph attention network and cross-modal distillation associated data Technical Field The invention relates to the technical field of computer vision and image processing, in particular to an image processing method based on associated data of a dynamic graph attention network and cross-modal distillation, which is suitable for image processing scenes such as automatic driving, remote sensing monitoring, intelligent security and the like, which need to be fused with multi-modal data. Background With the deep application of artificial intelligence technology in various fields, the requirements of image processing tasks on precision, generalization and deployment efficiency are increasingly improved. Cross-modal complementary information contained in associated data (such as combination of images and text, point cloud and sensor data) is key to improving image processing performance, but the following core problems still exist in the prior art: 1. The traditional method mostly adopts static weight fusion cross-modal characteristics (such as simple splicing and weighted summation), ignores dynamic changes of time-space correlation between data (such as real-time change of association relation between images and point cloud caused by vehicle movement in automatic driving and influence of weather condition change in remote sensing scenes on association of images and observation data), and has low characteristic fusion precision and difficult adaptation to complex scenes. The knowledge migration efficiency is low, the existing cross-modal distillation technology focuses on single-level feature migration (such as image feature or text feature migration only), a multi-modal collaborative distillation mechanism is not constructed, and the cross-modal associated information is easy to lose in the process of transferring teacher model knowledge to student models, so that the generalization of the student models is poor, and when the student models are deployed on computing power limited equipment (such as a vehicle-mounted embedded platform and an edge terminal), the precision and the speed are difficult to balance. The feature extraction robustness is poor, the traditional feature extraction method (such as HOG and SIFT) is sensitive to noise, and the prior information of cross-mode data is not combined, so that the feature characterization capability is weak under the scenes of low illumination, shielding, data loss and the like, and the performance of subsequent image processing tasks (such as target detection and classification) is directly influenced. In the prior art, CN116758391B discloses a multi-domain remote sensing target identification method of noise suppression distillation, generalization is improved through multi-teacher distillation, but cross-modal data association modeling is not involved, CN116524329A provides a cross-modal distillation scheme of a low-calculation-force platform, but a static BEV coding structure is adopted, so that the method cannot dynamically adapt to association relation change, CN118587562A focuses on graphic multi-modal distillation, but does not combine graph neural network modeling space-time association, and non-text cross-modal data such as point cloud, sensors and the like are difficult to process. Therefore, there is a need for an image processing method that can dynamically model cross-modal correlations and efficiently migrate multi-modal knowledge. Disclosure of Invention The invention aims to overcome the defects of the prior art, and provides an image processing method based on the correlation data of a dynamic graph attention network and cross-modal distillation, which improves the image processing precision, generalization and deployment efficiency through dynamic correlation modeling, multi-modal collaborative distillation and robust feature extraction. In order to solve the technical problems, the embodiment of the invention is realized as follows: In a first aspect, an embodiment of the present application provides an image processing method based on associated data of a dynamic graph attention network and cross-modal distillation, the method including: Acquiring associated data and preprocessing, wherein the associated data comprises image data and corresponding cross-modal auxiliary data, and the cross-modal auxiliary data is selected from at least one of text description data, laser radar point cloud data, sensor time sequence data or meteorological observation data; sequentially performing pixel value normalization, fixed resolution scaling (the resolution range is 256 multiplied by 256 to 1024 multiplied by 1024) and noise filtering (Gaussian filtering or bilateral filtering) on the image data, performing format conversion on the cross-modal auxiliary data (text data is converted into 768-dimensional word vectors through a BERT model, point cloud data keeps key feature points through voxel downsampling) and feature