CN-122002039-A - Man-machine cooperation continuous image compression method and system based on hybrid expert adapter

CN122002039ACN 122002039 ACN122002039 ACN 122002039ACN-122002039-A

Abstract

The invention discloses a man-machine cooperation continuous image compression method and a system based on a hybrid expert adapter, and the technical scheme is that a general characteristic data set and a downstream task data set are generated and preprocessed, and a basic coding and decoding network is built; the method comprises the steps of constructing a mixed expert adapter comprising a public expert module and a task expert module in parallel, constructing a man-machine cooperation decoupling framework, adopting two-stage training and implementing a parameter isolation strategy in a second stage, and dynamically switching a mode when reasoning, wherein a human eye vision mode needs to remove the mixed expert adapter, and a machine vision mode dynamically activates a corresponding expert according to task identification. The invention gives consideration to human visual quality and machine visual task performance, supports multi-task zero forgetting continuous learning, and improves the analysis precision of downstream machine visual tasks.

Inventors

LIU KAI
LIU SHAOBO
Xiong Haobo
DING ZHONGYANG
WANG FEIYANG

Assignees

西安电子科技大学

Dates

Publication Date: 20260508
Application Date: 20260116

Claims (10)

1. The man-machine cooperation continuous image compression method based on the hybrid expert adapter is characterized by comprising the following implementation steps of: Step 1, generating a general characteristic data set and a downstream task data set, and sequentially performing image size alignment, image cutting, normalization and vectorization preprocessing operations on the two data sets; step 2, constructing and initializing a basic coding and decoding network comprising an encoder, an entropy model and a decoder; step 3, constructing a mixed expert adapter comprising a public expert module and a task expert module, wherein the two expert modules are formed by cascading bottleneck light-weight structures of channel dimension reduction and channel dimension increase layers and parallel structures of space branches and frequency domain branches; step 4, constructing a man-machine cooperation decoupling framework which comprises a feature encoding network, a hybrid expert adapter, an entropy model and a feature decoding network; step 5, training a man-machine cooperation decoupling framework in a two-stage mode, and implementing a parameter isolation strategy in the second stage process; step 6, dynamically switching a working mode according to the scene requirement of image compression, executing step 7 if the human eye vision mode is to be realized, and executing step 8 if the machine vision mode is to be realized; Step 7, removing the mixed expert adapter in the man-machine cooperation decoupling framework, reverting to the basic coding and decoding network constructed in the step 2, and executing reasoning of the basic coding and decoding network to obtain an evaluation task aiming at image compression of human vision; And 8, in the reasoning stage of the machine vision mode, receiving the images and the task identifications by using a trained man-machine cooperation decoupling framework to execute dynamic route reasoning, and dynamically activating the public expert module, the corresponding specific task expert module and the route fusion coefficient according to the task identifications to complete output and evaluation.
2. The method for compressing human-computer collaborative continuous image according to claim 1, wherein the basic codec network in step 2 is used for extracting general human eye perception characteristics, selecting any one of a transducer-based or convolutional neural network-based basis, directly loading the weight of the trained basic codec network facing human eye vision for initialization, and generally selecting the weight of the basic codec network with four bit levels.
3. The method of continuous image compression in man-machine cooperation according to claim 1, wherein the hybrid expert adapter in step 3 is composed of a common expert module and a task expert module in parallel, wherein the common expert module comprises an expert structure, which is recorded as The task expert module is a dynamic list { , , ..., Each element corresponds to a specific downstream task, and is identified by the task when executed Selection of control task expert for a given input feature And assigned task identification Output features of a hybrid expert adapter The calculation process is as follows: ; ; wherein the features are input Is of the dimension of , The size of the batch is indicated and, The number of channels is indicated and the number of channels is indicated, The height of the feature map is indicated, Representing the width of the feature map; representing output features, the dimensions of which are equal to the input features Keeping consistency; representing the output of the hybrid expert adapter; Representing the output of the public expert module; Representing a current task The output of the corresponding task expert module; And Is directed to the first The individual tasks are independently trainable route fusion coefficients.
4. The human-computer collaboration continuous image compression method according to claim 1, wherein the bottleneck-type lightweight structure of the expert in the step 3 consists of a channel dimension-reducing layer and a channel dimension-increasing layer; the channel dimension reduction layer comprises a leachable scaling and a lower projection, wherein the leachable scaling consists of a layer normalization structure and a residual structure, and is used for inputting characteristics Normalized features The calculation formula is as follows: ; Wherein, the Normalizing the representation layer; representing a learnable scaling vector for the layer normalization structure for fine tuning the contribution of the normalization feature; Representing a learnable scaling vector for the residual structure, a manifold structure for preserving the original features, Representing element-wise multiplication operations, followed by normalized features Is passed into the lower projection: ; Wherein, the Indicating a core size of Compressing the number of characteristic channels from N in the input dimension to M in the intermediate dimension; After the space branch and the frequency domain branch respectively finish feature extraction, the output features are weighted and summed through the space branch and the frequency domain branch, and finally output is obtained through the up-projection processing The expression is as follows: ; ; Wherein, the And Representing two learnable scalar balance parameters introduced for spatial branch output features And frequency domain branch output features Weighted summation is performed, and the summed features are introduced into a nonlinear transformation by a ReLU activation function and then introduced into an upper projection consisting of a kernel of size Is used to recover the number of characteristic channels from the intermediate dimension M to the input dimension N.
5. The human-computer collaborative continuous image compression method according to claim 1, wherein the spatial branch captures spatial context information using a multiscale receptive field in step 3; firstly, extracting multi-scale characteristics and inputting normalized characteristics The number M of input and output channels is kept unchanged after three parallel group hole convolution processes, and the group hole convolution adopts the group number: Is used for generating three characteristic diagrams 、、 : ; Wherein, the Indicating void fraction of A kind of electronic device Grouping cavity convolution, the number M of input and output channels is kept unchanged, and each channel is Grouping convolution adopts grouping strategy and grouping number ; Second, for three feature maps 、、 Element level accumulation is carried out to obtain fusion characteristics : ; Third step, generating attention vectors by global average pooling GAP and full join projection FCP, wherein the full join projection comprises Is a convolution layer of (1), a ReLU activation function The first is the convolution layer of The convolution layer of (2) reduces the number of feature channels after global average pooling to 16, the second one The number of channels is reduced to 3 again, the number of the three parallel grouping hole convolutions is corresponding, and the attention vector calculates three hole convolutions output by using a Softmax function 、、 Attention vector of (a) 、、 Satisfies the following conditions : ; ; Fourth, feature multiplication fusion and convolution projection are performed, wherein the feature multiplication fusion is performed by using the generated attention vector 、、 Convolving the output of the original three holes 、、 Multiplying and fusing to obtain features The convolution projection structure comprises GELU activation functions and The expression of which is as follows: ; ; a fifth step of, in a fifth step, Through and pass Residual connection to obtain final spatial branch output characteristics : 。
6. The human-computer collaborative continuous image compression method according to claim 1, wherein the frequency domain branches capture long-range dependencies using global characteristics of fourier transform in step 3; First, for input features Two-dimensional real fast fourier transform rFFT d is performed, which is converted from the spatial domain to the frequency domain: ; Wherein, the Comprising a real part And imaginary part ; Representing a complex set, H representing the height of the feature map, W representing the width of the feature map; Secondly, frequency domain filtering is carried out, a real part and an imaginary part are spliced in a channel dimension, frequency domain feature interaction is carried out through a complex weighting module, and then the complex form is restored, wherein the formula expression is as follows: ; Wherein, the Representation of Is used for the convolution layer of (c), Representing GELU activation functions, [ ] represents stitching by channel dimension; The channel number of the feature is spliced by real part and imaginary part so as to expand by two times to 2*M, the complex weighting module comprises a lower projection, GELU activation function and an upper projection, and the lower projection is Is to be convolved with The number of channels of the feature is compressed to be original Rounding down, i.e. the number of channels is of size By projection onto The convolution layer of (a) is restored to the original channel number 2*M; In the third step, the third step is that, Splitting back to real part and imaginary part in channel dimension, and combining to obtain new complex frequency spectrum, which is restored back to space domain by inverse Fourier transform rIFFT d to obtain frequency domain branch output characteristic : 。
7. The human-computer collaborative continuous image compression method according to claim 1, wherein in step 4, the feature encoding network and the feature decoding network each comprise a cascaded block structure and a hybrid expert adapter, and the specific structure is as follows: The structure of the feature coding network comprises four coding blocks and three mixed expert adapters, and the feature coding network is divided into four cascading coding blocks according to the downsampling size of a feature map, and is defined as a first coding block to a fourth coding block; the feature coding network also comprises three mixed expert adapters, namely a first mixed expert adapter, a second mixed expert adapter and a third mixed expert adapter, wherein the coding blocks and the mixed expert adapters are sequentially cascaded, and each mixed expert adapter is used for receiving the features output by the previous coding block and transmitting the generated features to the next coding block; The feature decoding network structure comprises four coding blocks and three mixed expert adapters, wherein the four coding blocks and the three mixed expert adapters are divided into four cascaded decoding blocks according to the up-sampling size of a feature map, the four cascaded decoding blocks are defined as a first decoding block to a fourth decoding block, each decoding block comprises a convolutional neural network block or a Transformer block with up-sampling and arbitrary structures, the three mixed expert adapters are defined as a fourth mixed expert adapter, a fifth mixed expert adapter and a sixth mixed expert adapter, the mixed expert adapters at the decoding blocks and the decoding ends are sequentially cascaded, and the mixed expert adapter at each decoding end is used for receiving the features output by the previous decoding block and transmitting the generated features to the next decoding block, and finally reconstructing an image.
8. The human-computer collaboration continuous image compression method according to claim 1, wherein the two-stage training in step 5 specifically comprises: The first stage, pre-training the public expert module, selecting the general feature data set generated in the step 1 as training data, keeping the parameters of all structures except the mixed expert adapter in the man-machine cooperation decoupling framework constructed in the step 4 frozen, keeping the list of the task expert modules empty at the moment, carrying out gradient update training on the parameters of the public expert module in the mixed expert adapter only by the mixed expert adapter to obtain a general pre-training model, and constructing a joint loss function by adopting classification tasks according to the stage To guide the pre-training process to be completed, the calculation formula is as follows: ; ; ; Wherein, the Indicating a loss of code rate, Representation of quantized features Is used for estimating the code rate; Used for controlling different classification code rate grades; Representing a classification perceptual feature loss that computes an original image using a pre-trained classifier, such as ResNet a 50, as a feature extractor And reconstructing an image The mean square error MSE between the feature extractor intermediate layers, Representing feature extractor first The characteristic diagram of the phase output is shown in the figure, The value range of (2) is [1,4]; In the second stage, incremental training is carried out on the task expert module, and the universal pre-training model obtained in the first stage is constructed and loaded And fusing the routes corresponding to the public expert module and the task expert module into coefficients And Initializing to 0, forcing freezing of underlying codec network, public expert module and all old task expert { s } , , Parameters of the currently newly added task expert using only the downstream task data set generated in step 1 And its corresponding route fusion coefficient 、 Gradient updating and task perception loss function construction The training is guided, and the calculation formula is as follows: ; Wherein, the Representing a machine vision compression level control coefficient; representing task awareness losses, which extract semantic features using a pre-trained task network.
9. The method for human-computer collaborative continuous image compression according to claim 1, wherein the parameter isolation policy in step 5 means that, during incremental training for the second-stage nth newly added machine vision task, the following operations are performed by a gradient control mechanism of a deep learning computing framework: Setting all weight parameters of a general pre-training model comprising a basic coding and decoding network, a public expert module and the previous N-1 historical old task experts into a non-trainable state, and cutting off a gradient reverse propagation path aiming at the general pre-training model; Thawing the current domain, namely setting the internal weight of the current Nth task expert and the corresponding route fusion coefficient into a trainable state; The state is kept so that the feature extraction path and the inference result of the common expert module and the first N-1 old tasks remain strictly constant in value while optimizing the loss function of the nth task.
10. A human-computer cooperation continuous image compression system based on a hybrid expert adapter, which is characterized by being realized based on the human-computer cooperation continuous image compression method as claimed in any one of claims 1-9, comprising the following modules: The data generation and preprocessing module is used for generating and arranging a general characteristic data set and a downstream task data set, and executing preprocessing operations of image size alignment, image clipping, normalization and vectorization; The basic coding and decoding network construction and initialization module constructs a basic coding and decoding network comprising an encoder, an entropy model and a decoder, wherein the basic coding and decoding network is used as a basis for extracting general human eye perception characteristics, weight facing human eye vision is directly loaded for initialization, and parameters of the basic coding and decoding network are kept in a frozen state during machine vision task training; The system comprises a mixed expert adapter construction module, a mixed expert adapter, a channel dimension reduction layer, a bottleneck light structure, a space branch and a frequency domain branch, wherein the bottleneck light structure, the space branch and the frequency domain branch are integrated in the expert, the mixed expert adapter comprises a common expert module and a task expert module which are parallel; The system comprises a man-machine cooperation decoupling architecture construction module, a machine vision task processing module and a machine vision task processing module, wherein the man-machine cooperation decoupling architecture construction module is used for constructing a decoupling architecture consisting of a feature coding network, a hybrid expert adapter, an entropy model and a feature decoding network, and the hybrid expert adapter is inserted between each level of cascade blocks of the feature coding network and the feature decoding network and used for extracting and fine-adjusting semantic features required by the machine vision task; The second stage is aimed at the newly added machine vision task, instantiates the newly added task expert and route fusion coefficient in the mixed expert adapter, and forcible freezes model parameters comprising a basic coding and decoding network, the public expert module and all old task experts, and trains the currently added task expert and the corresponding route fusion coefficient by using only the downstream task data set; The working mode dynamic switching module dynamically switches to a human eye vision mode or a machine vision mode by removing or loading the mixed expert adapter according to the scene requirement of image compression; And the dynamic route reasoning module is used for receiving the image and the task identification in the machine vision mode, dynamically activating the public expert module, the corresponding specific task expert and the route fusion coefficient according to the task identification, calculating the characteristics and completing the output and evaluation aiming at the specific machine vision task.

Description

Man-machine cooperation continuous image compression method and system based on hybrid expert adapter Technical Field The invention belongs to the field of image processing technology and computer vision, and further relates to a man-machine cooperation continuous image compression method and system based on a hybrid expert adapter in the technical field of electric digital data classification. The invention can realize high-quality reconstruction of human vision and high-precision analysis of machine vision, supports zero-forgetting multi-task continuous learning, and is used for image compression in the technical fields of Internet of things, automatic driving and intelligent security. Background With the rapid development of the internet of things, automatic driving and intelligent security technologies, the application scene of the image data is fundamentally subjected to paradigm shift. The service objects of image compression techniques are no longer limited to the human visual system alone, and in many edge computing and real-time analysis scenarios, image data more largely serves back-end machine vision algorithms such as object detection, semantic segmentation, and the like. This application of center of gravity shifting puts a new double demand on image compression technology. However, the existing image compression technology has difficulty in satisfying the human-computer dual requirements. The traditional image compression standard and the mainstream leachable image compression method mainly aim at optimizing the peak signal-to-noise ratio or structural similarity perceived by human eyes, and the optimization strategy tends to keep low-frequency smooth information, so that key high-frequency semantic information required by machine vision is lost, and the accuracy of a back-end algorithm is reduced. On the contrary, the existing compression method specially oriented to machine vision often ignores the pixel-level reconstruction quality of the image although retaining semantic features, so that the decompressed image cannot meet the requirement of human eyes for viewing. The Shanghai university discloses a method for image compression using a state space-based model (e.g., mamba) in its applied patent document "Mamba-based image compression method and system for human eye and machine vision at the same time" (application number: 202510213376.9, application publication number: CN 120151540A). The method comprises the steps of inputting an input RGB image into a preset analysis transformation network to determine potential features, sequentially carrying out entropy parameter estimation, quantization and entropy coding on the potential features by adopting a preset super-prior entropy model to determine binary code streams, carrying out entropy decoding on the binary code streams by adopting the preset super-prior entropy model to determine the potential features, inputting the potential features into a preset synthesis transformation network to determine a reconstructed image for human eye viewing, carrying out up-sampling processing on the potential features based on Mamba to determine alignment features, and inputting the alignment features into a preset machine vision back-end network to carry out knowledge distillation to complete preset machine vision tasks. The patent realizes image reconstruction and machine vision tasks simultaneously by transmitting a single code stream, and improves the performance by utilizing the advantage of Mamba architecture on long-sequence modeling. However, this solution still suffers from the disadvantage that it relies mainly on knowledge distillation and joint training, when faced with newly added machine vision tasks, e.g. extending from support only detection to support segmentation, often requires retraining or fine tuning of the whole network, which is not only time consuming and laborious, but also easily results in "catastrophic forgetfulness", i.e. learning of new tasks may lead to a degradation of the quality of the old tasks, e.g. human eye reconstruction or old machine tasks. Furthermore, the Mamba architecture, while efficient, is still limited in flexibility in handling multi-tasking handoffs by interference from shared parameters. An image compression method for adaptively adjusting quantization step length is disclosed in patent literature 'image encoding, decoding methods and compression methods for machine and human vision' (application number: 202411211529.8, application publication number: CN 119180874A) applied by Shanghai university. The method mainly comprises the steps of obtaining a first feature map and a second feature of an image to be encoded, obtaining super prior information of the first feature map, obtaining quantization step sizes of the image to be encoded under different machine vision tasks, quantizing the quantization step sizes to obtain corresponding feature maps, modeling the feature maps into Gaussian distribution, pred