CN-122021909-A - Light-weight multi-mode audio-visual conversion method, device and system
Abstract
The invention relates to a light multi-mode audio-visual conversion method, device and system, which comprises the steps of A) obtaining a student model, wherein the student model is obtained by a teacher model through structured pruning, quantization and knowledge distillation training, the structured pruning is carried out on a convolution channel, a attention head or an equivalent structure, operator folding and calculation map optimization are carried out before deployment, B) carrying out regional processing and normalization on visual streams, extracting time-frequency characteristics of audio streams in parallel, and respectively entering independent queues and being input into batches through shared buffer organization. The lightweight multi-mode audio-visual conversion method, device and system (1) have low time delay/low power consumption, namely, the end side P95 reasoning time delay is obviously reduced, the memory peak value and the energy consumption are reduced, (2) the throughput is improved, the preprocessing and the reasoning are parallel, the running water idle running is reduced, and (3) the precision is kept, and the long-term running precision and the threshold stability are ensured by the consistency check of the distillation and the end cloud.
Inventors
- CHENG YUE
Assignees
- 北京跃创三品文化科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260203
Claims (10)
- 1. The end-side push processing method for the light-weight multi-mode audio-visual conversion is characterized by comprising the following steps of: A) The method comprises the steps of obtaining a student model, wherein the student model is obtained by a teacher model through structured pruning, quantification and knowledge distillation training, the structured pruning is carried out on a convolution channel, a attention head or an equivalent structure, and operator folding and computational graph optimization are carried out before deployment; b) Performing regional processing and normalization on the visual stream, extracting time-frequency characteristics from the audio stream in parallel, and respectively entering an independent queue and organizing the two into batch input through a shared buffer; C) The batch input is sent to a lightweight multi-mode network comprising a visual subnet and an audio subnet, characteristic alignment and fusion are carried out in a lightweight fusion layer, early-backward branches are arranged in an intermediate layer so as to be output in advance when the confidence reaches a threshold, and the lightweight fusion layer is one of cross attention, gate control weighting or other deployable lightweight fusion structures; D) Performing post-processing in parallel with reasoning, including result decoding, trace or key point smoothing and formatting, and reducing peak memory by memory multiplexing; E) And under the network availability condition, submitting the samples or the periodic sampling samples in the confidence intervals [ L, H ] to a consistency check service for shadow reasoning, obtaining consistency deviation based on a comparison result, returning a threshold value or a scale calibration quantity to update end side judgment, calculating the consistency deviation based on one of probability distribution difference measurement or vector distance measurement, and updating the threshold value or the scale calibration quantity according to the consistency deviation, wherein the consistency check only submits the embedded or the abstract with limited dimension and the minimum necessary context under the condition of not uploading the original audio-visual data.
- 2. The method, apparatus and system for light-weight multi-modal audio-visual transformation as set forth in claim 1, wherein the knowledge distillation includes distillation of one or more of teacher soft target output, selected intermediate features and attention distribution.
- 3. The method, apparatus and system for light-weight multi-modal audio-visual transformation of claim 1, wherein the visual stream comprises one or more of an on-screen rendering frame, a sampling frame, a composition frame or an acquisition frame, and is not limited to from any external acquisition device.
- 4. A training method for obtaining the student model of claim 1, comprising: 1) Determining a teacher model and constructing a student model skeleton; 2) Applying structured pruning to the student model to obtain a sparse structure; 3) Performing quantization perception training or post-training quantization on the sparse structure and performing dynamic range calibration; 4) Taking teacher soft target output and/or intermediate characteristics and/or attention distribution as supervision, and minimizing combined distillation loss to obtain a student model; 5) A distillation head is arranged on the early-return branch to ensure the accuracy of the early output path.
- 5. The training method of student model of claim 4, wherein the pruning target is importance ranking of a convolution channel, a attention head or an equivalent structure, and pruning proportion does not exceed a preset upper limit; The quantization adopts integer bit width b epsilon [2,8] and is compatible with an end-side integer matrix multiply-add instruction set.
- 6. The method, device and system for converting light multi-mode audio-visual according to claim 4, wherein the quantization is implemented by using integer b E [2,8] and compatible with an end-side integer matrix multiply-add instruction set; The dynamic range calibration obtains the scaling factor and zero point of each layer based on calibration set statistics.
- 7. A lightweight multimodal audiovisual transformation system for implementing the method according to any of the claims 1-6, characterized by comprising-a parallel preprocessing module, -a lightweight multimodal reasoning module (comprising early back and lightweight fusion layers), -a parallel post-processing module, -a consistency check module, -a resource and memory scheduling module (for ring buffering, token arbitration, memory Chi Fu).
- 8. An electronic device comprising a processor and a memory, wherein the memory stores a computer program executable on the processor, and wherein the processor implements the method of any of claims 1-6 when the program is executed.
- 9. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1-6.
- 10. A computer program product, characterized in that the method according to any of claims 1-6 is performed when running on a computer.
Description
Light-weight multi-mode audio-visual conversion method, device and system Technical Field The invention relates to the technical field of multi-mode sensing and intelligent edge computing, in particular to a light-weight multi-mode audio-visual conversion method, device and system. Background Multimodal audiovisual tasks often process visual and audio information simultaneously. The traditional scheme has large parameter scale, high reasoning time delay and large memory peak value, and the preprocessing/post-processing and reasoning are often carried out in series, so that running water idles and shakes, and the long-term running of the end side can also generate offset according to the distribution change and a cloud side high-precision teacher, so that threshold instability and output drift occur. An integrated end-side scheme which is light in weight, three-section parallel running water and consistency check is not considered. Disclosure of Invention The invention aims to provide a light-weight multi-mode audio-visual conversion method, device and system, which are used for solving the problems of the background technology. In order to achieve the purpose, the invention provides the following technical scheme that the terminal side push method for light-weight multi-mode audio-visual conversion comprises the following steps: A) The method comprises the steps of obtaining a student model, wherein the student model is obtained by a teacher model through structured pruning, quantification and knowledge distillation training, the structured pruning is carried out on a convolution channel, a attention head or an equivalent structure, and operator folding and computational graph optimization are carried out before deployment; b) Performing regional processing and normalization on the visual stream, extracting time-frequency characteristics from the audio stream in parallel, and respectively entering an independent queue and organizing the two into batch input through a shared buffer; C) The batch input is sent to a lightweight multi-mode network comprising a visual subnet and an audio subnet, characteristic alignment and fusion are carried out in a lightweight fusion layer, early-backward branches are arranged in an intermediate layer so as to be output in advance when the confidence reaches a threshold, and the lightweight fusion layer is one of cross attention, gate control weighting or other deployable lightweight fusion structures; D) Performing post-processing in parallel with reasoning, including result decoding, trace or key point smoothing and formatting, and reducing peak memory by memory multiplexing; E) And under the network availability condition, submitting the samples or the periodic sampling samples in the confidence intervals [ L, H ] to a consistency check service for shadow reasoning, obtaining consistency deviation based on a comparison result, returning a threshold value or a scale calibration quantity to update end side judgment, calculating the consistency deviation based on one of probability distribution difference measurement or vector distance measurement, and updating the threshold value or the scale calibration quantity according to the consistency deviation, wherein the consistency check only submits the embedded or the abstract with limited dimension and the minimum necessary context under the condition of not uploading the original audio-visual data. Further, the knowledge distillation includes distillation of one or more of teacher soft target output, selected intermediate features, and attention profile. Further, the visual stream includes one or more of an on-screen rendering frame, a sampling frame, a composition frame, or an acquisition frame, and is not limited to from any external acquisition device. A method of training a student model, comprising: 1) Determining a teacher model and constructing a student model skeleton; 2) Applying structured pruning to the student model to obtain a sparse structure; 3) Performing quantization perception training or post-training quantization on the sparse structure and performing dynamic range calibration; 4) Taking teacher soft target output and/or intermediate characteristics and/or attention distribution as supervision, and minimizing combined distillation loss to obtain a student model; 5) A distillation head is arranged on the early-return branch to ensure the accuracy of the early output path. Further, the pruning targets are importance sequences of convolution channels, attention heads or equivalent structures, and the pruning proportion does not exceed a preset upper limit; The quantization adopts integer bit width b epsilon [2,8] and is compatible with an end-side integer matrix multiply-add instruction set. Furthermore, the quantization adopts integer bit width b epsilon [2,8] and is compatible with an end-side integer matrix multiply-add instruction set; The dynamic range calibration obtains the scaling factor and zero point