CN-122021947-A - Large model self-adaptive teaching method and system based on causal guided decoupling and multi-modal perception learning

CN122021947ACN 122021947 ACN122021947 ACN 122021947ACN-122021947-A

Abstract

The invention discloses a large model self-adaptive teaching method and system based on causal guided decoupling and multi-modal perception learning. According to the method, two control vectors are decoupled according to a non-parallel pure text corpus through comparison learning, causal differential extraction and a robust aggregation algorithm, cognitive coefficients are determined according to the knowledge level of a learner, style coefficients are determined through emotion of input voice of the learner, the two coefficients are linearly overlapped in a shared semantic space of a multi-modal large language model, the two vectors are injected into the model through an active vector guiding technology, and the guided active vectors are synchronously fed back to a plurality of downstream decoders to generate multi-modal output. According to the invention, two orthogonal and continuously adjustable control vectors are injected, so that the content-style double-dimensional refined control of multi-mode teaching output of an AI teacher is realized, and therefore, the pain points of cognition mismatch and emotion rupture in AI teaching are solved, and the dependence on parallel corpus is thoroughly eliminated.

Inventors

WANG QINGSONG
CHEN JINGYUAN
YAO CHANG

Assignees

浙江大学

Dates

Publication Date: 20260512
Application Date: 20260415

Claims (10)

1. The large model self-adaptive teaching method based on causal guided decoupling and multi-modal perception learning is characterized by comprising the following steps of: s1, constructing a data set, and constructing a non-parallel pure text corpus, wherein the corpus comprises a cognitive library for extracting a cognitive mode and a style library for extracting a teaching style; S2, training by contrast learning LoRA, and respectively training a cognitive mode LoRA adapter on a multi-mode large language model by contrast learning targets by utilizing the corpus in S1 And style mode LoRA adapter ; S3, primarily extracting style vectors, and respectively loading cognitive mode LoRA adapters by calculating a multi-mode large language model Style mode LoRA adapter The activation difference under the same input is respectively extracted to obtain the original differential vector set of the cognitive library And original differential vector set of style library ; S4, style vector aggregation is carried out on the original differential vector set of the cognitive library respectively And original differential vector set of style library Robust aggregation processing is performed to obtain cognitive mode control vectors respectively And style mode control vector ; S5, determining the strength of the style required by the user during reasoning, and dynamically determining the applied strength coefficients of the two control vectors in the step S4 in the real-time interaction process with the user, wherein the step comprises the steps of determining the cognitive mode coefficients And dynamically determining teaching style coefficients by a speech emotion recognition SER module ; S6, carrying out style injection reasoning based on the reasoning intensity required by the user, injecting the determined application intensity coefficient and the two control vectors in the step S4 into the original activation of the multi-mode large language model in a linear superposition mode in the shared semantic space of the multi-mode large language model to realize modification of the original activation, and feeding forward the modified activation to all downstream decoders to generate synchronous multi-mode output.
2. The large model adaptive teaching method based on causal guided decoupling and multi-modal perception learning according to claim 1, wherein the step S1 specifically comprises: s11, constructing a cognitive library, wherein the cognitive library consists of two groups of non-parallel sub-corpora, namely a corpus A containing intuitive and popular science texts and a corpus B containing formalized and academic texts; S12, constructing a style library, wherein the cognitive library consists of two groups of non-parallel sub-corpora, namely a corpus C containing enthusiasm ocean and vivid excitation texts and a corpus D containing stable, strict and objective neutral texts.
3. The large model adaptive teaching method based on causal guided decoupling and multi-modal perception learning according to claim 2, wherein step S2 specifically comprises: S21, setting a comparison learning target, namely regarding texts in the corpus B as positive examples and regarding texts in the corpus A as negative examples for the cognitive database; s22, adopting a contrast learning loss function on the multi-modal large language model and setting a cognitive model LoRA adapter according to positive and negative examples of the cognitive library Training is performed to obtain a trained cognitive model LoRA adapter ; S23, processing the style library according to the same method as the step S21-the step S22 to obtain the style mode LoRA adapter with the training completed 。
4. The large model adaptive teaching method based on causal guided decoupling and multi-modal perception learning according to claim 1, wherein the step S3 specifically comprises: S31, defining an intervention model, and defining a multi-mode large language model as Will apply the cognition model LoRA adapter The first model of intervention is defined as ; S32, selecting an extraction layer and a function, and selecting a multi-mode large language model One or more middle layers in a shared semantic space of (a) As an extraction layer and define an active extraction function ; The activation extraction function The function is a model In receiving non-parallel plain text corpus Personal text entry At the time of The layer averages all MLP output vectors for generating token; s33, calculating an activation difference value, and inputting the same batch of texts selected from the non-parallel plain text corpus Respectively feeding into multi-mode large language model And a first model Extracting functions according to activation Calculate that both are at The activation difference of the layers; S34, constructing all activation difference values to obtain an original differential vector set of the cognitive library ; S35, adapter according to obtained style mode LoRA Processing according to the same method of step S31-step S34, thereby constructing an original differential vector set of the style library 。
5. The large model adaptive teaching method based on causal guided decoupling and multi-modal perception learning according to claim 1, wherein the step S4 specifically comprises: s41, performing Principal Component Analysis (PCA) noise reduction, and performing Principal Component Analysis (PCA) on the original differential vector set of the cognitive library obtained in the step S3 Principal component analysis PCA is applied to process, the first k principal components are reserved, and all vectors are projected to a principal component space; S42, performing geometric median aggregation, and aggregating all the projected vectors in the principal component space projected in the step S41 by using a geometric median algorithm to obtain a center point estimation robust to outlier vectors ; S43, generating a final control vector, and obtaining a robust center point in S42 Projecting back to the space before projecting to the main component space, and normalizing to obtain the final cognitive mode control vector ; S44, adopting the same method as that of the step S41-the step S43 to perform the original differential vector set of the style library Processing to obtain style mode control vector 。
6. The large model adaptive teaching method based on causal guided decoupling and multi-modal perception learning according to claim 1, wherein the step S5 specifically comprises: s51, determining a cognition mode coefficient, and setting the cognition mode coefficient according to a static knowledge portrait of a student or an explicit request of the student through a slider on a user interface UI Specific values of (2); S52, capturing voice emotion, and capturing and analyzing prosodic features of the current input voice of the student in real time through an external integrated voice emotion recognition SER module so as to analyze the emotion of the student; S53, dynamically deciding a teaching style coefficient for guiding 'sinking stability' or 'enthusiasm' based on student emotion analyzed by the speech emotion recognition SER module in the step S52 。
7. The large model adaptive teaching method based on causal guided decoupling and multi-modal perception learning according to claim 1, wherein the step S6 specifically comprises: S61, selecting an activation injection point and selecting a multi-mode large language model One or more intermediate layers in the shared semantic space as activation injection points; s62, applying bidirectional quantity activation guidance in a multi-modal large language model In the forward propagation process of generating each token, the activation injection points described in S61 are combined with the cognitive mode control vector And style mode control vector For original activation Modifying in a linear superposition mode to obtain modified activation; The modified activation is obtained by processing according to the following formula: Wherein, the Is a modified activation; Is originally activated; And Respectively a cognitive mode coefficient and a teaching style coefficient; Is a cognitive mode control vector; a control vector for a style mode; S63, generating synchronous multi-modal output, feeding forward the modified activation in S62 and simultaneously transmitting to all downstream decoders to generate synchronous multi-modal responses which are completely consistent in cognition and style.
8. The causal guided decoupling and multi-modal perceptual learning based large model adaptive teaching method of claim 1, further comprising: In the continuous interaction with students, steps S5-S6 are circularly executed, so that a 'listen-decision-say' teaching interaction closed loop is constructed.
9. A large model adaptive teaching system based on causal guided decoupling and multi-modal perceptual learning using the method of any of claims 1-8, comprising: the offline training module is used for training LoRA adapters through comparison learning based on a non-parallel cognition library and a style library, and sequentially extracting an original differential vector set and robust aggregation processing so as to obtain a decoupled cognition mode control vector And style mode control vector The voice emotion recognition SER module is used for analyzing the voice emotion of the user when the user interacts with the user in real time; A dynamic decision module for determining cognitive mode coefficients based on user portraits or explicit requests And dynamically determining teaching style coefficients according to the analysis result of the speech emotion recognition SER module ; An activation guidance module for multi-modal large language model According to cognitive pattern coefficients in a shared semantic space of (a) Coefficient of teaching style Multi-modal large language model with two control vector pairs The original activation of (2) is modified in a linear superposition mode; And the multi-mode generation module is used for receiving the activation modified by the activation guide module to generate synchronous multi-mode output.
10. The large model adaptive teaching system based on causal guided decoupling and multi-modal perception learning of claim 9, wherein: The robust aggregation in the offline training module is realized by a two-stage aggregation unit, wherein the unit is used for firstly applying Principal Component Analysis (PCA) to an original differential vector set to carry out noise reduction, and then using a geometric median algorithm to aggregate the noise-reduced vector; The multimodal generation module includes a text decoder, a speech decoder, and a teaching gesture decoder in parallel, each of which receives the same modified activation from the activation guidance module to ensure output synchronicity.

Description

Large model self-adaptive teaching method and system based on causal guided decoupling and multi-modal perception learning Technical Field The invention belongs to the fields of artificial intelligence, multi-modal learning, natural language processing and intelligent education intersection, and particularly relates to a large model self-adaptive teaching method and system based on causal guide decoupling and multi-modal perception learning. Background With the rapid development of multi-modal Large Language Models (LLMs), they exhibit great reformulation potential in the field of intelligent education. By virtue of its rich knowledge reserves and ability to synchronously generate text, speech, and even avatar, multimodal LLMs is able to build more immersive and personalized tutorial scenes, providing a more efficient learning experience over traditional text. However, despite the tremendous potential of multi-modal AI teachers, current models still face significant challenges in implementing truly "adaptive" teaching. One major problem is that the model cannot accurately adjust the "cognitive depth" of the content based on the learner's knowledge level, resulting in that beginners may receive excessively formalized and complex answers, while advanced learners may feel the analogy and interpretation simple. A second challenge is that the model lacks the ability to dynamically adjust the "teaching style", for example, a frustrated and confusing student requires a calm, calm intonation, while an excited and curious student is more responsive to enthusiastic, lively feedback. More seriously, the output of existing multimodal systems is often disjointed, the system may generate strict text content, but with excessive enthusiasm in speech intonation, or the avatar's expression is completely mismatched with speech emotion, which severely undermines teaching credibility and immersion. Finally, current systems are mostly static, they cannot sense in real time the student's speech emotion (e.g., confusion or excitement) in the interaction, and thus cannot construct a "sense-decision-feedback" dynamic teaching closed loop. To solve these problems, researchers have attempted to improve the adaptability of models by prompting engineering (Prompting) and model tuning (Fine-tuning). However, prompt engineering is difficult to achieve continuous, fine control (e.g., "70% formalization" is not possible), and synchronous prompts (e.g., "speak this sentence in this intonation") on multiple modalities are extremely difficult and unreliable. The biggest bottleneck of the model fine tuning method (including parameter interpolation) is to rely heavily on a 'parallel corpus' which is expensive and almost impossible to obtain in a large scale, for example, the same teaching content is needed, and the model fine tuning method has multiple paired text, voice and expression data of intuitiveness, formalism, countersunk stability and the like. Furthermore, the above approach has difficulty decoupling "what" (cognitive content) and "how to teach" (delivery style). In training data, the "formalized" content often appears concomitantly with the "sinking" style, resulting in the model binding the two incorrectly, and failing to achieve independent flexible control. Therefore, a new scheme is urgently needed in the field, and the scheme can (1) avoid parallel corpus, (2) realize independent and continuous control of two dimensions of 'cognitive depth' and 'teaching style', (3) sense emotion states of students in real time and dynamically adjust output, and (4) ensure synchronism and consistency of multi-mode output such as texts, voices and the like. Disclosure of Invention In order to solve the problems in the background art, the invention provides a large model self-adaptive teaching method and system based on causal guide decoupling and multi-modal perception learning, and solves the problems of dependency on expensive parallel corpus, confusion of control dimensions of cognitive depth and teaching style, lack of real-time emotion perception, asynchronous multi-modal output and the like in the background art. According to the invention, the decoupled cognitive mode and teaching style control vector is extracted from the non-parallel plain text offline by utilizing causal intervention and robust aggregation, and dynamic coefficient decision is carried out by combining the voice emotion recognition SER module during online interaction, and bidirectional quantity activation guidance is carried out in the shared semantic space of the multi-mode model, so that the technical problem that dynamic, independent and synchronous multi-mode self-adaptive teaching cannot be realized in the prior art is solved. The technical scheme adopted by the invention is as follows: 1. a large model self-adaptive teaching method based on causal guided decoupling and multi-modal perception learning comprises the following steps: s1, constructing a data set, and c