CN-121981253-A - Method for reconstructing scene of brain-television sense-of-test based on MLLM reasoning correction

CN121981253ACN 121981253 ACN121981253 ACN 121981253ACN-121981253-A

Abstract

The invention discloses a method for reconstructing a scene of a brain television sense to be tested based on MLLM reasoning correction, and belongs to the technical field of brain-computer interfaces. The method comprises the steps of firstly mapping an electroencephalogram signal of a target subject into a discrete semantic keyword set, and stripping individual specific noise by using a discrete symbol as a universal semantic anchor point. Then, inputting a multi-mode large language model, deducing scene element association through a thinking chain mechanism, complementing logic deletion and generating a coarse-granularity image. The method is characterized in that a dual-criterion visual feedback checking mechanism is introduced, intention consistency and physical logic errors are detected by using common general knowledge priori, a natural language editing instruction is generated, and an image editing model is guided to carry out closed-loop correction on a coarse-granularity image. The invention effectively solves the problem of drift across the tested signal domain, suppresses logic illusion in the generation process, and remarkably improves the generalization capability and reconstruction accuracy of the electroencephalogram decoding in a complex scene.

Inventors

LIN CHENGDE
YANG MINGZHE
WANG WENBO
JIANG HAIYAN
Gong Xuezhu
ZHANG SHAOXU
YU HUANYI
XU YI

Assignees

桂林电子科技大学
广西壮族自治区南溪山医院(广西壮族自治区第二人民医院)

Dates

Publication Date: 20260505
Application Date: 20251219

Claims (7)

1. A method for reconstructing a scene of a brain television sense to be tested based on MLLM reasoning correction is characterized by comprising the following steps: step S1, acquiring an electroencephalogram signal of a target subject to be reconstructed, preprocessing data of the electroencephalogram signal, and unifying signal distribution references to reduce individual physiological difference influence to obtain a preprocessed electroencephalogram signal with high signal-to-noise ratio; Step S2, mapping the preprocessed electroencephalogram signals into discrete semantic keyword sets, and stripping individual specific noise by using discrete symbols as universal semantic anchor points crossing the tested; s3, constructing an inference prompt word containing the semantic keyword set, inputting a pre-trained multi-mode large language model, and generating a dense scene description which complements individual decoding logic loss and scene details by using a thinking chain inference mechanism; s4, inputting the dense scene description as a universal condition for de-specialization into an image generation model to generate an initial coarse-granularity image; s5, inputting the coarse-granularity image and the semantic keyword set into the multi-modal large language model together, and executing dual-criterion visual feedback verification independent of subjective deviation of a subject by combining cross-crowd common sense priori implied by the multi-modal large language model to generate a natural language editing instruction; And S6, inputting the natural language editing instruction and the coarse-granularity image into an image editing model based on the instruction, and carrying out local correction on the coarse-granularity image to obtain a final reconstructed image.
2. The method for reconstructing a scene of brain-television-sense-of-test based on MLLM inference corrections according to claim 1, wherein the data preprocessing in step S1 specifically includes: removing environmental noise and power frequency interference in the original electroencephalogram signal by using a band-pass filter and a notch filter; removing common mode noise by using a whole brain average reference method; And segmenting and standardizing the signals to enable the data distribution of the signals to adapt to the input requirements of the follow-up model.
3. The method for reconstructing a scene of brain television sense to be tested based on MLLM inference correction according to claim 1, wherein in the step S2, the preprocessed brain electrical signal is mapped into a discrete semantic keyword set, and the specific implementation manner is as follows: Pre-constructing a text embedding library containing preset vocabulary; Inputting the electroencephalogram signal to be preprocessed into a pre-trained electroencephalogram encoder to obtain a target electroencephalogram feature vector, and calculating the similarity of the target electroencephalogram feature vector and text feature vectors corresponding to words in the text embedding library; and selecting a plurality of words with highest similarity to form the semantic keyword set, and realizing normalized translation from individual heterogeneous signals to general discrete semantics.
4. The method for reconstructing a scene of a brain-television-sense-of-test based on MLLM inference correction according to claim 1, wherein in the step S3, a dense scene description including scene details is generated by using a mental chain inference mechanism, and specifically includes: Embedding a guide instruction into the reasoning prompt word, and requiring the multi-modal large language model to infer the spatial layout relationship and object action logic among the keywords by utilizing human universal cognitive logic based on the input semantic keyword set; and the multi-modal large language model firstly outputs a logic chain of an reasoning process according to the guide instruction, and then carries out semantic completion on discrete semantic keywords based on the logic chain to generate the coherent dense scene description.
5. The method for reconstructing a scene of brain-television vision across a test based on MLLM inference correction according to claim 1, wherein the dual-criterion visual feedback verification in step S5 specifically includes two parallel verification tasks: The first task is intent consistency check, which is used for detecting whether core objects and scene elements described by the semantic keyword set are accurately presented in the coarse-granularity image; the second task is common sense rationality check, which is used for detecting whether the coarse-granularity image has generation errors against physical laws, anatomical structures or space logics by using the common world knowledge of the multi-modal large language model, so as to filter illusions generated by individual specific decoding; And the multi-modal large language model integrates the detection results of the two tasks and outputs the natural language editing instruction.
6. The method for reconstructing a scene of brain-television-sense-over-test based on MLLM inference correction according to claim 1, wherein the logic for generating the natural language editing instruction in step S5 is as follows: if the double-criterion visual feedback check detects that the coarse-granularity image has intention deletion or common sense error, generating an instruction text containing specific modification actions and modification targets; if no obvious error is detected, generating an instruction which is kept as it is; The natural language editing instruction is used for connecting semantic understanding output of the multi-modal large language model with conditional input of the image editing model.
7. The method for reconstructing a scene of brain-television-sense-across-test based on MLLM inference corrections according to claim 1, wherein the instruction-based image editing model in step S6 adopts a conditional diffusion model architecture configured to: Receiving the coarse-grain image as an initial latent variable and receiving the natural language editing instruction as conditional guidance; and injecting the natural language editing instruction into a generating process by using an attention mechanism, denoising and redrawing only the area related to the instruction on the premise of keeping the structural characteristics of the area unrelated to the instruction in the coarse-granularity image unchanged, and outputting the reconstructed image.

Description

Method for reconstructing scene of brain-television sense-of-test based on MLLM reasoning correction Technical Field The invention belongs to the technical field of intersection of artificial intelligence and brain-computer interfaces, and particularly relates to a method for reconstructing a television scene across a tested brain based on MLLM reasoning and correction. Background Visual scene reconstruction based on electroencephalogram signals (Electroencephalography, EEG) is a leading-edge hotspot in the fields of neuroscience and computer vision. The technology aims at establishing a mapping relation between the brain nerve activity and the external visual stimulus, and reconstructing a scene image seen by human eyes by decoding brain waves generated when a subject views the image and utilizing an artificial intelligence algorithm, thereby having important significance in analyzing a human visual cognition mechanism, assisting disabled people in communication, nerve rehabilitation medical treatment and the like. Compared to functional magnetic resonance imaging (Functional Magnetic Resonance Imaging, fMRI), electroencephalogram signals have the advantages of non-invasiveness, portability, and high temporal resolution. With the development of deep learning technology, this field has undergone an evolution from early variational self-encoders (Variational Autoencoder, VAE), generation of a countermeasure Network (GAN) to today based on diffusion models. Early researches mainly attempted to establish direct mapping from an electroencephalogram signal to a pixel space, but the direct mapping is limited by the characteristics of high noise, non-stationarity, low spatial resolution and the like of the electroencephalogram signal, and the generated image is often blurred and can only reduce the rough outline. To solve the above problem, recent mainstream research starts to introduce a pretrained contrast language-Image pretraining model (Contrastive Language-Image Pre-training, CLIP), in an attempt to align the electroencephalogram features to a semantically rich CLIP feature space, and then generate a high quality Image using a diffusion model. Meanwhile, the rapidly developing multi-modal large language model (Multimodal Large Language Model, MLLM) in recent years learns the universal world knowledge and cross-modal semantic alignment capability independent of the physiological features of a specific individual by means of massive pre-training data. The macroscopic cognition priori has extremely strong robustness, and provides a potential semantic anchor point and a logic compensation path for compensating the distribution drift of the underlying physiological signals among different subjects. " Although the existing mainstream technology can generate a clearer image on a specific data set of a single subject, in practical application and popularization, a core bottleneck restricting the technology to fall to the ground is still faced with serious generalization deficiency caused by individual difference (Inter-subject Heterogeneity). The prior art mainly has the following technical defects which are difficult to overcome: First, the individual specificity of the brain electrical signals makes it difficult for existing end-to-end models to migrate. There are significant differences in brain anatomy, intrinsic cognitive processing strategies, etc. among different subjects. This results in significant Domain Shift (Domain Shift) in time-frequency characteristics and spatial topology of the electroencephalogram signals generated by different subjects for the exact same visual stimulus. Most existing methods employ end-to-end training strategies, where models tend to overfit physiological characteristics of a particular subject, and when the model is applied to a new subject who is not involved in training, the generated image content deviates from the true intent due to signal distribution drift. Second, there is a lack of efficient semantic normalization and logic compensation mechanisms. An electroencephalogram signal is essentially a high noise, non-stationary signal. The disadvantage of this signal-to-noise ratio is further amplified across the scene under test, resulting in the loss of some of the valid semantic information. Existing methods focus mainly on feature alignment at the signal level, trying to pull the distance of different subjects in a mathematical distribution. However, this purely data-driven approach ignores the logical commonality of human visual perception. When a part of details of an electroencephalogram signal of a certain subject are lost due to individual differences, the existing generation model lacks an intermediate layer with the capability of 'common sense reasoning' to automatically complement the logic deletions, so that the generated image often has structural collapse or breaks away from physical common sense. Third, there is a lack of closed loop correction capability base