CN-121996066-A - Efficient brain television sense decoding method for region sensing layering sub-feature alignment
Abstract
The invention provides a high-efficiency brain television sense decoding method for region sensing layering sub-feature alignment, belonging to the fields of brain-computer interfaces, computer vision and multi-mode learning intersection. The method comprises the steps of acquiring EEG signals and image pairs which are used for presenting natural object images to a tested object under a rapid serial visual presentation paradigm in a public data set, preprocessing, intercepting a segment corresponding to an image presentation time window, taking the segment as an input image, constructing a brain television sense decoding model, and based on the input image, utilizing the brain television sense decoding model to carry out reasoning and outputting an image matched with the current brain activity. The invention solves the problems of the existing electroencephalogram-vision alignment method in the aspects of modeling the corresponding relation between brain regions and local vision sub-features, obtaining layered fine-granularity vision characterization, improving training stability, crossing tested generalization capability, lacking neuroscience interpretability and the like.
Inventors
- GUO JINYANG
- LIU XIANGLONG
- ZHU YANAN
- LI BO
Assignees
- 北京航空航天大学
Dates
- Publication Date
- 20260508
- Application Date
- 20260108
Claims (10)
- 1. The efficient brain television sense decoding method with the aligned regional sensing layering sub-features is characterized by comprising the following steps of: s1, acquiring an electroencephalogram EEG signal and image pair which presents a natural object image to a tested under a rapid serial visual presentation paradigm in a public data set, preprocessing, intercepting a segment corresponding to an image presentation time window, and taking the segment as an input image; S2, constructing a brain television sense decoding model; S3, based on the input image, reasoning is carried out by utilizing an electroencephalogram vision decoding model, an image matched with the current brain activity is output, and the brain vision decoding aligned with the regional perception layering sub-features is completed.
- 2. The efficient brain vision decoding method of region-aware hierarchical sub-feature alignment of claim 1, wherein the brain vision decoding model comprises: The visual sub-feature extraction module is used for encoding the input image by utilizing a pre-trained visual encoder and a visual resampler to obtain visual sub-features; The brain electricity encoder is based on the brain-vision alignment of the sub-feature level to realize the region perception layering sub-feature alignment treatment, wherein the brain electricity encoder comprises a channel embedder, a cross-channel fusion projector and a domain adapter; The sub-feature selection and local alignment module is used for selecting and self-adapting local alignment processing of the sub-features based on the sub-feature level brain-vision alignment result; a diffusion prior reconstruction module for decoding the brain based on the alignment guidance of the reconstruction As input condition, a reconstructed visual embedding is obtained ; Global contrast module for decoding sub-features based on visual sub-feature V and brain And performing global contrast alignment processing to obtain global embedding.
- 3. The method for efficient brain television sensation decoding with region-aware hierarchical sub-feature alignment of claim 2, wherein the obtaining the visual sub-feature comprises the steps of: Dividing the input image by using a pre-trained visual encoder, and outputting a region token sequence; The visual resampler uses K learnable query vectors and the area token patch to perform cross-attention computation, and the visual sub-features are obtained by aggregating the partitioned patch features into K semantic sub-features: Wherein, the The visual sub-features are represented as such, The visual encoder is represented by a visual encoder, The input image is represented by a representation of the input image, Representing the kth visual sub-feature, Representing real space, K representing the number of sub-features and n representing the feature dimension.
- 4. The method for efficient brain television sense decoding with alignment of regional sense layered sub-features according to claim 2, wherein the implementing the alignment process of regional sense layered sub-features comprises the steps of: In the channel embedder, for each channel Constructing a time branch and a frequency branch; in the frequency branch, a frequency embedded feature is calculated ; In the time branch, the time features are extracted ; Linear mapping is carried out on the original time signal, and the time characteristics are fused And frequency embedded features Obtaining channel embedded features ; Embedding all channels with features in a cross-channel fusion projector Performing serial connection according to the channel dimension to obtain serial characteristics ; To be connected in series with characteristics Mapping to K x n dimensional space to obtain mapped matrix Wherein K represents the number of sub-features, n represents the feature dimension, and the matrix A set of channel combinations learned for each visual sub-feature; sub-features derived from neural signal encoding, introduction of lightweight domain adapter domains in the direction of the sub-features Utilizing domain adapter domains Pair matrix And performing sub-feature-by-sub-feature adjustment, wherein the adjusted brain decoding sub-feature and the visual sub-feature V are in the same feature space, and sub-feature level brain-visual alignment is realized.
- 5. The efficient brain television sensation decoding method of region-aware hierarchical sub-feature alignment of claim 4, wherein the channel-embedded features The expression of (2) is as follows: Wherein, the Representing the features obtained after the linear mapping, A linear mapping is represented and is used to represent, Representing a real space of dimension m, A non-linear mapping is represented and, And the brain electrical signal of the jth channel after Fourier transformation is represented.
- 6. The method for efficient brain television sensation decoding with region-aware hierarchical sub-feature alignment according to claim 2, wherein the selecting and adaptive local alignment of the sub-features comprises the steps of: Based on the sub-feature level brain-vision alignment results, at the visual end, the attention matrix is crossed from the last layer of the visual resampler Wherein the average attention weight of each visual sub-feature is counted, wherein the perceived weight of the i-th visual sub-feature The expression of (2) is as follows: Wherein, the The number of attention heads is represented, h represents the number of heads in the multi-head attention, I and j attention weights representing an h attention header, and N represents a token number; for perceived weights Sorting, namely selecting top k sub-features with highest weights, wherein the top k sub-feature index sets The expression of (2) is as follows: Wherein, the Representing the first in the ordering A plurality of; Decoding sub-features of the brain By means of local projection modules Mapping to locally aligned spaces : =M( ) , wherein, Representing the ith sub-feature derived from the neural signal; Based on the mapping result, only in the sub-feature index set Weighted local alignment loss is calculated on the visual sub-features in (a).
- 7. The efficient brain television sensation decoding method of region-aware hierarchical sub-feature alignment of claim 2, wherein said reconstructed visual embedding is obtained Comprising the following steps: based on reconstructed alignment guidance, brain decoding sub-features U-Net denoising network fed into diffusion model as input condition Obtaining reconstructed visual embedding 。
- 8. The method for efficient brain television sensation decoding with region-aware hierarchical sub-feature alignment of claim 2, wherein said obtaining global embedding comprises the steps of: For vision sub-feature And brain decoding sub-feature Global embedding is obtained through flattening or aggregation operation respectively And , wherein, The representation of the target representation is performed, Representing a predictive representation of brain signals.
- 9. The efficient brain-vision decoding method of region-aware hierarchical sub-feature alignment of claim 2, characterized in that the expression of the loss function of the brain-vision decoding model is as follows: Wherein, the Representing the loss function of the brain television sense decoding model, A global contrast loss is indicated and, Indicating a loss of local alignment, Representation of Is used for the weight coefficient of the (c), Representing the loss of the reconstruction of the mean square error, Representation of Is used for the weight coefficient of the (c), Indicating the batch size, i indicating the ith sub-feature, Representing the similarity of the ith target value sub-feature and the ith predicted value sub-feature, which are positive samples, Representing the similarity of the ith target value sub-feature and the jth predicted value sub-feature, Representing the similarity of the jth target value sub-feature and the ith predicted value sub-feature, The representation of the target representation is performed, Representing a predictive representation of the brain signal, The temperature coefficient is represented by a temperature coefficient, A de-noising network is represented and, The visual sub-features are represented as such, Representing the brain decoding sub-feature, The sub-feature of the kth is represented, A set of sub-feature indices is represented, Representing the perceptual weight of the i-th sub-feature, Representing the degree of cosine similarity, Representing a local alignment space in which the position of the object is to be aligned, The ith sub-feature representing the target feature.
- 10. The method for efficient brain vision decoding with aligned regional perceptual layering sub-features according to any one of claims 1-9, wherein S3 comprises the steps of: s301, giving an input image, and sequentially passing through a channel embedder, a cross-channel fusion projector, a domain adapter and a local projection module Obtaining brain decoding sub-features and global embedding; S302, performing similarity calculation on the brain decoding sub-features and the global embedding obtained in the S301 and the visual sub-features and the global embedding which are calculated in advance in the candidate image library; s303, based on the similarity calculation result, outputting Top-N retrieval results according to the similarity sequence; S304, embedding the vision obtained by diffusion priori reconstruction And (3) inputting the conditions, sending the conditions into a diffusion model, outputting an image matched with the current brain activity, and visualizing the subjective visual content to be tested to finish the brain television sense decoding aligned with the regional perception layering sub-features.
Description
Efficient brain television sense decoding method for region sensing layering sub-feature alignment Technical Field The invention belongs to the fields of brain-computer interfaces, computer vision and multi-mode learning intersection, and particularly relates to a high-efficiency brain television sense decoding method for alignment of regional perception layering sub-features. Background In recent years, visual nerve decoding has become an important direction of brain-computer interfaces and cognitive neuroscience, with the goal of reconstructing or retrieving the content of an image to be seen from a non-invasive nerve signal. In the EEG visual decoding direction, the prior art generally follows a typical paradigm of extracting a global visual feature vector, typically a global embedding of visual transducers, from an image using a pre-trained visual encoder, encoding a multi-channel EEG signal into a vector by a convolutional neural network, visual transducer, or other space-time coding structure, and bringing the EEG vector as close as possible to the global visual vector of the corresponding image in joint embedding space by contrast learning or regression to accomplish image retrieval or reconstruction tasks. Representative works include multi-modal alignment methods that map brain signals, visual features and linguistic features together into a shared semantic space for image-text-brain signal joint modeling, a series of methods proposed on large-scale public datasets such as THINGS-EEG that align EEG with image global embeddings by contrast learning, ATM-S, cogCap, UBP that introduce more complex designs of spatiotemporal convolution, attention structure, fuzzy prior constraints, etc. on the EEG encoder side to promote modeling ability on EEG signals, but the alignment targets remain focused on a single global visual embedment. Another part of the methods proposed the use of multiple "artificially structured auxiliary modalities" to enrich the visual alignment supervision. For example, multiple views of text descriptions, gray level maps, blur maps, depth maps, semantic segmentation maps, etc., are derived from the original image, and multiple "EEG→corresponding modality embedded" mapping models are trained separately to try to constrain the brain signals to representations of visual content at different "levels". The main problem existing in the prior art is that the prior method for neglecting brain area function differentiation and local visual structure in global alignment mostly leads EEG multichannel signals to be pooled or linearly mixed through channels to obtain a single global vector, and then the single global vector is aligned with the global embedding of the image. However, the research of cognitive neuroscience shows that the visual cortex has obvious region specificity, that different brain regions such as occipital lobe, parietal lobe and the like have different sensitivity degrees to high-frequency textures, shape outlines, background low-frequency information and the like, and that the 'ventral pathway' and the 'dorsal pathway' have different processing of object types and spatial information. If all channels are simply compressed into one vector, the correspondence of the brain region-local visual mode is lost, and the final decoding precision and the interpretation are limited. The method relies on a plurality of manual auxiliary modes, the engineering cost is high, the existing part layering alignment scheme for destroying the natural consistency needs to convert an original image into a plurality of auxiliary modes such as texts, depths, fuzziness, segmentation and the like, additional preprocessing tool chains and computing resources are needed, each mode corresponds to an independent characteristic space and alignment branch, the whole system is complex, the parameter quantity is large, uniform substructure division is lacking among the manual modes, the naturally existing semantic substructure in the original image is weakened, and the method is unfavorable for constructing stable brain region-visual sub-characteristic mapping. Under the traditional global alignment framework, it is difficult to answer "which EEG channel corresponds mainly to which local area or texture pattern in the image". The model output is usually only the similarity of two global vectors, does not provide weight information of channel level or sub-feature level, has weak interpretability in the neuroscience sense, and is unfavorable for subsequent brain function analysis and clinical application. On large-scale multi-subject datasets such as THINGS-EEG, the existing methods are significantly less accurate than the in-subject settings at the cross-subject settings. The model is sensitive to individual differences, has insufficient generalization performance, and limits the feasibility of direct deployment in actual BCI application. Disclosure of Invention Aiming at the defects in the prior art, the inve