Search

CN-122021727-A - Method for constructing visual time sequence decision model based on selective memory mechanism

CN122021727ACN 122021727 ACN122021727 ACN 122021727ACN-122021727-A

Abstract

The invention discloses a method for constructing a visual time sequence decision model based on a selective memory mechanism, and belongs to the technical field of computer vision and deep learning. The method comprises the steps of carrying out feature extraction by utilizing an acquired multi-frame observation image sequence, converting a visual information sequence into a high-dimensional feature vector through a feature extraction network, inputting the feature vector into a selective memory management module, realizing fusion of current observation and key history information by utilizing a cross attention mechanism, generating a visual context vector containing global semantic information, inputting the vector as a guiding condition into a self-attention enhanced generated diffusion strategy model, predicting a target-oriented decision sequence track through an iterative denoising process, and finally converting the track into an action control vector of an execution main body through dynamic mapping. According to the invention, high-value information is accurately screened through a selective memory mechanism, so that the perception capability of a model on a long time sequence background is improved while the calculation cost is obviously reduced.

Inventors

  • LI XIUZHI
  • DONG HONGYOU
  • ZHAO CHUNBO

Assignees

  • 北京工业大学

Dates

Publication Date
20260512
Application Date
20260128

Claims (6)

  1. 1. The method for constructing the visual time sequence decision model based on the selective memory mechanism is characterized by comprising the following steps of: Step 1), taking time sequence continuous image frames acquired by a vision system as system input data, wherein the time sequence continuous image frames comprise current frames and historical frames, extracting vision features from the input images by adopting a EFFICIENTNET network, and transmitting the feature vectors to a transducer encoder; Step 2), constructing an initial memory pool with a fixed length in a sliding window mode, wherein the initial memory pool is empty, compiling a key feature recognition network, recognizing an observation image of each time step, and storing key features in the memory pool; Step 3), inputting the global visual context feature vector as a guiding condition into an improved diffusion strategy model, wherein the model introduces a self-attention mechanism in a U-Net structure, and predicting task decision tracks of a plurality of time steps in the future through an iterative denoising process by using the diffusion strategy model; And 4) mapping the predicted task decision track into an action control vector of the executing mechanism so as to drive the decision body to execute the target-oriented task sequence.
  2. 2. The method for constructing a visual time sequence decision model based on a selective memory mechanism according to claim 1, wherein the introduction of the memory pool explicitly stores the history information, and enhances the perceptibility of the history information.
  3. 3. The method for constructing a visual timing decision model based on a selective memory mechanism according to claim 1, wherein said key feature recognition network is implemented in an easy manner.
  4. 4. The method for constructing the visual time sequence decision model based on the selective memory mechanism according to claim 1 is characterized in that the memory mechanism calculates the correlation between the current observation information and the historical observation sequence by utilizing the characteristic of cross-attention cross-sequence interaction, and makes a decision for the current scene by explicitly combining the historical key characteristics, and the memory mechanism can better avoid forgetting phenomenon caused by overlong sequence for large-scale complex visual scene.
  5. 5. The method for constructing a visual timing decision model based on a selective memory mechanism as set forth in claim 1, wherein the accuracy in generating complex data is enhanced by utilizing self-attentive sequential processing power.
  6. 6. The method for constructing the visual time sequence decision model based on the selective memory mechanism, which is disclosed by claim 1, fuses the four steps according to the mode to realize the optimal effect from input visual to output decision action.

Description

Method for constructing visual time sequence decision model based on selective memory mechanism Technical Field The invention relates to the technical field of computer vision and deep learning, for example, by processing a series of visual observation information to make an optimal decision, and is a visual time sequence decision model based on a selective memory mechanism. Background Efficient visual timing decision is the core capability to implement advanced artificial intelligence systems, a task aimed at generating optimal policy sequences from continuous visual observation information. Currently, applications of visual timing decision models have covered a wide range of fields from autopilot, unmanned system interactions to complex industrial process control. In these applications, the model not only needs to understand the current visual input, but must also have the ability to implement target-oriented decisions by processing timing information in a diverse environment. Conventional timing decision schemes often rely on explicit state space modeling and heuristic search algorithms. Common strategies include traversal based on leading edge regions and path planning based on random sampling. Although the partially rule-based method is stable under the constraint of complex rules, the method still has significant limitations when facing large-scale unstructured decision tasks, namely, the computational complexity grows exponentially with the search space, the preset rules are difficult to cover all states of the dynamic environment, and the system lacks self-adaptive learning capability, so that the robustness of the decision strategy is insufficient. In data driven approaches, deep reinforcement learning is a common approach to solve timing decision problems. Despite the achievement of some partially structured tasks, the method still suffers from the bottleneck of low efficiency, namely, due to extremely sparse reward signals in complex tasks, models are difficult to converge, and migration adaptation cost between different task scenes is high. In recent years, a transducer shows strong historical information utilization capability as an advanced time sequence decision model. But its "memory" is a dynamic, content retrieval-based mechanism implemented by an attention mechanism, rather than a static storage. Specifically, the model may dynamically focus on any historical position in the sequence as the current input is processed, forming a "soft memory" by calculating the degree of association of the current feature with all historical features. However, this mechanism has significant limitations in practical applications (1) contextual window limitations. Only a sequence of a limited length can be memorized, and the information of the excess part can be cut off, and (2) no explicit memory is stored. Unlike LSTM, there is a cellular state, but rather "memory" is implied by the attention weight. In the field of personal intelligence, this capability enables the agent to make consistent and efficient decisions based on continuous visual observations, but still fails to cope with very long time-series tasks. In the decision-making fields of large-scale time sequence prediction, long-distance logic reasoning and the like, the model has very wide requirements for long-term memory capacity. Although students at the university of eastern China have proposed a method for memorizing a transducer, the method lacks screening of key information, is extremely easy to cause the phenomenon of memory explosion, and thus seriously affects the decision-making efficiency. The existing Transformer architecture is difficult to accurately screen key information in highly similar or repeated visual features under limited computing resources, and is easy to cause inefficiency and even erroneous decision. Disclosure of Invention The invention provides a visual sequence decision model based on a selective memory mechanism. The technical scheme of the invention is as follows: Step 1, taking time sequence continuous image frames (comprising current observation and historical observation) acquired by a visual system as input, extracting high-dimensional visual feature vectors through a feature extraction network (such as EFFICIENTNET), and transmitting the high-dimensional visual feature vectors to a time sequence encoder. And 2, constructing a selective memory pool with dynamic updating capability. And carrying out criticality evaluation on the observation characteristics of each time step by utilizing a key characteristic recognition network, and storing only the key characteristics with high information gain into a memory pool. And a memory processing mechanism based on cross attention is adopted to realize the depth association of the current observation information and the history information in the memory pool, extract a time sequence characteristic sequence containing global semantic information and generate a global visu