Search

CN-121999195-A - Infrared point target identification method and system based on cognitive fusion

CN121999195ACN 121999195 ACN121999195 ACN 121999195ACN-121999195-A

Abstract

The invention provides a recognition method and a recognition system for infrared point targets based on cognitive fusion, wherein the recognition method comprises the steps of obtaining language description containing continuous frame image samples in a dataset, high-level semantic features, word vectors of target semantic categories and space-frequency characteristic data of the samples after time dimension frequency domain analysis, obtaining language high-level semantic feature codes, obtaining visual high-level semantic feature codes, obtaining similarity of the language high-level semantic feature codes and the visual high-level semantic feature codes, obtaining distance loss between the language high-level semantic feature codes and the visual high-level semantic feature codes based on the similarity, conducting cognitive fusion interactive training on a language semantic coding model and a visual semantic coding model to obtain an updated visual language model, determining the infrared point target category to be detected according to recognition requirements, and inputting the infrared point target category to be detected into the updated visual language model to realize the recognition of the cognitive fusion infrared point targets.

Inventors

  • Song Zizhuang
  • ZHOU JIE
  • ZHANG QIANG
  • LOU YAXIN
  • HOU QIWEN

Assignees

  • 航天科工集团智能科技研究院有限公司

Dates

Publication Date
20260508
Application Date
20241105

Claims (12)

  1. 1. An infrared point target identification method based on cognitive fusion is characterized by comprising the following steps: Acquiring language description, advanced semantic features, word vectors of target semantic categories and space-frequency characteristic data of samples subjected to time dimension frequency domain analysis based on an infrared point target continuous frame image data set, prompt engineering and prior information, wherein the space-frequency characteristic data of the samples subjected to time dimension frequency domain analysis is used as cognitive domain features; Inputting word vectors containing language description, high-level semantic features and target semantic categories of continuous frame image samples in a dataset into a language semantic coding model to obtain language high-level semantic feature codes; Taking continuous frame image samples in the data set as features to be extracted in a perception domain, inputting the features to be extracted in the perception domain and the features to be extracted in the perception domain into a visual semantic coding model for preliminary fusion of the perceived features, and obtaining visual advanced semantic feature codes; Obtaining the similarity of the language high-level semantic feature codes and the visual high-level semantic feature codes, and obtaining the distance loss between the language high-level semantic feature codes and the visual high-level semantic feature codes based on the similarity; Performing cognitive fusion interactive training on the language semantic coding model and the visual semantic coding model based on the distance loss of the language high-level semantic feature coding and the visual high-level semantic feature coding to obtain an updated visual language model; And determining the type of the infrared point target to be detected according to the identification requirement, and inputting the infrared point target to be detected and the type of the infrared point target to be detected into the updated visual language model to realize the identification of the cognitive fusion type infrared point target.
  2. 2. The method of claim 1, wherein obtaining, based on the infrared point target continuous frame image dataset, the prompt engineering, and the prior information, the space-frequency characteristic data comprising the linguistic description of the continuous frame image samples in the dataset, the high-level semantic features, the word vectors of the target semantic categories, and the continuous frame image samples after the time-dimensional frequency domain analysis comprises: preliminary judgment is carried out on continuous frame image samples in the dataset based on prior information to obtain candidate keywords corresponding to each frame image sample, key confidence scoring is carried out on each candidate keyword, candidate keywords are screened based on scoring results, and reliable language description keywords unified expression of each frame image sample is obtained; providing specific language prompts or clues for the language expressions of the unified expression of the credible language description keywords of each frame of image sample by utilizing prompt engineering so as to enhance the language expressions and obtain sentences containing the language descriptions, the high-level semantic features and the target semantic categories of each frame of image sample; The method comprises the steps of converting sentences containing language descriptions, high-level semantic features and target semantic categories of each frame of image sample into word vectors containing language descriptions, high-level semantic features and target semantic categories of continuous frame of image samples in a data set by adopting a word vector coding mode; and carrying out time dimension frequency domain analysis on continuous frame image samples in the data set to obtain space frequency characteristic data.
  3. 3. The method of claim 1, wherein the candidate keyword is a target gray scale, a target category, or a target size.
  4. 4. The method of claim 2, wherein the trusted language description keyword unified expression for each frame of image samples is obtained by: Wherein, W key is the unified expression of the language description keywords of each frame of image sample credibility, W i is the candidate keywords with confidence higher than the screening threshold, and N is the number of the screened keywords.
  5. 5. The method of claim 2, wherein the statement comprising the linguistic description of each frame image sample, the high-level semantic features, and the target semantic categories is obtained by: P(W key )=L description (W key )+L semantics (W key )+L category (W key ); Wherein, P (W key ) is a sentence containing language description, high-level semantic features and target semantic categories of each frame image sample, L description (W key ) is a language description prompt sentence, L semantics (W key ) is a high-level semantic feature prompt sentence, and L category (W key ) is a target semantic category prompt sentence.
  6. 6. The method of claim 2, wherein the word vector comprising the linguistic description of successive frame image samples in the dataset, the high-level semantic features, and the target semantic categories is derived by: W embeding (P(W key ))=D(P(W key )) Wherein, W embeding (P(W key )) is a word vector containing the language description, the high-level semantic features and the target semantic category of the continuous frame image samples in the dataset, P (W key ) is a sentence containing the language description, the high-level semantic features and the target semantic category of each frame image sample, and D (·) is a word vector dictionary coding operation.
  7. 7. The method of claim 1, wherein inputting word vectors containing language descriptions, high-level semantic features, and target semantic categories of successive frame image samples in the dataset into the language semantic coding model, obtaining language high-level semantic feature codes comprises: the input coding layer in the language semantic coding model processes the input word vector data through dimension conversion, position and category coding to obtain the pre-coding characteristics of the language semantic coding model; The feature extraction layer in the language semantic coding model extracts key features in the pre-coding features of the language semantic coding model through a transducer to obtain key semantic features of the language semantic coding model; And the feature mapping layer in the language semantic coding model refines and fuses key semantic features of the language semantic coding model through mapping transformation to obtain language high-level semantic feature codes.
  8. 8. The method of claim 1, wherein taking successive frame image samples in the dataset as the features to be extracted in the perception domain, inputting the features in the perception domain and the features to be extracted in the perception domain into a visual semantic coding model for preliminary fusion of the features in the perception domain, and obtaining the visual advanced semantic feature codes comprises: The input coding layer in the visual semantic coding model processes the input cognitive domain features and the features to be extracted of the cognitive domain through dimension conversion, position and category coding to obtain space-time and space-frequency visual semantic coding model precoding features; The feature extraction layer in the visual semantic coding model extracts key features in the space-time and space-frequency visual semantic coding model pre-coding features through a transducer, and performs space-time and space-frequency feature fusion learning based on a cross attention mechanism to obtain the key semantic features of the visual semantic coding model; And the feature mapping layer in the visual semantic coding model refines and fuses key semantic features of the visual semantic coding model through mapping transformation to obtain visual advanced semantic feature codes.
  9. 9. The method of claim 1, wherein the distance penalty for language high-level semantic feature coding and visual high-level semantic feature coding is obtained by: Wherein L dis (I, T) is the distance loss between the language high-level semantic feature codes and the visual high-level semantic feature codes, S (I, T) is the similarity between the same-class target language high-level semantic feature codes and the visual high-level semantic feature codes, I is the visual high-level semantic feature codes, T is the language high-level semantic feature codes matched with the visual high-level semantic feature codes, S (I, T j ) is the similarity between the different-class target language high-level semantic feature codes and the visual high-level semantic feature codes, T j is the language high-level semantic feature codes not matched with the visual high-level semantic feature codes, M is the number of the unmatched language high-level semantic feature codes, tau is a temperature parameter, exp (,) is an exponential function, and log (·) is a natural logarithmic function.
  10. 10. The method of claim 1, wherein determining the infrared point target class to be detected according to the recognition requirement, inputting the infrared point target to be detected and the infrared point target class to be detected into the updated visual language model, and realizing the recognition and cognition fusion type infrared point target recognition comprises determining the infrared point target class to be detected according to the recognition requirement, inputting the infrared point target class to be detected into the updated language semantic coding model to update word vectors, and recognizing the infrared point target to be detected based on the updated visual semantic coding model, wherein the updated visual language model comprises the updated language semantic coding model and the updated visual semantic coding model.
  11. 11. An infrared point target recognition system based on cognitive fusion, which is characterized by comprising: The cognitive domain feature representation module is used for acquiring language description, advanced semantic features, word vectors of target semantic categories and space-frequency feature data of samples after time dimension frequency domain analysis based on the infrared point target continuous frame image data set, prompt engineering and prior information, wherein the space-frequency feature data of the samples after time dimension frequency domain analysis is used as the cognitive domain feature; The system comprises a cognitive feature fusion training module, a visual high-level semantic feature code, a distance loss obtaining module, a visual high-level semantic feature code processing module and a visual high-level semantic feature code processing module, wherein the cognitive feature fusion training module is used for inputting word vectors containing language descriptions, high-level semantic features and target semantic categories of continuous frame image samples in a dataset into a language semantic code model to obtain language high-level semantic feature codes; The reasoning module is used for determining the type of the infrared point target to be detected according to the identification requirement, inputting the infrared point target to be detected and the type of the infrared point target to be detected into the updated visual language model, and realizing the identification of the sensing-cognition fusion type infrared point target.
  12. 12. A computer device comprising a memory, a processor and a cognition fusion based infrared point object recognition program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 10 when executing the cognition fusion based infrared point object recognition program.

Description

Infrared point target identification method and system based on cognitive fusion Technical Field The invention relates to the technical field of computer vision image recognition, in particular to a recognition method and a recognition system for infrared point targets based on cognitive fusion. Background Image recognition is used as a basis and a core task in the field of computer vision, and has been rapidly developed in recent years, and the performance of the traditional deep learning method can be further improved by fusing perception features such as time, space dimension and the like with cognitive features such as language, target characteristics and the like. At the object recognition method level. The traditional method is characterized in that the apparent characteristics of the target are counted, key characteristics such as mean value, variance, contour, area and the like are extracted, and a threshold value is manually designed or a simple machine learning method such as AdaBoost, SVM, BP neural network and the like is adopted as a target recognition means, but a certain priori knowledge of the target is required to be used as a recognition method design reference, and the method is limited by a large number of threshold value parameter adjustment and cannot adapt to large dynamic scene change. The existing deep learning method adjusts a large number of parameters of a model through multi-level convolution operation, a truth value label and feedback optimization, and extracts key characteristics of the target in a learning mode to complete a target recognition task, but a large number of marked data samples are required to train, a complete data set of the mark in practice needs a large number of manpower marking costs to acquire, and the trained model has a good effect only in a scene with small data difference from a test set, and has insufficient adaptability to complex and diverse scenes. In particular, at the infrared point target recognition task level. The infrared point target is found at a longer distance and is effectively identified as a target pursued by the infrared detection system, but under the normal conditions, the infrared point target is strong in background noise, few in target pixels and weak in gray scale, so that the extractable characteristic dimension of the infrared point target is limited and difficult, key characteristics of the target cannot be extracted, and the infrared point target identification difficulty is formed. Disclosure of Invention The invention provides a sensing-cognition fusion-based infrared point target identification method and system, which can solve the technical problem of low accuracy of infrared point target identification in the prior art. According to an aspect of the invention, there is provided a cognitive fusion-based infrared point target recognition method, the method comprising: Acquiring language description, advanced semantic features, word vectors of target semantic categories and space-frequency characteristic data of samples subjected to time dimension frequency domain analysis based on an infrared point target continuous frame image data set, prompt engineering and prior information, wherein the space-frequency characteristic data of the samples subjected to time dimension frequency domain analysis is used as cognitive domain features; Inputting word vectors containing language description, high-level semantic features and target semantic categories of continuous frame image samples in a dataset into a language semantic coding model to obtain language high-level semantic feature codes; Taking continuous frame image samples in the data set as features to be extracted in a perception domain, inputting the features to be extracted in the perception domain and the features to be extracted in the perception domain into a visual semantic coding model for preliminary fusion of the perceived features, and obtaining visual advanced semantic feature codes; Obtaining the similarity of the language high-level semantic feature codes and the visual high-level semantic feature codes, and obtaining the distance loss between the language high-level semantic feature codes and the visual high-level semantic feature codes based on the similarity; Performing cognitive fusion interactive training on the language semantic coding model and the visual semantic coding model based on the distance loss of the language high-level semantic feature coding and the visual high-level semantic feature coding to obtain an updated visual language model; And determining the type of the infrared point target to be detected according to the identification requirement, and inputting the infrared point target to be detected and the type of the infrared point target to be detected into the updated visual language model to realize the identification of the cognitive fusion type infrared point target. Preferably, the acquiring, based on the infrared point target continu