CN-122024127-A - Automatic evaluation attribution method for emergency exercise capacity based on visual language model
Abstract
The invention discloses an automatic evaluation attribution method for emergency exercise capacity based on a visual language model, and relates to the technical field of emergency management. The invention collects drilling video data, separates vision, text and environment data and completes space-time depth alignment, extracts visual action and text semantic features through a CLIP and BERT model, combines environment risk one-hot vectors to construct a four-dimensional feature set, inputs the four-dimensional feature set into a LLM-VLM collaborative reasoning framework finely tuned by PEFT-LoRA technology, outputs a preliminary evaluation result by relying on a proprietary sample knowledge base and a collaborative evaluation mechanism, calculates a target evaluation result through dynamic weights, constructs a causal relation graph based on the four-dimensional feature set, and realizes automation, precision and traceability of emergency drilling capability evaluation by reversely tracing the root factors of unqualified dimensions through causal GNN and quantifying the influence duty ratio. The invention realizes the automatic evaluation of the emergency exercise capability of accuracy and stability through LLM-VLM cooperation, and has wide application value.
Inventors
- AN WENJUAN
- CHEN HANLIN
- CHEN HUAWEI
- LI YUANZHE
- ZHANG XUGUANG
- ZHANG RUIJIE
- LI ZHIFENG
- YU LE
- YANG JIANG
- YANG MAO
- ZHOU QIAN
Assignees
- 重庆建筑工程职业学院
- 招商局重庆交通科研设计院有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20251229
Claims (10)
- 1. An emergency exercise capability automatic evaluation attribution method based on a visual language model is characterized by comprising the following steps: Collecting video data in the emergency exercise process, separating visual data, text data and environmental data from the video data, and performing space-time alignment on the data to construct a four-dimensional feature set; Substituting the four-dimensional feature set into a LLM-VLM collaborative reasoning framework finely tuned in the field of emergency exercise, and outputting a preliminary evaluation result through LLM-VLM collaborative reasoning; Quantifying the preliminary evaluation result into a score by a multidimensional evaluation system, and calculating a target evaluation result by combining the dynamic weight output by reinforcement learning; an emergency exercise causal relation graph is constructed based on four-dimensional feature sets, and aiming at the dimension with unqualified comprehensive capacity level, the influence proportion of each root factor is quantized through the influence path of causal GNN backward tracing, so that depth attribution is completed.
- 2. The method of claim 1, wherein the separating visual data, text data and environmental data from video data comprises separating visual data, including video stream data and image data, by frame extraction and format recognition, extracting text from video data by OCR recognition technology, converting speech data into text by ASR, summarizing text data by combining exercise scheme documents and preset evaluation indexes, screening data with environmental parameter labels, and summarizing environmental factors in video data into environmental data.
- 3. The method according to claim 1, characterized in that the building of the four-dimensional feature set is in particular: Based on visual data, extracting action feature vectors of visual data key frames through a CLIP model, and constructing a space-time and action two-dimensional feature set; Based on text data, encoding the text through a BERT model, outputting semantic feature vectors representing core meanings of text instructions and dialogues, and combining two-dimensional feature sets to construct a space-time, action and instruction three-dimensional feature set; converting the discretized environment risk level into one-hot vectors as environment risk characteristics, and combining the three-dimensional characteristic sets to construct four-dimensional characteristic sets of space-time, action, instruction and environment; inputting the preprocessed visual features, text semantic features and environment risk features, and calculating the association weights of different modal features through a transducer-based space-time attention network; presetting a attention weight threshold Attention weight-based screening association degree is not less than By combining the space-time information, generating a four-dimensional feature set of space-time, action, instruction and environment, wherein the formula is as follows: Wherein, the Representation of The motion feature vector of the moment in time, Representation of The instruction semantic vector of the moment in time, Representation of The environmental state vector of the moment in time, Representation of Spatial position coordinates of time.
- 4. The method of claim 1, wherein the fine-tuned LLM-VLM collaborative reasoning framework in the emergency exercise field is specifically composed of a large language model LLM module and a visual language model VLM module, wherein the large language model LLM module is used for understanding text instructions, generating an evaluation reasoning chain and outputting text evaluation conclusions; Performing fine adjustment in the field of emergency exercise on LLM and VLM through a LoRA technology of PEFT; constructing a proprietary sample knowledge base, realizing self-adaptive reasoning of the model on different exercise scenes, and clearly evaluating the dimension and the data input format; and constructing a cooperative mechanism of VLM semantic conversion, LLM logical reasoning and VLM verification feedback, and outputting and verifying a preliminary evaluation result based on the cooperative mechanism.
- 5. The method of claim 4, wherein the emergency exercise field fine tuning is specifically performed based on an emergency exercise sample containing a four-dimensional feature set and a preliminary evaluation result of expert labeling, and a low-rank adapter is inserted into an attention weight matrix of a model transducer layer through LoRA technology to decompose a weight update amount into a product of two low-rank matrices, wherein the formula is: Wherein, the 、 、 Input feature matrices for the transducer layer attention, 、 、 The attention weight matrices are respectively defined as attention weight matrices, 、 、 The original attention weight matrices are respectively provided, 、 、 Respectively, an input low-rank matrix, 、 、 For outputting a low rank matrix; by training 、 The matrix realizes field adaptation and high-efficiency fine adjustment of parameters.
- 6. The method of claim 4, wherein the constructing the proprietary template knowledge base specifically comprises dividing templates according to exercise types, wherein the templates comprise scene background descriptions, assessment dimension definitions and data input formats, and the data input format module requires the model to output assessment results of corresponding dimensions based on input four-dimensional feature sets; by means of the drill type field in the drill scheme document, automatically matching corresponding template in the knowledge base; And filling the preprocessed four-dimensional feature set according to a template specified format to generate an input prompt adapted to the current drilling scene, so as to ensure that the model can directly analyze multi-mode associated information.
- 7. The method of claim 4, wherein the collaboration mechanism comprises a VLM semantic conversion sub-module, wherein the VLM semantic conversion sub-module is deployed to convert a four-dimensional feature set into a natural language description which can be understood by LLM, the LLM receives the visual semantic description and scene background and assessment dimension definition in the promt, carries out logic reasoning to judge whether each dimension meets the qualification standard, complements implicit risk information, generates an optimized promt comprising a single-dimension assessment primary result and an inference basis, feeds back the optimized promt to the VLM, the VLM receives the optimized promt, invokes an original image/video frame to carry out visual verification on the inference result of the LLM, outputs a final preliminary assessment result if the verification is passed, feeds back the verification opinion to the LLM if the verification is not passed, and reasonents until the results are consistent.
- 8. The method of claim 1, wherein the multi-dimensional evaluation system specifically aims at quick response, standard operation, efficient coordination and risk control of emergency exercises, evaluates response speed, operation compliance, coordination efficiency and risk prevention and control effectiveness of the exercises, sets a plurality of quantifiable refinement indexes under each evaluation dimension, and formulates general qualification standards; Through reinforcement learning, based on the risk level of the exercise scene, the personnel role and the exercise stage, a state space is constructed to determine an influence factor, dynamic weight distribution is realized, and a target evaluation result is calculated by combining the quantized score of the preliminary evaluation result.
- 9. The method of claim 1, wherein the depth attribution is specifically a causal relationship graph constructed based on a four-dimensional feature set and inherent associations of fields in the four-dimensional feature set; Mapping the unqualified dimension into a target node in the causal relation graph, starting causal GNN reverse reasoning, starting from the target node, traversing all upstream nodes pointing to the node, and outputting the optimized influence intensity of each upstream node on the target node; setting an influence intensity threshold, screening out upstream nodes with influence intensity larger than the threshold, and forming an effective influence path; performing hierarchical carding on the effective influence path, and extracting a node at the most upstream of the path as a core root node; and calculating the sum of the influence intensities of the core root nodes after optimization of the unqualified dimensions, and quantifying the influence duty ratio of each root factor by adopting a normalization formula, wherein the formula is as follows: Wherein, the The impact duty cycle for the jth core root cause, The strength is influenced after the jth core root cause is optimized on the disqualified dimension, and m is the total number of the core root cause; a normalized attribution report is generated based on the disqualifying dimension, the impact path, the core root, the impact duty cycle.
- 10. The method according to claim 9, wherein the constructing a causal relationship graph, in particular: Based on the four-dimensional feature set, four types of core nodes are defined, wherein the four types of core nodes comprise action nodes, instruction nodes, environment nodes and space-time nodes, attribute information is supplemented for each node, and the attribute data is directly taken from the four-dimensional feature set; Based on the internal association of each field in the four-dimensional feature set, constructing a side relation node pair with causal influence to form a side of the causal relation graph; The association rule of the edges of the relation graph is that an association of action nodes and instruction nodes is established for the instruction triggering action, the action nodes and the environment nodes are established for the environmental state influencing action execution or the environmental change caused by the action, the action nodes and the space-time nodes are established for the relation of the action under the specific space-time scene, and the instruction nodes and the space-time nodes are established for the relation of the instruction under the specific space-time scene; Based on an emergency exercise knowledge base and historical attribution data, giving initial influence intensity to each edge, constructing a causal relation graph through a Neo4j graph database, and inputting node attributes, edge relations and initial influence intensity; And through the index optimization function of the graph database, joint indexes of node types and time stamps are established, and the query efficiency of subsequent GNN reasoning is improved.
Description
Automatic evaluation attribution method for emergency exercise capacity based on visual language model Technical Field The invention relates to the technical field of emergency management, in particular to an automatic evaluation attribution method for emergency exercise capacity based on a visual language model, which is used for optimizing and perfecting an emergency management system by integrating a visual language model, a large language model, a causal graph neural network and reinforcement learning. Background With the deep advancement of modern construction of emergency management systems and capabilities in China, the complexity, the coupling and the uncertainty of emergency events are increasingly remarkable, and the importance of the emergency exercise is increasingly remarkable as a core means for checking the validity of a plan, improving the co-processing capability of teams and strengthening the linkage response level of government enterprises and institutions. The existing emergency exercise evaluation multi-reliance expert on-site scoring and post-recovery summarization have the problems of long evaluation period, strong subjectivity, fuzzy quantization index, fragmentation of data acquisition and the like, and are difficult to meet the evaluation requirements of large-scale and normalized emergency exercise. Meanwhile, the rapid iteration of the technology such as artificial intelligence, computer vision, natural language processing and the like provides technical support for intelligent transformation of emergency drilling evaluation. Chinese patent (publication No. CN 119694006A) discloses an emergency exercise evaluation analysis system and method, which are based on a deep learning video data analysis technology to sample key frames of an emergency exercise action video of a target person object, and intelligently generate an exercise evaluation analysis report by performing action semantic analysis on each sampled emergency exercise action key frame so as to perform upper and lower Wen Yuyi association representation of feature jump degrees according to each action semantic analysis feature, wherein, however, the analysis is performed only by a single feature of a video action, the depth alignment and collaborative analysis of visual, text, environment and space-time data are not realized, the depth attribution capability is lacking, the evaluation result or simple score can only be output, the core root and the influence occupation ratio of unqualified dimensions are difficult to trace, and the targeted guidance can not be provided for the exercise optimization. In summary, although the existing emergency exercise evaluation scheme starts to explore in the field of artificial intelligence, the existing evaluation scheme still has the defects of single dimension, lack of attribution capability and insufficient field suitability, and the problems cause that the existing evaluation method is difficult to meet the actual requirement of the transition from formalization to actual effect of the emergency exercise, so that a technical scheme with the functions of automatic evaluation and deep attribution is needed. Disclosure of Invention Based on the technical problems, the application discloses an emergency exercise capability automatic evaluation attribution method based on a visual language model, which specifically comprises the following steps: Collecting video data in the emergency exercise process, separating visual data, text data and environmental data from the video data, and performing space-time alignment on the data to construct a four-dimensional feature set; Substituting the four-dimensional feature set into a LLM-VLM collaborative reasoning framework finely tuned in the field of emergency exercise, and outputting a preliminary evaluation result through LLM-VLM collaborative reasoning; Quantifying the preliminary evaluation result into a score by a multidimensional evaluation system, and calculating a target evaluation result by combining the dynamic weight output by reinforcement learning; an emergency exercise causal relation graph is constructed based on four-dimensional feature sets, and aiming at the dimension with unqualified comprehensive capacity level, the influence proportion of each root factor is quantized through the influence path of causal GNN backward tracing, so that depth attribution is completed. Preferably, the specific process of separating visual data, text data and environment data from video data comprises the steps of separating the visual data, including video stream data and image data, through frame extraction and format recognition, extracting words in the video data through an OCR (optical character recognition) technology, converting voice data into text through ASR (automatic speech recognition), summarizing the text data by combining with an exercise scheme document and a preset evaluation index, screening data with environment parameter labe