CN-120823533-B - Reservoir safety intelligent inspection method and system based on YOLO and VLM fusion

CN120823533BCN 120823533 BCN120823533 BCN 120823533BCN-120823533-B

Abstract

The invention discloses a reservoir safety intelligent inspection method and a reservoir safety intelligent inspection system based on YOLO and VLM fusion, wherein the reservoir safety intelligent inspection method comprises the following steps of S1, multi-source data acquisition and preprocessing, namely acquiring image/video data by adopting an unmanned plane and ground equipment, and carrying out noise reduction, enhancement and time-space alignment processing on the data; S2, improving the YOLO target detection, namely optimizing a network structure and a training strategy, S3, performing VLM post-link analysis, namely realizing target/scene association judgment by adopting VLM, and S4, generating a report. By setting the reservoir safety intelligent tour inspection method and system with the integration of the YOLO and the VLM, the reservoir safety intelligent tour inspection method and system with the integration of the YOLO and the VLM can be used as a post-processing tool of the YOLO by utilizing the cross-modal semantic understanding, zero sample reasoning and video global analysis capability of the visual language big model, and can work in parallel with the YOLO in an integrated manner, so that the full-flow intellectualization of 'target detection-semantic analysis-report generation' can be realized, the problems in the prior art can be effectively solved, and the comprehensiveness of a hydraulic engineering safety monitoring system can be enhanced.

Inventors

LI ZHIZHEN
HUANG ZHIJIAN
ZHANG WEN
YANG JUN
Peng Xuekang

Assignees

江西省水投江河信息技术有限公司

Dates

Publication Date: 20260508
Application Date: 20250917

Claims (7)

1. A reservoir safety intelligent inspection method based on YOLO and VLM fusion is characterized by comprising the following steps, S1, multi-source data acquisition and pretreatment: collecting image/video data by adopting an unmanned plane and ground equipment, and carrying out noise reduction, enhancement and time-space alignment treatment on the data; S2, improving YOLO target detection: Optimizing a network structure and a training strategy, and realizing high-precision detection of dam body cracks and floater targets in an unmanned aerial vehicle scene; S3, VLM postlink analysis: Extracting the visual feature vector of the target in the step S2, extracting the global visual feature vector of the whole image, constructing a text dictionary containing reservoir inspection core semantic concepts, and realizing target/scene association judgment, zero sample target identification and video time sequence understanding by adopting VLM; S4, report generation: based on the detection result and semantic analysis, generating a structured report containing hidden danger information, risk assessment and treatment suggestions; In step S2, specifically, the method includes: On the basis of YOLOv architecture, replacing the back-bone part convolution with depth separable convolution to reduce the quantity of parameters, and adding a new attention module in the back part to enhance the feature extraction capability of a small-size dam body crack target; Constructing a special reservoir inspection data set containing dam body cracks and floaters and adopting a cosine annealing learning rate strategy in combination with a Focal Loss function to perform model training until the accuracy of the verification set is stable; inputting the preprocessed data set into an improved YOLO model, setting reasonable confidence coefficient and an IOU threshold value, removing a repeated detection frame through non-maximum value inhibition, realizing target detection of dam body cracks, floaters and offenders, and outputting target types, boundary frame coordinates and confidence coefficient; in step S3, specifically, the method includes: calculating cosine similarity of the visual feature vector of the target ROI and the text embedded vector of the key region of the scene, and judging that the target is associated with the scene when the similarity reaches a set threshold; Zero sample target recognition, namely extracting the ROI visual feature vector of an undetected unknown target, calculating the distance between the unknown target and the text embedding vector of the candidate category of the text dictionary, and selecting the category with the minimum distance as a recognition result without additional training; and the video time sequence understanding is to analyze the target motion trail and state change of the continuous multi-frame video detection result and output the target time sequence related information.
2. The reservoir safety intelligent inspection method based on the fusion of YOLO and VLM according to claim 1, wherein in step S1, the method specifically comprises the following steps: The data acquisition operation comprises the steps of adopting a multi-rotor unmanned aerial vehicle carrying an industrial camera to acquire images/videos of a reservoir dam body, a water surface, a drainage port and surrounding dangerous areas, supplementing and acquiring data of the bottom and gate areas of the dam body through a ground high-definition fixed camera, and recording a time stamp and GPS coordinates of each frame of data during acquisition; Noise reduction processing, namely removing high-frequency noise of an original image/video by adopting a Gaussian filtering algorithm, and removing dynamic noise of video data by an inter-frame difference method; The enhancement treatment, namely aiming at a complex illumination scene, adopting a Retinex algorithm to adjust the brightness and contrast of the image, and highlighting key target characteristics of dam cracks and floaters; And (3) space-time alignment processing, namely realizing data space registration of the unmanned aerial vehicle and ground equipment by adopting a RANSAC algorithm based on the time stamp and the GPS coordinates, and completing video frame time synchronization by a linear interpolation method to form a standardized multi-source fusion data set.
3. The reservoir safety intelligent inspection method based on the fusion of YOLO and VLM according to claim 1, wherein in step S3, the ROI area corresponding to the target detection boundary box is input into a pre-training ResNet model, then the visual feature vector is extracted, and the global visual feature vector of the whole image is extracted; In step S3, a text dictionary containing reservoir inspection core semantic concepts is constructed, the text concepts are input into a pre-training BERT model to generate text embedded vectors, and the text embedded vectors are unified in dimensionality through linear transformation and are matched with visual feature vectors.
4. The reservoir safety intelligent inspection method based on the fusion of YOLO and VLM according to claim 1, wherein in step S4, the method specifically comprises the following steps: Data integration, namely associating and integrating a target detection result and a VLM analysis result with original space-time data to form structured data containing target information, scene association information and space-time information; calculating hidden danger levels from three dimensions of hidden danger severity, influence range and development trend by adopting a fuzzy comprehensive evaluation method; Generating treatment suggestions, namely calling a preset treatment suggestion knowledge base according to the risk level, and generating targeted treatment suggestions by combining hidden danger space-time information; And outputting the structured report, namely generating a standardized report containing the inspection basic information, the hidden danger list, the risk assessment and the treatment suggestion, and synchronously storing the standardized report into a database to support subsequent inquiry.
5. The reservoir safety intelligent inspection method based on the fusion of YOLO and VLM according to claim 1, wherein in the VLM post-link analysis of step S3, the cosine similarity calculation of visual features and text embedding specifically comprises: Extracting visual feature vectors from the target ROI area, and generating text embedding vectors for the text concepts of the key areas of the scene; Unifying text embedded vector and visual feature vector dimensions through a linear transformation matrix; and calculating vector cosine similarity, and judging that the target is associated with a scene key area when the similarity reaches a set threshold value, so that scene association judgment of floaters/drainage ports and personnel/dangerous areas is realized, and the actual inspection accuracy requirement is met.
6. The reservoir safety intelligent inspection method based on the fusion of YOLO and VLM according to claim 1, wherein the zero sample target identification in the step S3 specifically comprises the following steps: extracting a complete ROI region from the undetected unknown target by improving the YOLO candidate frame generation module; Extracting a visual feature vector of an unknown target ROI region; Selecting candidate categories from the text dictionary, generating text embedded vectors of the categories and unifying dimensionality; Calculating the distance between the unknown target feature vector and the candidate class vector, and selecting the class with the smallest distance as the recognition result; the process does not need extra training, is suitable for a novel hidden danger scene in which a data set is difficult to construct, and meets the actual identification requirement.
7. The reservoir safety intelligent inspection method based on the integration of YOLO and VLM according to any one of claims 1-6 is characterized by comprising the following functional layers, wherein the layers are sequentially cooperated: the data acquisition layer consists of a six-rotor unmanned aerial vehicle with a camera, a GPS and a time stamp recording function and a ground high-definition fixed camera, and acquires and transmits images/videos of key areas of a reservoir to the preprocessing module; The preprocessing sub-layer integrates Gaussian filtering, retinex enhancement and space-time alignment modules, performs the preprocessing operation of the step S1 and outputs a standardized multisource fusion data set; The YOLO detection layer comprises a model training module and a real-time detection module, receives the preprocessing data, executes the detection operation of the step S2 and outputs a target detection result; The VLM analysis layer comprises a visual feature extraction module, a text embedding generation module, a correlation judgment module, a zero sample identification module and a time sequence understanding module, receives a detection result, executes the analysis operation of the step S3 and outputs correlation, identification and time sequence information; the report generation layer consists of a data integration, risk assessment, treatment suggestion and report output module, and is used for receiving analysis results and original data, executing the operation of step S4 and outputting a standardized report; The storage layer is used for respectively storing the structured data, the original data, the model file and the report by adopting a framework combining a relational database and object storage and supporting data storage and expansion; The cooperative logic of each layer comprises a data acquisition layer, a pretreatment sub-layer, a YOLO detection layer, a VLM analysis layer, a report generation layer, and finally all data are stored in a storage layer to form a data flow closed loop.

Description

Reservoir safety intelligent inspection method and system based on YOLO and VLM fusion Technical Field The invention relates to the technical field of hydraulic engineering safety monitoring, in particular to a reservoir safety intelligent inspection method and system based on YOLO and VLM fusion. Background The reservoir is used as a core hub for water resource regulation, and the safe operation of the reservoir is directly related to flood control safety of a river basin, water resource supply and stable ecological environment. The traditional reservoir safety inspection mode relies on means such as manual step inspection, fixed-point observation and regular aerial photography, and has the inherent defects of insufficient space-time coverage, lag in hidden danger identification, high labor cost and the like. Along with the deep fusion of unmanned aerial vehicle technology and computer vision, unmanned aerial vehicle inspection has become an important means of reservoir safety monitoring by virtue of the advantages of wide coverage range, flexibility and the like. In the prior art, the processing of unmanned aerial vehicle inspection data still has obvious limitations, namely, firstly, the unmanned aerial vehicle inspection data can realize rapid target detection by relying on a YOLO algorithm alone, but cannot finish scene judgment such as whether a floater is close to a drainage port, secondly, the hidden danger types of a reservoir scene are various, a part of novel targets (such as special breeding equipment and novel geological disaster precursors) are difficult to construct a perfect data set due to the fact that samples are scarce, so that the generalization capability of a traditional detection model is insufficient, thirdly, the unmanned aerial vehicle inspection data and ground monitoring data lack of semantic layer association, and a structural report containing deep analysis is difficult to generate. Disclosure of Invention In order to overcome the defects of the prior art, the embodiment of the invention provides a reservoir safety intelligent inspection method and a reservoir safety intelligent inspection system based on the fusion of YOLO and VLM, which are provided with the reservoir safety intelligent inspection method and the reservoir safety intelligent inspection system based on the fusion of YOLO and VLM, and the reservoir safety intelligent inspection method and the reservoir safety intelligent inspection system based on the fusion of YOLO and VLM are used as post-processing tools of YOLO by utilizing the cross-modal semantic understanding, zero sample reasoning and video global analysis capability of a visual language big model (VLM) and are mutually fused and work in parallel, so that the full-flow intellectualization of 'target detection-semantic analysis-report generation' can be realized, the problems in the prior art can be effectively solved, the comprehensiveness of a hydraulic engineering safety monitoring system is enhanced, and the problems in the background technology are solved. In order to achieve the aim, the invention provides the technical scheme that the reservoir safety intelligent inspection method and the system based on the integration of the YOLO and the VLM comprise the following steps, S1, multi-source data acquisition and pretreatment: collecting image/video data by adopting an unmanned plane and ground equipment, and carrying out noise reduction, enhancement and time-space alignment treatment on the data; S2, improving YOLO target detection: Optimizing a network structure and a training strategy, and realizing high-precision detection of dam body cracks and floater targets in an unmanned aerial vehicle scene; S3, VLM postlink analysis: Extracting the visual feature vector of the target in the step S2, extracting the global visual feature vector of the whole image, constructing a text dictionary containing reservoir inspection core semantic concepts, and realizing target/scene association judgment, zero sample target identification and video time sequence understanding by adopting VLM; S4, report generation: Based on the detection result and the semantic analysis, a structured report containing hidden danger information, risk assessment and treatment suggestions is generated. In a preferred embodiment, in step S1, specifically including: (1) The data acquisition operation comprises the steps of adopting a multi-rotor unmanned aerial vehicle carrying an industrial camera to acquire images/videos of a reservoir dam body, a water surface, a drainage port and surrounding dangerous areas, supplementing and acquiring data of the bottom and gate areas of the dam body through a ground high-definition fixed camera, and recording a time stamp and GPS coordinates of each frame of data during acquisition; (2) Noise reduction processing, namely removing high-frequency noise of an original image/video by adopting a Gaussian filtering algorithm, and removing dynamic noise o