CN-121999352-A - Water area monitoring method and system based on semantic driving and time sequence behavior cognition
Abstract
The invention discloses a water area monitoring method and a water area monitoring system based on semantic driving and time sequence behavior cognition, which realize the open perception capability of dynamically identifying unknown foreign matters without retraining by combining a multi-mode segmentation large model (SAM 3) with natural language prompt words, effectively distinguish real anomalies from environment interference by tracking and extracting motion characteristics in a cross-frame manner and accurately identifying biological behavior intention based on time sequence behavior analysis, greatly reduce false alarm rate, merge multiple paths of video frames into a high-dimensional tensor for parallel reasoning by adopting a global dynamic batch processing mechanism, remarkably improve GPU utilization rate and system concurrency processing capability, and can still keep smooth video output when the reasoning frame rate is lower by adopting a video rendering and reasoning asynchronous decoupling framework, thereby thoroughly solving the problem of picture blocking caused by introducing the large model.
Inventors
- LI ZHONGXIN
- CAI JIA
- QI JIAHUI
- YU DAPENG
Assignees
- 深圳市天衍智擎科技有限公司
- 广东海洋大学深圳研究院
Dates
- Publication Date
- 20260508
- Application Date
- 20260205
Claims (10)
- 1. A water area monitoring method based on semantic driving and time sequence behavior cognition is characterized by comprising the following steps: utilizing a plurality of independent acquisition threads to acquire video frames shot by each camera in the water area in parallel, and storing the video frames into memory buffer pools corresponding to each camera; Reading the latest frame at the current moment from each memory buffer pool according to a preset period, and constructing high-dimensional tensor batch data by utilizing each read latest frame; determining a target detection prompt word corresponding to the current monitoring task, and inputting the high-dimensional tensor batch data and the target detection prompt word into the SAM3 model to obtain target recognition results of each latest frame; performing cross-frame tracking on the targets identified in each latest frame to generate motion characteristic information corresponding to each target; analyzing the behavior intention of each target by utilizing the motion characteristic information corresponding to each target to obtain an intention state label of each target, and generating an entity state snapshot of each latest frame by utilizing each intention state label and each target identification result; For any latest frame, after the preset duration is stored in any latest frame, judging whether an entity state snapshot corresponding to any latest frame is generated, wherein the preset duration is smaller than the acquisition interval duration of a video frame; If not, acquiring a nearest neighbor snapshot, wherein the nearest neighbor snapshot is an entity state snapshot corresponding to a history frame with the nearest time interval of any latest frame in a memory buffer pool corresponding to the any latest frame; And generating local monitoring images based on the nearest neighbor snapshot and any latest frame, and after all the latest frames are executed concurrently, utilizing the obtained local monitoring images to form a water area monitoring result at the current moment.
- 2. The method of claim 1, wherein inputting the high-dimensional tensor batch data and the object detection hint word into the SAM3 model to obtain the object recognition result for each of the latest frames comprises: Inputting the high-dimensional tensor batch data and the target detection prompt words into a SAM3 model to output a low-resolution feature map corresponding to each latest frame, a segmentation mask frame of a target identified in each latest frame in the corresponding low-resolution feature map, and the confidence of the target identified in each latest frame; Mapping the segmentation mask frames in each low-resolution feature map back to the latest frame corresponding to each low-resolution feature map to obtain the real segmentation mask frames of the targets identified in each latest frame; and forming a target identification result of each latest frame by using the real segmentation mask frame and the confidence of the target identified in each latest frame.
- 3. The method of claim 1, wherein performing cross-frame tracking on the identified objects in each of the most recent frames to generate motion profile information corresponding to each of the objects comprises: for any latest frame, acquiring a plurality of time-continuous historical video frames from a target buffer pool, wherein the target buffer pool is a memory buffer pool of a camera corresponding to any latest frame; Utilizing the target identification results of a plurality of historical video frames and any latest frame to track and analyze the target identified in any latest frame in a frame-crossing track so as to allocate a global unique ID for the target identified in any latest frame and construct a corresponding initial motion characteristic sequence; normalizing the initial motion feature sequence to obtain a motion feature sequence; and utilizing the globally unique ID and the motion characteristic sequence to form motion characteristic information of the target identified in any latest frame.
- 4. A method according to claim 3, wherein using the historical video frames and the target recognition results of any one of the latest frames to track and analyze the target recognized in any one of the latest frames, the method comprises: Performing cross-frame track tracking on a specified target in any latest frame by adopting ByteTrack algorithm based on a target identification result of the any latest frame and a plurality of historical video frames to determine an outline mask of the specified target in each historical video frame, and distributing global unique IDs for the specified targets in all historical video frames, wherein the specified targets are identified in any latest frame; Performing multi-dimensional motion characteristic discrete resolving processing on the specified target according to the contour mask of the specified target in each historical video frame so as to obtain the instantaneous speed, steering angle, angular speed and length-width ratio of the specified target in each historical video frame and any one of the latest frames; And utilizing the instantaneous speed, the steering angle, the angular speed and the length-width ratio of the appointed target in each historical video frame and any latest frame to form an initial motion characteristic sequence corresponding to the appointed target.
- 5. The method of claim 4, wherein performing a multi-dimensional motion feature discrete solution process on the specified object based on the contour mask of the specified object in each of the historical video frames to obtain the instantaneous speed, the steering angle, the angular speed, and the aspect ratio of the specified object in each of the historical video frames comprises: for any historical video frame, acquiring a previous historical video frame of the any historical video frame as a designated video frame; Determining a first centroid coordinate of the specified target in the any historical video frame and a second centroid coordinate in the specified video frame based on the contour mask of the specified target in the any historical video frame and the contour mask in the specified video frame; determining the frame interval duration between any historical video frame and the appointed video frame, and calculating the instantaneous speed of the appointed target in any historical video frame by using the first centroid coordinate, the second centroid coordinate and the frame interval duration; calculating the steering angle of the specified target in any historical video frame according to the first centroid coordinate and the second centroid coordinate; acquiring a steering angle of the specified target in the specified video frame; Calculating the angular speed of the specified target in any historical video frame by using the steering angle of the specified target in any historical video frame and the steering angle of the specified target in the specified video frame; Determining the minimum circumscribed rectangle of the outline mask of the appointed target in any historical video frame; and taking the aspect ratio of the minimum circumscribed rectangle as the aspect ratio of the specified target in any historical video frame.
- 6. The method of claim 5, wherein calculating the instantaneous speed of the specified target in the any one of the historical video frames using the first centroid coordinates, the second centroid coordinates, and the frame interval duration comprises: calculating the instantaneous speed of the appointed target in any historical video frame according to the following formula; ; in the formula, Representing the instantaneous speed of the specified target in any one of the historical video frames, Representing the abscissa and the ordinate in the first centroid coordinates, Representing the abscissa and the ordinate in the second centroid coordinates, Representing a frame interval duration; the calculating, according to the first centroid coordinate and the second centroid coordinate, a steering angle of the specified target in the any historical video frame includes: calculating the steering angle of the specified target in any historical video frame according to the following formula; ; in the formula, Representing a steering angle of the specified target in the any one of the historical video frames; Correspondingly, calculating the angular velocity of the specified target in any one of the historical video frames by using the steering angle of the specified target in any one of the historical video frames and the steering angle in the specified video, wherein the method comprises the following steps: Calculating the angular speed of the specified target in any historical video frame according to the following formula; ; in the formula, Representing the angular velocity of the specified target in the any one of the historical video frames, Represents the steering angle of the specified target in the specified video frame, Representing the frame interval duration.
- 7. The method of claim 1, wherein each camera corresponds to one acquisition thread, and the memory buffer pool corresponding to any camera comprises a buffer area to be displayed and a buffer area to be inferred, wherein video frames shot by each camera in the water area are acquired in parallel and stored in the memory buffer pool corresponding to each camera, and the method comprises the steps of: Storing video frames shot by each camera into a buffer area to be displayed and a buffer area to be inferred, which are respectively corresponding to each camera, and releasing an acquisition lock of an acquisition thread corresponding to each camera after storing, so that the video acquisition frame rate of each camera is kept at the maximum frame rate; Wherein, according to the preset period, the latest frame at the current moment is read from each memory buffer pool, which comprises the following steps: Reading the latest frames at the current moment from each buffer area to be inferred according to a preset period, and constructing the high-dimensional tensor batch data by using each read latest frame; Correspondingly, obtaining the nearest neighbor snapshot includes: reading a history frame which is closest to any latest frame time interval from a buffer area to be displayed of a camera corresponding to the any latest frame as a target frame; and taking the entity state snapshot corresponding to the target frame as the nearest neighbor snapshot.
- 8. The method of claim 1, wherein the motion feature information corresponding to any one object comprises a globally unique ID corresponding to the any one object and a motion feature sequence; Analyzing the behavior intention of each target by utilizing the motion characteristic information corresponding to each target to obtain an intention state label of each target, wherein the method comprises the following steps: Inputting a motion feature sequence in motion feature information corresponding to each target into an intention cognition and state classification model to obtain an intention state label of each target, wherein the intention cognition and state classification model is a trained LSTM model; correspondingly, generating the entity state snapshot of each latest frame by using each intention state label and each target identification result comprises the following steps: and generating entity state snapshots of each latest frame by using the intention state labels and the global unique IDs corresponding to each target and each target identification result.
- 9. A water area monitoring system based on semantic driving and time sequence behavior cognition, comprising: the distributed image acquisition and flow control unit is used for parallelly acquiring video frames shot by each camera in the water area by utilizing a plurality of independent acquisition threads and storing the video frames into the memory buffer pools corresponding to each camera; the global semantic perception engine unit is used for reading the latest frames at the current moment from each memory buffer pool according to a preset period and constructing high-dimensional tensor batch data by utilizing each read latest frame; the global semantic perception engine unit is also used for determining a target detection prompt word corresponding to the current monitoring task, inputting the high-dimensional tensor batch data and the target detection prompt word into the SAM3 model, and obtaining target recognition results of each latest frame; the time sequence behavior cognition unit is used for carrying out cross-frame tracking on the targets identified in each latest frame so as to generate motion characteristic information corresponding to each target; the time sequence behavior cognition unit is also used for analyzing the behavior intention of each target by utilizing the motion characteristic information corresponding to each target to obtain intention state labels of each target, and generating entity state snapshots of each latest frame by utilizing each intention state label and each target identification result; the asynchronous visual interaction unit is used for judging whether entity state snapshots corresponding to any latest frame are generated after the preset duration is stored in any latest frame, wherein the preset duration is smaller than the acquisition interval duration of the video frame; if not, the asynchronous visual interaction unit is used for acquiring a nearest neighbor snapshot, wherein the nearest neighbor snapshot is an entity state snapshot corresponding to a history frame with the nearest time interval of any latest frame in a memory buffer pool corresponding to the any latest frame; The asynchronous visual interaction unit is further used for generating local monitoring images based on the nearest neighbor snapshot and any latest frame, and forming a water area monitoring result at the current moment by utilizing the obtained local monitoring images after all the latest frames are executed concurrently.
- 10. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the water area monitoring method based on semantic driving and time series behavior awareness as claimed in any one of claims 1 to 8.
Description
Water area monitoring method and system based on semantic driving and time sequence behavior cognition Technical Field The invention belongs to the technical field of artificial intelligence, and particularly relates to a water area monitoring method and system based on semantic driving and time sequence behavior cognition. Background At present, aquaculture and water area ecological monitoring mainly rely on manual inspection and periodic sampling, wherein aquaculture personnel need to observe the water surface for a long time to judge biological ingestion, activity state and water area environmental safety, so that the mode is high in labor intensity and subjectivity, all-weather and standardized refined management is difficult to realize, and abnormal condition discovery is often delayed. In recent years, computer vision technology based on deep learning is gradually applied to the field, wherein a mainstream scheme generally adopts an object detection network (such as YOLO series) or an instance segmentation network (such as Mask R-CNN) to automatically identify a monitored object, however, in actual high concurrency and complex scene application, the following significant limitations still exist in the prior art: Firstly, the existing model generally has 'predefined category limitation' and lacks open perception capability, the traditional visual model can only identify specific targets preset in a training stage (such as 'fish') so that when unforeseen abnormal objects (such as equipment parts fall off, mobile phones of patrol personnel accidentally fall into water, foreign invasive species and the like) occur in a culture environment, the existing system is often ignored as background noise due to the lack of semantic understanding capability, and if the identification of a new target is required to be added, the traditional visual model must undergo time-consuming data acquisition, labeling and retraining processes, so that the instant monitoring requirement of the sudden foreign objects or the custom targets in an actual scene cannot be met. Secondly, the prior art focuses on "static phenotype analysis", lacks depth cognition of "time series behavior intention", most of the existing monitoring technologies (including a method based on segmentation) mainly focus on body surface features (such as identification of white spots and ulcer areas) in a single frame image, however, early anomalies (such as panic swimming caused by hypoxia floating heads and water quality discomfort, outlier single-play and the like) of many organisms do not have obvious body surface pathological changes, but are reflected in time series changes of movement tracks, speeds and postures, so that the dynamic behavior anomalies are difficult to capture by only static image analysis, and leakage or misreporting of environmental interference (such as water weed swing) into biological anomalies are extremely easy to cause. Finally, a serious architecture bottleneck exists between the high-precision model and the real-time video stream processing, namely, along with the rising of the large model in the visual field, although the identification precision is obviously improved, the calculation cost is huge, the existing system mostly adopts a serial coupling architecture of acquisition, reasoning, rendering and displaying, the frame rate of the video stream is directly limited to the reasoning speed of the model (namely, the video frame must wait for the AI model to be reasoning to finish before the video frame can enter the rendering link, so that the Fluency (FPS) of the video stream is forcedly degraded to the reasoning speed of the model), once the high-precision large model is introduced, the video picture is often severely blocked and delayed, and in addition, the existing single-path single-reasoning mode cannot effectively utilize the parallel calculation capability of the GPU, and the high-concurrency real-time monitoring of the multi-path video stream is difficult to realize under the limited hardware resources. Therefore, based on the above-mentioned shortcomings, how to provide a water area monitoring method based on semantic driving and time sequence behavior cognition, which has the advantages of open sensing capability, high monitoring accuracy, high video rendering smoothness and high concurrency real-time monitoring, has become a problem to be solved. Disclosure of Invention The invention aims to provide a water area monitoring method and system based on semantic driving and time sequence behavior cognition, which are used for solving the problems that the real-time monitoring requirement on sudden foreign matters or custom targets in actual scenes cannot be met, the monitoring accuracy is poor, video rendering is blocked, and high concurrence real-time monitoring of multiple paths of video streams cannot be realized in the prior art. In order to achieve the above purpose, the present invention adopts the following t