CN-121789116-B - Bird flight video key point extraction method integrating YOLO network and LSTM network

CN121789116BCN 121789116 BCN121789116 BCN 121789116BCN-121789116-B

Abstract

The application discloses a bird flight video key point extraction method integrating a YOLO network and an LSTM network, belongs to the field of image data processing, and aims to solve the problem that when the existing method adopts the YOLO network technology to extract bird skeleton key points, the traditional method relies on a single frame image, lacks time sequence information association, is difficult to completely capture the dynamic change process of the bird flight video key point, and causes insufficient gesture analysis precision. The bird continuous flight video analysis method comprises the steps of constructing a multi-view bird data set, obtaining an enhanced bird data set, obtaining a pre-training model, obtaining a second post-training model, predicting continuous flight videos of birds to be analyzed by using the second post-training model, and carrying out time sequence modeling correction. Compared with the prior art, the method and the device have the advantages that through the two-stage training strategy and network structure optimization, the generalization capability of the model is effectively improved, the problem of key point drift is solved, and the average precision (mAP 50) of the model in the classification task and the key point detection task is effectively improved.

Inventors

PU JUNFENG
LU GUANGRUI
YU JIAJIA
BIAN QINGYONG
LIU RAN
CHEN YANRU
LIU DAWEI
LI GUN
WANG YUANJING
CAI JINYAN
ZHANG YU
Hou Baobin

Assignees

中国空气动力研究与发展中心高速空气动力研究所

Dates

Publication Date: 20260512
Application Date: 20260304

Claims (5)

1. A bird flight video key point extraction method integrating a YOLO network and an LSTM network is characterized by comprising the following steps: step 1, constructing a multi-view bird data set, wherein the multi-view bird data set is used for acquiring diversified data; Step 2, carrying out data set enhancement on the multi-view bird data set obtained in the step 1 to obtain an enhanced bird data set; Step 3, putting the multi-view bird data set in the step 1 and the enhanced bird data set in the step 2 into YOLOv network for training, and selecting the optimal batch size and optimization algorithm by experimental the influence of different batch sizes and different optimization algorithms on model results to obtain a pre-training model; And 4, retraining key bone points of the birds to be analyzed on the basis of the pre-training model obtained in the step 3 to obtain a second post-training model, wherein the specific operation is as follows: Step 4.1, taking birds to be analyzed as objects, collecting pictures of the birds, selecting partial pictures, and marking target detection frames and skeleton key points to obtain a post-training data set; Step 4.2, converting the post-training data set marked in the step 4.1 into a skeleton key point data set suitable for a YOLO network training format; Step 4.3, merging the lightweight partial convolution calculation module and the EMA attention mechanism into a C2f module in the YOLOv network to obtain a post-training network model; Step 4.4, inputting the skeleton key point data set in the Yolo format obtained in the step 4.2 into the post-training network model in the step 4.3, wherein training parameters are consistent with the pre-training model, and training to obtain a second post-training model; Step 5, predicting the bird continuous flight video to be analyzed by utilizing the second post-training model obtained in the step 4, and obtaining skeleton key point coordinate data under each frame; and 6, performing time sequence modeling correction, solving the problem of target loss caused by shielding in the flight process, and completing the identification of skeleton key points of birds to be analyzed, wherein the specific operation is as follows: Step 6.1, connecting every two frames of corresponding skeleton key points in the skeleton key point coordinate data obtained in the step 5 to form a speed vector, and forming a time sequence data set reflecting the dynamic evolution of the skeleton key points; step 6.2, inputting the time sequence data set obtained in the step 6.1 into an LSTM model for training to obtain a third time sequence prediction model; and 6.3, predicting the flight video of the birds to be analyzed by using the second post-training model, and when the targets in the flight video are shielded, detecting that the confidence coefficient dip or the key point of the identified birds to be analyzed are obviously deviated, and replacing the detection result of the second post-training model by using the third time sequence prediction model prediction result in the step 6.2.
2. The method according to claim 1, wherein in the step 1, a large-scale universal multi-view bird data set is collected through a network-opened deep learning data set, and parameterized grid pictures of birds to be analyzed are introduced to obtain diversified data in order to avoid the risk of overfitting of detected birds.
3. The method of claim 1, wherein in the step 2, the pictures in the multi-view bird dataset obtained in the step 1 are subjected to random illumination disturbance and background replacement through Stable-Diffusion, and conventional data enhancement is performed by using an OpenCV library to obtain an enhanced bird dataset, wherein the conventional data enhancement is performed by using the OpenCV library in the following steps of random rotation, random clipping, color dithering and flipping.
4. The method according to claim 1, wherein in step 6.1, the corresponding bone key points in the bone key point coordinate data obtained in step 5 are connected every two frames to form a velocity vector, and stored as a CSV file for LSTM model training.
5. The method according to claim 1, wherein in the step 6.3, when the target in the flight video is blocked, a sudden drop in confidence level of the identified bird to be analyzed is detected, or when the euclidean distance deviation between the detected value of the current frame and the predicted value of the third time series prediction model exceeds a set percentage of the span length, the predicted result of the third time series prediction model in the step 6.2 is used to replace the detected result of the second post-training model.

Description

Bird flight video key point extraction method integrating YOLO network and LSTM network Technical Field The application relates to the field of image data processing, in particular to a bird flight video key point extraction method integrating a YOLO network and an LSTM network, which is a bird flight video key point extraction method combining a deep learning neural network and time sequence modeling, and can accurately identify target key points and correct the key points by combining time sequence information. The method can be applied to the scenes of bird behavior analysis, ecological monitoring, biological kinematics research and the like. Background In the research of the functional morphology of birds, the extraction of the key points of the bones of the birds has irreplaceable scientific value. In functional morphology research, the three-dimensional reconstruction based on skeleton key points can analyze the mechanical conduction paths of key parts such as wingbones, phalanges and the like, and explain wing variant strategies of birds in different flight states. With the iterative upgrade of computer vision algorithm based on deep learning, a novel skeleton point analysis mode is induced. In the prior art, only a conventional camera device is required to collect space-time images, and key motion indexes such as displacement vectors, speed vectors, attitude angles and the like of targets can be output by combining a feature point space-time correlation modeling algorithm. However, there are significant limitations to the prior art, specifically as follows: (1) The traditional method relies on static characteristics of a single frame image, lacks time sequence information association, is difficult to completely capture the dynamic change process of the single frame image, and causes insufficient accuracy of key point analysis; (2) When shielding occurs, the situation of missed detection or false detection of key points is easy to occur, and the track continuity is seriously influenced; (3) When the traditional deep learning model processes bird specific tasks, if random initialization or single data set training is directly adopted, the risk of fitting is existed, and the complex scene requirement in actual ecological monitoring is difficult to meet. How to improve the extraction of bird flight video key points becomes a technical problem which needs to be solved urgently. Disclosure of Invention The application aims to provide a bird flight video key point extraction method which is fused with a YOLO network and an LSTM network, aiming at the problems that when the existing method adopts the YOLO network technology to extract the bird skeleton key points, the traditional method relies on the static characteristics of a single frame image, lacks time sequence information association, is difficult to completely capture the dynamic change process of the bird skeleton key points, and causes insufficient key point analysis precision. In order to achieve the above purpose, the present application adopts the following technical scheme. A bird flight video key point extraction method integrating a YOLO network and an LSTM network comprises the following steps: Step 1, collecting and establishing a multi-view bird data set based on bird data in open source data; Step 2, carrying out data set enhancement on the multi-view bird data set obtained in the step 1 to obtain an enhanced bird data set; Step 3, putting the multi-view bird data set in the step 1 and the enhanced bird data set in the step 2 into YOLOv network for training, and selecting the optimal batch size and optimization algorithm by experimental the influence of different batch sizes and different optimization algorithms on model results to obtain a pre-training model; Step 4, connecting a part of convolution calculation module (Partical _conv), a mlp layer formed by connecting a single-dimensional convolution and a two-dimensional convolution in series, a dropout layer and an EMA attention mechanism layer in series to form a CME module, adopting the CME module to replace Bottleneck sub-modules in a C2f module based on YOLOv, keeping the shortcut selection of the CME module consistent with the shortcut selection of Bottleneck sub-modules in a YOLOv original network, constructing to obtain a post-training network, sending a pre-training model and bone key point data of birds to be analyzed into the post-training network for re-training to obtain a second post-training model, and specifically performing the following operations: step 4.1, taking birds to be analyzed as objects, collecting pictures of the birds and grid images of the birds, and marking target detection frames and skeleton key points to obtain a post-training data set; Step 4.2, converting the post-training data set marked in the step 4.1 into a skeleton key point data set in a YOLO network training format; Step 4.3, a CME module is formed by connecting a partial convolution calculat