CN-115953428-B - Badminton detecting and tracking method based on time sequence encoding and decoding network

CN115953428BCN 115953428 BCN115953428 BCN 115953428BCN-115953428-B

Abstract

The invention provides a badminton real-time detection and tracking method based on a time sequence encoding and decoding network, which relates to the field of intelligent image processing and machine vision, combines a channel attention mechanism and a time sequence network structure, A badminton detecting and tracking model based on a coding and decoding network architecture is designed, input characteristics are recalibrated through a channel attention mechanism, and time sequence network structures fully utilize image sequence inter-frame information. The badminton with motion blur in the image sequence can be accurately detected and stably tracked in real time under a complex background, and the detection accuracy and tracking stability of the badminton are effectively improved. Meanwhile, the encoding and decoding network is firstly applied to badminton detection and tracking tasks. In addition, in order to improve the positioning accuracy of badminton detection, a binary thermodynamic diagram contour detection method is used in the network, so that the defect of insufficient accuracy of the end-to-end network output positioning coordinates is avoided, and the accuracy of badminton detection is further improved.

Inventors

OU QIAOFENG
ZHONG LIANG
XIONG BANGSHU
FANG TING
LIU CHANG
ZHANG LIPING
XU DI
Nie Xiaqing

Assignees

南昌航空大学
江西远大保险设备实业集团有限公司
江西方德科技有限公司

Dates

Publication Date: 20260508
Application Date: 20221215

Claims (3)

1. A badminton detecting and tracking method based on a time sequence encoding and decoding network is characterized by comprising the following steps: the method comprises the steps of (1) preprocessing image data, obtaining continuous k frames of pictures, adjusting the sizes of the pictures, normalizing the pictures, and finally superposing normalized feature images with the number of channels of 3k, the height of h and the width of w on the channel dimension; firstly, collecting a video image of a badminton match, dividing the video image into a training set and a verification set according to a proportion, decomposing the video image to obtain picture sequence data, and then marking points on the shape center of a badminton cap; Step (3) constructing a time sequence coding and decoding network model which sequentially comprises an input layer, a coding layer, a decoding layer and an output layer; The input layer adopts a time sequence network structure, after a preprocessing feature map of k frames which are continuous in time sequence and cut and normalized is obtained, the preprocessing feature map is overlapped with network model time sequence outputs F hot (n-3k+1) to F hot (n-k) on channels to form a normalized feature map with the number of channels being 5k, the height being h and the width being w; The coding layer consists of 4 layers of feature extraction modules and 3 layers of two-dimensional maximum pooling downsampling operation, wherein the convolution kernel size of the 3 layers of two-dimensional maximum pooling downsampling operation is set to be 2 multiplied by 2, the number of input channels and the number of output channels of the 4 layers of feature extraction modules are sequentially (5 k, 32), (32, 64), (64,128), (128, 256), the 3 layers of two-dimensional maximum pooling downsampling interval is applied between the feature extraction modules, the number of basic convolution modules contained in the 1/4 th and 2/4 th layer of feature extraction modules is 2, the number of basic convolution modules contained in the 3/4 th and 4/4 th layer of feature extraction modules is 3, and the height and the width of an output feature image are compressed to be 1/8 of the input feature image after three layers of downsampling, so that the compression coding function is realized; The decoding layer consists of a 3-layer feature extraction module, a 3-layer channel superposition operation and a 3-layer upsampling operation, wherein the spatial size multiplier of the upsampling operation is set to be 2, the sampling mode is nearest neighbor, the number of input channels and the number of output channels of the 3-layer feature extraction module 1/3, 2/3 and 3/3 are (384,128), (192,64) and (96,32) in sequence, the 1/3-layer feature extraction module contains a basic convolution module number of 3 and the rest is 2, the coding output firstly undergoes upsampling operation and then is superposed with the output of the 3/4-layer feature extraction module in the coding layer on the channel to obtain an output feature map 1, then the feature map 1 undergoes upsampling operation and then is superposed with the output of the 2/4-layer feature extraction module in the coding layer on the channel to obtain an output feature map 2, and then the feature map 2 undergoes upsampling operation and then is superposed with the output of the 1/4-layer feature extraction module in the coding layer on the channel to obtain a decoding output; The output layer firstly carries out two-dimensional convolution operation on the decoded output with the number of input channels being 32 and the number of output channels being k, and then carries out a Sigmoid activation function to finally output to obtain a normalized thermodynamic diagram with the number of channels being k, the height being h and the width being w; Training a time sequence coding and decoding network model, setting a data input path and super parameter information of the network model, setting data loading to be sequentially sampled, preprocessing data by using the method of the step (1), calculating output thermodynamic diagram loss by using a binary cross entropy loss function, performing iterative optimization by using an Adam optimizer, initializing convolutional layer parameters by using Kaiming normal distribution, and reducing learning rate parameters once after 1 training period, wherein the learning rate lr (n) of the nth round is as follows: ; the method comprises the steps of a, carrying out contour detection on predicted output of a network model and a real label thermodynamic diagram during verification to obtain relative coordinates of a badminton, and storing a group of model weights with highest accuracy of a verification set during training, wherein a is an initial learning rate, b is a tiny constant less than 1e-8, epochs is total training times, and epoch is current training times; And (5) detecting and tracking the shuttlecock in real time, and detecting real-time images of the shuttlecock match by using the time sequence encoding and decoding network model obtained in the step (3) and the step (4) to obtain the pixel coordinates of the shuttlecock.
2. The method according to claim 1, wherein the feature extraction module in the step (3) is a function of obtaining an output y1 through a basic convolution module after the feature map is input, obtaining y2 through 1 channel attention modules and N basic convolution modules, and finally adding y1 and y2 pixel by pixel to realize feature extraction; the basic convolution module consists of two-dimensional convolution operation, a ReLU activation function and a group normalization sequence, wherein the convolution kernel size of the two-dimensional convolution is 3 multiplied by 3; The channel attention module comprises 1 layer of self-adaptive average pooling operation based on feature map width and height, 2 layers of full-connection layer and ReLU and Sigmoid activation functions, wherein after the feature map is input, the space feature is reduced to 1 multiplied by 1 through the self-adaptive average pooling operation, then the space feature is weighted to each channel of the module input feature map channel by channel through the full-connection layer, the ReLU activation function, the full-connection layer and the Sigmoid activation function in sequence, and finally the recalibration of the channel attention to the input feature map is completed.
3. The method of claim 1, wherein the specific steps of shuttlecock detection and tracking in step (5) are as follows: a camera is deployed at one side of the badminton court to obtain video frames, and a normalized characteristic diagram with the number of channels of 5k, the height of h and the width of w is obtained through pretreatment shown in the step (1); loading the time sequence coding and decoding network model and reasoning, namely loading the network model and training weight saved in the step (4), and reasoning the model to obtain k normalized thermodynamic diagrams with single channels, height h and width w; The method comprises the steps of (5.3) carrying out binarization and target contour detection on a thermodynamic diagram to obtain relative coordinates of the shuttlecock under a pixel coordinate system, firstly, carrying out binarization processing on k normalized thermodynamic diagrams respectively by a threshold t, then, carrying out circumscribed rectangle detection of a non-zero area on the binarized thermodynamic diagrams, and calculating a minimum rectangle of a vertical boundary of each contour; And (5.4) outputting a detection result, mapping the badminton downsampling coordinates obtained in real time in the step (5.3) into badminton coordinates in the original k-frame image, marking coordinate information in the original image sequence frame through a white solid circle, and marking the coordinate information of the badminton of the previous m frames of the frame at the same time so as to achieve a visual tracking effect.

Description

Badminton detecting and tracking method based on time sequence encoding and decoding network Technical Field The invention relates to the field of intelligent image processing and machine vision, in particular to a badminton detecting and tracking method based on a time sequence encoding and decoding network. Background Currently, shuttlecocks are popular worldwide, becoming one of the most popular sports in the world. Meanwhile, tactical analysis of badminton games and auxiliary judgment of results are getting more and more important. Among them, detection and tracking of shuttlecocks is one of the core works, and detection information of shuttlecocks can provide help for many tasks, such as service, batting, landing recognition, etc. There are many more sophisticated detection and tracking systems, such as eagle eye systems, that use multiple high-end cameras to obtain detection information of badminton, which can assist judges and assist athletes in performing professional analysis and training. But the system belongs to a proprietary system and has high deployment cost. Therefore, the method for detecting and tracking the shuttlecock on the low-frame-rate camera has practical application value. Shuttlecock detection and tracking algorithms can be roughly classified into conventional vision processing algorithms and deep learning algorithms with convolutional neural networks as cores. The traditional badminton detecting and tracking method is mainly an optical flow method, namely the optical flow method utilizes the change of pixels in a badminton match image sequence on a time domain and the correlation between adjacent frames, calculates and obtains the motion information of objects between the adjacent frames according to the corresponding relation between the previous frame and the current frame, and then utilizes the characteristics of the shuttlecock to detect, such as shape and color. However, many of the video images of badminton games have similar characteristics to those of shuttlecocks, such as the player's racquet, shoes, socks, clothing, etc., and these objects may be misdetected as shuttlecocks. In addition, the motion blur of shuttlecocks under low frame rate cameras can be very severe, which can cause serious interference with the detection and tracking of shuttlecocks. In the field of deep learning, a common method is to implement target tracking by detecting target position information in each frame. However, since this method is to obtain position information by detecting frame by frame to realize tracking, correlation between frames before and after video is not fully used, and thus a target is easily lost during tracking. Therefore, it is practical to research an algorithm capable of accurately detecting and stably tracking the shuttlecock in real time under the conditions of serious motion blur and complex background. Disclosure of Invention The invention provides a badminton detecting and tracking method based on a time sequence encoding and decoding network, which solves the problems in the background technology, and the specific technical scheme is as follows: A badminton detecting and tracking method based on a time sequence encoding and decoding network is characterized by comprising the following steps: Step (1) preprocessing the image data to obtain continuous image data ...、K frames of RGB pictures are added, the size of the pictures is adjusted, and normalization treatment is carried out, and finally, normalized feature pictures with 3k channels, h height and w width are overlapped on the channel dimension; firstly, collecting a video image of a badminton match, dividing the video image into a training set and a verification set according to a proportion, decomposing the video image to obtain picture sequence data, and then marking points on the shape center of a badminton cap; And (3) constructing a time sequence coding and decoding network model, wherein the network model sequentially comprises an input layer, an encoding layer, a decoding layer and an output layer. The input layer adopts a time sequence network structure, after a preprocessing feature map of k frames which are continuous in time sequence and cut and normalized is obtained, the preprocessing feature map is overlapped with network model time sequence outputs F hot (n-3k+1) to F hot (n-k) on channels to form a normalized feature map with the number of channels being 5k, the height being h and the width being w; The coding layer consists of 4 layers of feature extraction modules and 3 layers of two-dimensional maximum pooling downsampling operation, wherein the convolution kernel size of the 3 layers of two-dimensional maximum pooling downsampling operation is set to be 2 multiplied by 2, the number of input channels and the number of output channels of the 4 layers of feature extraction modules are sequentially (5 k, 32), (32, 64), (64,128), (128, 256), the 3 layers of two-dimensional maximum pooling downsa