CN-121982723-A - Tennis video data labeling method and system

CN121982723ACN 121982723 ACN121982723 ACN 121982723ACN-121982723-A

Abstract

The invention discloses a tennis video data labeling method and a system, which relate to the technical field of artificial intelligence and comprise the steps of obtaining tennis video image frames, determining normalized court key point coordinates and court key point visibility through a convolution improvement YOLOv court key point identification model, determining upper team area dividing line coordinates by adopting the normalized court key point coordinates, determining normalized personnel boundary frame coordinates based on a personnel target detection model, determining player boundary frame coordinates according to the upper team area dividing line coordinates and the normalized personnel boundary frame coordinates, extracting an interested area image of each player boundary frame coordinate, inputting a human body gesture estimation model to determine player skeleton point coordinates and skeleton point visibility, determining normalized tennis boundary frame coordinates for the tennis video image frames by adopting an improved YOLOv tennis target detection model, and determining a tennis labeling data set by adopting hierarchical integration data. Based on the scheme, the reliability of the tennis video data annotation is improved.

Inventors

Xiao Huiyang
LIU TIANXIANG
ZHOU GUOXU

Assignees

广东工业大学

Dates

Publication Date: 20260505
Application Date: 20260402

Claims (10)

1. A tennis video data annotation method, comprising: Acquiring a tennis video image frame; determining normalized court key point coordinates and court key point visibility of the tennis video image frame by improving YOLOv court key point identification models based on convolutions of a coding and decoding convolutions module or/and an enhancement coding and decoding convolutions module, wherein the coding and decoding convolutions module comprises cascaded grouping convolutions and 1 multiplied by 1 point convolutions, the enhancement coding and decoding convolutions module comprises cascaded grouping convolutions, multi-expansion-rate cavity convolutions branch groups, 1 multiplied by 1 convolutions and 1 multiplied by 1 point convolutions, and the grouping convolutions are connected with 1 multiplied by 1 convolutions residual errors; determining upper team area dividing line coordinates by adopting the normalized court key point coordinates, determining normalized personnel boundary frame coordinates of the tennis video image frame based on a personnel target detection model, and determining player boundary frame coordinates according to the upper team area dividing line coordinates and the normalized personnel boundary frame coordinates; extracting region-of-interest images of the coordinates of the player boundary frames from the tennis video image frames, respectively inputting the region-of-interest images into a human body posture estimation model, and determining the coordinates of player skeleton points and the visibility of the skeleton points; detecting the tennis video image frame by adopting an improved YOLOv tennis target detection model based on a variable kernel convolution and a global attention module, and outputting normalized tennis boundary frame coordinates; and hierarchically integrating the tennis video image frame, the normalized court key point coordinates, the court key point visibility, the normalized player boundary frame coordinates, the player skeleton point coordinates, the skeleton point visibility and the normalized tennis boundary frame coordinates to determine a tennis annotation data set.
2. The tennis video data labeling method according to claim 1, wherein the convolution improvement YOLOv pitch keypoint identification model based on the codec convolution module or/and the enhancement codec convolution module comprises: Performing convolution replacement on the YOLOv baseline model by adopting a coding and decoding convolution module to obtain a convolution improved YOLOv court key point identification model; Or, replacing Conv layers except the first Conv layer in the backbone network of the YOLOv baseline model by adopting an enhanced coding and decoding convolution module to obtain a convolution improved YOLOv court key point identification model; or, the codec convolution module is adopted to replace a first Conv layer of the backbone network of the YOLOv baseline model and a Conv layer of the neck network, and the enhancement codec convolution module is adopted to replace Conv layers except the first Conv layer in the backbone network of the YOLOv baseline model.
3. The tennis video data annotation method according to claim 1, wherein the tennis video image frame is a single play tennis video image frame, wherein the determining player bounding box coordinates from the above team area dividing line coordinates and the normalized player bounding box coordinates comprises: taking the normalized personnel boundary frame coordinate which is in a preset lower team reserved area in the tennis video image frame and has the maximum normalized ordinate as a lower team player boundary frame coordinate; determining an upper team remaining area in the tennis video image frame in combination with the upper team area dividing line coordinates; performing ordinate clustering on the normalized personnel boundary frame coordinates in the upper team reserved area to determine a personnel target cluster; if a personnel target cluster of the single target cluster exists, the single target cluster with the largest normalized ordinate is used as the boundary frame coordinate of the team player above; if the single target cluster does not exist, traversing the personnel target cluster, and selecting a normalized personnel boundary frame coordinate with the minimum boundary frame area in the personnel target cluster with the intra-cluster distance smaller than the distance threshold value from the cluster center as an upper team player boundary frame coordinate; And if no single target cluster exists and no personnel target cluster with the intra-cluster distance smaller than the distance threshold value exists, taking the normalized personnel boundary frame coordinate with the largest boundary frame area in the upper team reserved area as the upper team player boundary frame coordinate.
4. The method for labeling tennis video data according to claim 1, wherein said improving YOLOv a tennis target detection model comprises: And replacing YOLOv Conv layers except the first Conv layer in the backbone network of the base line model by adopting variable kernel convolution, adding a global attention module in front of the SPPF module, and determining an improved YOLOv tennis target detection model.
5. The tennis video data annotation method of claim 1, wherein said set of multi-expansion-rate hole convolution branches comprises hole convolution branches having expansion rates of 2, 4 and 6, respectively.
6. The tennis video data annotation method of claim 1, further comprising: assigning a player ID to each player boundary frame coordinate of the first frame tennis video image frame; Performing cross-frame continuous tracking of single playing tennis videos according to the player ID by adopting a multi-target tracking algorithm, and determining a corresponding player motion track; constructing tennis tracks based on normalized tennis boundary frame coordinates of each frame of tennis video image frame; integrating the player movement track and the tennis movement track into a tennis labeling data set.
7. A tennis video data annotation system, comprising: the image acquisition module is used for acquiring tennis video image frames; The court identification module is used for determining the visibility of the normalized court key point coordinates and the court key points of the tennis video image frames by improving YOLOv court key point identification models based on the convolution of the encoding and decoding convolution module or/and the enhancement encoding and decoding convolution module, wherein the encoding and decoding convolution module comprises cascaded grouping convolution and 1 multiplied by 1 point-by-point convolution, the enhancement encoding and decoding convolution module comprises cascaded grouping convolution, multi-expansion-rate cavity convolution branch groups, 1 multiplied by 1 convolution and 1 multiplied by 1 point-by-point convolution, and the grouping convolution is connected with the 1 multiplied by 1 convolution residual error; The player identification module is used for determining upper team area dividing line coordinates by adopting the normalized court key point coordinates, determining normalized personnel boundary frame coordinates of the tennis video image frame based on a personnel target detection model, and determining player boundary frame coordinates according to the upper team area dividing line coordinates and the normalized personnel boundary frame coordinates; the skeleton recognition module is used for extracting the interested region images of the coordinates of the boundary frames of the players from the tennis video image frames, respectively inputting the interested region images into the human body posture estimation model, and determining the coordinates of skeleton points and the visibility of the skeleton points of the players; The tennis ball identification module is used for detecting the tennis ball video image frame by adopting an improved YOLOv tennis ball target detection model based on the variable kernel convolution and the global attention module, and outputting normalized tennis ball boundary frame coordinates; And the data integration module is used for hierarchically integrating the tennis video image frames, the normalized court key point coordinates, the court key point visibility, the normalized player boundary frame coordinates, the player skeleton point coordinates, the skeleton point visibility and the normalized tennis boundary frame coordinates to determine a tennis labeling data set.
8. A computer device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to perform the steps of the tennis video data annotation method according to any of claims 1-6.
9. A computer readable storage medium having stored thereon a computer program/instructions, which when executed by a processor, performs the steps of the tennis video data annotation method according to any of claims 1-6.
10. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the tennis video data annotation method according to any of claims 1-6.

Description

Tennis video data labeling method and system Technical Field The invention relates to the technical field of artificial intelligence, in particular to a tennis video data labeling method and a tennis video data labeling system. Background In the fields of sports training, event analysis, sports scientific research and the like, accurate marking of information such as tennis, player positions, player skeleton points, court key points and the like is carried out on tennis video data, and important data support can be provided for subsequent technical action analysis, tactical research and the like. The traditional tennis video data annotation is mainly finished manually, however, the problems of extremely low annotation efficiency, poor standard precision stability and the like exist in the mode, and for this reason, the traditional target detection model is considered to be introduced for recognition and positioning in the existing scheme so as to realize data annotation, but the problems of insufficient feature extraction and low positioning precision exist, so that the reliability of the data annotation is lower. Disclosure of Invention The invention provides a tennis video data labeling method and a tennis video data labeling system, which solve the technical problem that the reliability of data labeling is low when a traditional target detection model is used for identifying and positioning and labeling tennis video data. The tennis video data labeling method provided by the first aspect of the invention comprises the following steps: Acquiring a tennis video image frame; determining normalized court key point coordinates and court key point visibility of the tennis video image frame by improving YOLOv court key point identification models based on convolutions of a coding and decoding convolutions module or/and an enhancement coding and decoding convolutions module, wherein the coding and decoding convolutions module comprises cascaded grouping convolutions and 1 multiplied by 1 point convolutions, the enhancement coding and decoding convolutions module comprises cascaded grouping convolutions, multi-expansion-rate cavity convolutions branch groups, 1 multiplied by 1 convolutions and 1 multiplied by 1 point convolutions, and the grouping convolutions are connected with 1 multiplied by 1 convolutions residual errors; determining upper team area dividing line coordinates by adopting the normalized court key point coordinates, determining normalized personnel boundary frame coordinates of the tennis video image frame based on a personnel target detection model, and determining player boundary frame coordinates according to the upper team area dividing line coordinates and the normalized personnel boundary frame coordinates; extracting region-of-interest images of the coordinates of the player boundary frames from the tennis video image frames, respectively inputting the region-of-interest images into a human body posture estimation model, and determining the coordinates of player skeleton points and the visibility of the skeleton points; detecting the tennis video image frame by adopting an improved YOLOv tennis target detection model based on a variable kernel convolution and a global attention module, and outputting normalized tennis boundary frame coordinates; and hierarchically integrating the tennis video image frame, the normalized court key point coordinates, the court key point visibility, the normalized player boundary frame coordinates, the player skeleton point coordinates, the skeleton point visibility and the normalized tennis boundary frame coordinates to determine a tennis annotation data set. Optionally, the convolutional improvement YOLOv of the codec convolutional module or/and the enhanced codec convolutional module comprises a key point identification model of the pitch, which comprises the following steps: Performing convolution replacement on the YOLOv baseline model by adopting a coding and decoding convolution module to obtain a convolution improved YOLOv court key point identification model; Or, replacing Conv layers except the first Conv layer in the backbone network of the YOLOv baseline model by adopting an enhanced coding and decoding convolution module to obtain a convolution improved YOLOv court key point identification model; or, the codec convolution module is adopted to replace a first Conv layer of the backbone network of the YOLOv baseline model and a Conv layer of the neck network, and the enhancement codec convolution module is adopted to replace Conv layers except the first Conv layer in the backbone network of the YOLOv baseline model. Optionally, the tennis video image frame is a single play tennis video image frame, and the determining player bounding box coordinates according to the above team area dividing line coordinates and the normalized person bounding box coordinates includes: taking the normalized personnel boundary frame coordinate which is in a preset lo