CN-115210770-B - End-to-end camera calibration for broadcast video
Abstract
Systems and methods of calibrating broadcast video sources are disclosed herein. The computing system retrieves a plurality of broadcast video sources comprising a plurality of video frames. The computing system generates a trained neural network by generating a plurality of training data sets based on the broadcast video source and learning through the neural network to generate a homography matrix for each of the plurality of frames. The computing system receives a target broadcast video source of a target motion event. The computing system divides the target broadcast video source into a plurality of target frames. The computing system generates a target homography matrix for each of a plurality of target frames via a neural network. The computing system calibrates the target broadcast video source by warping each target frame by a corresponding target homography matrix.
Inventors
- SHA LONG
- S. GANGULI
- Patrick Lusi
Assignees
- 斯塔特斯公司
- 斯塔特斯公司
Dates
- Publication Date
- 20260421
- Application Date
- 20210409
- Priority Date
- 20200410
Claims (17)
- 1. A method of calibrating a broadcast video source, comprising: receiving, by the computing system, a target broadcast video source of a target motion event; Dividing, by the computing system, the target broadcast video source into a plurality of target frames; Generating, by the neural network, a target homography matrix for each of the plurality of target frames, wherein the generating comprises: Generating a playing field agnostic view of the playing field in each target frame; identifying a template that best matches a playing field agnostic view of the playing field; Retrieving a template homography matrix associated with the template; predicting a relative homography matrix based on the identified templates and the playing field agnostic view, and Generating the target homography matrix based on the template homography matrix and the relative homography matrix, and The target broadcast video source is calibrated by the computing system by warping each target frame by a corresponding target homography matrix.
- 2. The method of claim 1, wherein the neural network comprises: A semantic segmentation module; camera pose initialization module, and And a homography refinement module.
- 3. The method according to claim 2, the method comprising: Retrieving, by a computing system, a plurality of broadcast video sources of a plurality of motion events, each broadcast video source including a plurality of video frames, and Generating, by the computing system, a neural network trained to generate a homography matrix by: generating a plurality of training data sets based on the plurality of broadcast video sources by dividing the plurality of broadcast video sources into a plurality of frames, and Learning by the neural network to generate a homography matrix for each of the plurality of frames, the learning comprising: learning by the semantic segmentation module to generate a venue-agnostic appearance for each of the plurality of frames.
- 4. A method according to claim 3, further comprising: learning by the camera pose initialization module to calculate a distance between each input received from the semantic segmentation module and a set of template images, and Learning by the camera pose initialization module to identify a template homography matrix associated with the semantic segmentation module and the set of template images.
- 5. The method of claim 4, further comprising: learning by the homography refinement module to generate a relative homography matrix based on a tandem input including the site agnostic appearance and a template image for each frame, and And learning by the homography refinement module to generate the homography matrix based on the relative homography matrix and the template homography matrix.
- 6. The method of claim 5, wherein each of the semantic segmentation module, the camera pose initialization module, and the homography refinement module are trained simultaneously.
- 7. A system for calibrating a broadcast video source, comprising: Processor, and A memory having stored thereon programming instructions that, when executed by the processor, perform one or more operations comprising: A target broadcast video source for receiving a target motion event; Dividing the target broadcast video source into a plurality of target frames; Generating, by the neural network, a target homography matrix for each of the plurality of target frames, wherein the generating comprises: wherein generating, by the neural network, the target homography matrix for each of the plurality of target frames includes: Generating a playing field agnostic view of the playing field in each target frame; identifying a template that best matches a playing field agnostic view of the playing field; Retrieving a template homography matrix associated with the template; predicting a relative homography matrix based on the identified templates and the playing field agnostic view, and Generating the target homography matrix based on the template homography matrix and the relative homography matrix, and The target broadcast video source is calibrated by warping each target frame by a corresponding target homography matrix.
- 8. The system of claim 7, wherein the neural network comprises: A semantic segmentation module; camera pose initialization module, and And a homography refinement module.
- 9. The system of claim 8, the operations further comprising: retrieving a plurality of broadcast video sources of a plurality of motion events, each broadcast video source comprising a plurality of video frames; a neural network trained to generate homography matrices is generated by: generating a plurality of training data sets based on the plurality of broadcast video sources by dividing the plurality of broadcast video sources into a plurality of frames, and Learning by the neural network to generate a homography matrix for each of the plurality of frames, the learning comprising: learning by the semantic segmentation module to generate a venue-agnostic appearance for each of the plurality of frames.
- 10. The system of claim 9, further comprising: learning by the camera pose initialization module to calculate a distance between each input received from the semantic segmentation module and a set of template images, and Learning by the camera pose initialization module to identify a template homography matrix associated with the semantic segmentation module and the set of template images.
- 11. The system of claim 10, further comprising: learning by the homography refinement module to generate a relative homography matrix based on a tandem input including the site agnostic appearance and a template image for each frame, and And learning by the homography refinement module to generate the homography matrix based on the relative homography matrix and the template homography matrix.
- 12. The system of claim 11, wherein each of the semantic segmentation module, the camera pose initialization module, and the homography refinement module are trained simultaneously.
- 13. A non-transitory computer-readable medium comprising one or more sequences of instructions which, when executed by one or more processors, cause: receiving, by the computing system, a target broadcast video source of a target motion event; Dividing, by the computing system, the target broadcast video source into a plurality of target frames; Generating, by the neural network, a target homography matrix for each of the plurality of target frames, wherein the generating comprises: wherein generating, by the neural network, the target homography matrix for each of the plurality of target frames includes: Generating a playing field agnostic view of the playing field in each target frame; identifying a template that best matches a playing field agnostic view of the playing field; Retrieving a template homography matrix associated with the template; predicting a relative homography matrix based on the identified templates and the playing field agnostic view, and Generating the target homography matrix based on the template homography matrix and the relative homography matrix, and The target broadcast video source is calibrated by the computing system by warping each target frame by a corresponding target homography matrix.
- 14. The non-transitory computer-readable medium of claim 13, wherein the neural network comprises: A semantic segmentation module; camera pose initialization module, and And a homography refinement module.
- 15. The non-transitory computer-readable medium of claim 14, further causing: Retrieving, by the computing system, a plurality of broadcast video sources of a plurality of motion events, each broadcast video source comprising a plurality of video frames; Generating, by the computing system, a neural network trained to generate a homography matrix by: generating a plurality of training data sets based on the plurality of broadcast video sources by dividing the plurality of broadcast video sources into a plurality of frames, and Learning by the neural network to generate a homography matrix for each of the plurality of frames, the learning comprising: learning by the semantic segmentation module to generate a venue-agnostic appearance for each of the plurality of frames.
- 16. The non-transitory computer-readable medium of claim 15, further comprising: learning by the camera pose initialization module to calculate a distance between each input received from the semantic segmentation module and a set of template images, and Learning by the camera pose initialization module to identify a template homography matrix associated with the semantic segmentation module and the set of template images.
- 17. The non-transitory computer-readable medium of claim 16, further comprising: learning by the homography refinement module to generate a relative homography matrix based on a tandem input including the site agnostic appearance and a template image for each frame, and And learning by the homography refinement module to generate the homography matrix based on the relative homography matrix and the template homography matrix.
Description
End-to-end camera calibration for broadcast video Cross Reference to Related Applications The present application claims priority from U.S. provisional application serial No.63/008,184, filed on 10, 4, 2020, which is incorporated herein by reference in its entirety. Technical Field The present disclosure relates generally to systems and methods for broadcast video action based on, for example, tracking data and participant end-to-end camera calibration. Background An increasing number of vision-based tracking systems deployed in production require fast, robust camera calibration. For example, in the field of sports, most of the current work is focused on sports that are easy to extract lines and intersections and relatively consistent in appearance across the field. Disclosure of Invention In some embodiments, a method of calibrating a broadcast video source is disclosed herein. The computing system retrieves a plurality of broadcast video sources for a plurality of motion events. Each broadcast video source includes a plurality of video frames. The computing system generates a trained neural network by dividing a broadcast video source into a plurality of frames to generate a plurality of training data sets based on the broadcast video source and learning through the neural network to generate a homography matrix for each of the plurality of frames. The computing system receives a target broadcast video source of a target motion event. The computing system divides the target broadcast video source into a plurality of target frames. The computing system generates a target homography matrix for each of a plurality of target frames via a neural network. The computing system calibrates the target broadcast video source by warping each target frame by a corresponding target homography matrix. In some embodiments, a system is disclosed herein. The system includes a processor and a memory. The memory has stored thereon programming instructions that, when executed by the processor, perform one or more operations. The one or more operations include retrieving a plurality of broadcast video sources for a plurality of motion events. Each broadcast video source includes a plurality of video frames. The one or more operations further include generating a trained neural network by generating a plurality of training data sets based on the broadcast video source by dividing the broadcast video source into a plurality of frames and generating a homography matrix for each of the plurality of frames by the neural network learning. The one or more operations further include receiving a target broadcast video source of the target motion event. The one or more operations further include dividing the target broadcast video source into a plurality of target frames. The one or more operations further include generating, by the neural network, a target homography matrix for each of the plurality of target frames. The one or more operations further include calibrating the target broadcast video source by warping each target frame by a respective target homography matrix. In some embodiments, disclosed herein is a non-transitory computer-readable medium. The non-transitory computer-readable medium includes one or more sequences of instructions which, when executed by one or more processors, cause a computing system to perform one or more operations. The computing system retrieves a plurality of broadcast video sources for a plurality of motion events. Each broadcast video source includes a plurality of video frames. The computing system generates a trained neural network by dividing a broadcast video source into a plurality of frames to generate a plurality of training data sets based on the broadcast video source and generating a homography matrix for each of the plurality of frames through neural network learning. The computing system receives a target broadcast video source of a target motion event. The computing system divides the target broadcast video source into a plurality of target frames. The computing system generates a target homography matrix for each of a plurality of target frames via a neural network. The computing system calibrates the target broadcast video source by warping each target frame by a corresponding target homography matrix. Drawings So that the manner in which the above recited features of the present disclosure can be understood in detail, a more particular description of the disclosure, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this disclosure and are therefore not to be considered limiting of its scope, for the disclosure may admit to other equally effective embodiments. FIG. 1 is a block diagram illustrating a computing environment according to an example embodiment. Fig. 2A-2B are block diagrams illustrating a neural network ar