EP-4362465-B1 - METHOD AND SYSTEM FOR VIDEO FRAME INTERPOLATION

EP4362465B1EP 4362465 B1EP4362465 B1EP 4362465B1EP-4362465-B1

Inventors

BARNICH, OLIVIER
Castin, Martin
Massoz, Quentin

Dates

Publication Date: 20260506
Application Date: 20221027

Claims (13)

Computer-implemented method for iteratively interpolating intermediate images between two consecutive anchor images (I 1 ;I 2 ) in a video input stream (101), the method comprises - providing (S1) the images of the video input stream (101) in 1 to n different image resolutions, wherein the first image resolution (LR) is the lowest image resolution and the n-th image resolution is the highest image resolution (HR), wherein k is an integer between 2 and n, wherein n>2; - receiving (S2) the anchor images in the (k-1th) image resolution in a (k-1th) neural network (513); - processing (S3) the anchor images in the (k-1)th neural network (513) trained for interpolating one or several (k-1th) intermediate image(s) in the (k-1th) image resolution; - upscaling (S4) one or several intermediate image(s) having a (k-1)th image resolution to a k-th image resolution; - receiving (S5) the upscaled first intermediate image(s) and the anchor images in the k-th image resolution as input for a k-th neural network (523), - processing (S6) the anchor images in the k-th image resolution and the previous level's upscaled intermediate image(s) in the k-th neural network, which is trained for interpolating one or several intermediate image(s) in the k-th image resolution, - using the upscaled one or several intermediate image(s) in a next iteration.
Method according to claim 1, wherein - the anchor images (I 1 ,I 2 ) are received (S1) in a format with H vertical pixels, W horizontal pixels and C color channels, and wherein each anchor image can be represented as a three-dimensional tensor having a format of H×W×C.
Method according to claim 1or 2, further comprising - transforming the anchor images into a different format having fewer horizontal (H) and fewer vertical (V) pixels and more channels (C), such that the transformed images are represented as another three-dimensional tensor having the format H/s×W/s×s 2 C.
Method according to one of the preceding claims, wherein - processing (S6) anchor images (I 1 ,I 2 ) and/or one or several intermediate images includes generating (S7) a single composed feature map.
Method according to claim 4, wherein - generating the single composed feature map comprises concatenating (S8) the two anchor images and previous level's upscaled intermediate image(s) in a channel wise manner
Method according to one of the preceding claims wherein the neural network (513,523, 533) is trained to produce (S9) an output feature map that contains all channels of at least one intermediate image.
Method according to claim 6, further comprising, if necessary, decomposing (S10) the output feature map into at least two or more intermediate images (Î1/3,Î2/3).
Computer program product including program code, which implements the method according to one of the preceding claims when the program code is executed on a computer.
Broadcast production system comprising a plurality of video cameras, a vision router (1102), a replay server (1104), and a vision mixer (1103), wherein each video camera generates a video stream with a first frame rate, which are supplied to the vision router transferring the camera streams to the vision mixer and the replay server, wherein the replay server stores all camera streams and enables selection of one of the camera streams for replay, and wherein the vision mixer generates a program output stream, wherein the broadcast production system further comprises an interpolation device (1105) that interpolates one or several intermediate images between two consecutive images in the video stream selected for replay to create a slow-motion video stream with a second frame rate that is higher than the first frame rate characterized in that the interpolation device is configured to implement the method according to one of claims 1 to 7.
Broadcast production system according to claim 9, wherein the slow-motion video stream is provided to the vision mixer (1103).
Broadcast production system according to claim 9 or 10, wherein the interpolation device (1105) is integrated into the replay server (1104').
Broadcast production system according to one of claims 9 or 10, wherein the interpolation device (1105) is implemented as a cloud service.
Broadcast production system according to one of claims 9 to 12, wherein the interpolation device (1105) comprises a convolutional neural network.

Description

Field The present disclosure relates to a method and a system for interpolating intermediate images in a video stream between two consecutive images in the stream. Background In broadcast productions covering a sports event, multiple cameras capture the event from different perspectives. To this end, the cameras are installed at different locations in a sports venue. All camera streams are recorded on a video production or replay server allowing an operator to go back in time and make a live playback of the clip showing an action of interest that just happened. The coverage of the sports event is delivered to the viewers as a broadcasted program output stream. The playback is usually done in slow motion, meaning that the images are played out at a lower image frame rate than their acquisition frame rate. Best slow-motion quality is achieved with high-speed cameras, so-called super slow-motion cameras (SSM cameras). SSM cameras exist for multiple speeds, e.g. 2x, 3x, and 4x, wherein single speed denominated with a frame rate factor 1x corresponds to the frame rate of the program output stream having a frame rate of 50 frames per second (FPS) or 59,94 FPS. The frame rate factors 2x, 3x, and 4x are also referred to as frame rate multipliers. Hence, SSM cameras output a camera stream having a frame rate that is a multiple of the frame rate of the program output stream. For instance, a clip recorded by an SSM camera producing a video stream with a 3x50 FPS frame rate can be played back at 1/3 of the original capturing speed (that is a frame rate of 50 FPS) and still provides smooth transitions between consecutive image frames. Since it is unpredictable where an interesting event worthwhile to be replayed in slow-motion occurs, it would be a straightforward approach to install a multitude of SSM cameras at the sports venue. However, SSM cameras are a scarce resource because they are expensive, require a high bandwidth for data transmission to the replay server and, therefore, occupy multiple server channels. For instance, one server channel of the video production server can ingest an uncompressed camera stream at normal reproduction frame rate, e.g with 50 FPS. Hence, an SSM camera operating at a 3 times higher frame rate (3x) outputs a video stream with 3x50 FPS and requires three server channels to transmit the video stream and enable recording of the video stream from the SSM camera at the replay server. As a result, SSM cameras are too large, too fragile, and too expensive to be installed at every location in the sports venue where slow-motion replay might potentially bring value to a broadcast production. Given this situation, broadcast production companies seek for alternative solutions for providing slow-motion replays with variable frame rate multipliers for every type of camera in every broadcast production. Notably, there is a desire to provide slow-motion replays of camera streams that have been captured with non-SSM cameras. A known approach for solving this problem is to calculate intermediate image frames between two consecutive video frames of a normal camera outputting the video stream at a frame rate corresponding to the production frame rate, e.g. the video stream with 50 FPS. The process of calculating intermediate image frames is also referred to as "video frame interpolation". In most of the cases, video frame interpolation depends on an accurate optical flow estimation, which describes for every pixel in an image how the pixel moves between a first and a second anchor image. The first and second anchor images are consecutive images in the camera stream. Knowledge about the optical flow enables video frame interpolation between the first and the second anchor images. Video frame interpolation constructs initial estimates of intermediate frames by image warping with the estimated optical flow and, subsequently, refining the initial interpolation result through high-level processing using a deep neural network that helps improving the initial intermediate image estimation. An accurate optical flow estimation leads to a successful quantitative and qualitative performance of video interpolation. In this sense an optical flow estimation is an explicit estimation of movement of each pixel between two images. A disadvantage of this approach is that the optical flow estimation incurs substantial computational cost in terms of time and memory. In addition to that, it is important to note that no optical flow estimator is perfect and, thus, the optical flow estimator limits the performance of the video frame interpolation. In a paper of Choi et al. [1] an alternative approach is proposed. This process is also known as "channel attention is all you need" (CAIN). CAIN is a deep learning method based on convolutional neural networks. The CAIN approach replaces the use of optical flows with simple feature map transformations by gradually distributing the information about motion into multiple color channe