EP-4738844-A1 - IMAGE PROCESSING SYSTEM AND METHOD

EP4738844A1EP 4738844 A1EP4738844 A1EP 4738844A1EP-4738844-A1

Abstract

A method of image streaming, comprises receiving at least a first stream comprising video packet data and separate supplementary packet data, upon failure to receive video packet data, providing as inputs to a generative model one or more last video images, and data based on currently received supplementary packet data, receiving from the generative model a predicted image, being an estimate of a current missing video image based on these inputs, and outputting the predicted image for display.

Inventors

BARMAN, Nabajeet
BARAHONA RIOS, Adrian
ZADTOOTAGHAJ, Saman
BIGOS, ANDREW

Assignees

Sony Interactive Entertainment Inc.

Dates

Publication Date: 20260506
Application Date: 20251023

Claims (15)

A method of image streaming, comprising: receiving at least a first stream comprising video packet data and separate supplementary packet data; upon failure to receive video packet data, providing as inputs to a generative model one or more last video images, and data based on currently received supplementary packet data; receiving from the generative model a predicted image, being an estimate of a current missing video image based on these inputs; and outputting the predicted image for display.
The method of claim 1, in which the supplementary packet data comprises audio data corresponding to the video stream.
The method of claim 2, in which the audio data is pre-processed to comprise one or more selected from the list consisting of: i. a frequency domain transform of the audio; ii. a frequency domain transform of the audio formatted as at least one colour or greyscale channel of an image of the same kind as the last video image; and iii. a text prompt based on speech to text transcription.
The method of any preceding claim, in which the supplementary packet data comprises caption data corresponding to the video stream.
The method of claim 4, in which descriptive caption data and dialogue caption data are provided as separate inputs to the generative model.
The method of any preceding claim, in which the supplementary packet data comprises duplicate image data corresponding to the video stream, the duplicate image data being at least an order of magnitude smaller than the image data in the video stream.
The method of any preceding claim, in which the supplementary packet data comprises motion vector data corresponding to the video stream, the motion vector data corresponding to a reduced image resolution at least an order of magnitude smaller than that of the image data in the video stream.
The method of any preceding claim, in which the one or more last video images comprise one or more selected from the list consisting of: i. those last decoded from the received video stream; and ii. at least one predicted as the previous current image by the generative model.
The method of any preceding claim, comprising the steps of: upon receiving subsequent video packet data, decoding a current image from the received video packet data; comparing the decoded current image with the current predicted image from the generated model; and if a difference between the two images exceeds a threshold, outputting an image based on both images as a replacement intermediate image.
The method of any preceding claim, in which: the image inputs and outputs of the generative model are at a resolution smaller that the image resolution output for display; and the method comprises the step of: upscaling the output of the generative model for output for display.
The method of any preceding claim, in which: the generative model was trained on inputs based on video and supplementary data, and on successive ones of a series of target images representing successive lost images following after the input video data representing the last video image.
A computer program comprising computer executable instructions adapted to cause a computer system to perform the method of any one of the preceding claims.
A client device, comprising a data interface configured to receive at least a first stream comprising video packet data and separate supplementary packet data; and an input processor configured, upon a failure to receive video packet data, to provide as inputs to a generative model one or more last video images, and data based on currently received supplementary packet data; an output processor configured to receive from the generative model a predicted image, being an estimate of a current missing video image based on these inputs; and a display processor configured to output the predicted image for display.
A client device according to claim 13, in which the supplementary packet data comprises one or more selected from the list consisting of: i. audio data corresponding to the video stream; ii. caption data corresponding to the video stream; iii. duplicate image data corresponding to the video stream, the duplicate image data being at least an order of magnitude smaller than the image data in the video stream; and iv. motion vector data corresponding to the video stream, the motion vector data corresponding to a reduced image resolution at least an order of magnitude smaller than that of the image data in the video stream.
A system, comprising: a client according to claim 13 or claim 14, configured to request a retransmission of supplementary data if supplementary data packets are lost; and a server, configured to prioritise retransmission of such supplementary data packets.

Description

The present invention relates to an image processing system and method. Streaming systems typically stream content from a source (e.g. at a server) to a client device, over a network such as the internet. A large proportion of the transmitted data is video data, making it vulnerable to interruptions that result in packet corruption or loss. When video packet loss or corruption result in the loss of a video image, this can be compensated for by known techniques for frame prediction. However, the prediction is typically fairly approximate and only for the next image frame. If the packet loss or corruption is such that successive video images are lost or unusable, then typically this error cannot be masked by the client. One solution is to buffer the received data and request retransmission of lost packets to patch the buffered data as needed; however for typical video frame rates and network latencies, this will require the buffering of a significant number of video frames, and hence require a significant provision of and use of memory. Meanwhile for certain content, such as streamed videogames and other interactive content, the delay created by buffering the video is unacceptable. The present invention seeks to mitigate or alleviate this problem. SUMMARY OF THE INVENTION Various aspects and features of the present invention are defined in the appended claims and within the text of the accompanying description. In a first aspect, a method of image streaming is provided in accordance with claim 1. In another aspect, a client device is provided in accordance with claim 13. BRIEF DESCRIPTION OF THE DRAWINGS A more complete appreciation of the disclosure and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein: Figure 1 is a schematic diagram of a client device in accordance with embodiments of the present description.Figure 2 is a schematic diagram of a client system for generating output images in accordance with embodiments of the present description.Figure 3 is a flow diagram of a method of generating output images in accordance with embodiments of the present description. DESCRIPTION OF THE EMBODIMENTS An image processing system and method are disclosed. In the following description, a number of specific details are presented in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to a person skilled in the art that these specific details need not be employed to practice the present invention. Conversely, specific details known to the person skilled in the art are omitted for the purposes of clarity where appropriate. Client device Referring now to the drawings, wherein like reference numerals designate identical or corresponding parts throughout the several views, Figure 1 illustrates an entertainment system 10 such a computer or console. The entertainment system may operate as client device for a video streaming service. The entertainment system 10 comprises a central processor or CPU 20. The entertainment system also comprises a graphical processing unit or GPU 30, and RAM 40. Two or more of the CPU, GPU, and RAM may be integrated as a system on a chip (SoC). Further storage may be provided by a disk 50. The entertainment device may transmit or receive data via one or more data ports 60, such as a USB port, Ethernet® port, Wi-Fi® port, Bluetooth® port or similar, as appropriate. It may also optionally receive data via an optical drive 70. Audio/visual outputs from the entertainment device are typically provided through one or more A/V ports 90 or one or more of the data ports 60. Where components are not integrated, they may be connected as appropriate either by a dedicated data link or via a bus 100. Examples of a device for displaying images output by the entertainment system include a head mounted display 'HMD' 120 worn by a user 1, a TV (not shown), and a portable screen 140. Interaction with the system is typically provided using one or more handheld controllers 130, 140, and/or one or more VR controllers (130A-L,R) in the case of the HMD. Whilst a console-like system is illustrated, it will be appreciated that any suitable client streaming device may be considered, such as a phone or tablet, or a smart TV. Hence such aspects as the display and input controls may vary with the device and may be separate or integral as appropriate. Compensating for video packet loss Known techniques for frame prediction may include extrapolation from the preceding last image or images received at the client. Alternatively, it is possible to predict the next image using a generative model. Recent generative models such as generative adversarial networks (GANs), variational autoencoders (VAEs), large language models (LLMs), and more recently diffusion models, have found great success in t