CN-116391353-B - Winding shaping image processing method and device with neighborhood consistency and storage medium

CN116391353BCN 116391353 BCN116391353 BCN 116391353BCN-116391353-B

Abstract

An input image of a first bit depth in an input domain is received. A forward shaping operation is performed on the input image to generate a forward shaped image of a second bit depth in the shaping domain. An image container containing image data derived from the forward shaped image is encoded into an output video signal of a second bit depth.

Inventors

HORVATH JOSHUA
H. Kadu
SU GUANMING

Assignees

杜比实验室特许公司

Dates

Publication Date: 20260505
Application Date: 20211110
Priority Date: 20201111

Claims (20)

1. A method for image processing, comprising: Receiving an input image of a first bit depth in an input domain from an input video signal of the first bit depth, the first bit depth being higher than a second bit depth in a shaping domain; Performing a forward shaping operation on the input image to generate a forward shaped image of the second bit depth in the shaping domain, the forward shaping operation comprising wrapping an input codeword in the input image along a non-wrapped axis of the input domain into a shaped codeword in the forward shaped image on a wrapped axis of the shaping domain, wherein the non-wrapped axis is represented in the input domain as a straight axis, and wherein the wrapped axis is represented in the shaping domain in a non-straight spatial shape; Encoding an image container containing image data derived from the forward shaped image into an output video signal of the second bit depth, the image data in the image container causing a recipient device of the output video signal to construct a backward shaped image of a third bit depth for rendering on a display device, the third bit depth being higher than the second bit depth; wherein to generate the image data in the image container, the method further comprises: applying a forward cut-off field transform to the forward shaped image in the shaping domain to transform the forward shaped image in the shaping domain into an intermediate image in a cut-off field domain, wherein the forward cut-off field transform ensures that codewords remain within a specified range of codeword values; performing one or more image processing operations on the intermediate image to generate a processed intermediate image, and An inverse truncated field transform is applied to the processed intermediate image to generate the image data in the image container.
2. The method of claim 1, wherein the shaping domain is a winding shaping domain, and wherein the winding axis refers to an axis in the winding shaping domain mapped from the non-winding axis by one or more winding transforms, wherein the one or more winding transforms are one or more geometric transforms.
3. The method of claim 1 or 2, wherein the non-wrapped axis refers to a single channel or dimension or a combination of multiple channels or dimensions in the input image in the input domain.
4. The method of claim 1 or 2, wherein the winding axis of the shaping domain has a different geometry than the non-winding axis, wherein a total number of codewords available on the winding axis is greater than a total number of codewords available on the non-winding axis.
5. The method of claim 1, wherein the processed intermediate image includes one or more cropped codeword values generated according to a cropping operation to ensure that all codeword values in the processed intermediate image are within a target spatial shape representing the shaping domain.
6. The method of claim 1 or 2, wherein the second bit depth represents one of 8 bits, 10 bits, 12 bits, or another number of bits lower than the first bit depth.
7. The method of claim 1 or 2, wherein the forward shaping operation is based on a set of forward shaping maps, wherein a set of operating parameter values used in the set of forward shaping maps is selected from a plurality of sets of operating parameter values based at least in part on an input codeword in the input image, wherein each set of operating parameter values is optimized to minimize a prediction error for a respective training image cluster of a plurality of training image clusters.
8. The method of claim 7, wherein the set of forward shaping maps comprises one or more of a multivariate multiple regression MMR map, a tensor product B-spline TPB map, a forward shaping lookup table FLUT, or other types of forward shaping maps.
9. The method of claim 7, wherein each set of operating parameter values of the plurality of sets of operating parameter values is generated based on a backward error subtraction signal adjustment BESA algorithm.
10. The method of claim 9, wherein the prediction error propagated in the BESA algorithm is calculated based at least in part on a) a difference between an input value and a reconstructed value and b) a spatial gradient derived as a partial derivative of a cross-channel forward shaping function used to generate the reconstructed value.
11. The method of claim 1 or 2, wherein the input codewords in the input image are scaled with one or more scaling factors, wherein the one or more scaling factors are optimized at run-time using a search algorithm based on one or more of the golden section algorithm, the Nelder-Mead algorithm, or another search algorithm.
12. The method of claim 1 or 2, wherein the input shaping representing the input field is mapped to a target reference shape representing the shaping field using a set of mapping functions, wherein the target reference shape is selected from a plurality of different reference shapes corresponding to a plurality of different candidate shaping fields based at least in part on a distribution of input codewords in the input image.
13. The method of claim 12, wherein the plurality of different reference shapes comprises at least one of a complete geometry, a complete geometry with a single gap portion inserted, a complete geometry with a plurality of gap portions inserted, a torus shape, a cylinder shape, or a torus shape.
14. The method of claim 1 or 2, wherein the input field represents one of an RGB color space, a YCbCr color space, a perceptual quantization color space, a linear color space, or another color space.
15. The method of claim 1 or2, wherein the first bit depth represents one of 12 bits, 16 bits or more, or another number of bits higher than the second bit depth.
16. The method of claim 1 or 2, wherein the forward shaping operation represents a primary shaping operation, the method further comprising performing a secondary shaping operation on the input image to linearly scale a limited input data range to a full data range for each color channel, wherein the secondary shaping operation comprises one or more of per-channel pre-shaping or scaling.
17. The method of claim 16, wherein image metadata is generated based on operating parameters used in the primary shaping operation and the secondary shaping operation, wherein the image metadata is provided to the recipient device in the output video signal at the second bit depth.
18. The method of claim 16, wherein the secondary shaping operation is performed in a first stage prior to performing a second stage of the primary shaping operation.
19. The method of claim 16, wherein the combination of the secondary shaping operation and the primary shaping operation is performed in a single combining stage at runtime.
20. The method of claim 1 or 2, wherein the shaping domain is represented by a torus shape.

Description

Winding shaping image processing method and device with neighborhood consistency and storage medium Cross Reference to Related Applications The present application claims priority from U.S. provisional application No. 63/112,336, filed 11/2020, and from european patent application No. 20206922.5, filed 11/2020, both of which are incorporated herein by reference in their entirety. Technical Field The present disclosure relates generally to image processing operations. More particularly, embodiments of the present disclosure relate to video codecs. Background As used herein, the term "Dynamic Range (DR)" may relate to the ability of the Human Visual System (HVS) to perceive a range of intensities (e.g., luminance, brightness) in an image, e.g., from darkest black (dark) to brightest white (high light). In this sense, DR is related to the "scene-referred" intensity of the reference scene. DR may also relate to the ability of a display device to adequately or approximately render an intensity range of a particular breadth (breadth). In this sense, DR is related to the "reference display (display-referred)" intensity. Unless a specific meaning is explicitly specified to have a specific meaning at any point in the description herein, it should be inferred that the terms can be used interchangeably in either sense, for example. As used herein, the term "High Dynamic Range (HDR)" relates to DR broadness of the order of about 14 to 15 or more across the Human Visual System (HVS). Indeed, DR of a broad breadth in the range of intensities that humans can simultaneously perceive may be slightly truncated relative to HDR. As used herein, the term "Enhanced Dynamic Range (EDR) or Visual Dynamic Range (VDR)" may be related, either alone or interchangeably, to a DR that is perceivable within a scene or image by the Human Visual System (HVS), including eye movement, allowing for some light adaptation variation across the scene or image. As used herein, EDR may involve DR spanning 5 to 6 orders of magnitude. While HDR relative to a reference real scene may be somewhat narrower, EDR represents a broad DR breadth and may also be referred to as HDR. In practice, an image includes one or more color components (e.g., luminance Y and chrominance Cb and Cr) of a color space, where each color component is represented by an accuracy of n bits per pixel (e.g., n=8). Non-linear luma coding (e.g., gamma coding) is used, wherein an image with n≤8 (e.g., a color 24-bit JPEG image) is considered a standard dynamic range image, and wherein an image with n >8 can be considered an enhanced dynamic range image. The reference electro-optic transfer function (EOTF) of a given display characterizes the relationship between the color value (e.g., luminance) of an input video signal and the output screen color value (e.g., screen luminance) produced by the display. Reference electro-optic transfer functions of flat panel displays such as those used in ,ITU Rec. ITU-R BT.1886, "Reference electro-optical transfer function for flat panel displays used in HDTV studio production [HDTV studio production ] "(month 3 2011) define a reference EOTF for flat panel displays, the contents of which are incorporated herein by reference in their entirety. Given a video stream, information about its EOTF may be embedded as (image) metadata in the bitstream. The term "metadata" herein relates to any auxiliary information that is transmitted as part of the encoded bitstream and that assists the decoder in rendering the decoded image. Such metadata may include, but is not limited to, color space or gamut information, reference display parameters, and auxiliary signal parameters as described herein. The term "PQ" as used herein refers to the quantification of perceived brightness magnitude. The human visual system responds to increasing light levels in a very nonlinear manner. The ability of a human to observe a stimulus is affected by the intensity of the stimulus, the size of the stimulus, the spatial frequency at which the stimulus is composed, and the level of intensity of light that the eye is adapted to at a particular moment in time when the stimulus is viewed. In some embodiments, the perceptual quantizer function maps linear input gray levels to output gray levels that better match contrast sensitivity thresholds in the human visual system. An example PQ mapping function is described in SMPTE ST 2084:2014"High Dynamic Range EOTF of Mastering Reference Displays [ high dynamic range EOTF of mastering reference display ]" (hereinafter "SMPTE"), which is incorporated herein by reference in its entirety, wherein for each light brightness level (e.g., stimulus level, etc.), the minimum visible contrast step at that light brightness level is selected according to the most sensitive adaptation level and the most sensitive spatial frequency (according to the HVS model), given a fixed stimulus size. Displays supporting light luminance of 200 to 1,000 cd/m 2 or nit repre