EP-4315244-B1 - PICTURE QUALITY-SENSITIVE SEMANTIC SEGMENTATION FOR USE IN TRAINING IMAGE GENERATION ADVERSARIAL NETWORKS

EP4315244B1EP 4315244 B1EP4315244 B1EP 4315244B1EP-4315244-B1

Inventors

BAU, TIEN C.
GARUD, HRISHIKESH DEEPAK

Dates

Publication Date: 20260506
Application Date: 20220907

Claims (13)

A method (900) comprising: training a semantic segmentation network (440, 640) to generate semantic segmentation maps (230, 311, 312, 441) comprising class-wise probability values and to be sensitive to picture quality of an output image (408) generated by an image generation network (402) during the training of the image generation network (402); generating a semantic segmentation map (230, 311, 312, 441) using the trained semantic segmentation network (228); and utilizing the semantic segmentation map (230, 311, 312, 441) during training of the image generation network (402) as part of a loss function that comprises multiple losses, characterized in that training the semantic segmentation network (440, 640) to be sensitive to picture quality comprises: training the semantic segmentation network (440, 640) to vary the class-wise probability values based on the picture quality.
The method (900) of Claim 1, wherein the training of the semantic segmentation network (440, 640) to be sensitive to picture quality of the output image (408) generated by the image generation network (402) during the training of the image generation network (402) occurs such that increased degradation of the picture quality of the output image (408) results in decreased prediction confidence by the semantic segmentation network (440, 640).
The method (900) of Claim 1, wherein training the semantic segmentation network (440, 640) to vary the class-wise probability values based on the picture quality comprises: scaling each class-wise probability value using a confusion factor having an inverse relationship with the picture quality such that each class-wise probability value indicates higher confusion when the picture quality is lower.
The method (900) of Claim 1, wherein the multiple losses of the loss function comprise a semantic segmentation loss (445) provided by the semantic segmentation network (440, 640), a pixel loss, and a generative adversarial network (162, 204), GAN, loss provided by a discriminator network (410).
The method (900) of Claim 4, wherein the loss function further comprises a perceptual loss (425) provided by a pre-trained perceptual neural network (162, 204).
The method (900) of Claim 1, wherein the image generation network (402) comprises a super-resolution neural network (162, 204) or an image (210, 301, 302, 801, 802, 803) simulation network (162, 204).
An electronic device (101, 102, 104) comprising: at least one memory (130) configured to store instructions; and at least one processing device configured when executing the instructions to: train a semantic segmentation network (440, 640) to generate semantic segmentation maps (230, 311, 312, 441) comprising class-wise probability values and to be sensitive to picture quality of an output image (408) generated by an image generation network (402) during the training of the image generation network (402); generate a semantic segmentation map (230, 311, 312, 441) using the trained semantic segmentation network (228); and utilize the semantic segmentation map (230, 311, 312, 441) during training of the image generation network (402) as part of a loss function that comprises multiple losses, characterized in that to train the semantic segmentation network (440, 640) to be sensitive to picture quality, the at least one processing device is configured to: train the semantic segmentation network (440, 640) to vary the class-wise probability values based on the picture quality.
The electronic device (101, 102, 104) of Claim 7, wherein the at least one processing device is further configured to: train the semantic segmentation network (440, 640) to be sensitive to picture quality of the output image (408) generated by the image generation network (402) during the training of the image generation network (402) such that increased degradation of the picture quality of the output image (408) results in decreased prediction confidence by the semantic segmentation network (440, 640).
The electronic device (101, 102, 104) of Claim 7, wherein, to train the semantic segmentation network (440, 640) to vary the class-wise probability values based on the picture quality, the at least one processing device is configured to: scale each class-wise probability value using a confusion factor having an inverse relationship with the picture quality such that each class-wise probability value indicates higher confusion when the picture quality is lower.
The electronic device (101, 102, 104) of Claim 7, wherein the multiple losses of the loss function comprise a semantic segmentation loss (445) provided by the semantic segmentation network (440, 640), a pixel loss, and a generative adversarial network (162, 204), GAN, loss provided by a discriminator network (410).
The electronic device (101, 102, 104) of Claim 10, wherein the loss function further comprises a perceptual loss (425) provided by a pre-trained perceptual neural network (162, 204).
The electronic device (101, 102, 104) of Claim 7, wherein the image generation network (402) comprises a super-resolution neural network (162, 204) or an image (210, 301, 302, 801, 802, 803) simulation network (162, 204).
A non-transitory machine-readable medium containing instructions that when executed cause at least one processor (120) of an electronic device (101, 102, 104) to: train a semantic segmentation network (440, 640) to generate semantic segmentation maps (230, 311, 312, 441) comprising class-wise probability values and to be sensitive to picture quality of an output image (408) generated by an image generation network (402) during the training of the image generation network (402); generate a semantic segmentation map (230, 311, 312, 441) using the trained semantic segmentation network (228); and utilize the semantic segmentation map (230, 311, 312, 441) during training of the image generation network (402) as part of a loss function that comprises multiple losses, characterized in that the instructions that when executed cause the at least one processor to train the semantic segmentation network to be sensitive to picture quality comprise instructions that when executed cause the at least one processor to: train the semantic segmentation network to vary the class-wise probability values based on the picture quality.

Description

[Technical Field] This disclosure relates generally to imaging systems. More specifically, this disclosure relates to a system and method for picture quality-sensitive semantic segmentation for use in training image generation adversarial networks. [Background Art] Image generation algorithms typically create new images from scratch by learning abstract contextual information of real-life objects, such as cars, trees, mountains, clouds, and the like. Image generation algorithms are useful or important for multiple applications like training data generation, super-resolution, simulation, and the like. Typically, machine learning models are trained using special methods and loss functions to achieve desired results. For example, generative adversarial network (GAN)-based super-resolution algorithms often try to generate the most realistic high-resolution images with the aid of perceptual loss and discriminator loss. For example, "Multi-Resolution Generative Adversarial Networks for Tiny-Scale Pedestrian Detection" by Yin Ruihao introduces an approach to enhancing pedestrian detection in low-resolution scenarios by leveraging a multi-resolution GAN architecture that improves the clarity and distinctiveness of tiny-scale objects. Most of these algorithms generate details that are plausible but not realistic, meaning one can easily tell they are artificially generated on close inspection. [Disclosure] [Technical Solution] This disclosure provides a system and method for picture quality-sensitive semantic segmentation for use in training image generation adversarial networks. The present invention is directed to subject matter as defined in the claims. In a first embodiment, a method is defined in claim 1. In a second embodiment, an electronic device is defined in claim 7. In a third embodiment, a non-transitory machine-readable medium is defined in claim 13. Other technical features may be readily apparent to one skilled in the art from the following figures, descriptions, and claims. Before undertaking the DETAILED DESCRIPTION below, it may be advantageous to set forth definitions of certain words and phrases used throughout this patent document. The terms "transmit," "receive," and "communicate," as well as derivatives thereof, encompass both direct and indirect communication. The terms "include" and "comprise," as well as derivatives thereof, mean inclusion without limitation. The term "or" is inclusive, meaning and/or. The phrase "associated with," as well as derivatives thereof, means to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, have a relationship to or with, or the like. Moreover, various functions described below can be implemented or supported by one or more computer programs, each of which is formed from computer readable program code and embodied in a computer readable medium. The terms "application" and "program" refer to one or more computer programs, software components, sets of instructions, procedures, functions, objects, classes, instances, related data, or a portion thereof adapted for implementation in a suitable computer readable program code. The phrase "computer readable program code" includes any type of computer code, including source code, object code, and executable code. The phrase "computer readable medium" includes any type of medium capable of being accessed by a computer, such as read only memory (ROM), random access memory (RAM), a hard disk drive, a compact disc (CD), a digital video disc (DVD), or any other type of memory. A "non-transitory" computer readable medium excludes wired, wireless, optical, or other communication links that transport transitory electrical or other signals. A non-transitory computer readable medium includes media where data can be permanently stored and media where data can be stored and later overwritten, such as a rewritable optical disc or an erasable memory device. As used here, terms and phrases such as "have," "may have," "include," or "may include" a feature (like a number, function, operation, or component such as a part) indicate the existence of the feature and do not exclude the existence of other features. Also, as used here, the phrases "A or B," "at least one of A and/or B," or "one or more of A and/or B" may include all possible combinations of A and B. For example, "A or B," "at least one of A and B," and "at least one of A or B" may indicate all of (1) including at least one A, (2) including at least one B, or (3) including at least one A and at least one B. Further, as used here, the terms "first" and "second" may modify various components regardless of importance and do not limit the components. These terms are only used to distinguish one component from another. For example, a first user device and a second user device may indicate diffe