EP-4540779-B1 - GENERATING IMAGES WITH SMALL OBJECTS FOR TRAINING A PRUNED SUPER-RESOLUTION NETWORK

EP4540779B1EP 4540779 B1EP4540779 B1EP 4540779B1EP-4540779-B1

Inventors

JNAWALI, Kamal
BAU, Tien Cheng
KIM, JOONSOO

Dates

Publication Date: 20260506
Application Date: 20230926

Claims (15)

A method comprising: detecting at least one object displayed within at least one input frame (305) of an input video; cropping, from the at least one input frame (305), at least one cropped image (335, 345) including the at least one object; generating at least one training image (355) by overlaying simulated text on the at least one cropped image (335, 345); and providing the at least one training image (355) to a pruned convolutional neural network (CNN), wherein the pruned CNN learns, from the at least one training image (355), to reconstruct objects and textual regions during image super-resolution.
The method of claim 1, wherein the detecting comprises: applying Sobel edge detection to a pre-determined number of input frames (305) of the input video to generate a probabilistic static map, wherein the probabilistic static map is noise-free and contains only one or more static objects detected within the pre-determined number of input frames (305).
The method of claim 2, wherein the cropping comprises: determining a center of each of the one or more static objects detected based on the probabilistic static map; generating a list comprising each pixel-location of each center determined; randomly sampling a pixel-location from the list; and generating a cropped image (335) comprising at least one of the one or more static objects detected, wherein the randomly sampled pixel-location is a center of the cropped image (335).
The method of any one of claims 1 to 3, wherein the detecting comprises: detecting and localizing one or more objects within the at least one input frame (305) of the input video using a deep learning model for You Only Look Once (YOLO)-based object detection; and providing an output frame comprising the one or more objects and one or more bounding boxes corresponding to the one or more objects.
The method of any one of claims 1 to 4, wherein the cropping comprises: generating a list comprising each of the one or more bounding boxes; randomly selecting a bounding box from the list; and generating a cropped image (345) comprising at least one of the one or more objects, wherein the randomly selected bounding box corresponds to an object included in the cropped image (345).
The method of any one of claims 1 to 5, wherein each object occupies less than five percent of an entire area of the at least one input frame (305).
The method of any one of claims 1 to 6, wherein each object is one of an icon, a map, a logo, a number, or text.
A processor-readable medium that includes a program that, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 7.
A system comprising: at least one processor (910); and a processor-readable memory device (930) storing instructions that when executed by the at least one processor (910) causes the at least one processor (910) to perform operations including: detecting at least one object displayed within at least one input frame (305) of an input video; cropping, from the at least one input frame (305), at least one cropped image (335, 345) including the at least one object; generating at least one training image (355) by overlaying simulated text on the at least one cropped image (335, 345); and providing the at least one training image (355) to a pruned convolutional neural network (CNN), wherein the pruned CNN learns, from the at least one training image (355), to reconstruct objects and textual regions during image super-resolution.
The system of claim 9, wherein the detecting comprises: applying Sobel edge detection to a pre-determined number of input frames (305) of the input video to generate a probabilistic static map, wherein the probabilistic static map is noise-free and contains only one or more static objects detected within the pre-determined number of input frames (305).
The system of claims 10, wherein the cropping comprises: determining a center of each of the one or more static objects detected based on the probabilistic static map; generating a list comprising each pixel-location of each center determined; randomly sampling a pixel-location from the list; and generating a cropped image (335) comprising at least one of the one or more static objects detected, wherein the randomly sampled pixel-location is a center of the cropped image (335).
The system of any one of claims 9 to 11, wherein the detecting comprises: detecting and localizing one or more objects within the at least one input frame (305) of the input video using a deep learning model for You Only Look Once (YOLO)-based object detection; and providing an output frame comprising the one or more objects and one or more bounding boxes corresponding to the one or more objects.
The system of any one of claims 9 to 12, wherein the cropping comprises: generating a list comprising each of the one or more bounding boxes; randomly selecting a bounding box from the list; and generating a cropped image (345) comprising at least one of the one or more objects, wherein the randomly selected bounding box corresponds to an object included in the cropped image (345).
The system of any one of claims 9 to 13, wherein each object occupies less than five percent of an entire area of the at least one input frame (305).
The system of any one of claims 9 to 14, wherein each object is one of an icon, a map, a logo, a number, or text.

Description

Technical Field This disclosure generally relates to image super-resolution (SR), in particular, a method and system for generating images with small objects for training a pruned SR network. Background Art Image super-resolution (SR) is the process of recovering high-resolution (HR) images from low-resolution (LR) images. "Single image super-resolution based on neural networks for text and face recognition", Peyrard Clément, discloses super-resolution (SR) methods utilizing convolutional neural networks to enhance automatic recognition systems, such as optical character recognition and face recognition, by generating high-resolution images from low-resolution inputs while restoring spatial high frequencies and mitigating artifacts. Disclosure of Invention Solution to Problem In an embodiment, a method may comprise detecting at least one object displayed within at least one input frame of an input video. The method may further comprise cropping, from the at least one input frame, at least one cropped image including the at least one object. The method may further comprise generating at least one training image by overlaying simulated text on the at least one cropped image. The method may further comprise providing the at least one training image to a pruned convolutional neural network (CNN). The pruned CNN may learn, from the at least one training image, to reconstruct objects and textual regions during image super-resolution. In an embodiment, a system may comprise at least one processor and a processor-readable memory device storing instructions that when executed by the at least one processor causes the at least one processor to perform operations. The operations may include detecting at least one object displayed within at least one input frame of an input video. The operations may further include cropping, from the at least one input frame, at least one cropped image including the at least one object. The operations may further include generating at least one training image by overlaying simulated text on the at least one cropped image. The operations may further include providing the at least one training image to a pruned CNN. The pruned CNN may learn, from the at least one training image, to reconstruct objects and textual regions during image super-resolution. In an embodiment, a processor-readable medium that includes a program that, when executed by a processor, may cause the processor to perform a method. The method may comprise detecting at least one object displayed within at least one input frame of an input video. The method may further comprise cropping, from the at least one input frame, at least one cropped image including the at least one object. The method may further comprise generating at least one training image by overlaying simulated text on the at least one cropped image. The method may further comprise providing the at least one training image to a pruned CNN. The pruned CNN may learn, from the at least one training image, to reconstruct objects and textual regions during image super-resolution. Brief Description of Drawings For a fuller understanding of the nature and advantages of the disclosure, as well as a preferred mode of use, reference should be made to the following detailed description read in conjunction with the accompanying drawings, in which: FIG. 1 illustrates an example of high-resolution (HR) image reconstructed from a low-resolution (LR) image using a conventional pruned super-resolution (SR) network;FIG. 2 illustrates an example computing architecture for generating training data for training a pruned SR network to learn features corresponding to small objects, according to an embodiment of the disclosure;FIG. 3 illustrates an example training system for generating training data for training a pruned SR network to learn features corresponding to small objects, according to an embodiment of the disclosure;FIG. 4 illustrates an example workflow of a static map generator of the training system, according to an embodiment of the disclosure;FIG. 5 illustrates an example workflow of training a deep learning model utilized by a deep learning You Only Look Once (YOLO)-based object detector of the training system, according to an embodiment of the disclosure;FIG. 6 illustrates an example workflow of a text overlayer of the training system, according to an embodiment of the disclosure;FIG. 7 illustrates example input frames, an example probabilistic static map, and an example static detection-based cropped image, in according to an embodiment of the disclosure;FIG. 8 illustrates an example input frame, an example output frame with one or more bounding boxes, and an example YOLO-based cropped image, in according to an embodiment of the disclosure;FIG. 9 illustrates an example cropped image and an example training image with added or overlayed text, according to an embodiment of the disclosure;FIG. 10 illustrates an example of visual differences between a LR image and a H