US-20260127857-A1 - SYNTHETIC LICENSE PLATE DATA GENERATION

US20260127857A1US 20260127857 A1US20260127857 A1US 20260127857A1US-20260127857-A1

Abstract

A method or system for enhancing vehicle identification accuracy. The system identifies a gap in a training dataset for training a license plate identification model. The gap represents underrepresented visual characteristics in misidentified license plates. A guidance prompt is generated based on the visual characteristics of the misidentified license plate. Condition embeddings are then generated from the guidance prompt, which are used to condition a diffusion model to create synthetic license plate images. The diffusion model is trained to receive a real license plate image, encode it into a vector, apply forward diffusion to add noise, and then apply reverse diffusion to remove the noise, resulting in a denoised vector that represents a synthetic license plate image conditioned by the condition embeddings. The denoised vector is then decoded to produce the synthetic license plate image. The synthetic images are then used to retrain the license plate identification model.

Inventors

Ji Sung HWANG
Anil Kumar Nayak
Long Ngo Hoang Truong

Assignees

METROPOLIS IP HOLDINGS, LLC

Dates

Publication Date: 20260507
Application Date: 20241107

Claims (20)

1 . A method for improving vehicle identification accuracy in a vehicle management system, comprising: identifying a gap in a training dataset of license plate images used by a license plate identification model, wherein the gap corresponds to underrepresented visual characteristics in a misidentified license plate; generating a guidance prompt based on the underrepresented visual characteristics of the misidentified license plate; generating condition embeddings based on the guidance prompt; generating a synthetic license plate image by applying a diffusion model conditioned on the condition embeddings, wherein the diffusion model is trained to: receive a real license plate image as input; encode the real license plate image into a vector; apply forward diffusion to the vector to incrementally add noise, generating a noisy vector associated with the real license plate image with added noise; apply reverse diffusion to the noisy vector to produce a denoised vector representing a synthetic license plate image, conditioned on the condition embeddings; and decode the denoised vector to create the synthetic license plate image; and retraining the license plate identification model using the synthetic license plate image.
2 . The method of claim 1 , wherein the underrepresented visual characteristics of the misidentified license plate includes one or more of a character font, character size, character spacing, character positioning, a symbol, a background color, a background graphic, and a slogan, specific to a jurisdiction's license plate.
3 . The method of claim 1 , wherein identifying the gap in the training dataset comprises analyzing data distribution of the training dataset to identify a gap in the data distribution of the training dataset.
4 . The method of claim 1 , wherein identifying the gap in the training dataset comprises identifying recurring misidentifications by the vehicle identification model.
5 . The method of claim 1 , wherein applying the forward diffusion to the vector includes iteratively adding noise to a noisy vector generated over a plurality of time steps, and applying the reverse diffusion includes iteratively removing noise from the noisy vector over a plurality of time steps.
6 . The method of claim 1 , further comprising: generating a plurality of synthetic license plate images; applying a scoring metric to each of the plurality of synthetic license plate images to determine a compliance score between the synthetic image and the guidance prompt, the compliance score indicating whether the synthetic is compliance with the guidance prompt; selecting a subset of the generated synthetic license plate images based on the compliance scores; retraining or fine-tuning the vehicle identification model using the selected subset of synthetic license plate images.
7 . The method of claim 6 , wherein applying the scoring metric: applying optical character recognition (OCR) to extract alphanumeric characters from the synthetic images; comparing the extracted characters with expected characters as indicated in the guidance prompt to determine a similarity.
8 . The method of claim 6 , wherein the scoring metric further comprises at least one of following performance metrics: structural similarity index (SSIM), peak signal-to-noise ratio (PSNR), mean squared error (MSE), learned perceptual image patch similarity (LPIPS), Fréchet inception distance (FID), and Fréchet contrastive language-image pretraining (CLIP) distance (FCD).
9 . The method of claim 1 , wherein the condition embeddings includes a layout embedding, a mask embedding, a text embedding, a character embedding generated based on the guidance prompt.
10 . The method of claim 9 , wherein the condition embeddings further include an image embedding generated based on existing real license plate images.
11 . The method of claim 9 , wherein the diffusion model is trained by: accessing a training dataset comprising real license plate images annotated with license plate element locations, sizes, and associated metadata, the metadata including a jurisdiction; for each of the real license plate images, encoding the real license plate image into an initial vector representation; applying a forward diffusion process to the initial vector to add noise to the initial vector to generate a noisy vector corresponding to the real license plate image with noise added; training a reverse diffusion block to denoise the noisy vector to generate a denoise vector approximate the initial vector, conditioned on condition embeddings associated with positions and sizes of a plurality of license plate elements; determining a reconstruction error based on a difference between the denoised vector and the initial vector; and adjusting weights of the reverse diffusion block to reduce the reconstruction error.
12 . A non-transitory computer-readable medium comprising memory with instructions encoded thereon, the instructions comprising instructions to cause one or more processors to perform steps comprising: identifying a gap in a training dataset of license plate images used by a license plate identification model, wherein the gap corresponds to underrepresented visual characteristics in a misidentified license plate; generating a guidance prompt based on the underrepresented visual characteristics of the misidentified license plate; generating condition embeddings based on the guidance prompt; generating a synthetic license plate image by applying a diffusion model conditioned on the condition embeddings, wherein the diffusion model is trained to: receive a real license plate image as input; encode the real license plate image into a vector; apply forward diffusion to the vector to incrementally add noise, generating a noisy vector associated with the real license plate image with added noise; apply reverse diffusion to the noisy vector to produce a denoised vector representing a synthetic license plate image, conditioned on the condition embeddings; and decode the denoised vector to create the synthetic license plate image; and retraining the license plate identification model using the synthetic license plate image.
13 . The non-transitory computer-readable medium of claim 12 , wherein the underrepresented visual characteristics of the misidentified license plate includes one or more of a character font, character size, character spacing, character positioning, a symbol, a background color, a background graphic, and a slogan, specific to a jurisdiction's license plate.
14 . The non-transitory computer-readable medium of claim 12 , wherein identifying the gap in the training dataset comprises analyzing data distribution of the training dataset to identify a gap in the data distribution of the training dataset.
15 . The non-transitory computer-readable medium of claim 12 , wherein identifying the gap in the training dataset comprises identifying recurring misidentifications by the license plate identification model.
16 . The non-transitory computer-readable medium of claim 12 , wherein applying the forward diffusion to the vector includes iteratively adding noise to a noisy vector generated over a plurality of time steps, and applying the reverse diffusion includes iteratively removing noise from the noisy vector over a plurality of time steps.
17 . The non-transitory computer-readable medium of claim 12 , the steps further comprising: generating a plurality of synthetic license plate images; applying a scoring metric to each of the plurality of synthetic license plate images to determine a compliance score between the synthetic image and the guidance prompt, the compliance score indicating whether the synthetic is compliance with the guidance prompt; selecting a subset of the generated synthetic license plate images based on the compliance scores; retraining or fine-tuning the license plate identification model using the selected subset of synthetic license plate images.
18 . The non-transitory computer-readable medium of claim 17 , wherein applying the scoring metric: applying optical character recognition (OCR) to extract alphanumeric characters from the synthetic images; comparing the extracted characters with expected characters as indicated in the guidance prompt to determine a similarity.
19 . The non-transitory computer-readable medium of claim 18 , wherein the diffusion model is trained by: accessing a training dataset comprising real license plate images annotated with license plate element locations, sizes, and associated metadata, the metadata including a jurisdiction; for each of the real license plate images, encoding the real license plate image into an initial vector representation; applying a forward diffusion process to the initial vector to add noise to the initial vector to generate a noisy vector corresponding to the real license plate image with noise added; training a reverse diffusion block to denoise the noisy vector to generate a denoise vector approximate the initial vector, conditioned on condition embeddings associated with positions and sizes of a plurality of license plate elements; determining a reconstruction error based on a difference between the denoised vector and the initial vector; and adjusting weights of the reverse diffusion block to reduce the reconstruction error.
20 . A system comprising: memory with instructions encoded thereon; and one or more processors that, when executing the instructions, are caused to perform operations comprising: identifying a gap in a training dataset of license plate images used by a license plate identification model, wherein the gap corresponds to underrepresented visual characteristics in a misidentified license plate; generating a guidance prompt based on the underrepresented visual characteristics of the misidentified license plate; generating condition embeddings based on the guidance prompt; generating a synthetic license plate image by applying a diffusion model conditioned on the condition embeddings, wherein the diffusion model is trained to: receive a real license plate image as input; encode the real license plate image into a vector; apply forward diffusion to the vector to incrementally add noise, generating a noisy vector associated with the real license plate image with added noise; apply reverse diffusion to the noisy vector to produce a denoised vector representing a synthetic license plate image, conditioned on the condition embeddings; and decode the denoised vector to create the synthetic license plate image; and retraining the license plate identification model using the synthetic license plate image.

Description

TECHNICAL FIELD The disclosure generally relates to the field of computer vision, and more particularly relates to using diffusion model to generate synthetic images. BACKGROUND Computer vision is a field of artificial intelligence (AI) focused on enabling computers to interpret and make decisions based on visual data. Computer vision includes the use of algorithms and machine learning models to analyze images and videos, extract meaningful information, and perform tasks that typically require human visual understanding. Such machine learning models may include image recognition, image classification, object detection, image segmentation, pose estimation, scene understanding, among others. For example, a license recognition model may be trained to automatically detect and recognize license plates within images or video frames. Such a model may be implemented for access control in restricted areas, such as parking facilities, residential or commercial areas, toll roads, etc. Notably, a sufficient amount of training data is generally required for training a computer vision model due to the complexity and diversity of visual information in real-world applications. Real-world images exhibit large variability, such as differences in lighting, angle, backgrounds, object appearances, sizes, and poses. A model trained on limited data may only perform well on a narrow range of conditions, failing to generalize to unseen scenarios. Further, with a small dataset, a model may “memorize” specific features of the training images instead of learning generalized patterns. This can lead to overfitting, where the model performs well on training data but poorly on new data. Furthermore, real-world datasets may have insufficient or uneven coverage. For example, for a license recognition model, the real-world dataset may have gaps in data for certain license plate layouts, states, characters, lighting conditions, angles, or environmental factors. This can lead to poor model performance in recognizing certain types of plates or underrepresented variations. SUMMARY Embodiments described herein address the above-describe limitations by training and applying a machine learning model (e.g., a diffusion model) to generate synthetic license plate images conditioned on layout conditions of a license plate. The generated synthetic license plate images can then be used to train or retrain a license plate identification model. Embodiments described herein include a system or a method for enhancing vehicle identification accuracy. The system identifies a gap in a training dataset for training a license plate identification model. The gap represents underrepresented visual characteristics in misidentified license plates. A guidance prompt is generated based on the visual characteristics of the misidentified license plate. Condition embeddings are then generated from the guidance prompt, which are used to condition a diffusion model to create synthetic license plate images. The diffusion model is trained to receive a real license plate image, encode it into a vector, apply forward diffusion to add noise, and then apply reverse diffusion to remove the noise, resulting in a denoised vector that represents a synthetic license plate image conditioned by the condition embeddings. The denoised vector is then decoded to produce the synthetic license plate image. The synthetic images are then used to retrain the license plate identification model. In some embodiments, the system receives a guidance prompt specifying elements of a license plate, including a jurisdiction of the license plate. The system uses a first language model to parse the guidance prompt to extract these elements, which are then applied to a pre-trained layout generation model to produce a license plate layout. The layout generation model is trained on real license plate images with labeled bounding boxes identifying each element. The system then uses a second language model to generate a set of condition embeddings based on the license plate layout. The system then applies a machine learning model (e.g., a diffusion model conditioned on the set of condition embeddings) to generate synthetic license plate images, and uses the synthetic license plate images to train or retrain a license plate identification model. BRIEF DESCRIPTION OF DRAWINGS The disclosed embodiments have other advantages and features which will be more readily apparent from the detailed description, the appended claims, and the accompanying figures (or drawings). A brief introduction of the figures is below. FIG. 1 illustrates one embodiment of a system environment for managing vehicle parking or transit through a managed facility, using an edge device and a vehicle management server. FIG. 2 illustrates one embodiment of exemplary modules operated by an edge device in accordance with one or more embodiments. FIG. 3 illustrates one embodiment of exemplary modules operated by a vehicle management server in accord