US-20260127862-A1 - MULTIMODAL LLM CONTROLLER FOR AUTONOMOUS DRIVING CORNER CASES

US20260127862A1US 20260127862 A1US20260127862 A1US 20260127862A1US-20260127862-A1

Abstract

Systems and methods for identifying an issue in an input image displayed on a user interface, the issue being a visual depiction of an aspect of the input image that training data used for training a model does not have sufficient training on, having sufficient training includes the model reaching a performance threshold in response to testing the model on the issue and generating a natural language description of the issue. The systems and methods further include generating a set of simulated images from the natural language description that reflect one or more variations of the issue, selecting one or more training images to provide selected one or more training images from the set of simulated images, the selected one or more training images increasing the one or more variations of the issue in the training data, and training the model using the selected one or more training images.

Inventors

Sparsh Garg
Manmohan Chandraker
Xu Cao

Assignees

NEC LABORATORIES AMERICA, INC.

Dates

Publication Date: 20260507
Application Date: 20251106

Claims (20)

1 . A method comprising: identifying an issue in an input image displayed on a user interface, the issue being a visual depiction of an aspect of the input image that training data used for training a model does not have sufficient training on, having sufficient training includes the model reaching a performance threshold in response to testing the model on the issue; generating a natural language description of the issue; generating a set of simulated images from the natural language description that reflect one or more variations of the issue; selecting one or more training images to provide selected one or more training images from the set of simulated images, the selected one or more training images increasing the one or more variations of the issue in the training data; and training the model using the selected one or more training images.
2 . The method of claim 1 , wherein generating the set of simulated images further comprises: iteratively correcting the set of simulated images by applying Application Programming Interfaces (APIs) until the set of simulated images matches preset requirements.
3 . The method of claim 1 , wherein generating the set of simulated images from the natural language description further comprises: extracting bounding boxes in the input image by applying an open vocabulary detector (OVD) to localize objects.
4 . The method of claim 1 , further comprising: storing at least one of the set of simulated images in a database.
5 . The method of claim 4 , further comprising: identifying issues in at least one stored image from the set of simulated images.
6 . The method of claim 1 , wherein generating the set of simulated images further comprises: editing a bounding box in the input image to replace an object in the bounding box with a different object.
7 . The method of claim 1 , wherein generating the set of simulated images further comprises: merging multiple bounding boxes in the input image.
8 . The method of claim 1 , wherein generating the set of simulated images further comprises: splitting a bounding box the input image into multiple bounding boxes.
9 . The method of claim 1 , wherein generating the set of simulated images further comprises: changing a background and lighting of the set of simulated images.
10 . A system comprising: a processor; and a memory storing computer-readable instructions that, when executed by the processor, cause the system to: identify an issue in an input image displayed on a user interface, the issue being a visual depiction of an aspect of the input image that training data used for training a model does not have sufficient training on, having sufficient training includes the model reaching a performance threshold in response to testing the model on the issue; generate a natural language description of the issue; generate a set of simulated images from the natural language description that reflect one or more variations of the issue; select one or more training images to provide selected one or more training images from the set of simulated images, the selected one or more training images increasing the one or more variations of the issue in the training data; and train the model using the selected one or more training images.
11 . The system of claim 10 , wherein causing the system to generate the set of simulated images further includes causing the system to: iteratively correct the set of simulated images by applying Application Programming Interfaces (APIs) until the set of simulated images matches preset requirements.
12 . The system of claim 10 , wherein causing the system to generate the set of simulated images from the natural language description further includes causing the system to: extract bounding boxes in the input image by applying an open vocabulary detector (OVD) to localize objects.
13 . The system of claim 10 , further causing the system to: store at least one of the set of simulated images in a database.
14 . The system of claim 13 , further causing the system to: identify issues in at least one stored image from the set of simulated images.
15 . The system of claim 10 , wherein causing the system to generate the set of simulated images further includes causing the system to: edit a bounding box in the input image to replace an object in the bounding box with a different object.
16 . The system of claim 10 , wherein causing the system to generate the set of simulated images further includes causing the system to: merge multiple bounding boxes in the input image.
17 . The system of claim 10 , wherein causing the system to generate the set of simulated images further includes causing the system to: split a bounding box the input image into multiple bounding boxes.
18 . A computer program product comprising a non-transitory computer-readable storage medium containing computer program code, the computer program code when executed by one or more processors causes the one or more processors to perform operations, the computer program code comprising instructions to: identify an issue in an input image displayed on a user interface, the issue being a visual depiction of an aspect of the input image that training data used for training a model does not have sufficient training on, having sufficient training includes the model reaching a performance threshold in response to testing the model on the issue; generate a natural language description of the issue; generate a set of simulated images from the natural language description that reflect one or more variations of the issue; select one or more training images to provide selected one or more training images from the set of simulated images, the selected one or more training images increasing the one or more variations of the issue in the training data; and train the model using the selected one or more training images.
19 . The computer program product of claim 18 , wherein causing the processor to generate the set of simulated images further includes causing the processor to: iteratively correct the set of simulated images by applying Application Programming Interfaces (APIs) until the set of simulated images matches preset requirements.
20 . The computer program product of claim 18 , wherein causing the processor to generate the set of simulated images from the natural language description further includes causing the processor to: extract bounding boxes in the input image by applying an open vocabulary detector (OVD) to localize objects; store at least one of the set of simulated images in a database; and identify issues in at least one stored image from the set of simulated images.

Description

RELATED APPLICATION INFORMATION This application claims priority to U.S. Provisional Patent Application No. 63/717,476, filed on Nov. 7, 2024, and U.S. Provisional Patent Application No. 63/719,691, filed on Nov. 13, 2024, both incorporated herein by reference in their entirety. BACKGROUND Technical Field The present invention relates to synthetic training data generation for artificial intelligence models and more particularly applying a multimodal large language model to generate training data of corner cases for autonomous vehicle driving scenario training. Description of the Related Art The majority of current autonomous systems, such as autonomous vehicles (AV), rely on modular-based architectures that combine components for perception, prediction, and planning to navigate driving scenarios. These systems face considerable challenges when dealing with rare and unpredictable “corner cases” that emerge in real world driving scenarios. These corner cases include encountering unusual objects such as, e.g., animals on the road, adverse weather conditions, unexpected events like accidents and downed powerlines, vehicle malfunctions such as brake failure, unpredictable traffic such as emergency vehicles, or external events such as falling objects. In other words, corner cases can include situations that are difficult to anticipate and react to, which can come from their rarity and corresponding lack of presence in training data, or bias from events or situations not contemplated when developing the training data. Traditional self-driving systems struggle to generalize open domains, especially when encountering real-world corner cases. Collecting data on these scenarios such as, e.g., accidents and extreme weather conditions, can be helpful for autonomous vehicle training and enhance system performance but can be difficult or impossible to document in some situations. Some works have proposed developing on-road accident detection and anticipation datasets. However, these datasets lack object-level risk annotations, making recognizing risky traffic agents difficult. Simulation tools have also been adopted to alleviate this problem by augmenting the datasets. Unfortunately, synthetic data may not always accurately capture the distribution of real driving scenes, and the tools can be difficult to control. SUMMARY According to an aspect of the present invention, a method is provided for augmenting training data. The method includes identifying an issue in an input image displayed on a user interface, the issue being a visual depiction of an aspect of the input image that training data used for training a model does not have sufficient training on, having sufficient training includes the model reaching a performance threshold in response to testing the model on the issue and generating a natural language description of the issue. The method further includes generating a set of simulated images from the natural language description that reflect one or more variations of the issue, selecting one or more training images to provide selected one or more training images from the set of simulated images, the selected one or more training images increasing the one or more variations of the issue in the training data, and training the model using the selected one or more training images. According to another aspect of the present invention, a system is provided for augmenting training data. The system includes a processor and a memory storing computer-readable instructions. When the computer-readable instructions are executed by the processor, the instructions cause the system to identify an issue in an input image displayed on a user interface, the issue being a visual depiction of an aspect of the input image that training data used for training a model does not have sufficient training on, having sufficient training includes the model reaching a performance threshold in response to testing the model on the issue and generate a natural language description of the issue. The memory also causes the processor to generate a set of simulated images from the natural language description that reflect one or more variations of the issue, select one or more training images to provide selected one or more training images from the set of simulated images, the selected one or more training images increasing the one or more variations of the issue in the training data, and train the model using the selected one or more training images. According to yet another aspect of the present invention, a computer program product including a non-transitory computer-readable storage medium containing computer program code, the computer program code when executed by one or more processors causes the one or more processors to perform operations. The computer program code includes instructions to identify an issue in an input image displayed on a user interface, the issue being a visual depiction of an aspect of the input image that training data us