US-20260127782-A1 - METHOD AND SYSTEM FOR INTELLIGENT IMAGE OR VIDEO GENERATION THROUGH LARGE LANGUAGE MODEL INTERACTION

US20260127782A1US 20260127782 A1US20260127782 A1US 20260127782A1US-20260127782-A1

Abstract

A method for intelligent image or video generation through large language model (LLM) interaction is provided. An original prompt request is received from a user, descriptive information is extracted from the original prompt request and populated into an internal storyboard template of a system, or the original prompt request and an external storyboard template are directly imported, and the external storyboard template is mapped to the internal storyboard template. The content within the storyboard template is serialized and textualized to obtain at least one prompt request segment each corresponding to one image or storyboard video. The at least one prompt request segment is submitted to the LLM one by one or in batch. An image or video content is generated and returned by the LLM. A system for intelligent image or video generation through LLM interaction is also provided.

Inventors

Yin Yu
Ziyan YANG

Dates

Publication Date: 20260507
Application Date: 20251218
Priority Date: 20240301

Claims (10)

1 . A method for intelligent image or video generation through large language model (LLM) interaction, comprising: an image or video generation process; wherein the image or video generation process comprises: (S 11 ) receiving an original prompt request from a user, extracting descriptive information from the original prompt request, and populating the descriptive information into an internal storyboard template of a system; or directly importing the original prompt request and an external storyboard template, and mapping the external storyboard template to the internal storyboard template; (S 12 ) serializing and textualizing a content within the storyboard template to obtain at least one prompt request segment, wherein each of the at least one prompt request segment corresponds to an image or a storyboard video; and (S 13 ) submitting the at least one prompt request segment one by one or in batch to an LLM to obtain a generated image or a generated video content, and returning, by the LLM, the generated image or the generated video content.
2 . The method of claim 1 , wherein the storyboard template is configured to be displayed in a table form on a generation interaction page, and is in a re-editable state; the content within the storyboard template comprises setup number, shot number, scene reference, subject, shot size, camera position number, shooting angle, camera movement mode, equipment, lens length, sound, scene description, editing and transition mode, lighting, generation count number and shooting time; and the external storyboard template is a text file, a table file, or a Markdown configuration file in a JSON format or a YAML format.
3 . The method of claim 1 , further comprising: correspondingly populating the at least one prompt request segment within the storyboard template; generating, by each of the at least one prompt request segment, a suggested prompt in response to a user trigger; and replacing the at least one prompt request segment with the suggested prompt, and submitting the suggested prompt to the LLM.
4 . The method of claim 1 , further comprising: a dynamic playback interaction process; wherein the dynamic playback interaction process comprises: (S 21 ) during playback of the generated video content, receiving a playback interaction input by the user in real time; (S 22 ) based on a preset playback interaction rule or a preset playback interaction use case, the playback interaction and a timestamp, locating a corresponding video reference frame and a corresponding descriptive text, modifying or adding corresponding parameters on the storyboard template in real time to obtain modified or added parameters, and submitting the modified or added parameters to the LLM; and (S 23 ) regenerating and returning, by the LLM, a regenerated image or a regenerated video content in real time.
5 . The method of claim 4 , wherein the preset playback interaction rule or the preset playback interaction use case is provided by the system or is user-defined.
6 . A system for intelligent image or video generation through LLM interaction, comprising: a template management module; wherein the template management module is configured to perform steps of: (S 11 ) receiving an original prompt request from a user, extracting descriptive information from the original prompt request, and populating the descriptive information into an internal storyboard template of a system; or directly importing the original prompt request and an external storyboard template, and mapping the external storyboard template to the internal storyboard template; (S 12 ) serializing and textualizing a content within the storyboard template to obtain at least one prompt request segment, wherein each of the at least one prompt request segment corresponds to an image or a storyboard video; and (S 13 ) submitting the at least one prompt request segment one by one or in batch to an LLM to obtain a generated image or a generated video content.
7 . The system of claim 6 , wherein the storyboard template is configured to be displayed in a table form on a generation interaction page, and is in a re-editable state; the content within the storyboard template comprises setup number, shot number, scene reference, subject, shot size, camera position number, shooting angle, camera movement mode, equipment, lens length, sound, scene description, editing and transition mode, lighting, generation count number and shooting time; and the external storyboard template is a text file, a table file, or a Markdown configuration file in a JSON format or a YAML format.
8 . The system of claim 6 , wherein the template management module is further configured to perform steps of: correspondingly populating the at least one prompt request segment within the storyboard template; generating, by each of the at least one prompt request segment, a suggested prompt in response to a user trigger; and replacing the at least one prompt request segment with the suggested prompt, and submitting the suggested prompt to the LLM.
9 . The system of claim 6 , further comprising: a dynamic playback interaction module; wherein the dynamic playback interaction module is configured to perform steps of: (S 21 ) during playback of the generated video content, receiving a playback interaction input by the user in real time; (S 22 ) based on a preset playback interaction rule or a preset playback interaction use case, the playback interaction and a timestamp, locating a corresponding video reference frame and a corresponding descriptive text, modifying or adding corresponding parameters on the storyboard template in real time to obtain modified or added parameters, and submitting the modified or added parameters to the LLM; and (S 23 ) regenerating and returning, by the LLM, a regenerated image or a regenerated video content in real time.
10 . The system of claim 9 , wherein the preset playback interaction rule or the preset playback interaction use case is provided by the system or is user-defined.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of International Patent Application No. PCT/CN 2024/106325, filed on Jul. 19, 2024, which claims the benefit of priority from Chinese Patent Application No. 202410233999.8, filed on Mar. 1, 2024. The content of the aforementioned application, including any intervening amendments made thereto, is incorporated herein by reference in its entirety. TECHNICAL FIELD This application relates to large language model (LLM) interaction, and more particularly to a method and system for intelligent image or video generation through LLM interaction. BACKGROUND At present, multimodal large language models (LLMs) of artificial intelligence are developing rapidly. For example, models such as Stable Video and OpenAI Sora have already achieved text-to-image or text-to-video generation capabilities. In the existing multimodal LLMs of artificial intelligence, the text-to-image or text-to-video generation process includes the following steps. A user submits a prompt request, provides a linguistic description of an image or scene to the LLM, and the LLM then returns the generated image or video. The function of a prompt request primarily relies on a text/string template, which includes the scene theme, subject description, scene background, environment description, style definition and model parameters of the image or video. As shown in FIG. 1, when an LLM performs text-to-image generation based on a prompt request, the user provides the prompt request shown on the left side of FIG. 1, and the LLM then generates the image shown on the right side of FIG. 1. As shown in FIG. 2, when an LLM performs text-to-video generation based on a prompt request, the user provides the corresponding prompt request, and the LLM then generates the video shown in FIG. 3. However, the existing multimodal LLMs of artificial intelligence exhibit the following deficiencies when performing text-to-image or text-to-video generation. 1. The interaction mode of the generation page is limited, typically supporting only single-item, item-by-item input interactions for text-to-image and text-to-video generation. All dimensional information of the scene is contained within the prompt. Newly released models such as Stable Video place camera state configurations in a secondary draft generation page for selection, which results in the lack of systematic management for prompt requests of each scene.2. When a prompt request involves long-text descriptions, the user may lack descriptive dimensions, thereby affecting the quality of the image or video generated by the LLM.3. When the LLM returns the generated content, the playback interaction modes are limited. Typically, only image and video browsers are provided, which merely support playback interactions such as playback, zooming, variable-speed playback, and skip playback.4. Batch generation is not supported, leading to low efficiency. SUMMARY An object of the disclosure is to provide a method and system for intelligent image or video generation through large language model (LLM) interaction, which can construct and provide users with a systematic storyboard template to achieve systematic and structured editing and management, while supporting batch generation and offering diverse interaction modes to enhance user interaction experience. In order to achieve the above object, the following technical solutions are adopted. In a first aspect, this application provides a method for intelligent image or video generation through LLM interaction, comprising: an image or video generation process;wherein the image or video generation process comprises:(S11) receiving an original prompt request from a user, extracting descriptive information from the original prompt request, and populating the descriptive information into an internal storyboard template of a system; ordirectly importing the original prompt request and an external storyboard template, and mapping the external storyboard template to the internal storyboard template;(S12) serializing and textualizing a content within the storyboard template to obtain at least one prompt request segment, wherein each of the at least one prompt request segment corresponds to an image or a storyboard video; and(S13) submitting the at least one prompt request segment one by one or in batch to an LLM to obtain a generated image or a generated video content, and returning, by the LLM, the generated image or the generated video content. In a second aspect, this application provides a system for intelligent image or video generation through LLM interaction, comprising: a template management module;wherein the template management module is configured to perform steps of:(S11) receiving an original prompt request from a user, extracting descriptive information from the original prompt request, and populating the descriptive information into an internal storyboard template of a system; ordirectly importing the ori