KR-102962412-B1 - Method for generating a three-dimensional model based on multimodal input, and computer program recorded on a recording medium for executing the same

KR102962412B1KR 102962412 B1KR102962412 B1KR 102962412B1KR-102962412-B1

Abstract

The present invention proposes a method for generating a reliable 3D model based on multimodal inputs by fusing multimodal inputs, such as text and images, to automatically generate a 3D model that reflects industrial standard conditions. The method may include the steps of: a modeling server receiving multimodal inputs including text and images related to a device to be generated from a user terminal; the modeling server inputting the received text and images into a pre-trained artificial intelligence (AI) model to generate a 2D model corresponding to the device; the modeling server receiving inspection information for correcting the generated 2D model and correcting the 2D model based on the received inspection information; and the modeling server generating a 3D model corresponding to the corrected 2D model.

Inventors

박명호

Assignees

주식회사 디엔지니어

Dates

Publication Date: 20260508
Application Date: 20250925

Claims (10)

A modeling server receiving multimodal input including text and an image related to a device to be generated from a user terminal; The above modeling server inputs the received text and image into a pre-trained artificial intelligence (AI) model to generate a two-dimensional model corresponding to the device; The above modeling server receives inspection information for correcting the generated 2D model and corrects the 2D model based on the received inspection information; and The above modeling server includes the step of generating a 3D model corresponding to the corrected 2D model; and The step of generating the above 3D model Based on the above corrected 2D model, a multi-view image is generated by extending it, wherein a pose consistency loss is calculated based on the Structural Similarity Index (SSIM) to maintain shape consistency between the multi-view images, and the multi-view image is generated by extending it so that the calculated pose consistency loss is minimized. A method for generating a 3D model based on a multimodal input, characterized by reconstructing a 3D scene using a Neural Radiance Field (NeRF) based on the above-mentioned multi-viewpoint images, converting the reconstructed result into a Truncated Signed Distance Function (TSDF) to extract a mesh, optimizing the extracted mesh to remove unnecessary polygons to generate a 3D model, and evaluating structural stability based on the stress distribution of the generated 3D model.
In claim 1, the receiving step A method for generating a multimodal input-based 3D model, characterized by identifying text feature information including at least one of the dimensions, material, and structural conditions of the device from the text, converting the identified text feature information into a text embedding vector, identifying image feature information including at least one of the shape, viewpoint, and structural conditions of the device from the image, and converting the identified image feature vector into an image embedding vector.
In claim 2, the step of generating the two-dimensional model A method for generating a 3D model based on multimodal input, characterized by mapping the text embedding vector and the image embedding vector into a latent space, performing a cross-attention fusion operation in the latent space to generate a fusion vector of a single semantic space, and reflecting a specification vector representing industrial specification conditions for the device into the cross-attention fusion operation to generate the 2D model satisfying the industrial specification conditions.
In claim 3, the step of generating the two-dimensional model A multimodal input-based 3D model generation method characterized by improving the compliance rate with industrial standard conditions by injecting the specification vector into noise prediction at each time step of an image generation model based on a latent diffusion model.
In claim 4, the step of correcting the two-dimensional model A method for generating a 3D model based on multimodal input, characterized by converting the generated 2D model to high resolution using an ESRGAN (Enhanced Super-Resolution Generative Adversarial Network)-based correction network, and evaluating the relative realism of the generated image and the actual image using a relativistic discriminator.
In claim 5, the step of correcting the two-dimensional model A method for generating a 3D model based on multimodal input, characterized by reinforcing the edges of the generated 2D model and performing anti-aliasing filtering on curves and dimension lines.
In claim 1, the step of correcting the two-dimensional model is, A multimodal input-based 3D model generation method characterized by providing an interactive user interface capable of correcting the generated 2D model to the user terminal, rendering information input through the interactive user interface in real time to convert it into a delta mask representing a correction area, partially regenerating the generated 2D model through the converted delta mask, separating the correction area into image patches, regenerating a latent vector corresponding to the separated patch areas, and merging the regenerated latent vector with the original latent vector using a weighted average method to obtain a corrected 2D model.
delete
delete
Memory; and Combined with a computing device configured to include a processor that processes instructions residing in the memory above, The above processor receives multimodal input of text and image related to a device to be generated from a user terminal; The processor inputs the received text and image into a pre-trained artificial intelligence (AI) model to generate a two-dimensional model corresponding to the device; The above processor receives inspection information for correcting the generated two-dimensional model and corrects the two-dimensional model based on the received inspection information; and The processor comprises the step of generating a three-dimensional model corresponding to the corrected two-dimensional model; and The step of generating the above 3D model Based on the above corrected 2D model, a multi-view image is generated by extending it, wherein a pose consistency loss is calculated based on the Structural Similarity Index (SSIM) to maintain shape consistency between the multi-view images, and the multi-view image is generated by extending it so that the calculated pose consistency loss is minimized. A computer program recorded on a recording medium for executing a multimodal input-based 3D model generation method, characterized by reconstructing a 3D scene using a Neural Radiance Field (NeRF) based on the above multiple viewpoint images, converting the reconstructed result into a Truncated Signed Distance Function (TSDF) to extract a mesh, optimizing the extracted mesh to remove unnecessary polygons to generate a 3D model, and evaluating structural stability based on the stress distribution of the generated 3D model.

Description

Method for generating a three-dimensional model based on multimodal input, and computer program recorded on a recording medium for executing the same The present invention relates to a three-dimensional model generation technology. More specifically, it relates to a multimodal input-based three-dimensional model generation method for automatically generating a high-quality three-dimensional model that reflects industrial standard conditions by fusing multimodal inputs, and a computer program recorded on a recording medium for executing the same. A smart factory is an intelligent production plant that improves productivity, quality, and customer satisfaction by applying Information and Communications Technology (ICT) combined with digital automation solutions to production processes such as design, development, manufacturing, and distribution. It is a factory of the future that installs the Internet of Things (IoT) on equipment and machinery to collect process data in real time, analyze it, and enable autonomous control. In particular, automated production facilities for implementing a smart factory refer to the automation of production processes using machinery, equipment, robots, and the like. Such automated production facilities aim to reduce human intervention and improve productivity and efficiency through automated systems. These facilities are suitable for performing highly repetitive and precise tasks and can help reduce errors and ensure consistency in the production process. Recently, while most companies are adopting automated production facilities, they are facing difficulties in implementation due to a lack of in-house expertise regarding such equipment. In particular, since the design of automated production facilities is complex and requires advanced technical knowledge, the reality is that there are high barriers to entry for small-scale enterprises, such as SMEs and startups. Furthermore, the lack of clear direction and concept setting during the initial design phase has resulted in cost burdens and wasted resources, and inefficient design has caused problems such as time loss due to frequent design changes and delays in market response. However, conventional automated equipment design support technologies had limitations in that they relied primarily on single-modality data. For example, using only text-based input made it difficult to accurately reflect the shape or detailed structure of the device desired by the designer, while using only image-based input restricted the ability to incorporate detailed attribute information such as material and dimensions. Furthermore, conventional systems faced the problem of being unable to automatically reflect industrial standards, such as dimensional tolerances and material strength, making it difficult to secure designs that were actually manufacturable. Consequently, conventional systems had issues where initial design results were incompatible with the actual manufacturing process or required repetitive correction processes, and there were limitations in that the user inspection process was limited to simple visual verification, preventing actual quality correction and learning reflection. FIG. 1 is a configuration diagram of a three-dimensional model generation system according to one embodiment of the present invention. FIG. 2 is a logical configuration diagram of a modeling server according to one embodiment of the present invention. FIG. 3 is an exemplary diagram illustrating a cross-attention fusion process based on multimodal input according to one embodiment of the present invention. FIG. 4 is an exemplary diagram illustrating a process for outputting a three-dimensional model according to an embodiment of the present invention. FIG. 5 is an exemplary diagram showing a simulation process according to one embodiment of the present invention. FIG. 6 is an illustrative diagram for explaining a sequence prediction model according to one embodiment of the present invention. FIG. 7 is a hardware configuration diagram of a modeling server according to one embodiment of the present invention. FIG. 8 is a flowchart illustrating a method for generating a three-dimensional model according to an embodiment of the present invention. FIG. 9 is a flowchart illustrating a file creation method according to an embodiment of the present invention. FIG. 10 is a flowchart for explaining a simulation method according to one embodiment of the present invention. It should be noted that technical terms used in this specification are used merely to describe specific embodiments and are not intended to limit the invention. Furthermore, unless specifically defined otherwise in this specification, technical terms used in this specification should be interpreted in the sense generally understood by those skilled in the art to which the invention pertains, and should not be interpreted in an overly broad or overly narrow sense. Additionally, if a technical term used in t