CN-122019818-A - Multi-mode CAD model retrieval method and device based on text and sketch
Abstract
The invention discloses a multi-mode CAD model retrieval method and device based on texts and sketches, which are characterized in that multi-view rendering is carried out on a CAD model dataset model, automatic text labeling is realized by utilizing a large visual model, and the text is converted into a sketch style through a sketch generator; the method comprises the steps of constructing a B-Rep model encoder, extracting geometric and topology information from a B-Rep data structure, completing feature encoding by adopting a graphic neural network, constructing a sketch-text encoder, converting sketch into text, adopting a multi-stage training strategy, realizing three-mode alignment training of a model network, a CLIP text network and a CLIP visual network, carrying out alignment training of text and sketch features and a CAD model, and finally realizing retrieval reasoning supporting text and sketch input. According to the invention, a model is not required to be used as a retrieval input, and more convenient and easily available texts and sketches can be directly adopted to complete retrieval, so that the retrieval usability and engineering practicability are remarkably improved.
Inventors
- DING YINA
- JIN YAO
Assignees
- 浙江理工大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260413
Claims (10)
- 1. A multi-mode CAD model retrieval method based on texts and sketches is characterized by comprising the following steps: firstly, acquiring a CAD model, generating a text label through a visual large language large model, generating a network rendering into a sketch through the sketch, and thus constructing a CAD model data set containing the text label and the sketch; step two, based on the geometrical attribute adjacency graph, extracting geometrical and topological information from a boundary representation B-Rep data structure of a CAD model data set; Constructing a B-Rep model encoder based on a graph neural network, encoding geometric and topology information, and extracting CAD model features; step four, constructing a sketch-text encoder which comprises a sketch network and a sketch-text conversion network, converting the sketch into text representation and fusing the text representation with the text, and realizing the mixed feature extraction of 'text+sketch'; step five, performing multi-modal alignment training on the CAD model, namely aligning a B-Rep model encoder and a CLIP visual encoder in a first stage, performing tri-modal alignment on the basis of the pre-trained B-Rep model encoder and combining a CLIP text encoder and the CLIP visual encoder in a second stage, and finely adjusting each encoder network; Step six, sketch-text mixed input training, which is to construct triples of sketch, positive sample view and negative sample view in the first stage and align sketch network, and simplify labeling text in the second stage, and realize alignment of 'text + sketch' and CAD model mode based on pre-training sketch network, CLIP visual encoder and sketch-text conversion network; And seventhly, inputting texts and sketches based on the aligned and trained CLIP visual encoder, sketch encoder and sketch-text conversion network, and carrying out CAD model retrieval.
- 2. The multi-mode CAD model retrieval method based on texts and sketches according to claim 1 is characterized in that in the first step, CAD model data set construction comprises the steps of carrying out multi-view projection rendering on a three-dimensional CAD model based on a public CAD model data set, scaling the model to a unit sphere, uniformly sampling n camera poses to obtain n projection views, setting customized text annotation prompts for different data sets, inputting the text annotation prompts and the n projection views into a visual large language model to generate text annotations, inputting a set of rendering views into a visual angle selection network screening view, and inputting the selected view into a sketch conversion network to generate a corresponding sketch.
- 3. The method for retrieving the multi-mode CAD model based on the text and the sketch is characterized by comprising the steps of dispersing faces and edges into UV grids and sampling to obtain grid features, wherein the grid features comprise sampling point coordinates and normal vectors, extracting face attribute features comprising face types, area and centroid coordinates, extracting edge attribute features comprising edge geometric types, length and convexity characterization, and representing B-Rep topology information by using a face adjacency graph FAG, wherein FAG nodes correspond to the faces, and connecting edges between the nodes correspond to model edges.
- 4. The method for searching the multi-mode CAD model based on the text and the sketch according to claim 1, wherein in the third step, the B-Rep model encoder is composed of a geometric information encoder, a topology information encoder and a multi-task feature encoder: the geometric information encoder encodes geometric attribute data, embeds the grid data of the face and the side respectively by using a convolutional neural network, embeds the geometric attribute data of the face and the side respectively by using an MLP, and then embeds and splices the grid embedments corresponding to the face and the side with the attribute respectively as nodes and edge feature vectors of the FAG to form a graph data structure; The topology information encoder comprises five encoding modules and four gating networks, wherein the four encoding modules are task specific modules, and the rest is a task sharing module; each coding module consists of a GNN block, wherein the GNN block obtains a face-level coding feature and is aggregated into a picture-level global feature based on picture attention convolution; the multi-task feature encoder receives the output of the topology information encoder to which each task belongs, establishes an information route based on a gating network, distributes geometric features and topology features to corresponding MLPs, sequentially transmits each sub-task feature, splices the current task feature with the previous group of MLP features as input, outputs features containing task target information, fuses the current task and the task information in model feature tasks, and maps to generate a unified CAD model feature vector.
- 5. The method for searching the multi-mode CAD model based on the texts and the sketches according to claim 1, wherein in the fourth step, the sketch-text encoder is characterized in that the sketch network adopts a multi-scale ViT network, a plurality of mutually independent and non-overlapping windows are interacted with each other through moving windows to perform self-attention operation to obtain sketch feature vectors, the sketch-text converter is in a network structure of an MLP with ReLU, the sketch feature vectors are input, a sketch conversion text tokens is output, and the sketch conversion text tokens and the text tokens are fused to obtain mixed searching input of the sketch and the text.
- 6. The method for retrieving a multi-modal CAD model based on text and sketch according to claim 1, wherein in the fifth step, the multi-modal alignment training of the CAD model comprises: The first stage uses the projection view rendered by the large visual language model in the first step, inputs all views into the CLIP visual encoder to generate visual feature vectors, takes geometric and topological information as input, outputs CAD model feature vectors through the B-Rep model encoder, trains the visual feature vectors and the model feature vectors, carries out difference value operation with 1 based on cosine similarity of the model output feature vectors and the image feature vectors to obtain a loss value reflecting vector similarity difference, keeps the CLIP visual encoder parameters frozen during training, and trains the B-Rep model encoder parameters; The second stage uses CAD model, multi-view and text label in data set to train, the model feature extraction mode and visual feature extraction mode are the same as the first stage, the text label is input into the text encoder of the pre-trained CLIP large model to generate text feature, the view feature, text feature and model feature are trained, the similarity score of the positive feature pair is normalized and compared with the similarity score of all samples in the same batch, the learnable temperature parameter is combined to adjust similarity distribution, finally, the comparison loss for measuring the matching degree of the dual-mode pair is obtained, the comparison loss between the two modes is obtained based on the comparison loss of the text, the view and the model, the total loss of the multi-mode joint training is obtained after weighting and fusion, and during training, all parameters are subjected to fine tuning training.
- 7. The method for searching a multi-modal CAD model based on text and sketch according to claim 1, wherein in the seventh step, the sketch-text mixed input training comprises: The training stage I comprises the steps of constructing a triplet comprising a sketch, a positive sample view and a negative sample view aiming at each CAD model, inputting the sketch into a sketch encoder to generate sketch feature vectors, inputting the positive and negative sample views into a CLIP view encoder respectively to generate the view feature vectors of the positive and negative samples, comparing the similarity between the view features of the positive sample pair and the sketch features with the similarity between the negative sample pair and the sketch features, and carrying out constraint by combining a boundary threshold value to obtain a loss value for ensuring effective distinction between the positive and negative sample features; The training stage II comprises the steps of constructing a binary group comprising a sketch and a model for each CAD model, outputting the sketch to text tokens through a sketch encoder and a sketch-text conversion network, marking the simplified text as a text prompt, transmitting the simplified text to a text word segmentation device Tokenizer to output text tokens, combining the sketch to text tokens and text tokens, transmitting the combined sketch to a frozen CLIP text encoder to obtain a final combined query vector, inputting the corresponding CAD model to a B-Rep model encoder for training by using a contrast loss function, and only training parameters of the sketch encoder and the sketch-text conversion network during training, wherein the rest parameters are kept frozen.
- 8. The multi-mode CAD model retrieval method based on texts and sketches according to claim 1 is characterized in that in the seventh step, a CAD model feature vector library is generated in advance, namely geometric and topology information is extracted from a database CAD model, model feature vectors are obtained and stored by inputting a trained B-Rep model encoder, a multi-mode retrieval reasoning frame integrating a training aligned CLIP text encoder and a sketch-text encoder is constructed, text retrieval, sketch retrieval and text-sketch joint retrieval are supported, matching degree of query features and feature vectors in the library is calculated by adopting cosine similarity, candidate CAD models are output according to similarity ordering, and an approximate nearest neighbor index is established to realize large-scale rapid retrieval.
- 9. The multi-mode CAD model retrieval device based on the text and the sketch comprises a memory and one or more processors, and is characterized in that executable codes are stored in the memory, and the processor realizes the multi-mode CAD model retrieval method based on the text and the sketch when executing the executable codes.
- 10. A computer-readable storage medium, on which a program is stored, which program, when being executed by a processor, implements a multi-modal CAD model retrieval method based on text and sketches as claimed in any one of claims 1-8.
Description
Multi-mode CAD model retrieval method and device based on text and sketch Technical Field The invention relates to the technical fields of computer graphics and aided design, deep learning and multi-modal retrieval, in particular to a multi-modal CAD model retrieval method and device based on texts and sketches. Background As informatization technology penetrates deeply into the manufacturing industry, digital model driven product development and manufacturing modes have become the mainstream of the industry. The CAD model is used as a core information carrier for product research and development, is constructed based on a boundary representation (B-Rep) structure, contains accurate geometric shapes and topology association attributes, and bears rich design ideas, technological parameters and manufacturing knowledge. The target model is quickly retrieved from the CAD models with mass storage, so that the efficient multiplexing of design knowledge is realized, the method is a key path for shortening the research and development period of products and improving the research and development quality and production efficiency, and has important significance for the intelligent upgrading of modern industrial enterprises. The existing CAD model retrieval method based on the content mainly comprises a three-dimensional model, a two-dimensional view, a hand-drawn sketch and text description in an input mode. The text description and sketch input have the advantages of low acquisition cost and low use threshold, the text can accurately express design intent and functional requirements, the sketch can intuitively present shape characteristics, and the combination of the text description and the sketch can more comprehensively fit a user retrieval scene. Based on the CAD model retrieval of text and sketch, the core target is to realize the rapid positioning of the target model by establishing semantic association between multi-modal input and the CAD model and outputting the similarity and side by side of the quantized model and the query aiming at text description or sketch query given by a user. However, the prior art still faces two major core bottlenecks, namely the problem of modal heterogeneity, that is, texts and sketches belong to two-dimensional modes, that is, the geometric topological information of a CAD model based on B-Rep belongs to three-dimensional structured data, that is, the characteristic space difference of different modal data is obvious, so that the alignment difficulty of cross-modal semantics is high, that is, the CAD model has obvious intra-class difference and non-uniform distribution characteristics, that is, the traditional coarse-granularity retrieval can only return to the similar model, that is, the specific requirements of users on detailed shapes and functional attributes cannot be precisely matched, and that the retrieval precision is difficult to meet the actual application scene of engineering. Therefore, how to break through the multi-modal heterogeneous barriers, construct the accurate semantic mapping among the text, the sketch and the CAD model, improve the fine granularity retrieval performance, and become the research key and difficulty in the CAD model retrieval field. The method solves the problems, can promote the iterative upgrade of CAD model retrieval technology, can accelerate the multiplexing of design knowledge, and provides core technical support for the fields of intelligent manufacturing, product innovation design and the like. Based on the method, the method and the device for constructing the multi-mode CAD model retrieval based on texts and sketches are provided by the invention by fusing multi-mode learning and cross-mode feature alignment strategies. Disclosure of Invention The invention relates to a multi-mode CAD model retrieval method and device based on texts and sketches. The invention aims to solve the problems of low retrieval efficiency, high calculation cost, low database marking cost and the like, thereby realizing a CAD model retrieval system supporting the industrial design and manufacturing industry. The invention aims at realizing the following technical scheme that in a first aspect, the invention provides a multi-mode CAD model retrieval method and device based on texts and sketches, and the method comprises the following steps: firstly, acquiring a CAD model, generating a text label through a visual large language large model, generating a network rendering into a sketch through the sketch, and thus constructing a CAD model data set containing the text label and the sketch; extracting geometric information and topology information from a boundary representation B-Rep data structure of a CAD model of the CAD model dataset based on a geometric attribute adjacency graph; Constructing a B-Rep model encoder based on a graph neural network, and encoding the geometric information and the topology information of the second step to realize extraction