CN-121999245-A - Printing coordinate prediction and description generation method based on vision-language model, electronic equipment and storage medium

CN121999245ACN 121999245 ACN121999245 ACN 121999245ACN-121999245-A

Abstract

The invention discloses a printing coordinate prediction and description generation method based on a vision-language model, which solves the technical problems of multi-task and multi-product printing prediction in the design field of clothing and daily necessities, realizes accurate positioning, detailed description and style identification of a printing area, and remarkably improves the suitability of multi-product commodities. Firstly, the method systematically solves the difficult problem of insufficient positioning and description precision of printing by constructing a multi-granularity label system, further adopts a progressive training strategy and LoRA high-efficiency fine-tuning framework, rapidly introduces text generation tasks on the basis of guaranteeing coordinate precision, and finally outputs a multi-task model with accurate positioning capability and rich description capability.

Inventors

LIN JIEXING
LIU PENG
LIN HANQUAN

Assignees

厦门灵图科技有限公司

Dates

Publication Date: 20260508
Application Date: 20251230

Claims (5)

1. A print coordinate prediction and description generation method based on a vision-language model, comprising: step 1, constructing a multi-granularity label system aiming at the field of clothing and daily necessities printing design, wherein the multi-granularity label system comprises the following components: The coordinate positioning layer comprises normalized coordinate information [ [ x min ,y min ],[x max ,y min ],[x max ,y max ],[x min ,y max ] ] and supports a plurality of boundary frame labels in a multi-printing scene, wherein x min 、x max 、y min and y max respectively represent an x-axis coordinate minimum value of the boundary frame, an x-axis coordinate maximum value of the boundary frame, a y-axis coordinate minimum value of the boundary frame and a y-axis coordinate maximum value of the boundary frame; A detailed description layer containing a natural language description of the printed content; a style classification layer comprising print style labels; Step 2, adopting a progressive training strategy from easy to difficult and from thick to thin, and carrying out fine adjustment on a pre-trained vision-language model in three stages to realize two tasks of balanced coordinate regression and text generation; Step 3, inserting LoRA adapters into a cross attention layer and a feedforward network layer of the vision-language model, dynamically adjusting LoRA rank according to task complexity, processing rank r=8 when the coordinate task is processed, and processing rank r=16 when the description/style task is processed; and step 4, after training is completed, a complete reasoning assembly line is constructed, and a structured result is output.
2. The print coordinate prediction and description generation method based on a vision-language model as claimed in claim 1, wherein the three stages of step 2 are respectively: in the first stage, the training target is to establish accurate coordinate prediction capability, the training data is to use all training images but only coordinate positioning layer labels as supervision signals, the training configuration is that the learning rate is 0.00001, only the relevant network layers of coordinate output are trained, and the Smooth L1 Loss is used as a coordinate regression Loss function; In the second stage, the training target is combined training description generation and style classification tasks, the training data is all training images, the training configuration is that a vision layer, a language layer and an alignment layer are unfrozen, a multitask Loss function loss=alpha is adopted, loss coord +β*Loss desc +γ*Loss style is adopted, loss coord 、Loss desc and Loss style respectively represent positioning Loss, printing content description Loss and style description Loss, alpha, beta and gamma respectively represent weights of the three losses, the learning rate is improved to 0.0001, and deep fine adjustment is carried out; In the third stage, the training aim is to prevent the model from being fitted excessively and improve the generalization capability on unseen commodity products and printing patterns, the training data is to mix the marking data with commodity images to simulate real application scenes, the training configuration is that the learning rate is 0.0001, all layers are slightly fine-tuned, and the data are added for enhancement.
3. The method for generating print coordinates prediction and description based on vision-language model as claimed in claim 1, wherein said step 4 specifically comprises: 4.1 The parallel output generation, namely synchronously outputting normalized coordinates, printing descriptions, style classification and frame height ratio information; 4.2 And (3) visually outputting, namely generating commodity images with label frames and a structured prediction report.
4. An electronic device comprising a processor, a memory, and an application program stored in the memory and configured to execute the vision-language model-based decal coordinate prediction and description generation method according to any one of claims 1 to 3 by the processor.
5. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed in the computer, causes the computer to perform the vision-language model-based print coordinate prediction and description generation method as claimed in any one of claims 1 to 3.

Description

Printing coordinate prediction and description generation method based on vision-language model, electronic equipment and storage medium Technical Field The invention belongs to the technical field of computer vision, and particularly relates to a printing coordinate prediction and description generation method based on a vision-language model, electronic equipment and a storage medium. Background In recent years, a multi-modal understanding technology based on a visual-language model has significantly progressed, and image content understanding and text generation can be realized. However, when these generic models are applied to a specific vertical field of apparel and commodity printing designs, there are significant deviations in their predicted effects. The prior art mainly faces the following core challenges: (1) The problem of suitability and positioning accuracy of multiple products is that the existing model is difficult to simultaneously adapt to the printing prediction requirements of various commodities such as T-shirts, bedsheets, throw pillows and cups, and the coordinate prediction accuracy is insufficient due to the shape, size and material difference of different products. The lack of explicit modeling of the geometric characteristics of the printed area in the pure end-to-end learning method results in inaccurate coordinate boundaries and larger frame heights than calculation errors. (2) The difficulty of multi-task coordination and balance is that a common model cannot simultaneously realize high-precision generation of printing coordinate positioning, detailed description of text and style classification information, and the method is particularly poor in performance under a multi-printing and complex background scene (difficult multi-task coordination). The model is difficult to balance between two tasks with large differences of coordinate regression (regression task) and text generation (sequence task), and the lack of systematic training strategies leads to slow convergence speed and weak generalization capability (low training efficiency). The prior art can solve part of the problems, but fails to systematically realize end-to-end joint optimization of coordinate positioning, detailed description and style recognition. Therefore, a novel method is urgently needed, and high-precision coordinate prediction and rich text description generation can be realized in multiple-class and multiple-printing scenes. Disclosure of Invention The invention mainly aims to provide a printing coordinate prediction and description generation method, electronic equipment and storage medium based on a vision-language model, which solve the technical problems of multi-task and multi-product printing prediction in the design field of clothing and daily necessities, realize accurate positioning, detailed description and style identification of a printing area and obviously improve the suitability of multi-product commodities. In order to achieve the above object, one of the solutions of the present invention is: a printing coordinate prediction and description generation method based on a vision-language model comprises the following steps: step 1, constructing a multi-granularity label system aiming at the field of clothing and daily necessities printing design, wherein the multi-granularity label system comprises the following components: The coordinate positioning layer comprises normalized coordinate information [ [ x min,ymin],[xmax,ymin],[xmax,ymax],[xmin,ymax ] ] and supports a plurality of boundary frame labels in a multi-printing scene, wherein x min、xmax、ymin and y max respectively represent an x-axis coordinate minimum value of the boundary frame, an x-axis coordinate maximum value of the boundary frame, a y-axis coordinate minimum value of the boundary frame and a y-axis coordinate maximum value of the boundary frame; A detailed description layer containing a natural language description of the printed content; a style classification layer comprising print style labels; Step 2, adopting a progressive training strategy from easy to difficult and from thick to thin, and carrying out fine adjustment on a pre-trained vision-language model in three stages to realize two tasks of balanced coordinate regression and text generation; Step 3, inserting LoRA adapters into a cross attention layer and a feedforward network layer of the vision-language model, dynamically adjusting LoRA rank according to task complexity, processing rank r=8 when the coordinate task is processed, and processing rank r=16 when the description/style task is processed; and step 4, after training is completed, a complete reasoning assembly line is constructed, and a structured result is output. The three stages of the step 2 are respectively as follows: in the first stage, the training target is to establish accurate coordinate prediction capability, the training data is to use all training images but only coordinate positioning la