CN-121982314-A - Image material processing method and system

CN121982314ACN 121982314 ACN121982314 ACN 121982314ACN-121982314-A

Abstract

The present invention relates to the field of data processing technologies, and in particular, to a method and a system for processing image materials. The method comprises the steps of obtaining image materials to be processed, conducting semantic analysis on the image materials by means of a preset visual language model to obtain text prompts corresponding to each element, inputting the text prompts and the image materials into a preset target detection model to obtain detection results, determining interaction matching thresholds of the levels corresponding to each text prompt or each boundary frame, screening the boundary frames according to comparison results of confidence levels of the boundary frames and the corresponding interaction matching thresholds to obtain target boundary frames, and inputting the target boundary frames and the corresponding text prompts into a preset segmentation model to generate segmentation masks corresponding to each element. The method can realize automatic positioning, separation and filling restoration of elements of different levels in the image material, and improves the efficiency and precision of image material processing.

Inventors

ZHU JINGHUI
HE YINGJIA
DING JUNWEI
CHEN DEPIN

Assignees

钛动科技股份有限公司

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (10)

1. A method of processing image material, comprising: Acquiring an image material to be processed, and performing semantic analysis on the image material by using a preset visual language model to obtain a text prompt corresponding to each element; inputting the text prompt and the image material into a preset target detection model to obtain a detection result, wherein the detection result comprises a boundary box and the confidence of the boundary box; determining an interaction matching threshold value of a hierarchy corresponding to each text prompt or boundary box; screening the boundary frame according to the comparison result of the confidence coefficient of the boundary frame and the corresponding interaction matching threshold value to obtain a target boundary frame; and inputting the target boundary box and the corresponding text prompt into a preset segmentation model, and generating a segmentation mask corresponding to each element.
2. The image material processing method according to claim 1, wherein the filtering of the bounding box includes determining whether a confidence level of the bounding box is greater than a corresponding interaction matching threshold, if so, retaining the bounding box, and if not, discarding the bounding box.
3. The image material processing method of claim 1, wherein determining the interaction match threshold for the hierarchy comprises: Determining semantic entropy and visual distribution dispersion of the level, wherein the semantic entropy represents semantic diversity and fuzzy degree of the level, and the visual distribution dispersion represents visual complexity of the level; carrying out weighted summation on the semantic entropy and the visual distribution dispersion to obtain a comprehensive complexity score; calculating the ratio of the comprehensive complexity score of the level to the total comprehensive complexity score to obtain the weight of the semantic level; and determining an interaction matching threshold according to the semantic hierarchy weight, wherein the interaction matching threshold is positively correlated with the semantic hierarchy weight.
4. The image material processing method according to claim 3, wherein the calculation expression of the interaction matching threshold is: In the formula, Is a hierarchy of Is a threshold value of the interaction matching of (c), In order to be a preset basic threshold value, For a predetermined level of sensitivity coefficient, Is a hierarchy of Is a semantic level weight of (1).
5. The method for processing image material according to claim 1, wherein the segmentation model is a SAM3 model, and wherein the step of inputting the target bounding box and the corresponding text prompt into a preset segmentation model to generate the segmentation mask corresponding to each element comprises the steps of: determining the global existence probability of the target boundary box in the image material; determining a concept matching score according to a global existence probability, wherein the concept matching score is positively correlated with the global existence probability; Judging whether the concept matching score is larger than a preset matching threshold value, if so, performing cross-modal attention fusion in a visual language coding space to obtain a fusion result; and decoding the fusion result to obtain the segmentation masks corresponding to the elements.
6. The image material processing method according to claim 1or 5, wherein the segmentation model includes a lightweight adaptation module, and a processing procedure of the lightweight adaptation module includes: freezing the backbone weights of the segmentation model; Inserting a learnable multi-layer perceptron at each stage of a multi-layer transform encoder of the segmentation model; For the visual characteristics of each layer, generating a task prompt vector by using the multi-layer perceptron, and embedding the task prompt vector into an attention module of each layer of transducer encoder.
7. The image material processing method according to claim 1, further comprising, after generating the segmentation masks corresponding to the respective elements: separating the image material into a plurality of hierarchical partial images according to the segmentation mask; and taking the segmentation mask as a guide, and adopting a preset image restoration model to carry out detail completion on each level part image to obtain a complete level image.
8. The image material processing method according to claim 7, wherein the process of obtaining the complete hierarchical image is expressed as: In the formula, Is the first The images of the individual layers of the image, In order to make the image restoration model, For a residual prediction network in an image restoration model, In order to obtain the Hadamard product, Is the first The images of the portions of the hierarchy, Is the first The segmentation mask.
9. The image material processing method of claim 7, further comprising performing multi-scale feature fusion during detail completion: In the formula, Is the first The visual characteristics of the layer(s), Is the first The visual characteristics of the layer(s), Is the first The visual characteristics of the layer(s), In order to adaptively fuse the weights, For the purpose of the channel splicing, An up-sampling operation is indicated and, Is a convolution layer.
10. An image material processing system comprising a processor and a memory, the memory storing computer program instructions that when executed by the processor implement the image material processing method of any one of claims 1-9.

Description

Image material processing method and system Technical Field The present invention relates to the field of data processing technologies, and in particular, to a method and a system for processing image materials. Background In scenes such as planar design, advertisement production, e-commerce material generation and the like, decoupling processing is often required to be performed on existing image materials, namely, a plurality of elements such as backgrounds, characters, commodity bodies, brand characters and the like in images are separated so as to perform secondary editing or multiplexing. When separating, usually the identification, the removal by matting and the restoration are carried out one by the human, however, the manual operation is long, and the problem of insufficient consistency is easy to generate. To solve this problem, the related art uses semantic segmentation, instance segmentation, or panorama segmentation models to region-divide an image into parts of different semantic categories, such as characters, objects, backgrounds, and words. Among them, a pixel level mask is usually generated using a transducer or a convolutional network architecture, and this method can distinguish between different kinds of elements, but for elements such as translucence, shading, lamination, or complex decoration, the pixel level mask is prone to error or omission, thereby reducing the accuracy of image decoupling. In this regard, how to improve the accuracy of image material decoupling is a technical problem that needs to be solved at present. Disclosure of Invention In order to solve the technical problem that elements such as the semi-transparent elements, the laminated elements or the complex decoration elements are easy to leak or miss, the invention provides the following aspects. In a first aspect, the invention provides an image material processing method, which comprises the steps of obtaining an image material to be processed, conducting semantic analysis on the image material by using a preset visual language model to obtain text prompts corresponding to each element, inputting the text prompts and the image material into a preset target detection model to obtain a detection result, determining an interaction matching threshold value of each text prompt or a hierarchy corresponding to the boundary frame according to the confidence of the boundary frame, screening the boundary frame according to a comparison result of the confidence of the boundary frame and the corresponding interaction matching threshold value, and inputting the target boundary frame and the corresponding text prompts into a preset segmentation model to generate segmentation masks corresponding to each element. Further, screening the bounding box comprises the steps of judging whether the confidence coefficient of the bounding box is larger than a corresponding interaction matching threshold, if so, reserving the bounding box, and if not, discarding the bounding box. Further, determining an interaction matching threshold of the hierarchy comprises determining semantic entropy and visual distribution dispersion of the hierarchy, wherein the semantic entropy represents semantic diversity and fuzzy degree of the hierarchy, the visual distribution dispersion represents visual complexity of the hierarchy, weighting and summing the semantic entropy and the visual distribution dispersion to obtain a comprehensive complexity score, calculating the ratio of the comprehensive complexity score of the hierarchy to the total comprehensive complexity score to obtain semantic hierarchy weight, and determining the interaction matching threshold according to the semantic hierarchy weight, wherein the interaction matching threshold is positively correlated with the semantic hierarchy weight. Further, the calculation expression of the interaction matching threshold is: In the formula, Is a hierarchy ofIs a threshold value of the interaction matching of (c),In order to be a preset basic threshold value,For a predetermined level of sensitivity coefficient,Is a hierarchy ofIs a semantic level weight of (1). The segmentation model is a SAM3 model, a target boundary box and a corresponding text prompt are input into a preset segmentation model, a segmentation mask corresponding to each element is generated, the segmentation model comprises the steps of determining the global existence probability of the target boundary box in an image material, determining a concept matching score according to the global existence probability, positively correlating the concept matching score with the global existence probability, judging whether the concept matching score is larger than a preset matching threshold, if so, performing cross-modal attention fusion in a visual language coding space to obtain a fusion result, and decoding the fusion result to obtain a plurality of segmentation masks. Further, the segmentation model comprises a lightweight adaptation modul