CN-120931961-B - Image processing method and device and electronic equipment

CN120931961BCN 120931961 BCN120931961 BCN 120931961BCN-120931961-B

Abstract

The disclosure provides an image processing method, an image processing device and electronic equipment. The method comprises the steps of obtaining image-text comparison data, wherein the image-text comparison data comprise homologous first images and second images and description texts corresponding to the first images and the second images, obtaining first image features and second image features corresponding to the first images and the second images through a target image encoder, obtaining first text features of the description texts corresponding to the first images and second text features of the description texts corresponding to the second images through a target text encoder, constructing a difficult negative sample feature group according to the first image features and the second image features, the first text features and the second text features, determining comprehensive loss values corresponding to the difficult negative sample feature groups of the image-text comparison data through a target loss function, and carrying out parameter adjustment on the target image encoder and the target text encoder according to the comprehensive loss values. The embodiment of the disclosure can improve the resolution capability of an image encoder for fine granularity differences of images.

Inventors

YANG SIQI
WANG ZITENG
MA LIN

Assignees

北京三快云计算有限公司

Dates

Publication Date: 20260512
Application Date: 20250722

Claims (10)

1. An image processing method, comprising: Obtaining graphic control data, wherein the graphic control data comprises a first image and a second image, and a description text corresponding to the first image and a description text corresponding to the second image, the first image and the second image are homologous images, fine granularity difference areas exist in the first image and the second image, and the pixel area occupation ratio of the fine granularity difference areas in the first image and the second image is smaller than a preset value; Acquiring first image features and second image features corresponding to the first image and the second image through a target image encoder, and acquiring first text features of descriptive texts corresponding to the first image and second text features of descriptive texts corresponding to the second image through a target text encoder; Constructing a difficult-to-negative sample feature group according to the first image feature and the second image feature, the first text feature and the second text feature, wherein the difficult-to-negative sample feature group comprises a first element, a second element and a third element, the images corresponding to the first element and the second element are the same, the feature types are different, and the features of the second element and the third element are the same and the corresponding images are different; and determining comprehensive loss values corresponding to the difficult negative sample feature sets of the graphic comparison data through a target loss function, and carrying out parameter adjustment on the target image encoder and the target text encoder according to the comprehensive loss values.
2. The image processing method according to claim 1, wherein the second image is generated based on a local editing instruction for the first image and the first image, and the descriptive text corresponding to the second image is generated based on a local editing instruction for the first image and the descriptive text corresponding to the first image.
3. The image processing method according to claim 1, wherein the descriptive text corresponding to the second image is generated based on local modification of the descriptive text corresponding to the first image, and the second image is generated based on the descriptive text corresponding to the second image.
4. The image processing method of claim 1, wherein the difficult-to-negative sample feature set includes a first set and a second set for evaluating a distance of an image feature to a text feature, a first element of the first set being the first image feature, a second element being the first text feature, a third element being the second text feature, a first element of the second set being the second image feature, a second element being the second text feature, a third element being the first text feature.
5. The image processing method according to claim 1 or 4, wherein the difficult-to-negative-sample feature group includes a third group and a fourth group for evaluating a distance from a text feature to an image feature, a first element of the third group being a first text feature, a second element being a first image feature, a third element being a second image feature, a first element of the fourth group being a second text feature, a second element being a second image feature, and a third element being a first image feature.
6. The image processing method of claim 1, wherein determining, by an objective loss function, a composite loss value corresponding to the difficult negative set of features for a plurality of teletext control data comprises: Obtaining an M-th type of difficult-to-negative sample feature group corresponding to N groups of image-text comparison data, wherein the first element, the second element and the third element in the M-th type of difficult-to-negative sample feature group are the same, N is more than or equal to 1, M is more than or equal to 1 and less than or equal to M, and M is the number of the difficult-to-negative sample feature groups corresponding to one group of image-text comparison data; According to the M-th class of difficult negative sample characteristic groups corresponding to the N groups of text comparison data, determining an M-th loss value corresponding to the M-th class of difficult negative sample characteristic groups through an M-th loss item corresponding to the M-th class of difficult negative sample characteristic groups, wherein the target loss function consists of M loss items; and obtaining the comprehensive loss value according to the sum of the M loss values.
7. The image processing method of claim 6, wherein the penalty term comprises: Wherein L NegCL represents an mth loss term corresponding to an mth class of difficult-to-negative sample feature group, X, Y, Y 'is a set of first elements, a set of second elements and a set of third elements corresponding to an mth class of difficult-to-negative sample feature group of N groups of image-to-text control data respectively, f (x i ) is a first element in the mth class of difficult-to-negative sample feature group of the ith batch of image-to-text control data, g (y i ) is a second element in the mth class of difficult-to-negative sample feature group of the ith batch of image-to-text control data, g (y' j ) is a third element in the mth class of difficult-to-negative sample feature group of the jth batch of image-to-text control data, g (y k ) is a second element in the mth class of difficult-to-negative sample feature group of the kth batch of image-to-text control data, and τ is a temperature super-parameter.
8. An image processing apparatus, comprising: an image encoder; Wherein the image encoder is a target image encoder as claimed in any of the image processing methods of claims 1-7, the image encoder being trained by the image processing method of any of the claims 1-7.
9. An electronic device, comprising: Memory, and A processor coupled to the memory, the processor configured to perform the method of any of claims 1-7 based on instructions stored in the memory.
10. A computer readable storage medium having stored thereon a program which, when executed by a processor, implements the method of any of claims 1-7.

Description

Image processing method and device and electronic equipment Technical Field The disclosure relates to the technical field of machine learning, in particular to an image processing method, an image processing device and electronic equipment. Background Visual Language Models (VLMs) such as CLIP have met with significant success in the connection of vision to language, but there are still deficiencies in fine-grained detail understanding, particularly in the perception of color, quantity, and spatial relationships. Fine-grained enhancement methods such as region-level contrast learning, self-distillation methods, and difficult negative sample mining strategies each suffer from limitations such as reliance on additional labor for region labeling, limited ability to teacher models, and lack of visual similarity or fine-grained control of the generated samples (e.g., random replacement of text or image local regions). Thus, there is a need for a better way to enhance the model's ability to understand the fine granularity of an image. It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art. Disclosure of Invention The disclosure aims to provide an image processing method, an image processing device and electronic equipment, which are used for at least improving the comprehensiveness of an image encoder on fine granularity detail similarity of images. According to a first aspect of the disclosed embodiments, an image processing method is provided, which includes obtaining image-text comparison data, wherein the image-text comparison data includes a first image and a second image, and a descriptive text corresponding to the first image and a descriptive text corresponding to the second image, the first image and the second image are homologous images, fine granularity difference areas exist in the first image and the second image, the pixel area occupation ratio of the fine granularity difference areas in the first image and the second image is smaller than a preset value, the image-text comparison data includes a first image feature and a second image feature corresponding to the first image and the second image, the first text feature and the second text feature of the descriptive text corresponding to the first image are obtained through a target text encoder, a difficult-to-negative sample feature set is constructed according to the first image feature and the second image feature, the first text feature and the second text feature, the pixel area occupation ratio of the fine granularity difference areas in the first image and the second image is smaller than a preset value, the difficult-to-negative sample feature set includes a first element, a second element, the difficult-to-sample loss element and the second element is determined according to the target text encoder, the first element and the target element loss value and the same as the target element, and the target element loss is determined according to a multiple-to the binary-sample-class-loss function. According to a second aspect of embodiments of the present disclosure, there is provided an image processing apparatus comprising an image encoder, wherein the image encoder is a target image encoder as described in any one of the image processing methods described above, the image encoder being trained by the image processing method as described in any one of the above. According to a third aspect of the present disclosure there is provided an electronic device comprising a memory, and a processor coupled to the memory, the processor being configured to perform a method as any one of the above based on instructions stored in the memory. According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements the image processing method as set forth in any one of the above. According to the embodiment of the disclosure, the target image encoder and the target text encoder are used for extracting the characteristics of two images which are homologous and have fine granularity difference areas and texts corresponding to the images, a plurality of difficult negative sample characteristic groups are built according to the extracted characteristics, corresponding loss values of the plurality of difficult negative sample characteristic groups are determined through a target loss function, and then the target image encoder and the target text encoder are optimized according to the loss values, so that the target image encoder can improve the fine granularity visual semantic understanding capability through learning of the plurality of difficult negative samples. It is to be understood that both the foregoing general description an