CN-121982717-A - Method, device, equipment and medium for generating text description of image

CN121982717ACN 121982717 ACN121982717 ACN 121982717ACN-121982717-A

Abstract

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for generating text descriptions of images. The method and the device are applied to medical scenes, according to initial attention weights of all image tokens and corresponding current attention weights, activation scores of all image tokens are calculated and obtained when an ith text token is generated, the image tokens with the activation scores being larger than a preset threshold value are determined to be target image tokens, the number of times that all target image tokens appear when the ith text token is generated is counted, the target attention weights of all the image tokens are determined according to the number of times that all the target image tokens appear, an attention score mechanism ensures that attention can be flexibly transferred among different areas of the images, attention stiffness caused by a naive enhancement method is avoided, the model can describe more different objects and details in the images, recall rate and description richness are improved, and text description accuracy is improved.

Inventors

WANG JIANZONG
ZHANG XULONG
PAN PENG

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20260120

Claims (10)

1. A text description generation method of an image, characterized in that the text description generation method comprises: Acquiring N image tokens of an image to be described and i-1 generated text tokens, and determining initial attention weights of the image tokens when the i text tokens are generated according to the N image tokens and the i-1 generated text tokens, wherein N is an integer greater than zero, and i is an integer greater than 1; Acquiring the previous attention weight of each image token when generating the i-1 text token, and calculating the current attention weight of any image token when generating the i text token according to the previous attention weight of the image token and the initial attention weight of the corresponding image token; According to the initial attention weight of each image token and the corresponding current attention weight, calculating to obtain the activation score of each image token when generating an ith text token; determining the image tokens with the activation scores larger than a preset threshold as target image tokens, and counting the occurrence times of each target image token when generating an ith text token; Determining the target attention weight of each image token according to the number of times that each target image token appears; And generating an ith text token according to the target attention weight of each image token, the image tokens and the i-1 text tokens, and obtaining an image description text of the image to be described.
2. The text description generation method of claim 1, wherein determining initial attention weights to each image token when generating an i-th text token based on the N image tokens and i-1 text tokens comprises: Splicing and fusing the N image tokens and the i-1 text tokens to obtain fused feature vectors; Performing linear transformation on the fused feature vectors to obtain query vectors and key vectors; And calculating initial attention weights of all the image tokens according to the query vector and the key vector.
3. The text description generating method according to claim 1, wherein for any one image token, calculating a current attention weight for the image token when generating an ith text token according to a previous attention weight of the image token and the initial attention weight of a corresponding image token, includes: and carrying out weighted summation on the initial attention weight of the image token when the ith text token is generated and the last attention weight of the image token when the ith-1 text token is generated, so as to obtain the current attention weight corresponding to the image token.
4. The text description generating method according to claim 1, wherein the calculating, based on the initial attention weight and the corresponding current attention weight of each image token, an activation score of each image token when generating the ith text token includes: And calculating the activation score of the image token according to the current attention weight of the image token when the ith text token is generated and the last attention weight of the image token when the ith-1 text token is generated.
5. The text description generating method according to claim 1, wherein the determining the target attention weight of each image token based on the number of times each target image token has appeared comprises: And obtaining a scaling parameter, and calculating the target attention weight of each image token according to the scaling parameter and the number of times of occurrence of the corresponding target image token.
6. The text description generation method of claim 1, wherein the generating an i-th text token from the target attention weight of each image token, the image token, and i-1 text tokens comprises: acquiring a value vector of each image token; According to the target attention weight of each image token and the value vector, calculating to obtain the attention characteristic of each image token; and generating an ith text token according to the attention characteristic of each image token and the i-1 text tokens.
7. The text description generating method according to claim 6, wherein the acquiring N image tokens of the image to be described includes: Performing blocking processing on the image to be described to obtain N image blocks; performing feature coding and position coding on each image block to obtain image coding features and position coding features of each image block; And performing feature stitching on the image coding features and the position coding features of each image block to obtain the image tokens corresponding to each image block.
8. A text description generation apparatus of an image, characterized in that the text description generation apparatus includes: The acquisition module is used for acquiring N image tokens of the image to be described and i-1 generated text tokens, and determining initial attention weights of the image tokens when the i text tokens are generated according to the N image tokens and the i-1 generated text tokens, wherein N is an integer greater than zero, and i is an integer greater than 1; The first calculation module is used for obtaining the previous attention weight of each image token when the ith-1 text token is generated, and calculating the current attention weight of any image token when the ith text token is generated according to the previous attention weight of the image token and the initial attention weight of the corresponding image token; the second calculation module is used for calculating the activation score of each image token when the ith text token is generated according to the initial attention weight of each image token and the corresponding current attention weight; The statistics module is used for determining the image tokens with the activation scores larger than a preset threshold as target image tokens, and counting the occurrence times of each target image token when the ith text token is generated; The determining module is used for determining the target attention weight of each image token according to the number of times that each target image token appears; And the generation module is used for generating an ith text token according to the target attention weight of each image token, the image tokens and the i-1 text tokens to obtain an image description text of the image to be described.
9. A computer device, characterized in that it comprises a processor, a memory and a computer program stored in the memory and executable on the processor, which processor implements the text description generation method according to any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the text description generation method of any one of claims 1 to 7.

Description

Method, device, equipment and medium for generating text description of image Technical Field The present invention relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for generating text descriptions of images. Background In the medical field, accurate and comprehensive image description has important significance in aspects of disease diagnosis, treatment scheme formulation, medical research and the like. In recent years, multimodal large language models have made significant progress in fusing visual and linguistic information, enabling the generation of textual descriptions from image input. In the process of generating longer descriptions, the multi-mode large language model gradually weakens the attention to visual information, and the attenuation of the attention causes the model to easily forget the image content when generating the latter half of the description, so that illusion or missing details are generated, and the accuracy of the final text description is lower, so that how to improve the accuracy of the text description becomes a problem to be solved in the process of performing text description on the image. Disclosure of Invention In view of this, the embodiments of the present application provide a method, an apparatus, a device, and a medium for generating text description of an image, so as to solve the problem of low accuracy of text description in the process of text description of an image. In a first aspect, an embodiment of the present application provides a method for generating a text description of an image, where the method includes: Acquiring N image tokens of an image to be described and i-1 generated text tokens, and determining initial attention weights of the image tokens when the i text tokens are generated according to the N image tokens and the i-1 generated text tokens, wherein N is an integer greater than zero, and i is an integer greater than 1; Acquiring the previous attention weight of each image token when generating the i-1 text token, and calculating the current attention weight of any image token when generating the i text token according to the previous attention weight of the image token and the initial attention weight of the corresponding image token; According to the initial attention weight of each image token and the corresponding current attention weight, calculating to obtain the activation score of each image token when generating an ith text token; determining the image tokens with the activation scores larger than a preset threshold as target image tokens, and counting the occurrence times of each target image token when generating an ith text token; Determining the target attention weight of each image token according to the number of times that each target image token appears; And generating an ith text token according to the target attention weight of each image token, the image tokens and the i-1 text tokens, and obtaining an image description text of the image to be described. In a second aspect, an embodiment of the present application provides a text description generating apparatus of an image, the text description generating apparatus including: The acquisition module is used for acquiring N image tokens of the image to be described and i-1 generated text tokens, and determining initial attention weights of the image tokens when the i text tokens are generated according to the N image tokens and the i-1 generated text tokens, wherein N is an integer greater than zero, and i is an integer greater than 1; The first calculation module is used for obtaining the previous attention weight of each image token when the ith-1 text token is generated, and calculating the current attention weight of any image token when the ith text token is generated according to the previous attention weight of the image token and the initial attention weight of the corresponding image token; the second calculation module is used for calculating the activation score of each image token when the ith text token is generated according to the initial attention weight of each image token and the corresponding current attention weight; The statistics module is used for determining the image tokens with the activation scores larger than a preset threshold as target image tokens, and counting the occurrence times of each target image token when the ith text token is generated; The determining module is used for determining the target attention weight of each image token according to the number of times that each target image token appears; And the generation module is used for generating an ith text token according to the target attention weight of each image token, the image tokens and the i-1 text tokens to obtain an image description text of the image to be described. In a third aspect, embodiments of the present application provide a computer device comprising a processor, a memory, and a co