CN-121981130-A - Visual language generation method, device, computer equipment and storage medium

CN121981130ACN 121981130 ACN121981130 ACN 121981130ACN-121981130-A

Abstract

The invention relates to the field of natural language processing and discloses a visual language generating method, a visual language generating device, computer equipment and a storage medium, wherein the method comprises the steps of acquiring an original image and text information, and dividing the original image into a plurality of sub-images; the method comprises the steps of processing an original image, text information and a plurality of sub-images through a dynamic attention weight calculation module to obtain attention guiding weights, processing the original image, the plurality of sub-images, the text information and a generated word element sequence according to a large visual language model and the attention guiding weights to obtain word element distribution data, wherein the word element distribution data comprise candidate word elements and distribution probability thereof, and processing the word element distribution data through a decoding self-adaptive constraint module to obtain target word elements of a current time step. The invention can improve the accuracy and the credibility of the generated visual language. The invention can be applied to a financial science and technology scene and an intelligent endowment scene, and improves the service intelligence and individuation level.

Inventors

WANG JIANZONG
ZHANG XULONG
PAN PENG

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20260120

Claims (10)

1. A visual language generation method, comprising: Acquiring an original image and text information, and dividing the original image into a plurality of sub-images; processing the original image, the text information and the plurality of sub-images through a dynamic attention weight calculation module to obtain attention guiding weights; Processing the original image, the plurality of sub-images, the text information and the generated word element sequence according to the large visual language model and the attention guiding weight to obtain word element distribution data, wherein the word element distribution data comprises candidate word elements and distribution probability thereof; And processing the word element distribution data through a decoding self-adaptive constraint module to obtain the target word element of the current time step.
2. The visual language generation method according to claim 1, wherein the dividing the original image into a plurality of sub-images comprises: acquiring a first size of the original image and a second size of the sub-image to be segmented; acquiring an overlap segmentation strategy matched with the first size and the second size; and dividing the original image according to the overlapping division strategy to obtain the plurality of sub-images.
3. The visual language generating method according to claim 1, wherein said processing the original image, the text information, and the plurality of sub-images by the dynamic attention weight calculation module to obtain an attention guiding weight comprises: processing the original image, the text information and the generated word element sequence through an attention mechanism to obtain a plurality of attention matrixes of the current time step; determining a single-layer single-head attention matrix according to the plurality of attention matrices; calculating the aggregate attention score of each sub-image according to the single-layer single-head attention matrix; normalizing all the aggregate attention scores to obtain the attention guidance weights.
4. A visual language generation method according to claim 3 wherein said processing said original image, said text information and said generated sequence of tokens by an attention mechanism to obtain a plurality of attention matrices for a current time step comprises: encoding and fusing the original image, the text information and the generated word sequence to obtain a multi-mode context; determining a key matrix according to the multi-mode context, and acquiring a query vector of a currently generated word element; The attention matrix is determined from the key matrix and the query vector.
5. A visual language generation method according to claim 3, wherein said determining a single-layer single-headed attention matrix from said plurality of attention matrices comprises: Calculating an average attention score for each of said attention matrices; screening K high-layered attention matrixes and H high-head attention matrixes according to the average attention score; and averaging the K high-layering attention matrixes and the H high-headedness attention matrixes to obtain the single-layer single-headedness attention matrix.
6. The visual language generating method according to claim 1, wherein said processing said original image, said plurality of sub-images, said text information, and said generated word sequence according to said large visual language model and said attention-directing weight, obtains word distribution data including candidate words and their distribution probabilities, comprises: processing the original image, the text information and the generated word element sequence through the large visual language model to obtain global word element distribution data; processing the plurality of sub-images, the text information and the generated word element sequence through the large visual language model to obtain local word element distribution data; weighting and fusing the local word element distribution data according to the attention guiding weight to obtain target local word element distribution data; And carrying out weighted integration on the global word element distribution data and the target local word element distribution data to obtain the word element distribution data.
7. The visual language generating method according to claim 1, wherein said processing the word-element distribution data by the decoding adaptive constraint module to obtain the target word-element of the current time step comprises: obtaining weighted probability sum of each candidate word element on all sub-images; Setting a head candidate word element set according to the weighted probability sum; Cutting off the word element distribution data according to the head candidate word element set to obtain a target word element set; and sampling by using the target word element set to obtain the target word element.
8. A visual language generating apparatus, comprising: The image text acquisition module is used for acquiring an original image and text information and dividing the original image into a plurality of sub-images; the guiding weight calculation module is used for processing the original image, the text information and the plurality of sub-images through the dynamic attention weight calculation module to obtain attention guiding weights; the character distribution module is used for processing the original image, the plurality of sub-images, the text information and the generated character sequence according to the large visual language model and the attention guiding weight to obtain character distribution data, wherein the character distribution data comprises candidate characters and distribution probability thereof; and the decoding self-adaptive constraint module is used for processing the word element distribution data through the decoding self-adaptive constraint module to obtain the target word element of the current time step.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the visual language generation method according to any one of claims 1 to 7 when executing the computer program.
10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the visual language generation method according to any one of claims 1 to 7.

Description

Visual language generation method, device, computer equipment and storage medium Technical Field The present invention relates to the field of natural language processing, and in particular, to a visual language generating method, a visual language generating device, a computer device, and a storage medium. Background In recent years, large visual language models (LargeVision-LanguageModels, LVLMs) have shown wide application potential in the fields of financial science and technology and medical health care. In the financial scene LVLMs can be used for ATM monitoring, identity verification and fraud recognition to realize real-time analysis of transaction risks, and in the medical health and care monitoring, the model can analyze the behavior videos of the old people, automatically recognize emergency events such as falling, abnormal activities and the like, and improve the response efficiency of remote care. However, LVLMs has a common problem of "object illusion", i.e. generating objects in the text that are not present in the fictitious image or erroneously describing their properties, number and location. In the financial field, such illusions may lead to misjudging transaction behavior or identity information, causing security risks, and in medical scenes, true events may be missed or false alarms may be generated, endangering life security of users. The existing solutions such as data enhancement fine tuning or contrast decoding and other decoding strategies often depend on external modules, have large calculation overhead, and are difficult to dynamically capture the context information of the image key areas. Especially in complex financial counter environments or home care scenarios, the static attention mechanism is susceptible to background interference, leading to false positives for fine movements or sensitive operations. Therefore, a new visual language generating method is needed to improve the accuracy and reliability of LVLMs in the high-reliability requirements of finance security and health care. Disclosure of Invention The embodiment of the invention provides a visual language generation method, a visual language generation device, computer equipment and a storage medium, so as to improve the accuracy and the credibility of the generated visual language. A visual language generation method, comprising: Acquiring an original image and text information, and dividing the original image into a plurality of sub-images; processing the original image, the text information and the plurality of sub-images through a dynamic attention weight calculation module to obtain attention guiding weights; Processing the original image, the plurality of sub-images, the text information and the generated word element sequence according to the large visual language model and the attention guiding weight to obtain word element distribution data, wherein the word element distribution data comprises candidate word elements and distribution probability thereof; And processing the word element distribution data through a decoding self-adaptive constraint module to obtain the target word element of the current time step. Optionally, the dividing the original image into a plurality of sub-images includes: acquiring a first size of the original image and a second size of the sub-image to be segmented; acquiring an overlap segmentation strategy matched with the first size and the second size; and dividing the original image according to the overlapping division strategy to obtain the plurality of sub-images. Optionally, the processing the original image, the text information and the plurality of sub-images by the dynamic attention weight calculation module to obtain an attention guidance weight includes: processing the original image, the text information and the generated word element sequence through an attention mechanism to obtain a plurality of attention matrixes of the current time step; determining a single-layer single-head attention matrix according to the plurality of attention matrices; calculating the aggregate attention score of each sub-image according to the single-layer single-head attention matrix; normalizing all the aggregate attention scores to obtain the attention guidance weights. Optionally, the processing the original image, the text information and the generated word element sequence through an attention mechanism to obtain a plurality of attention matrixes of the current time step includes: encoding and fusing the original image, the text information and the generated word sequence to obtain a multi-mode context; determining a key matrix according to the multi-mode context, and acquiring a query vector of a currently generated word element; The attention matrix is determined from the key matrix and the query vector. Optionally, the determining a single-layer single-head attention matrix according to the plurality of attention matrices includes: Calculating an average attention score for each