CN-121982295-A - Interactive reasoning method, device, equipment and medium based on visual cue encoder

CN121982295ACN 121982295 ACN121982295 ACN 121982295ACN-121982295-A

Abstract

The invention relates to the technical field of artificial intelligence, can be applied to the fields of financial science and technology and medical health, and discloses an interactive reasoning method, device, equipment and medium based on a visual cue encoder, wherein the method comprises the steps of receiving images to be reasoning, problem texts and various types of visual cues input by a user, wherein the visual cues are target reasoning positions marked on the images to be reasoning by the user; the method comprises the steps of converting images to be inferred into an image token sequence through a pre-trained visual encoder, normalizing coordinate information of visual cues through the visual cue encoder, extracting spatial features through position encoding, generating type embedded features, fusing and mapping the type embedded features to a unified feature space to obtain a visual cue token sequence, converting a question text into a text token sequence through the pre-trained text encoder, splicing three types of token sequences to obtain a multi-modal input sequence, inputting a multi-modal large language model to infer to obtain an answer text, and achieving both flexible and various interaction modes and high-precision fine-grained understanding.

Inventors

ZHANG XULONG
XIE JUNFEI

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20260302

Claims (10)

1. An interactive reasoning method based on a visual cue encoder, the method comprising: Receiving an image to be inferred, a problem text and a visual prompt input by a user, wherein the visual prompt is a mark of the user on a target inference position on the image to be inferred, and the visual prompt comprises multiple types; Converting the image to be inferred into an image token sequence through a pre-trained visual encoder; Carrying out normalization processing on the coordinate information of the visual cue through a visual cue encoder, carrying out position encoding on the normalized coordinate information to extract spatial features, generating corresponding type embedded features according to the type of the visual cue, fusing the spatial features and the type embedded features, mapping the spatial features and the type embedded features to a unified feature space, and generating a unified visual cue token sequence; Converting the question text into a sequence of text tokens by a pre-trained text encoder; And splicing the image token sequence, the visual cue token sequence and the text token sequence to obtain a multi-modal input sequence, and inputting the multi-modal input sequence into a multi-modal large language model for reasoning to obtain an answer text.
2. The method of claim 1, wherein the visual cue comprises at least one of a point marker, a bounding box marker, and a free shape marker, the normalizing the coordinate information of the visual cue by a visual cue encoder, position encoding the normalized coordinate information to extract spatial features, generating corresponding type-embedded features from the type of visual cue, fusing the spatial features with the type-embedded features and mapping to a unified feature space, and generating a unified visual cue token sequence, comprising: If the visual prompt comprises a point mark, extracting two-dimensional coordinate information corresponding to the point mark, and normalizing the two-dimensional coordinate information to a preset numerical interval; Adopting high-frequency Fourier feature position coding processing to the normalized two-dimensional coordinate information to extract spatial features; adding a learnable type embedded feature corresponding to the point mark type for the spatial feature to obtain a fusion feature; And mapping the fusion features to a unified feature space through a linear layer, and generating a visual cue token sequence corresponding to the point mark.
3. The method of claim 1, wherein the visual cue comprises at least one of a point marker, a bounding box marker, and a free shape marker, the normalizing the coordinate information of the visual cue by a visual cue encoder, position encoding the normalized coordinate information to extract spatial features, generating corresponding type-embedded features from the type of visual cue, fusing the spatial features with the type-embedded features and mapping to a unified feature space, and generating a unified visual cue token sequence, comprising: if the visual cue comprises a boundary box mark, extracting two sets of diagonal two-dimensional coordinate information of the boundary box mark, and respectively normalizing the two sets of two-dimensional coordinate information to a preset numerical interval; Respectively adopting high-frequency position coding processing to the two groups of normalized two-dimensional coordinate information, and respectively extracting two corresponding groups of spatial features; splicing the learnable type embedded features of the corresponding boundary frame mark type for each group of space features to obtain two groups of fusion features; And mapping the two groups of fusion features to a unified feature space through a linear layer respectively, and splicing the two groups of mapped features to generate a visual cue token sequence corresponding to the boundary box mark.
4. The method of claim 1, wherein the visual cue comprises at least one of a point marker, a bounding box marker, and a free shape marker, the normalizing the coordinate information of the visual cue by a visual cue encoder, position encoding the normalized coordinate information to extract spatial features, generating corresponding type-embedded features from the type of visual cue, fusing the spatial features with the type-embedded features and mapping to a unified feature space, and generating a unified visual cue token sequence, comprising: If the visual cue comprises a free shape mark, extracting contour coordinate information of the free shape mark, determining an external boundary box mark based on the contour coordinate information, and extracting two sets of diagonal two-dimensional coordinate information of the external boundary box mark; Respectively normalizing the two sets of two-dimensional coordinate information to a preset numerical value interval; Respectively adopting high-frequency position coding processing to the two groups of normalized two-dimensional coordinate information, and respectively extracting two corresponding groups of spatial features; Splicing the learnable type embedded features corresponding to the free shape mark types for each group of space features respectively to obtain two groups of fusion features; And mapping the two groups of fusion features to a unified feature space through a linear layer respectively, and splicing the two groups of mapped features to generate a visual cue token sequence corresponding to the free shape mark.
5. The method of claim 1, wherein the step of concatenating the sequence of image tokens, the sequence of visual cue tokens, and the sequence of text tokens to obtain a multimodal input sequence comprises: adding a visual cue start marker token at the starting position of the visual cue token sequence; adding a visual cue end mark token at the end position of the visual cue token sequence; And splicing the marked visual cue token sequence, the image token sequence and the text token sequence according to a preset sequence to form a multi-mode input sequence.
6. The method of any one of claims 1-5, wherein the step of training the visual cue encoder comprises: Constructing a first training data set, wherein the first training data set comprises a detection data set and a segmentation data set, and the detection data set and the segmentation data set both comprise image samples, corresponding visual prompt labels and label information; Freezing model parameters of a visual encoder and a multi-mode large language model, and configuring a projection layer connected with the visual cue encoder in series, wherein the projection layer is used for mapping fusion features output by the visual cue encoder to a unified feature space; performing joint training on the visual cue encoder and the projection layer by using the first training data set so as to align visual cue features output by the projection layer with image features output by the pre-trained visual encoder in the unified feature space, and enabling a prediction result based on the visual cue features to conform to label information of the first training data set; constructing a second training data set, automatically marking a target area on an image sample through a marking point set prompting technology, inputting the marked image sample into a pre-training visual language model, and generating a question-answer pair related to visual prompts to form the second training data set; And thawing model parameters of the multi-modal large language model, and performing end-to-end training on an integral frame containing the visual cue encoder, the projection layer and the multi-modal large language model by using the second training data set, so that the integral frame can generate answer text which accords with standard answers in the second training data set according to the input images to be inferred, visual cues and question text.
7. The method of claim 6, wherein the step of automatically marking the target area on the image sample by a mark point set prompt technique, inputting the marked image sample into a pre-training visual language model, generating a question-answer pair related to visual prompts, and forming a second training data set comprises: Selecting a plurality of image samples, and automatically marking a plurality of visual cue points on the outline, key parts and characteristic areas of the target object in the target area by a marker point set cue technology aiming at the target area and the corresponding target object in each image sample to generate an image sample with a visual cue mark; And inputting the image sample with the visual cue marks into a pre-training visual language model to generate question-answer pairs related to the visual cues, and constructing a second training data set based on the question-answer pairs.
8. An interactive reasoning device based on a visual cue encoder, characterized in that it comprises means for performing the method of any of the preceding claims 1-7.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.

Description

Interactive reasoning method, device, equipment and medium based on visual cue encoder Technical Field The invention relates to the technical field of artificial intelligence, can be applied to the fields of financial science and technology and medical health, and particularly relates to an interactive reasoning method, device, equipment and medium based on a visual cue encoder. Background In recent years, a multi-modal large language model has made remarkable progress in the alignment of images and texts, and can handle tasks such as image description, visual questions and answers and the like. However, existing mainstream models typically encode and understand images as a whole, lacking the ability to perceive specific areas or pixel-level details in the image. In practical applications, users often want to interact with specific objects in an image. For example, in a security scenario, the user asks claims criteria corresponding to damaged portions of the accident vehicle in a red box in the map, or in medical diagnosis, the physician wishes to know in detail the pathological features of a specific lesion area in the CT image. Traditional solutions rely mainly on plain text descriptions, such as where the user can only describe the target location in language, which is often not accurate enough when the targets are dense or the background is complex, and it is difficult to capture the user's real intent. One existing improvement is rigid region of interest extraction, and some of the prior art relies on pre-trained segmentation models or fixed bounding box inputs, which generally require the user to provide an accurate mask or support only a single type of prompt, such as support only framing, and lack flexibility. Therefore, a multi-modal interactive reasoning method capable of supporting multiple interactive modes and having fine granularity sensing capability is needed to solve the technical bottleneck. Disclosure of Invention The invention provides an interactive reasoning method, device, equipment and medium based on a visual prompt encoder, which are used for solving the problem that the existing visual question-answering task requires a user to provide an accurate mask or only support a single type of prompt and lacks flexibility. In a first aspect, there is provided a visual cue encoder-based interactive reasoning method, the method comprising: Receiving an image to be inferred, a problem text and a visual prompt input by a user, wherein the visual prompt is a mark of the user on a target inference position on the image to be inferred, and the visual prompt comprises multiple types; Converting the image to be inferred into an image token sequence through a pre-trained visual encoder; Carrying out normalization processing on the coordinate information of the visual cue through a visual cue encoder, carrying out position encoding on the normalized coordinate information to extract spatial features, generating corresponding type embedded features according to the type of the visual cue, fusing the spatial features and the type embedded features, mapping the spatial features and the type embedded features to a unified feature space, and generating a unified visual cue token sequence; Converting the question text into a sequence of text tokens by a pre-trained text encoder; And splicing the image token sequence, the visual cue token sequence and the text token sequence to obtain a multi-modal input sequence, and inputting the multi-modal input sequence into a multi-modal large language model for reasoning to obtain an answer text. In a second aspect, an interactive reasoning apparatus based on a visual cue encoder is provided, comprising means for performing the above method. In a third aspect, a computer device is provided comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when the computer program is executed. In a fourth aspect, a computer readable storage medium is provided, the computer readable storage medium storing a computer program which, when executed by a processor, performs the steps of the above method. The invention provides an interactive reasoning method, device, equipment and medium based on a visual cue encoder, which are used for carrying out normalization processing on coordinate information of a visual cue by a compatible user on a target reasoning position on an image to be reasoning when receiving user input, carrying out position coding on the normalized coordinate information by the visual cue encoder to extract spatial characteristics, generating corresponding type embedded characteristics according to the type of the visual cue, merging the spatial characteristics and the type embedded characteristics and mapping the spatial characteristics to a unified feature space to generate a unified visual cue token sequence, combining a pre-trained visual encoder and a