CN-117235232-B - Training method and device for open question-answering and multi-mode large model and related equipment

CN117235232BCN 117235232 BCN117235232 BCN 117235232BCN-117235232-B

Abstract

The application discloses a training method, a device and related equipment for an open question-answer and multi-mode large model, which aim to prompt the multi-mode large model to pay attention to space information, a matched image description text with the space information is generated for a training image in a pre-training stage, the space information is used for representing the space position of an object contained in the training image, the training image and the image description text added with explicit object space information are adopted to pre-train the multi-mode large model, and the multi-mode large model can pay further attention to the space position of the object in the image on the basis of the semantic alignment relation between a learning image and the content description text, namely, the multi-mode large model has the capability of detecting the space position of the object. On the basis, when the multi-mode large model is applied to an open question-answering task, correct answers can be accurately given based on mastery capability when the questions related to spatial arrangement are answered.

Inventors

YIN BAOCAI
WEI SI
WANG SHIJIN
LIU CONG
HU GUOPING
Pan Jicai
LIU WENCHAO
SHENG DIAN
WU HAO
BAI HANG
HE SHAN
YIN BING
LIU QUAN

Assignees

科大讯飞股份有限公司

Dates

Publication Date: 20260508
Application Date: 20231023

Claims (15)

1. An open question-answering method, comprising: acquiring an input target image and asking questions; Calling a configured multi-modal large model, and inputting the target image and the question into the multi-modal large model to obtain an output answer text corresponding to the question; The multi-modal large model adopts a training image and a matched image description text with spatial information to perform pre-training, wherein the spatial information is used for representing the spatial position of an object in the training image; Before the configured multi-modal big model is called, the method further comprises the step of performing supervised fine tuning on the pre-trained multi-modal big model, wherein the supervised fine tuning process comprises the following steps: acquiring supervised training data, wherein the supervised training data comprises a training image, a question text related to spatial arrangement and a matched answer label, which are proposed for the training image, and the answer label comprises coordinate information of an object in the training image, and the supervised training data comprises a spatial azimuth word and the coordinate information of the object; And fine-tuning the pretrained multi-modal large model by using the supervised training data to obtain a fine-tuned multi-modal large model, so that the fine-tuned multi-modal large model learns the association relation between the coordinate information and the spatial azimuth words.
2. The method of claim 1, further comprising pre-training the multimodal big model prior to invoking the configured multimodal big model, the pre-training process comprising: Acquiring a training image set; Generating an image description text matched with a training image in the training image set and provided with spatial information, wherein the spatial information is used for representing the spatial position of an object in the training image; And pre-training the multi-modal large model by using the training image and the matched image description text until the training end condition is set, and obtaining the pre-trained multi-modal large model.
3. The method of claim 2, wherein the spatial information in the image description text includes coordinate information of objects in the training image.
4. A method according to claim 3, wherein generating image description text with spatial information that matches training images in the training image set comprises: acquiring an initial image description text of a training image in the training image set; acquiring coordinate information of a detection frame where an object is located in the training image; And adding the coordinate information of the detection frame where the object is located into the initial image description text to obtain the image description text with the space information, which is matched with the training image.
5. The method of claim 4, wherein the process of obtaining coordinate information of a detection frame in which the object is located in the training image comprises: acquiring coordinate information of a detection frame of each object in a training image in an open source data set of a target detection task; Or alternatively, the first and second heat exchangers may be, And carrying out target detection on the training image, and determining the coordinate information of the obtained object detection frame.
6. The method of claim 4, further comprising, prior to adding the coordinate information of the detection box in which the object is located to the initial image description text: Normalizing the coordinate information of the detection frame where the object is located according to the size of the training image, and multiplying by 10 n to obtain the normalized coordinate information, wherein n is a positive integer greater than or equal to 1.
7. The method of claim 1, wherein the process of obtaining supervised training data comprises: acquiring coordinate information of each detection object in a training image in an open source data set of a target detection task; determining a basic spatial orientation of the object in the training image based on the coordinate information of the object, and/or determining a spatial orientation relationship between different objects based on the coordinate information of each object; generating a question text and a matched answer label related to space arrangement by adopting a pre-configured question-answer template according to the basic space orientation of each object and/or the space orientation relation among different objects; And determining the object contained in the answer label, and adding the coordinate information of the object to the answer label.
8. The method of claim 7, wherein determining a base spatial orientation of the object in the training image based on the coordinate information of the object comprises: dividing the training image into a plurality of different azimuth areas, wherein each azimuth area corresponds to a basic space azimuth; and obtaining the basic space orientation of the object in the training image according to the target orientation area to which the coordinate information of the object belongs.
9. The method of claim 1, wherein the process of obtaining supervised training data comprises: acquiring an initial description text of a training image and coordinate information of each object in the image based on an open source data set; Acquiring a template prompting instruction, wherein the template prompting instruction comprises an image information slot, and the template prompting instruction is used for indicating a large language model to design a question-answer dialogue text aiming at the spatial position relation of an object in an image based on information in the image information slot; Filling the initial description text of the training image and the coordinate information of the object in the initial description text into an image information slot of the template prompt instruction to obtain an edited prompt instruction, and inputting the edited prompt instruction into a configured large language model to obtain a question text output by the model and a matched answer label; And determining the object contained in the answer label, and adding the coordinate information of the object to the answer label.
10. The method of claim 1, further comprising, after obtaining the output answer text corresponding to the question: and carrying out post-processing on the answer text to remove the coordinate information of the object in the answer text.
11. A method for training a multimodal mass model, comprising: Acquiring a training image set; Generating an image description text matched with a training image in the training image set and provided with spatial information, wherein the spatial information is used for representing the spatial position of an object in the training image; pre-training the multi-modal large model by using the training image and the matched image description text until the training end condition is set, and obtaining the pre-trained multi-modal large model; And performing supervised fine tuning on the pre-trained multi-mode large model, wherein the supervised fine tuning process comprises the following steps of: acquiring supervised training data, wherein the supervised training data comprises a training image, a question text related to spatial arrangement and a matched answer label, which are proposed for the training image, and the answer label comprises coordinate information of an object in the training image, and the supervised training data comprises a spatial azimuth word and the coordinate information of the object; And fine-tuning the pretrained multi-modal large model by using the supervised training data to obtain a fine-tuned multi-modal large model, so that the fine-tuned multi-modal large model learns the association relation between the coordinate information and the spatial azimuth words.
12. An open question-answering apparatus, comprising: the input acquisition unit is used for acquiring an input target image and asking questions; the multi-mode large model calling unit is used for calling the configured multi-mode large model, inputting the target image and the question into the multi-mode large model, and obtaining an output answer text corresponding to the question; The multi-modal large model adopts a training image and a matched image description text with spatial information to perform pre-training, wherein the spatial information is used for representing the spatial position of an object in the training image; The device is also used for performing supervised fine tuning on the pre-trained multi-mode large model, and the supervised fine tuning process comprises the following steps: acquiring supervised training data, wherein the supervised training data comprises a training image, a question text related to spatial arrangement and a matched answer label, which are proposed for the training image, and the answer label comprises coordinate information of an object in the training image, and the supervised training data comprises a spatial azimuth word and the coordinate information of the object; And fine-tuning the pretrained multi-modal large model by using the supervised training data to obtain a fine-tuned multi-modal large model, so that the fine-tuned multi-modal large model learns the association relation between the coordinate information and the spatial azimuth words.
13. A multi-modal large model training apparatus comprising: The training image set acquisition unit is used for acquiring a training image set; An image description text generation unit, configured to generate an image description text with spatial information matched with a training image in the training image set, where the spatial information is used to represent a spatial position of an object in the training image; The pre-training unit is used for pre-training the multi-modal large model by utilizing the training image and the matched image description text until the training end condition is set, so as to obtain the pre-trained multi-modal large model; The device is also used for performing supervised fine tuning on the pre-trained multi-mode large model, and the supervised fine tuning process comprises the following steps: acquiring supervised training data, wherein the supervised training data comprises a training image, a question text related to spatial arrangement and a matched answer label, which are proposed for the training image, and the answer label comprises coordinate information of an object in the training image, and the supervised training data comprises a spatial azimuth word and the coordinate information of the object; And fine-tuning the pretrained multi-modal large model by using the supervised training data to obtain a fine-tuned multi-modal large model, so that the fine-tuned multi-modal large model learns the association relation between the coordinate information and the spatial azimuth words.
14. A data processing apparatus includes a memory and a processor; the memory is used for storing programs; the processor is configured to execute the program to implement the steps of the open question-answering method according to any one of claims 1 to 10, or to implement the steps of the multi-modal large model training method according to claim 11.
15. A storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the open question-answering method according to any one of claims 1 to 10, or the steps of the multi-modal large model training method according to claim 11.

Description

Training method and device for open question-answering and multi-mode large model and related equipment Technical Field The application relates to the technical field of artificial intelligence, in particular to a training method, a training device and related equipment for open question-answering and multi-mode large models. Background With the development of artificial intelligence technology, particularly the maturation of large model technology, more and more field tasks are handled by large models. Taking an image content understanding scene as an example, the prior art can train a multi-mode large model to realize understanding of image content, and can perform open question and answer based on images input by a user. The multi-modal large model generally represents an input image as an embedded vector embedding, and simultaneously converts an input question text into an embedded vector embedding, both vectors are input into a large language model (Large Language Model, LLM), and answer text corresponding to the output question is decoded by the large language model. In order to achieve the above functions, a multi-mode large model is generally required to be pre-trained, an image sample is expressed as embedding and then input into LLM during pre-training, the LLM decodes and outputs image description text, and training is performed with the aim of aligning the output of the LLM with a description text label corresponding to the image sample. The pre-training process ignores the spatial arrangement information of the objects in the image, so that the trained multi-modal large model is easy to cause some spatial disorder, as shown in fig. 1a and 1b, aiming at fig. 1a, the user provides a partial question to the existing multi-modal large model and the model outputs an answer including Q1: what is placed on the front desk. Q2: what is the back table on what is A2. A wooden coffee table is placed on the back table, with a vase and a potted plant. With respect to fig. 1b, the user provides a partial question and model output answer to the existing multi-modal large model including what is Q1: on top of the sofa a bookshelf is placed, on which various books are placed. Q2: what is the left side of the sofa? a vase is placed on the left side of the sofa. As is evident from the above examples, the existing multi-modal large model fails to learn the spatial arrangement information of the objects in the image, resulting in spatially disordered answer errors when answering the questions related to the spatial arrangement. Disclosure of Invention In view of the above problems, the present application provides a training method, apparatus and related device for open question-answering and multi-modal large models, which are used for solving the problem that the existing multi-modal large models cannot learn the spatial arrangement information of objects in images, and answer errors easily occur when the questions related to the spatial arrangement are answered. The specific scheme is as follows: in a first aspect, an open question-answering method is provided, including: acquiring an input target image and asking questions; Calling a configured multi-modal large model, and inputting the target image and the question into the multi-modal large model to obtain an output answer text corresponding to the question; The multi-modal large model adopts a training image and a matched image description text with spatial information to pretrain, wherein the spatial information is used for representing the spatial position of an object in the training image. Preferably, before invoking the configured multimodal big model, the method further comprises the step of pre-training the multimodal big model, wherein the pre-training process comprises the following steps: Acquiring a training image set; Generating an image description text matched with a training image in the training image set and provided with spatial information, wherein the spatial information is used for representing the spatial position of an object in the training image; And pre-training the multi-modal large model by using the training image and the matched image description text until the training end condition is set, and obtaining the pre-trained multi-modal large model. Preferably, the spatial information in the image description text includes coordinate information of an object in the training image. Preferably, the process of generating the image description text with spatial information matched with the training images in the training image set includes: acquiring an initial image description text of a training image in the training image set; acquiring coordinate information of a detection frame where an object is located in the training image; And adding the coordinate information of the detection frame where the object is located into the initial image description text to obtain the image description text with the space information, which is matche