CN-122021775-A - Test time learning method for visual language large model space reasoning

CN122021775ACN 122021775 ACN122021775 ACN 122021775ACN-122021775-A

Abstract

The invention discloses a learning method used for space reasoning of a large visual language model during test, which aims at the problems that the space reasoning robustness is poor and the prediction result does not accord with geometric consistency when the existing visual language model lacks a supervision signal, and provides an unsupervised online adaptation scheme; the method can dynamically update model parameters in a test stage without marking data, and remarkably improves the accuracy and physical rationality of the model in quantitative space reasoning tasks such as distance, object size, direction and the like.

Inventors

HUANG SHUANGPING
ZHANG GEGE

Assignees

华南理工大学

Dates

Publication Date: 20260512
Application Date: 20260207

Claims (10)

1. A learning method used for the space reasoning of a visual language big model is characterized by comprising the following steps: step 1, acquiring an input image and an original query, and expanding the original query into a group of auxiliary queries meeting geometric coupling relations through a query enhancement strategy; Step 2, inputting the auxiliary distance inquiry into a pre-training model to obtain an initial prediction result, carrying out consistency verification on the initial prediction result by utilizing a geometric constraint condition, and converting the verified numerical value into structured pseudo-label distribution; Step 3, constructing an optimization function containing geometric consistency loss by taking the pseudo tag as an optimization target; and 4, dynamically updating specific parameters of the visual language model in an inference stage through an optimization function for minimizing the geometric consistency loss.
2. The method for learning under test for spatial reasoning of large visual language model of claim 1, wherein said step 1 comprises the following steps: step 101, identifying target attributes in original query The target attribute Comprising a diagonal distance Vertical distance Or horizontal distance One of the following; Step 102, generating the target attribute based on the template transformation Auxiliary query with geometric coupling relation, and constructing an enhanced query set, wherein the geometric coupling relation is formed by Pythagorean theorem formula And (5) defining.
3. The method for learning under test for spatial reasoning of large model of visual language as claimed in claim 2, wherein said step 101 comprises the following steps: Step 10101, performing entity recognition on the original query text, and locking the origin object entity to be measured in the input image With end object entity ; Step 10102, extracting a spatial measurement keyword of the original query, wherein the keyword is selected from one of an oblique distance, a vertical distance and a horizontal distance; Step 10103, performing semantic mapping according to the extracted keywords, classifying the original query as a target attribute 。
4. The method for learning under test for spatial reasoning of large model of visual language as claimed in claim 3, wherein said step 102 comprises the following steps: Step 10201, according to the identified target attribute Retrieving two complementary dimension templates from a preset geometric relationship template library, if the original attribute is a skew Then search for vertical distance Template and horizontal distance A template; Step 10202, the origin object entity is processed With end object entity Filling the two dimension templates to retrieve complementary two dimension templates, generating corresponding auxiliary query text, and forming an enhanced query set together with the original query 。
5. The method for learning under test for spatial reasoning of visual language big model as set forth in claim 4, wherein said step 2 comprises the steps of: Step 201, input images are respectively combined with enhanced query sets Pairing the queries in the model (a) in pairs to form a plurality of groups of input pairs, and inputting the input pairs into a pre-training Visual Language Model (VLM) to obtain a plurality of corresponding groups of original numerical prediction results; Step 202, establishing an adaptive geometric trigger mechanism, and calculating a reference value of a third dimension by using any two dimension values in the prediction result And calculate the reference value With corresponding original predicted values Relative error between; Step 203, determining whether the relative error is smaller than a preset tolerance threshold, if so, determining the geometric reference value Is a reliable signal and is converted into a structured pseudo tag distribution using a word segmentation serialization operation.
6. The method for learning under test for spatial reasoning of the visual language big model of claim 5, wherein said step 202 comprises the steps of: step 20201, calculating the target attribute using Pythagorean theorem Geometric reference value of (2) When the target attribute Is of a diagonal pitch In the time-course of which the first and second contact surfaces, Wherein And The vertical distance and the horizontal distance predicted for the model; step 20202 calculates the model raw predictions And reference value Relative error between 。
7. The method for learning under test for spatial reasoning of the visual language big model of claim 5, wherein said step 203 comprises the steps of: step 20301, determining a reference value of a reliable signal Analyzing the character string sequence into a character string sequence consisting of integer digits, decimal places and units; Step 20302, using a preset word segmentation device Performing a mapping operation: Thereby mapping the continuous numerical space into a sequence of discrete word-segmentation indices, wherein, The word is segmented into an integer part, The word is divided into decimal points, The word is segmented into decimal parts, The word is divided for the unit of measure, To correspond to the reference value Is a discrete word index sequence of (a).
8. The method for learning under test for spatial reasoning of large model of visual language as set forth in claim 7, wherein said step 20302 comprises the following steps: Step 2030201, for fixed format position Construction of single point distribution pseudo tag I.e. , wherein, Represents a position index in the word segmentation sequence, The position of the decimal point is indicated, The unit position is indicated by the number of units, A reference word segmentation index corresponding to the position is represented; step 2030202, for numeric content location In the word segmentation table Mid-screening neighborhood word-segmentation set And construct uniform pseudo tag distribution : ; Wherein, the Represents the position of an integer number, The decimal place is represented by the number of decimal places, A corpus of word segmentation tables preset for the model, For a subset of candidate segmentations centered around a reference segmentations, Representing the total number of tokens of the subset, Is a word segmentation index in the vocabulary.
9. The method for learning under test for spatial reasoning of large model of visual language as set forth in claim 8, wherein said step 3 comprises the following steps: Extracting the model at the sequence position Is a predictive distribution of (2) Calculation covers the geometric consistency loss of the full sequence : ; Wherein, the Representing model predicted locations For word segmentation Probability distribution of (2); to indicate the function, when indexing The value is 1 when the content belongs to the numerical content position, otherwise, the value is 0; the method comprises the steps of setting a preset negative likelihood penalty coefficient; The method comprises the steps of representing a residual word segmentation set after neighborhood word segmentation set removal in a word list, and generating a numerical value for punishing physical logic errors; Represents an integer position; Representing the decimal point location; representing a decimal place; Representing a unit position; A word segmentation list corpus preset for the model; representing a pseudo tag distribution.
10. The method for learning under test for spatial reasoning of large model of visual language as set forth in claim 9, wherein said step 4 comprises the following steps: Step 401, freezing backbone network parameters of the visual language model, and only weighting the low-rank adaptive LoRA module Setting the training state; step 402, for each test sample, utilizing geometric consistency loss At least one step of gradient descent iteration is carried out, and the updating formula is as follows: ; Wherein, the For the preset learning rate, the learning rate is set, Representing the weight parameter of the loss function pair The model output achieves self-consistency in the geometric dimension by updating parameters online.

Description

Test time learning method for visual language large model space reasoning Technical Field The invention relates to the crossing field of computer vision and natural language processing, in particular to a technology for enhancing the spatial reasoning capability of a large visual language model, and particularly relates to a learning method used for the spatial reasoning of the large visual language model during test. Background With the rapid development of deep learning technology, a Visual Language Model (VLM) shows strong cross-modal understanding and generating capability by pre-training data on large-scale graphics and texts, and is widely applied to the fields of visual question-answering, image description, natural Language instruction following and the like. However, despite the breakthroughs made by existing visual language models in terms of qualitative descriptions and semantic recognition, serious challenges remain in dealing with quantitative spatial reasoning tasks (e.g., measuring skew, vertical distance, and horizontal distance between objects) that require accurate quantitative calculations. First, existing models often exhibit logically non-consistent characteristics when generating numerical results. In particular, when a model predicts multiple dimensions with geometric coupling relationships in the same visual scene, the results tend to violate the geometric constraints of the physical world. This phenomenon suggests that the model, while learning the associative mapping between visual pixels and text values, does not truly understand the structural logic of physical space. Second, existing solutions rely primarily on full parameter tuning or instruction tuning. However, such methods have serious limitations: 1) The data cost is high, the acquisition of accurate space measurement annotation data requires expensive sensors or manual intensive annotation, so that the high-quality quantitative reasoning data set has limited scale; 2) The generalization performance is insufficient, namely knowledge learned in the pre-training stage is often limited to the distribution of a training set, and when a model is deployed to face a long-tail distribution scene or a visual view angle which is never seen, a serious 'illusion' phenomenon is easy to generate, and a numerical value which is completely deviated from the visual reality is output. 3) Static reasoning limits that parameters of a traditional model are frozen in a test reasoning stage, and real-time learning and online evolution (Self-adaptation) cannot be performed according to a specific current input scene, so that the model lacks flexibility in processing a dynamic environment. Therefore, on the premise of no need of additional manual labeling, the prior geometric knowledge of the physical world is used as an unsupervised constraint to guide the model to find reasoning conflict in real time in a test stage and drive model parameters to dynamically correct itself, so that the accuracy and logic consistency of space reasoning are improved, and the method has become a core technical problem to be solved in the field of current visual language understanding. Disclosure of Invention (1) Technical problem to be solved The invention discloses a test time learning method for space reasoning of a large visual language model, which aims to solve the problems that the existing visual language model has poor robustness, predicted values do not accord with geometric logic consistency and on-line self-adaption to specific scenes is difficult in a quantitative space reasoning task. (2) Technical proposal The invention discloses a test time learning method for visual language big model space reasoning, which comprises the following steps: step 1, acquiring an input image and an original query, and expanding the original query into a group of auxiliary queries meeting geometric coupling relations through a query enhancement strategy; Step 2, inputting the auxiliary distance inquiry into a pre-training model to obtain an initial prediction result, carrying out consistency verification on the initial prediction result by utilizing a geometric constraint condition, and converting the verified numerical value into structured pseudo-label distribution; Step 3, constructing an optimization function containing geometric consistency loss by taking the pseudo tag as an optimization target; and 4, dynamically updating specific parameters of the visual language model in an inference stage through an optimization function for minimizing the geometric consistency loss. Further, the specific steps of the step 1 are as follows: step 101, identifying target attributes in original query The target attributeComprising a diagonal distanceVertical distanceOr horizontal distanceOne of the following; Step 102, generating the target attribute based on the template transformation Auxiliary query with geometric coupling relation, and constructing an enhanced query set, wherein