CN-122021904-A - Large language model open vocabulary prediction method based on context concept prompt
Abstract
The invention discloses a large language model open vocabulary prediction method based on contextual concept prompt. Firstly, a word guessing game mechanism based on a large language model is provided, necessary basic constraint conditions are provided before the large language model is used for realizing reasoning, and game background description is formulated, wherein the game background description comprises character setting, task setting and target space position and size priori information of the large language model as a word guessing reasoner. Secondly, a context concept prompt method is provided, and a specific context concept prompt is generated by constructing a customized prompt containing elements such as a target main body and context information and completing automatic filling of a context logic clue in a 'language only' form. Finally, the customization prompts are input into a large language model by using the form of a word guessing game, so that the reasoning capacity of the large language model is fully exerted, and the open vocabulary category prediction is realized.
Inventors
- ZHUANG YIN
- ZHU GUIYING
- YANG BOWEN
- WANG GUANQUN
- CHE ZHIHAO
- CHEN HE
Assignees
- 北京理工大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260130
Claims (4)
- 1. A large language model open vocabulary prediction method based on context concept prompt is characterized by comprising the following steps: The method comprises the steps of 1, designing a word guessing game mechanism based on a large language model, namely, formulating game background description, defining the large language model as character setting of a word guessing reasoner, giving specific setting content of a task, and providing prior information of the space position and the size of a target so that the large language model has necessary basic constraint conditions before reasoning; Step 2, constructing a context concept prompt, namely designing a structured context concept prompt template on the basis of the step 1, integrating scattered visual semantic clues, forming a complete prompt frame containing a target main body and related context information by utilizing prompt engineering, and automatically filling various context logic clues in a 'language' form to generate a specific context concept prompt, wherein the generated prompt provides a sufficient context basis for a large language model so as to support the reasoning prediction of the unknown target class; Step 3, reasoning the large language model, namely driving the large language model to perform reasoning by using a word guessing game mode, inputting the contextual concept prompt generated in the step 2 into the large language model to perform category inference according to contextual clues provided in the prompt, and then analyzing the text output of the large language model to extract category nouns given in the text output as a prediction result in an open scene.
- 2. The method for predicting open vocabulary of large language model based on context concept prompt according to claim 1, wherein the specific method for designing "word guessing" game mechanism based on large language model in step 1 is as follows: Step 1.1, constructing a background module, namely firstly setting roles, setting a large language model as an expert in a specific field, designing a specific prompt as "{ serving as a remote sensing image recognition expert }", secondly, defining tasks, defining reasoning tasks of the large language model, designing the specific prompt as "{ guessing the most probable category of the unknown target based on descriptive words provided below }", defining spatial attributes, and inputting normalized boundary frame coordinates of the target And calculating aspect ratio of the object Target in original Duty cycle in an image And finally, defining the size of the target, dividing the target into a small target, a medium target and a large target, and providing scale priori information for reasoning of a large language model, wherein the specific target length-width ratio and the target duty ratio formula are as follows: ; ; Wherein, the And Representing the width and height of the target bounding box, respectively; And Representing the areas of the target bounding box and the image, respectively.
- 3. The large language model open vocabulary prediction method based on the context concept prompt according to claim 1, wherein the specific method for constructing the context concept prompt in step 2 is as follows: The method comprises the steps of 2.1, constructing a detailed information module, wherein visual semantic cues describing a target body are embedded in cues, the design cues are "{ possible visual semantic information of the object.+ -.)", meanwhile, the visual semantic cues describing the local and surrounding of the target are embedded in cues, the design cues are "{ local visual semantic information focused in a boundary box.+ -.), and environmental visual semantic information focused in the surrounding of the boundary box.+ -.)"; And 2.2, automatically filling the target and the local and surrounding context visual semantic clues of the descriptive target into a customized detailed information module in a 'language only' form, thereby generating a specific context concept prompt.
- 4. The method for predicting open vocabulary of large language model based on context concept prompt according to claim 1, wherein the specific method for reasoning the large language model in step 3 is: step 3.1, inputting the contextual concept prompt generated in the step 2 into a large language model, and constructing an overall prompt template into background information, wherein the background information comprises (role setting: "{ serving as a remote sensing image recognition expert }", task definition: "{ guessing the most probable category of the unknown target based on descriptive words provided below }", spatial position: "{ normalized boundary frame coordinates of the target And calculating aspect ratio of the object Target in original Duty cycle in an image The detailed information (describing the object body: "{ the possible visual semantic information of the object:..+ -), describing the local visual semantic information of the object local and surrounding contexts:" { the local visual semantic information focused in the bounding box:.., the environmental visual semantic information focused in the surrounding of the bounding box:.}); step 3.2 selecting Llama as the large language model of "word guess" game reasoning, expressed as Taking the first K visual semantic cues as input, and recording as The reasoning process is divided into two stages, an internal reasoning clue is generated firstly, and then unknown target class prediction is carried out; Step 3.3 Llama carries out logic analysis and visual cue association according to the complete prompt generated in step 3.1, and a series of explanatory reasoning steps are generated: the process simulates a human thinking chain, deduces according to the target shape attribute, the environment and the spatial association relation, and the probability distribution formula of the reasoning process is as follows: ; Step 3.4 the inference cues generated in step 3.3 Based on (a), llama further predicts the final target class The process can be expressed as: ; Through the process, the large language model can more accurately predict the unknown target category in the remote sensing image based on rich context information and reasoning capability.
Description
Large language model open vocabulary prediction method based on context concept prompt Technical Field The invention belongs to the field of crossing of remote sensing and computer vision, and particularly relates to a large language model open vocabulary prediction method based on contextual concept prompt. Background Open vocabulary target detection is an important direction in remote sensing image interpretation, which is based on a large-scale image-text pre-trained vision-language model and constructs a general vision-semantic joint feature space. Through a cross-modal alignment mechanism, the model can effectively migrate rich semantic knowledge in natural language to a visual detection task, so that the perceptibility of a complex target is improved. The multi-mode large model is utilized, so that the cognitive ability of the intelligent interpretation system in remote sensing scenes such as complex ground object analysis, disaster emergency response, dynamic target perception and the like is remarkably enhanced, and a solid foundation is laid for efficient and flexible target perception application in remote sensing open scenes. Although the existing open vocabulary detection method based on vision-language alignment has made a certain progress in a general scene, the model generalization capability and recognition accuracy of the method still face challenges when facing the complex spatial structure and multi-scale targets of remote sensing images. The current mainstream method mostly relies on simple category names or fixed templated texts as prompts, and target categories are judged through shallow feature matching, so that abundant context semantic information in a remote sensing scene is ignored, and the association of targets with the surrounding context Wen Yuyi is not fully utilized. Making it difficult to make efficient discrimination using surrounding context information when dealing with objects that are similar in appearance but differ semantically. In addition, while current vision-language models have strong feature alignment capabilities, there is a general lack of logical reasoning capabilities and visual cue association capabilities. The lack of a semantic reasoning mechanism limits the prediction accuracy of the model in the face of unknown, shielding or fuzzy targets to a great extent, and the accurate perception requirement of the remote sensing open scene in the true sense is difficult to meet. Disclosure of Invention In view of the above, the present invention aims to provide a method for predicting open vocabulary of a large language model based on contextual concept hints, which is used for solving the problems of insufficient accuracy of unknown target class prediction and limited generalized migration capability caused by neglecting the association of the upper part and the lower part Wen Yuyi and lacking in the deep logic reasoning capability when facing the remote sensing open scene in the prior art. The technical scheme of the invention is as follows: A large language model open vocabulary prediction method based on context concept prompt comprises the following steps: Step 1, designing a word guessing game mechanism based on a large language model, namely making a game background description, defining the large language model as role setting of a word guessing reasoner, giving specific setting content of a task, and providing prior information of the spatial position and the size of a target so that the large language model has necessary basic constraint conditions before reasoning. And 2, constructing a context concept prompt, namely designing a structured context concept prompt template on the basis of the step 1, and integrating scattered visual semantic clues. The prompt engineering is utilized to form a complete prompt framework containing the target main body and related context information thereof, and various context logic clues are automatically filled in a 'language only' form, so that a specific context concept prompt is generated, and the structural integration of visual semantic information is realized. The generated hints provide sufficient context basis for the large language model to support its inference predictions for unknown target classes. And 3, reasoning the large language model, namely driving the large language model to perform reasoning by using the form of the word guessing game. Inputting the contextual concept prompt generated in the step 2 into a large language model, and performing category inference according to the contextual clues provided in the prompt. And then, analyzing the text output of the large language model, and extracting the category nouns given in the text output as a prediction result in the open scene. Further, the specific process of the step 1 is as follows: firstly, setting roles, setting a large language model as an expert in a specific field, and designing a specific prompt as "{ as a remote sensing image recognition