CN-122024047-A - Geographic positioning analysis method and system based on multi-mode multi-agent

CN122024047ACN 122024047 ACN122024047 ACN 122024047ACN-122024047-A

Abstract

The invention discloses a geographic positioning analysis method and a system based on multi-mode multi-agent, and relates to the technical field of geographic information, wherein the method comprises the following steps of S1, obtaining a picture to be detected and a user prompt text; the method comprises the steps of S2, carrying out OCR recognition on a picture to be detected to obtain global OCR information, S3, inputting the picture to be detected and a prompt text of a user into a visual analysis intelligent body to output image features and a plurality of candidate geographic positions, S4, inputting the picture to be detected into a cutting intelligent body to output key region coordinates and operation instructions, S5, cutting the picture to be detected according to the key region coordinates and the operation instructions and carrying out OCR recognition to obtain local OCR information, and S6, inputting the global OCR information, the image features and the local OCR information into an inference intelligent body to output the most accurate geographic positions. According to the invention, by combining visual features, OCR text information and language description, local amplification and OCR detail extraction are carried out on a key region, and high positioning precision is still obtained under a sparse or fuzzy scene of information, so that local details are supplemented.

Inventors

JI RONGRONG
FANG CHENXIN
ZHOU YIYI

Assignees

厦门大学

Dates

Publication Date: 20260512
Application Date: 20260122

Claims (8)

1. The geographic positioning analysis method based on the multi-mode multi-agent is characterized by comprising the following steps: s1, obtaining a picture to be detected and a user prompt text; s2, performing OCR (optical character recognition) on the picture to be detected to obtain global OCR information; S3, inputting the picture to be detected and the prompt text of the user into a visual analysis intelligent agent based on a visual large model, and outputting image features reflecting geographic scene information in the picture to be detected and a plurality of candidate geographic positions under the constraint of a first prompt word; S4, inputting the picture to be detected into a cutting intelligent agent based on a visual large model, and outputting key region coordinates and operation instructions under the constraint of a second prompting word, wherein the operation instructions comprise cutting purposes corresponding to the key region coordinates and/or information expected to be acquired; s5, the cutting agent calls a cutting tool to cut the picture to be detected according to the key region coordinates and the operation instruction to obtain a local picture; S6, inputting global OCR information, image features and local OCR information into an inference agent based on an inference big model, determining an accurate geographic position from the candidate geographic positions and outputting the accurate geographic position.
2. The multi-modal multi-agent based geolocation analysis method of claim 1, wherein the image features include features of five dimensions of architectural style, natural environment, topography, weather at shooting time, and type of shooting.
3. The multi-modal multi-agent based geolocation analysis method of claim 1, wherein outputting key region coordinates and operational instructions under the first hint word constraint includes strictly controlling the format of model key region coordinates and operational instructions by configuring json schema.
4. The multi-mode multi-agent-based geolocation analysis method according to claim 1, wherein the performing OCR recognition on the picture to be detected to obtain global OCR information specifically includes: And analyzing text fields and coordinate information fields in the json file to obtain text information and corresponding coordinate information, and formatting and splicing the text information and the coordinate information to obtain global OCR information.
5. The multi-modal multi-agent based geolocation analysis method of claim 1, wherein said OCR model is a trained got_ocr model.
6. The multi-modal multi-agent based geolocation analysis method of claim 1, wherein the picture to be detected is encoded using a Base64 picture.
7. The multi-modal multi-agent based geolocation analysis method of claim 1, wherein said inference agents invoke an internet search engine during the inference process, concretely implemented by the inference agents invoking the internet search engine to retrieve and verify accuracy global OCR information, image features and local OCR information, and assisting the inference agents in determining an accurate one of said plurality of candidate geographic locations and outputting.
8. A multi-modal multi-agent based geolocation analysis system, comprising: the picture and text acquisition module is used for acquiring a picture to be detected and a user prompt text; the global OCR information acquisition module is used for carrying out OCR recognition on the picture to be detected to obtain global OCR information; the visual analysis module is used for inputting the picture to be detected and the prompt text of the user into a visual analysis intelligent agent based on a visual large model, and outputting image features reflecting geographic scene information in the picture to be detected and a plurality of candidate geographic positions under the constraint of a first prompt word; the key region acquisition module is used for inputting the picture to be detected into a cutting intelligent agent based on a visual large model and outputting key region coordinates and operation instructions under the constraint of a second prompting word, wherein the operation instructions comprise cutting purposes corresponding to the key region coordinates and/or information expected to be acquired; The local OCR information acquisition module is used for calling a cutting tool to cut the picture to be detected according to the key region coordinates and the operation instruction by the cutting agent to obtain a local picture; And the geographic position reasoning module is used for inputting the global OCR information, the image characteristics and the local OCR information into a reasoning agent based on a reasoning big model, determining an accurate geographic position from the candidate geographic positions and outputting the accurate geographic position.

Description

Geographic positioning analysis method and system based on multi-mode multi-agent Technical Field The invention relates to the technical field of geographic information, in particular to a geographic positioning analysis method and system based on multi-mode multi-agent. Background Image geolocation is an important class of computer vision and artificial intelligence tasks aimed at inferring geographic coordinates or structured location information of a shooting location by analyzing the content of an image without explicit geographic labeling. The difficulty of image geographic positioning is that scene diversity is high, information is sparse, and geographic similarity is strong. For example, cities in different countries may have similar architectural styles, and partial pictures may lack obvious landmarks or text information, making direct localization difficult. In image geolocation studies, common methods include visual feature matching, deep learning classification, and text information recognition. Visual feature matching is performed better in a scene with obvious landmarks by extracting global or local features of an image and comparing the global or local features with an image database of a known position to infer a shooting place, but the accuracy is reduced in environments with sparse features such as sky, desert and the like, and the database coverage is relied on. The deep learning classification method divides the earth into grid areas by using a convolutional neural network or a visual transducer, predicts grids to which images belong, and can realize urban precision under large-scale data, but is difficult to accurately reach streets and has insufficient utilization of local details. The text information recognition method extracts characters such as signboards, shop signboards and the like through OCR and combines the location of a place name library, so that the effect of clear characters is remarkable, but the performance is greatly reduced under the conditions of blurring, shielding or rare language. However, the existing image geographic positioning method still has obvious defects in practical application. Firstly, most methods lack multi-stage fine granularity analysis, often only the whole picture is processed once, but potential key information such as elements which can be illegally recognized in original pictures, such as signboards, shop signboards and the like, is not obtained after local details are amplified, secondly, the information utilization mode is still insufficient, social media or news pictures often contain various clues such as natural landscapes and artificial buildings at the same time, but the existing models often only pay attention to one information and ignore the other information, so that the accuracy is reduced, in addition, the prior art generally lacks a multi-agent cooperation mechanism, usually relies on a single model to complete all reasoning tasks, and the labor division cooperation of different types of analysis modules is difficult to realize, so that the reasoning depth and the reasoning accuracy under complex scenes are limited. Disclosure of Invention Aiming at the problems, the invention provides a geographic positioning analysis method and a geographic positioning analysis system based on a multi-mode multi-agent, which can still obtain higher positioning precision under a sparse or fuzzy scene by combining visual features, OCR text information and language description. On the one hand, the geographic positioning analysis method based on the multi-mode multi-agent comprises the following specific steps: s1, obtaining a picture to be detected and a user prompt text; s2, performing OCR (optical character recognition) on the picture to be detected to obtain global OCR information; S3, inputting the picture to be detected and the prompt text of the user into a visual analysis intelligent agent based on a visual large model, and outputting image features reflecting geographic scene information in the picture to be detected and a plurality of candidate geographic positions under the constraint of a first prompt word; S4, inputting the picture to be detected into a cutting intelligent agent based on a visual large model, and outputting key region coordinates and operation instructions under the constraint of a second prompting word, wherein the operation instructions comprise cutting purposes corresponding to the key region coordinates and/or information expected to be acquired; s5, the cutting agent calls a cutting tool to cut the picture to be detected according to the key region coordinates and the operation instruction to obtain a local picture; S6, inputting global OCR information, image features and local OCR information into an inference agent based on an inference big model, determining an accurate geographic position from the candidate geographic positions and outputting the accurate geographic position. Preferably, the image features inc