CN-121997908-A - Multi-mode large model fine-tuning corpus production method for planning and natural resource field
Abstract
The invention provides a multi-mode large model fine-tuning corpus production method in the planning and natural resource field, and aims to solve the problems that the existing corpus lacks professional semantics, is low in manual labeling efficiency and does not have a standardized flow. The method comprises the steps of constructing an industry exclusive cognitive task and VQA template library, generating a graph-text pairing, generating a VQA sample driven by the template, and generating an automatic quality control optimization four-core module, and generating an exclusive VQA corpus by combining an industry planning standard and a cognitive level, namely perception-reasoning-association-application. In the process, firstly, visual elements of the planning chart and policy texts are automatically extracted, a professional template is matched to generate candidate Q & A, and finally, a standardized corpus is output through semantic detection and expert rechecking optimization. The invention realizes accurate alignment of professional semantics, greatly reduces labor cost, ensures reliable corpus quality and can be expanded to the similar field, and provides high-efficiency support for the fine adjustment and capability evaluation of the multi-mode large model in the field.
Inventors
- ZHANG WENJIA
Assignees
- 同济大学
Dates
- Publication Date
- 20260508
- Application Date
- 20251226
Claims (10)
- 1. The multi-mode large model fine-tuning corpus production method for planning and natural resource fields is characterized by comprising the following steps of: S1, constructing an industry exclusive cognition task and VQA template library, namely defining a four-layer progressive cognition task system which covers perception, reasoning, association and application according to an industry standard, a professional examination system and a planning text and image task in the field of planning and natural resources, and constructing a VQA problem template library which corresponds to each cognition layer and comprises planning chart classification, planning chart element identification, spatial relationship, professional reasoning and policy association; S2, image-text pairing generation, namely acquiring multi-format original data in the field of planning and natural resources, screening, analyzing and extracting elements from the original data, and establishing a corresponding mapping relation between image data and text data comprising policy texts, planning indexes and situation descriptions to obtain standardized image-text pairing data; S3, generating a template-driven VQA sample, namely calling a multi-mode large model to generate candidate question-answer pairs based on the VQA question template library and image-text pairing data, screening the candidate question-answer pairs according to professional semantics of planning industry, correcting question-answer expression accuracy through industry technical requirements in the template library, and outputting an initial VQA training sample containing image-question-answer; S4, automatic quality control optimization, namely automatically detecting semantic consistency and checking logic rationality expert on the initial VQA training sample, and finally classifying and labeling in order type, difficulty and four-layer cognitive task systems to obtain high-quality standardized corpus; s5, outputting the expandable corpus, namely outputting the standardized corpus according to a preset format, wherein the standardized corpus comprises image IDs, question texts, answer texts, task types and labeling source information and is used for planning multi-mode big model fine adjustment in the field of natural resources.
- 2. The method for producing the multi-mode large model fine-tuning corpus in the planning and natural resource field according to claim 1, wherein the multi-mode large model fine-tuning in the building design and ecological restoration field is adapted by replacing VQA industry standards and technical terms in a problem template library.
- 3. The method for producing a multi-modal large model fine-tuning corpus for planning and natural resource domain according to claim 1, wherein in step S1, the four cognitive hierarchies specifically include: The perception hierarchy comprises element identification and image description, wherein the element identification is used for evaluating the capability of a model to identify a layout structure, text annotation, basic geographic features and drawing elements in a planning chart, and establishing semantic alignment between image content and natural language; The inference hierarchy comprises a planning chart classification, a spatial relationship inference and a professional inference, wherein the planning chart classification is realized according to a five-level three-class planning system, and the planning chart is divided into overall planning, detailed planning and special planning; The association hierarchy is used for evaluating the capability of the model association planning map and the back policy, regulation and planning index, and establishing cross-domain association of the space elements and the corresponding policy frames; the application level comprises scheme evaluation and decision making, wherein the scheme evaluation is used for evaluating the merits of planning schemes according to visual input and space context, and the decision making is used for making selections based on values and principles in limited design situations.
- 4. The method for producing a multi-modal large model fine-tuning corpus for planning and natural resource domain according to claim 1, wherein in step S1, the VQA problem template library specifically comprises: The planning diagram classification template comprises a provincial basic analysis diagram, a provincial planning result diagram, a municipal investigation type diagram, a municipal management type diagram and a municipal schematic type diagram classification judgment problem template; The spatial relationship template comprises a topological spatial relationship judging template, a sequential spatial relationship inquiring template and a metric spatial relationship comparing template, and corresponds to standardized problem expressions of adjacent/containing/intersecting relationship judgment, azimuth distribution inquiring and distance comparison among geographic entities; The professional reasoning templates comprise a spatial layout analysis template, a functional organization evaluation template, a traffic system adaptation template and an ecological environment adaptation template, which respectively correspond to urban layout morphological characteristics and cause analysis, functional area planning rationality evaluation, support analysis of traffic facility layout on economic and social development and problem template of support analysis of ecological protection planning on sustainable development; The policy association template comprises a planning index association template, a regulation requirement adaptation template, association inquiry corresponding to each planning index in the planning chart, and a suitability judgment problem template of land use standard and ecological protection regulation.
- 5. The method for producing the multi-mode macro-model fine-tuning corpus in the planning and natural resource field according to claim 1, wherein the step S2 specifically comprises the steps of extracting and screening an effective planning chart through a custom script, extracting text information in original data by adopting an OCR (optical character recognition) technology, and extracting professional visual elements in the planning chart through a semantic segmentation or layer analysis technology so as to establish a mapping relation between image data and corresponding policy texts, planning indexes and situation descriptions; the planning map elements comprise administrative boundaries, government residences, notes, scales, legends, contour lines and water systems; The visual elements include roads, greenbelts, plots and traffic facilities; The multi-format original data comprises a planning type PDF file, a JPG image file and a DOC text file; The raw data originates from official planning files issued by municipal governments, planning institutions and academic institutions.
- 6. The method for producing the multi-mode large model fine-tuning corpus in the planning and natural resource field according to claim 1, wherein in the step S3, the screening process specifically eliminates candidate question-answer pairs which do not meet the professional term specification of the planning industry and deviate from the planning technical requirements, and the correction process comprises supplementing the exclusive expression of the planning industry, and adjusting the consistency of question-answer logic and planning professional logic.
- 7. The method for producing the multi-mode large model fine-tuning corpus in the planning and natural resource field according to claim 1 is characterized in that in step S4, the semantic consistency detection model is realized through a pre-trained planning industry semantic matching model, and the review content of the logic rationality expert review comprises question-answer pairs which are in accordance with the technical specifications of the planning industry, accurately reflect planning diagram core information and attach policy text requirements.
- 8. The method for producing the multi-modal large model fine-tuning corpus for planning and natural resource domain as claimed in claim 3, wherein, The space layout in the professional pushing comprises a centralized layout and a decentralized layout, wherein the centralized layout comprises a grid-shaped radial layout and a ring-shaped radial layout, and the decentralized layout comprises a group-shaped, strip-shaped, star-shaped, ring-shaped, satellite-shaped, multi-center and group city layout; The functional organization in the professional pushing comprises an industrial land, a living area, a storage area and a public facility land, wherein the layout of the industrial land is required to meet the pollution isolation requirement and the transportation connection optimization requirement of the living area, and the layout of the storage area is required to meet the requirements of dangerous goods isolation and logistics transportation convenience; The traffic system in the professional pushing comprises site selection layout evaluation of railways, highways, ports and airports, and the suitability requirements of traffic facilities and urban spaces in the planning industry are met; The ecological environment reasoning in the professional pushing comprises the protection utilization evaluation of the natural environment and the ecological system, and the requirements of ecological protection red line and natural protection ground planning are attached.
- 9. The method for producing a multi-modal large model fine-tuning corpus in the planning and natural resource domain according to claim 3, wherein the planning indexes in the associated hierarchy include a farmland conservation amount index, a permanent basic farmland protection area index, an ecological protection red line area index, and a town development boundary expansion multiple index, and the policy framework includes a homeland space planning schema file, an industry technical specification, and related rule files.
- 10. The utility model provides a be used for planning and natural resources field multimode big model fine setting corpus production system which characterized in that includes: The data input module is used for receiving multi-format original data in the planning field; defining four layers of progressive cognitive tasks and constructing an industry exclusive VQA problem template library; The image-text pairing module is used for screening and optimizing original data, extracting characters and professional visual elements and establishing cross-mode precise mapping; The sample generation module is used for calling the multi-mode large model to generate and screen the correction candidate question-answer pairs and outputting an initial VQA sample; The quality control module completes classification labeling by automatically detecting and professional rechecking the optimized sample; And the extensible output module outputs standardized corpus according to a structured format and supports cross-domain template substitution adaptation.
Description
Multi-mode large model fine-tuning corpus production method for planning and natural resource field Technical Field The invention relates to the technical field of multi-modal corpus construction, in particular to a multi-modal large model fine-tuning corpus production method for planning and natural resource fields. Background The current multi-modal corpus construction is mainly focused on general Visual language tasks such as COCO capture, visual Genome, chartQA and the like. The common characteristics of the data sets are that the image content is mainly natural scenes, living objects and general charts, the text labeling is mainly used for identifying objects and general descriptions, planning semantics and space logic are lacked, the generation mode is mainly used for manual or crowdsourcing labeling, and a professional symbology is difficult to cover. In the field of planning and natural resources, partial auxiliary labeling and drawing systems, such as a drawing spot recognition tool and an automatic drawing and index extraction tool of a geographic information drawing system (GIS), are also presented, but the methods are mainly oriented to vector data and space calculation, the VQA mapping relation of image-text is not established, multi-level understanding of policy semantics, space structure logic and planning intention is lacking, and end-to-end large model fine tuning or question-answer generating tasks cannot be supported. Summarizing, the current-stage multi-modal large model (Multimodal Large Language Model, MLLM) performs well in general scenarios, but the following problems still exist in the professional fields of urban planning and natural resource management: 1) The professional semantic understanding capability is insufficient, namely the general model can not accurately identify symbologies, space elements and policy intentions in the planning chart; 2) The industrial corpus lacks a systematic construction method that a planning file and image data are heterogeneous in multiple sources, and a standardized image-text pairing rule and a knowledge system are lacked to map; 3) The manual labeling efficiency is low, the cost is high, the existing data construction depends on expert manual labeling, and an automatic and semi-automatic auxiliary mechanism is lacked; 4) The data generation flow for VQA (visual question and answer) tasks is lacking, and particularly, the data generation flow for complex visual text contents such as planning files, space layout diagrams, design schemes and the like does not have reusable corpus production specifications. Therefore, a technical system capable of automatically converting a domain knowledge system such as a planning image, a policy text, an examination question and the like into VQA training samples is needed, and standardized, multi-stage and reusable corpus production is realized. Disclosure of Invention The invention aims to provide a multi-mode corpus production and structuring generation flow scheme for planning and natural resource industries, which is used for fine adjustment and evaluation of a large model in the support field and lays a foundation for automatic understanding, design assistance and policy generation of subsequent urban planning. . In order to achieve the above purpose, the invention provides a multi-mode large model fine tuning corpus production method for planning and natural resource fields, which comprises the following steps: S1, constructing an industry exclusive cognition task and VQA template library, namely defining a four-layer progressive cognition task system which covers perception, reasoning, association and application according to an industry standard, a professional examination system and a planning text and image task in the field of planning and natural resources, and constructing a VQA problem template library which corresponds to each cognition layer and comprises planning chart classification, planning chart element identification, spatial relationship, professional reasoning and policy association; S2, image-text pairing generation, namely acquiring multi-format original data in the field of planning and natural resources, screening, analyzing and extracting elements from the original data, and establishing a corresponding mapping relation between image data and text data comprising policy texts, planning indexes and situation descriptions to obtain standardized image-text pairing data; S3, generating a template-driven VQA sample, namely calling a multi-mode large model to generate candidate question-answer pairs based on the VQA question template library and image-text pairing data, screening the candidate question-answer pairs according to professional semantics of planning industry, correcting question-answer expression accuracy through industry technical requirements in the template library, and outputting an initial VQA training sample containing image-question-answer; S4, automatic quality contr