CN-122020361-A - Urban land utilization identification method based on BERT model text classification algorithm

CN122020361ACN 122020361 ACN122020361 ACN 122020361ACN-122020361-A

Abstract

The invention relates to a urban land use identification method based on a BERT model text classification algorithm, which is characterized by comprising the following steps of 1, determining an urban land use identification area range clearly, acquiring POI data in the area range, then screening data, 2, establishing a urban land use identification unit in a urban land use identification model based on the BERT model text classification algorithm, associating the screened POI data with the urban land use identification unit, 3, carrying out geographic text mining on the POI data, and 4, carrying out urban land use identification of the urban land use identification model based on the BERT model text classification algorithm to obtain a high-precision urban land use identification result. The method and the device can utilize the available POI data to timely and accurately identify the urban land use with high precision.

Inventors

Mi Xiaoyan
Chen Yantianxiang
WANG ZHAO

Assignees

天津大学

Dates

Publication Date: 20260512
Application Date: 20251231

Claims (6)

1. A urban land utilization identification method based on a BERT model text classification algorithm is characterized by comprising the following steps: Step 1, determining urban land use identification categories, determining urban land use identification area ranges, acquiring POI data in the area ranges, and then carrying out data screening; Step 2, establishing a urban land use identification unit in a urban land use identification model based on a BERT model text classification algorithm, and associating the POI data screened in the step 1 with the urban land use identification unit; step 3, carrying out geographic text mining on the POI data based on the association result of the urban land utilization recognition unit and the POI data obtained in the step 2; And 4, carrying out urban land use recognition of the urban land use recognition model based on the BERT model text classification algorithm based on the geographical text mining result of the POI data obtained in the step 3, and obtaining a high-precision urban land use recognition result.
2. The urban land use identification method based on the BERT model text classification algorithm according to claim 1, wherein the specific method of the step 1 is as follows: firstly, constructing urban land use identification categories of an urban land use identification model by taking 6 types of urban construction lands of living lands, public management and public service lands, commercial service lands, industrial lands, logistics storage lands and public facility lands as BERT model text classification algorithms; Then, determining the range of the urban land utilization identification area, and acquiring POI data comprising serial numbers, administrative regions, addresses, names, longitude and latitude coordinate information and 23 inherent attributes of dining service, road affiliated facilities, address place name information, scenic spots, public facilities, company enterprises, shopping service, transportation facility service, financial insurance service, scientific and educational culture service, motorcycle service, automobile maintenance, automobile sales, business housing, life service, event activity, indoor facilities, sports and leisure service, traffic facilities, medical care service, government institutions, social groups and accommodation service; then, removing POI data in green land, water body and roads in the identification area range, and removing 6 major POI data of road affiliated facilities, address name information, traffic facility services, event activities, indoor facilities and traffic facilities according to the correlation between the POI data types and 6 types of urban construction land; Finally, the filter retains POI data including 17 major classes, 229 intermediate classes, and 753 minor classes.
3. The urban land use identification method based on the BERT model text classification algorithm according to claim 1, wherein the specific steps of the step 2 comprise: Firstly, establishing urban land use identification units in an urban land use identification model based on a BERT model text classification algorithm by using a subset of traffic analysis areas, dividing a research area by adopting a 50X 50 meter grid, taking the center of the grid as a sampling position, and setting a buffer area with the radius of 50 meters; And then, calculating the distance between the POI data and the corresponding sampling position for the POI data in each sampling position buffer area, sequencing the POI data according to the sequence from near to far, and generating a POI data list comprising fields such as sampling position numbers, POI addresses, POI names, POI longitude and latitude coordinates, POI categories, the distance between the POI and the sampling position and the like, wherein each grid corresponds to one POI data list and is used as characteristic information of the grid to represent the association result of the urban land utilization identification unit and the POI data.
4. The urban land use identification method based on the BERT model text classification algorithm according to claim 1, wherein the specific steps of the step 3 comprise: (1) Connecting and de-duplicating Chinese names in the POI data list based on the association result of the urban land utilization identification unit and the POI data obtained in the step 2, and separating by spaces as representative attributes of each sampling position buffer area; (2) Inputting the input data obtained in the step (1) into a BERT model encoder Transformer Encoder to obtain a high-dimensional embedded vector of the information, and using the high-dimensional embedded vector as a geographical text mining result of POI data; The specific steps of the step 3 (2) comprise: First, the input data is segmented by a segmenter Tokenizer in the BERT model, and each word is converted into a corresponding digital code As input to the BETR model encoder Transformer Encoder; the BETR model encoder Transformer Encoder then maps the input into a high-dimensional embedded vector And taking the POI data as a geographical text mining result of the POI data, namely: 。
5. The urban land use identification method based on the BERT model text classification algorithm according to claim 1, wherein the specific steps of the step 4 comprise: (1) Firstly, randomly screening partial POI data, manually marking land utilization categories by comparing the POI data with a latest urban land utilization planning chart to form a marked data set for training and verifying a model, wherein the rest of unmarked POI data only contain basic information and do not relate to urban land utilization categories and are used as input data in model prediction; (2) The data enhancement technology is utilized to copy words in the sentences corresponding to each grid number for multiple times, and the sequence is randomly disordered, so that the robustness of the model is improved; (3) Inputting the training set and the verification set data into a model training program, and performing iterative optimization of the model by using a back propagation mechanism of a neural network to generate a trained urban land utilization recognition model based on the BERT model text classification algorithm.
6. The urban land use identification method based on the BERT model text classification algorithm according to claim 5, wherein the specific method of the step 4 (3) is as follows: Inputting the high-dimensional embedded vector obtained in the step 3 into a Classifier based on a multi-layer perceptron to obtain the score of each urban land use class, and converting the score into a probability value of the class by using a softMax function The method comprises the following steps: ; finally, the category with the highest probability value is returned as the final prediction result: ; in summary, the complete mathematical expression formula of the trained urban land utilization recognition model based on the BERT model text classification algorithm is as follows: ; (4) And finally, inputting the test data set into the trained final model to obtain a high-precision urban land utilization recognition result of 50×50m grid division of a recognition area range, and evaluating the performance and verifying the effectiveness of model recognition.

Description

Urban land utilization identification method based on BERT model text classification algorithm Technical Field The invention belongs to the technical field of urban land utilization, relates to a method for identifying urban land utilization, and in particular relates to a method for identifying urban land utilization based on a BERT model text classification algorithm. Background Urban land utilization is the result of the combined action of various elements and human activities in cities, and plays a vital role in urban planning, design, activity guidance, management and the like. However, in practice, urban land use planning is usually optimized based on previous versions of achievements, relying on top-down empirical adjustments, and lacking a bottom-up feedback mechanism from the base level of practice and social demands, which limits the scientificity and accuracy of planning. In addition, since urban land utilization planning needs to be dynamically adjusted, relevant data update is delayed, and information often cannot reflect the current situation in time. Therefore, the method accurately and timely acquires the high-precision urban land utilization information, further provides support for government decision and management, and becomes one of the core tasks of the current urban space planning. Along with the rapid development of information technology, diversified social perception data effectively make up for the defect that the traditional remote sensing technology and high-resolution satellite images cannot acquire social and economic information caused by human activities in land coverage identification. Particularly POI (Point of Interest) data, which has the remarkable characteristics of high precision, wide coverage, rapid updating, large data volume, easy acquisition and the like. The POI data describes geographic entities through rich semantic information such as coordinates, addresses, names, categories and the like, provides powerful support for the generation of urban social activity maps, and is widely applied to urban land utilization identification. In the current technology for urban land utilization identification based on POI data, the method is generally divided into a plurality of key steps, namely firstly, identification unit division is carried out through a grid with a side length of 200-1000 meters or a traffic analysis area based on a road network, the division mode directly influences the accuracy of urban land utilization identification, secondly, geographic text information in the POI data is extracted through natural language processing models such as Word2Vec, place2Vec, geoSemantic Vec and the like, finally, land utilization categories are marked manually by comparing the latest urban land utilization planning diagrams, training samples are generated, the relation between the geographic text information in the POI data and the urban land utilization is established, and finally, the identification result of the urban land utilization is output through machine learning algorithms such as a support vector machine, a random forest and XGBoost (Extreme Gradient Boosting). The defects and deficiencies in the prior art in utilizing POI data to identify urban land utilization are embodied in the following aspects: (1) In the prior art, in the urban land utilization identification by utilizing POI data, the urban land utilization of 8 types of urban construction lands, such as residential land, public management and public service land, business service facility land, industrial land, logistics storage land, road and transportation facility land, public facility land, green land and square land, is always subjected to unified identification paths, the characteristic that space entities such as the road and transportation facility land, the green land and the square land are in net or planar distribution is ignored, the problem that POI data are less and land areas are large is faced, the accuracy of establishing the relation between the POI data and the urban land utilization type is influenced, and large deviation exists in urban land utilization identification of large-area land. (2) The identification unit division problem is that the accuracy of the identification unit is low in the existing technology, the urban land utilization category is generally determined only according to POI data in the identification unit, the influence of POI data around the identification unit is ignored, the identification units are mutually independent, the spatial relationship and the mutual connection between urban lands are ignored, and therefore the accuracy and the comprehensiveness of urban land utilization identification are limited. (3) Model algorithm has the defect that a natural language processing model in the prior art generally lacks context sensitivity and deep semantic understanding capability, and cannot accurately capture the change of words in different contexts. At the s