CN-121979960-A - Multi-level address matching method and system based on dynamic programming
Abstract
The invention discloses a multilevel address matching method and a system based on dynamic programming, and relates to the technical field of text processing, wherein the method realizes multi-source address data convergence through an ETL tool, and address text is standardized through preprocessing operations such as symbol unification, invalid character removal, feature character conversion and the like; and designing a hierarchical matching mechanism based on a dynamic programming idea, defining four-level matching hierarchies of courtyard, building, unit and house and corresponding state transition equations, and realizing accurate matching by calculating the text similarity of the address elements. The invention obviously improves the matching robustness and accuracy of the noise address data, and can be widely applied to the fields of geographic information systems, logistics scheduling, public safety and the like.
Inventors
- YANG GUANG
- ZHANG BAIXIN
- ZHANG YU
- XIANG LIANG
Assignees
- 武汉众智数字技术有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20251224
Claims (10)
- 1. The multi-level address matching method based on dynamic programming is characterized by comprising the following steps of: extracting original data containing service numbers and address texts from a plurality of service systems through an ETL tool, and converging and storing the original data into a database to form an address converging library; Preprocessing an address text in an address convergence library, wherein the preprocessing specifically comprises standardized processing of symbols and conversion of characteristic characters; extracting the address elements based on the preprocessed address text, and processing the extracted address elements according to a preset rule to obtain a standard address; and obtaining the business address to be matched, calculating the matching score of the business address and the standard address according to a predefined matching hierarchy, and outputting a matching result according to the matching score.
- 2. The address matching method according to claim 1, wherein the standardized processing is performed on the symbols, and the specific method comprises the steps of converting Chinese full-angle symbols into English half-angle forms, clearing special control symbols and redundant punctuation through regular expressions, merging repeated spaces, converting capital-case Chinese numbers into Arabic numbers, and uniformly converting English letters into capital-case.
- 3. The address matching method according to claim 1, wherein the feature characters are converted, and the method specifically comprises the step of replacing synonymous feature characters in the address text in batches based on a preset character mapping table, wherein the character mapping table at least comprises the correspondence of "# |building=number", "No. =number", "room=unit".
- 4. The address matching method according to claim 1, wherein the address elements are extracted based on the preprocessed address text, and the specific method comprises the steps of defining a hierarchical element system of provinces, cities, counties, streets and villages, village and villages, roads, road numbers, interest points, buildings, units and rooms, performing large-scale address element preliminary extraction by adopting MGeo models, and combining a jieba word segmentation device after modification with a custom element dictionary, and realizing accurate extraction of the address elements through bidirectional longest-like matching.
- 5. The address matching method according to claim 4, wherein the extracted address elements are processed according to a preset rule to obtain a standard address, and the preset rule includes: Constructing an address element entity library, namely performing de-duplication processing on the extracted address elements, and storing the extracted address elements in a classified manner according to a hierarchical element system; Establishing an address alias library, namely associating semantic abbreviations and historical aliases of address elements of the same hierarchy and establishing an alias mapping relation; constructing an address mapping library, namely mapping according to the mapping between the interest points and the detailed addresses, mapping between the building numbers and the unified codes and mapping between the building numbers and the house numbers; And constructing an address standard library, namely storing standard address data comprising unique address codes, address elements of each level, longitude and latitude coordinates, a coordinate system, address area codes, address responsibility area codes, address major classes, address minor classes and address states.
- 6. The method of claim 4, wherein the method of implementing bi-directional longest-likelihood matching comprises matching the address elements in forward and reverse directions simultaneously, and preferentially intercepting the longest-match text segment.
- 7. The method of claim 5, wherein the address states of the address standard library include an active state and a disabled state, and the address major class and the address minor class are divided according to application industry usage attributes.
- 8. The address matching method according to claim 1, wherein the step of obtaining the service address to be matched and calculating the matching score of the service address and the standard address according to a predefined matching hierarchy comprises the steps of: defining matching levels, wherein the matching levels comprise a courtyard matching level, a building matching level, a unit matching level and a house matching level; Calculating a matching score of a service address and a standard address based on a state transition formula, wherein the state transition formula is as follows: ; ; Wherein x, y respectively represent the service address to the standard address, The score of x and y in matching level n is represented, the threshold p is the set text similarity, and can be adjusted according to the business, wherein A score greater than 0 indicates that x, y match, otherwise, x, y do not match.
- 9. The multi-level address matching system based on dynamic programming adopts the address matching method of any one of the claims 1-8, and is characterized by comprising a multi-source address aggregation module, an address text preprocessing module, a standard address acquisition module and a dynamic programming matching module, wherein: the multi-source address convergence module is used for extracting original data containing service numbers and address texts from a plurality of service systems through the ETL tool, and converging and storing the original data into a database to form an address convergence library; The address text preprocessing module is used for preprocessing the address text in the address convergence library, and specifically comprises standardized processing of symbols and conversion of characteristic characters; The standard address acquisition module is used for extracting the address elements based on the preprocessed address text, and processing the extracted address elements according to a preset rule to obtain a standard address; and the dynamic programming matching module is used for acquiring the service address to be matched, calculating the matching score of the service address and the standard address according to the predefined matching hierarchy, and outputting a matching result according to the matching score.
- 10. An electronic device, comprising: One or more processors; A memory for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the address matching method of any of claims 1 to 8.
Description
Multi-level address matching method and system based on dynamic programming Technical Field The invention relates to the technical field of text processing, in particular to a multi-level address matching method and system based on dynamic programming. Background Address matching is a core basic technology in the fields of geographic information systems, census, logistics scheduling, public safety and the like, and the accuracy of the address matching directly influences the development quality of subsequent services. However, address data in a real scene has a plurality of problems that the formats are not uniform due to various data sources, the difference of address input standards of different service systems is large, a large number of wrongly written characters, redundant symbols and invisible characters exist, address expression is in the conditions of short name, alias, history name change and the like, and part of addresses are in hierarchical deletion and only comprise part of address elements. The existing address matching technology is mainly divided into two types, namely a matching method based on statistical word segmentation, which focuses on word overlap calculation and has low tolerance to semantic difference and format noise, and a matching algorithm based on deep learning, which realizes matching through word vector training and semantic similarity calculation, but has high model training cost, strict requirements on hardware resources and insufficient efficiency when processing small-scale data or real-time matching scenes. In addition, in the prior art, the whole character string comparison method is mostly adopted, and the hierarchical structure characteristic of the address is not fully utilized, so that the matching accuracy is greatly reduced when part of elements are missing or fuzzy. Dynamic programming is used as a classical optimization solution idea, and can effectively realize optimal comparison among sequences by decomposing sub-problems, recursion solution and result multiplexing, and is excellent in scenes such as character string editing distance calculation and the like. The dynamic programming is combined with the hierarchical structure of the address, so that the subtle difference can be corrected at the character level, the overall consistency is ensured on the hierarchical structure, and a new technical path is provided for solving the problem of complex address matching. Therefore, there is a need for a dynamic programming-based multi-level address matching method and system that solves the problems of the prior art. Disclosure of Invention The invention aims to solve at least one technical problem in the prior art, and provides a multi-level address matching method and system based on dynamic programming. In a first aspect, an embodiment of the present invention provides a method for matching a multi-level address based on dynamic programming, including: extracting original data containing service numbers and address texts from a plurality of service systems through an ETL tool, and converging and storing the original data into a database to form an address converging library; Preprocessing an address text in an address convergence library, wherein the preprocessing specifically comprises standardized processing of symbols and conversion of characteristic characters; extracting the address elements based on the preprocessed address text, and processing the extracted address elements according to a preset rule to obtain a standard address; and obtaining the business address to be matched, calculating the matching score of the business address and the standard address according to a predefined matching hierarchy, and outputting a matching result according to the matching score. Further, the symbol is standardized, and the specific method comprises the steps of converting a Chinese full-angle symbol into an English half-angle form, clearing special control symbols and redundant punctuation through a regular expression, merging repeated spaces, converting uppercase Chinese numbers into Arabic numbers, and uniformly converting English letters into uppercase. Further, the special characters are converted, and the method specifically comprises the step of replacing synonymous special characters in the address text in batches based on a preset character mapping table, wherein the character mapping table at least comprises corresponding relations of "# |building=number", "No. =number", "room=unit". Further, based on the preprocessed address text, the specific method comprises the steps of defining a hierarchical element system of provinces, cities, counties, streets, villages and towns, roads, road numbers, interest points, buildings, units and households, performing large-scale initial extraction of the address elements by adopting a MGeo model, and combining a jieba word segmentation device after modification with a custom element dictionary to achieve accurate extraction