CN-122019741-A - Chinese address matching method and system based on word segmentation and multiple similarity determination
Abstract
The embodiment of the invention discloses a Chinese address matching method and a Chinese address matching system based on word segmentation and multiple similarity judgment, wherein the method comprises the following steps of S1, address preprocessing, namely collecting address data, removing irrelevant information in the address data and unifying formats; S2, word segmentation and element extraction, namely extracting elements in address data based on a Chinese word segmentation technology, S3, feature vector generation, namely integrating the elements and converting the elements into high-dimensional feature representation, S4, multiple similarity judgment, namely comprehensively judging by adopting heterogeneous similarity measurement and an adaptive fusion mechanism under a multi-dimensional feature space, and S5, intelligent judgment and output, namely outputting final judgment of the same address or different addresses according to a judgment result. The invention can analyze and compare address data with different sources and different expression modes in a unified mode, and improves the accuracy and the processing efficiency of the address data.
Inventors
- YANG LIANGZHI
- BAI LIN
- WANG ZHIXIN
- LI HAITAO
- Fang Yuehan
- ZHOU GUANGHUI
Assignees
- 彩讯科技股份有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20251209
Claims (8)
- 1. A Chinese address matching method based on word segmentation and multiple similarity judgment is characterized by comprising the following steps: s1, address preprocessing, namely collecting address data, removing irrelevant information in the address data and unifying formats; s2, word segmentation and element extraction, namely extracting elements in address data based on a Chinese word segmentation technology; s3, generating a feature vector, namely aggregating the elements and converting the elements into a high-dimensional feature representation; s4, multiple similarity judgment, namely comprehensively judging by adopting heterogeneous similarity measurement under a multidimensional feature space and a self-adaptive fusion mechanism; S5, intelligent judgment and output, namely outputting the final judgment of the same address or different addresses according to the judgment result.
- 2. The chinese address matching method based on word segmentation and multiple similarity determination as recited in claim 1, wherein step S3 comprises the sub-steps of: s31, element layering coding, namely layering classification is carried out on word segmentation and element extraction results according to semantic categories, and element sets of each category are respectively coded to form a plurality of groups of sub-feature sets; s32, aggregate and sparse vectorization, namely constructing a corpus word list for each element category, and mapping the element aggregate of each address into a sparse binary vector on the corpus; S33, sequence feature coding, namely reserving the element sequence after word segmentation except the aggregate features to obtain sequence features; s34, counting and supplementing semantic features, namely calculating the counting features of each element category as auxiliary feature vectors; And S35, splicing the multi-mode features, namely splicing the obtained sparse binary vector, sequence features and statistical features into a high-dimensional composite feature vector.
- 3. The method for matching a chinese address based on word segmentation and multiple similarity determination as recited in claim 2, wherein in step S3, semantic similarity is calculated based on cosine similarity of BERT vector as semantic vector, and when the address contains space coordinates, spatial relationship feature is calculated, and the obtained sparse binary vector, sequence feature, statistical feature, semantic vector, spatial relationship feature are spliced into high-dimensional composite feature vector.
- 4. The chinese address matching method based on word segmentation and multiple similarity determination as recited in claim 1, wherein step S4 comprises the sub-steps of: S41, calculating weighted Jaccard similarity: Let the element category of addresses a and B be i, the vocabulary item set under each category be G i and G i ', calculate the basic Jaccard similarity of each category first: ; Then, a weight w i is assigned to each category, and the weighted Jaccard similarity is: ; s42, measuring the editing distance of the block: Segmenting the word segmentation sequence according to element blocks, setting each block as B i and B i ', respectively calculating a Levenshtein distance D i for each block, and normalizing the Levenshtein distance to be similarity: ; then, a weight α i is assigned to each block, and the weighted block edit distance similarity is: ; S43, longest common subsequence similarity: The element sequences after the word segmentation of the addresses A and B are respectively marked as A= [ a 1 ,a 2 ,...,a m ] and B= [ B 1 ,b 2 ,...,b n ]; Calculating the length L of the longest public subsequence by adopting a dynamic programming algorithm, and normalizing the length L into a similarity score: ; s44, an anomaly weighting mechanism: setting an element set of an address as E, and setting a key element set as E key ⊂ E; For each key element E E key , if missing, a missing penalty factor delta e is calculated, and the anomaly correction factor delta is calculated by an exponential decay function: ; Where λ is the attenuation coefficient, II is the indicator function, and p e is the weight of the element; multiplying the anomaly correction factor by a fused similarity score: ; S45, judging a dynamic threshold value, namely adjusting the dynamic threshold value theta by adopting a distribution self-adaptive algorithm according to the historical data distribution; S46, fusion decision: Splicing the weighted Jaccard similarity J weighted , the block editing distance similarity S block and the LCS similarity S LCS into a feature vector f; fusing the multi-layer perceptron model or the weighted linear model to judge whether the same address exists; Weighted linear model: S fusion =w T f+b; Where w is the weight vector and b is the bias; multilayer perceptron model: ; wherein W 1 、W 2 is a weight matrix, b1, b2 are biases, and σ is a sigmoid function; judging according to a dynamic threshold value theta: 。
- 5. A Chinese address matching system based on word segmentation and multiple similarity determination is characterized by comprising: the address preprocessing module is used for collecting address data, removing irrelevant information in the address data and unifying formats; The word segmentation and element extraction module is used for extracting elements in the address data based on a Chinese word segmentation technology; the feature vector generation module is used for aggregating the elements and converting the elements into a high-dimensional feature representation; the multiple similarity judging module is used for comprehensively judging by adopting heterogeneous similarity measurement under a multidimensional feature space and a self-adaptive fusion mechanism; and the intelligent judging and outputting module is used for outputting the judgment of the final same address or different addresses according to the judgment result.
- 6. The chinese address matching system based on word segmentation and multiple similarity determination as recited in claim 5, wherein the feature vector generation module obtains a high-dimensional feature representation according to the steps of: element layering coding, namely layering and classifying word segmentation and element extraction results according to semantic categories, wherein element sets of each category are respectively coded to form a plurality of groups of sub-feature sets; for each element category, constructing a corpus word list, and mapping the element set of each address into a sparse binary vector on the corpus; the sequence feature coding, which is to reserve the element sequence after word segmentation except the aggregate feature to obtain the sequence feature; calculating the statistical characteristics of each element category as auxiliary characteristic vectors; and splicing the multi-mode features, namely splicing the obtained sparse binary vector, sequence features and statistical features into a high-dimensional composite feature vector.
- 7. The chinese address matching system based on word segmentation and multiple similarity determination as recited in claim 6 wherein the feature vector generation module further calculates semantic similarity as a semantic vector based on cosine similarity of the BERT vector, and calculates spatial relationship features when the address contains spatial coordinates, and concatenates the resulting sparse binary vector, sequence features, statistical features, semantic vectors, spatial relationship features into a high-dimensional composite feature vector.
- 8. The chinese address matching system based on word segmentation and multiple similarity determination as recited in claim 5, wherein the multiple similarity determination module performs a comprehensive determination according to: Weighted Jaccard similarity calculation: Let the element category of addresses a and B be i, the vocabulary item set under each category be G i and G i ', calculate the basic Jaccard similarity of each category first: ; Then, a weight w i is assigned to each category, and the weighted Jaccard similarity is: ; block edit distance metric: Segmenting the word segmentation sequence according to element blocks, setting each block as B i and B i ', respectively calculating a Levenshtein distance D i for each block, and normalizing the Levenshtein distance to be similarity: ; then, a weight α i is assigned to each block, and the weighted block edit distance similarity is: ; Longest common subsequence similarity: The element sequences after the word segmentation of the addresses A and B are respectively marked as A= [ a 1 ,a 2 ,...,a m ] and B= [ B 1 ,b 2 ,...,b n ]; Calculating the length L of the longest public subsequence by adopting a dynamic programming algorithm, and normalizing the length L into a similarity score: ; anomaly weighting mechanism: setting an element set of an address as E, and setting a key element set as E key ⊂ E; For each key element E E key , if missing, a missing penalty factor delta e is calculated, and the anomaly correction factor delta is calculated by an exponential decay function: ; Where λ is the attenuation coefficient, II is the indicator function, and p e is the weight of the element; multiplying the anomaly correction factor by a fused similarity score: ; Dynamic threshold judgment, namely adjusting a dynamic threshold theta by adopting a distribution self-adaptive algorithm according to the historical data distribution; fusion decision: Splicing the weighted Jaccard similarity J weighted , the block editing distance similarity S block and the LCS similarity S LCS into a feature vector f; fusing the multi-layer perceptron model or the weighted linear model to judge whether the same address exists; Weighted linear model: S fusion =w T f+b; Where w is the weight vector and b is the bias; multilayer perceptron model: ; wherein W 1 、W 2 is a weight matrix, b1, b2 are biases, and σ is a sigmoid function; judging according to a dynamic threshold value theta: 。
Description
Chinese address matching method and system based on word segmentation and multiple similarity determination Technical Field The invention relates to the technical field of address data processing, in particular to a Chinese address matching method and system based on word segmentation and multiple similarity judgment. Background With the development of the internet and informatization, the address data are increasingly widely applied in the fields of logistics, finance, electronic commerce, government affairs and the like. However, the existing address data acquisition and processing methods have the following disadvantages: 1. address expression is diversified, namely, multiple expression modes possibly exist in the same address, and the unified standard is lacked, so that the data comparison and integration difficulty is high. 2. The structure is not standard, partial address data is not recorded according to a unified format, and the problems of deletion, redundancy, sequence confusion and the like exist, so that the subsequent processing is affected. 3. The traditional comparison method is limited in that the existing method mostly adopts simple character string matching, can not effectively identify addresses with the same semantics but different expressions, and has low accuracy. 4. The intelligent processing is lacking, the word segmentation and element extraction mechanism aiming at the characteristics of Chinese addresses is lacking, and efficient automatic processing is difficult to realize. 5. The existing scheme is difficult to flexibly cope with the requirements of address standardization and similarity judgment under different service scenes, and has poor expansibility. Therefore, a method for automatically, accurately and efficiently performing standardized processing and similarity recognition on the intermediate address is needed to improve the data quality and the service processing efficiency. Disclosure of Invention The technical problem to be solved by the embodiment of the invention is to provide a Chinese address matching method and a Chinese address matching system based on word segmentation and multiple similarity judgment so as to improve data quality and service processing efficiency. In order to solve the technical problems, the embodiment of the invention provides a Chinese address matching method based on word segmentation and multiple similarity judgment, which comprises the following steps: s1, address preprocessing, namely collecting address data, removing irrelevant information in the address data and unifying formats; s2, word segmentation and element extraction, namely extracting elements in address data based on a Chinese word segmentation technology; s3, generating a feature vector, namely aggregating the elements and converting the elements into a high-dimensional feature representation; s4, multiple similarity judgment, namely comprehensively judging by adopting heterogeneous similarity measurement under a multidimensional feature space and a self-adaptive fusion mechanism; S5, intelligent judgment and output, namely outputting the final judgment of the same address or different addresses according to the judgment result. Correspondingly, the embodiment of the invention also provides a Chinese address matching system based on word segmentation and multiple similarity determination, which comprises the following steps: the address preprocessing module is used for collecting address data, removing irrelevant information in the address data and unifying formats; The word segmentation and element extraction module is used for extracting elements in the address data based on a Chinese word segmentation technology; the feature vector generation module is used for aggregating the elements and converting the elements into a high-dimensional feature representation; the multiple similarity judging module is used for comprehensively judging by adopting heterogeneous similarity measurement under a multidimensional feature space and a self-adaptive fusion mechanism; and the intelligent judging and outputting module is used for outputting the judgment of the final same address or different addresses according to the judgment result. The beneficial effects of the invention are as follows: 1. The invention improves the address data processing efficiency, obviously improves the automation degree of Chinese address standardization and comparison and reduces the manual intervention and processing cost through multistage preprocessing, automatic word segmentation and element extraction. 2. The invention enhances the compatibility of address data, supports Chinese address analysis and same address judgment in various expression modes, and avoids the adaptation problem caused by the change of address format. 3. The invention adopts a multiple similarity fusion judging mechanism, effectively improves the recognition accuracy under different expressions of the same address, and reduces data