Search

CN-121980240-A - Multi-source heterogeneous data entity alignment method and system

CN121980240ACN 121980240 ACN121980240 ACN 121980240ACN-121980240-A

Abstract

The application discloses a multi-source heterogeneous data entity alignment method and system, relates to the technical field of data processing, and solves the problem that error and leakage are easy to occur when entity data with the same objects and different names are aligned. The method and the device can effectively avoid the problems of insufficient context and shallow semantic features of unstructured short texts by utilizing stable and quantized commodity index time sequence data to align entities, and can effectively improve the accuracy of aligning the entity data with different names of the same objects and improve the problems of low database query efficiency and data redundancy caused by the entity data with different names of the same objects by extracting the multidimensional features of the commodity index time sequence curve, calculating the similarity and focusing on the features of the curve to judge whether the entities to be aligned correspond to the same entity.

Inventors

  • HU PAN
  • JIANG YUHUAN
  • XU RUIJIE
  • WANG HEQUAN
  • Xiong Xingwei
  • LIU WUYI

Assignees

  • 贵州数创控股(集团)有限公司

Dates

Publication Date
20260505
Application Date
20260120

Claims (10)

  1. 1. A method of multi-source heterogeneous data entity alignment, the method comprising: acquiring a plurality of entities to be aligned and commodity index time sequence data of each entity to be aligned; Constructing a corresponding commodity index time sequence curve based on commodity index time sequence data of each entity to be aligned; Extracting multi-dimensional characteristics of each commodity index time sequence curve; Calculating the similarity between every two commodity index time sequence curves corresponding to the plurality of entities to be aligned based on the multidimensional features; And aligning the plurality of entities to be aligned based on the similarity.
  2. 2. The method of claim 1, wherein the commodity index timing data comprises price timing data and the commodity index timing profile comprises a price volatility profile.
  3. 3. The method according to claim 1 or 2, wherein the multi-dimensional features include morphological features and periodic features, and wherein the extracting the multi-dimensional features of each of the commodity index timing curves comprises: Normalizing time sequence data in the commodity index time sequence curve to obtain a normalized fluctuation curve, and taking the normalized fluctuation curve as the morphological characteristic; And performing fast Fourier transform on the normalized fluctuation curve to obtain the periodic characteristic.
  4. 4. The method of claim 3, wherein the commodity index timing curves corresponding to the plurality of entities to be aligned comprise a first fluctuation curve and a second fluctuation curve, and wherein the calculating the similarity between the commodity index timing curves corresponding to the plurality of entities to be aligned based on the multi-dimensional features comprises: calculating a pearson correlation coefficient between morphological features of the first fluctuation curve and morphological features of the second fluctuation curve; Calculating a dynamic time warping distance between morphological features of the first fluctuation curve and morphological features of the second fluctuation curve; Calculating cosine similarity between the periodic characteristics of the first fluctuation curve and the periodic characteristics of the second fluctuation curve; and calculating the similarity of the first fluctuation curve and the second fluctuation curve based on the pearson correlation coefficient, the dynamic time warping distance and the cosine similarity.
  5. 5. The method of claim 4, wherein the multi-dimensional features further comprise trend features, and wherein the extracting multi-dimensional features of each of the commodity index timing curves comprises: Performing first-order difference calculation on the normalized fluctuation curve to obtain the trend characteristic; Based on the multidimensional features, calculating the similarity between the commodity index time sequence curves corresponding to the plurality of entities to be aligned, wherein the similarity comprises the following steps: calculating the synchronization rate between the trend characteristic of the first fluctuation curve and the trend characteristic of the second fluctuation curve; And calculating the similarity of the first fluctuation curve and the second fluctuation curve based on the synchronization rate, the pearson correlation coefficient, the dynamic time warping distance and the cosine similarity.
  6. 6. The method of claim 4, wherein the aligning the plurality of entities to be aligned based on the similarity comprises: calculating a first mean value of the first fluctuation curve and a second mean value of the second fluctuation curve under the condition that the similarity is larger than or equal to a first threshold value; and if the difference value between the first mean value and the second mean value is smaller than a second threshold value, aligning the entity to be aligned corresponding to the first fluctuation curve and the entity to be aligned corresponding to the second fluctuation curve.
  7. 7. The method of claim 6, wherein the aligning the plurality of entities to be aligned based on the similarity comprises: And under the condition that the similarity is smaller than the first threshold and larger than a third threshold, marking the entity to be aligned corresponding to the first fluctuation curve and the entity to be aligned corresponding to the second fluctuation curve as highly suspected, and pushing a manual auditing request to an administrator role.
  8. 8. A multi-source heterogeneous data entity alignment system for use in the method of any of claims 1-7, the system comprising: The acquisition module is used for acquiring a plurality of entities to be aligned and commodity index time sequence data of each entity to be aligned; The construction module is used for constructing a corresponding commodity index time sequence curve based on commodity index time sequence data of each entity to be aligned; the extraction module is used for extracting the multidimensional feature of each commodity index time sequence curve; the calculation module is used for calculating the similarity between every two commodity index time sequence curves corresponding to the plurality of entities to be aligned based on the multidimensional features; And the alignment module is used for aligning the plurality of entities to be aligned based on the similarity.
  9. 9. A computing device, comprising: a memory for storing a program; a processor for loading the program to perform the method of any of claims 1-7.
  10. 10. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored program, wherein the program, when run, controls a device in which the computer readable storage medium is located to perform the method of any one of claims 1-7.

Description

Multi-source heterogeneous data entity alignment method and system Technical Field The present invention relates to the field of data processing technologies, and in particular, to a method and a system for aligning multiple source heterogeneous data entities. Background In the big data age, the data sources are increasingly diversified, and a multi-source heterogeneous data environment is formed. For example, in the field of agricultural products, bulk goods, retail, etc., the system gathers vast amounts of data from different channels daily. In these data, the same objective entity often has a plurality of different text names due to different collection specifications, incorrect entry, naming habit differences and the like. This phenomenon of synonyms leads to severe data redundancy and inconsistencies, greatly affecting the query efficiency of the database, the storage costs, and the accuracy of subsequent data analysis. The currently mainstream entity alignment method mostly depends on natural language processing technology, and matching is carried out by calculating the editing distance and semantic similarity of the text. However, such methods require adequate context support, are not effective in handling professional fields, abbreviations, aliases, and unstructured short text, and are prone to misalignment or misalignment. In view of this, there is a need for a method and system for multi-source heterogeneous data entity alignment. Disclosure of Invention Aiming at the problem that error and leakage easily occur when entity data of the same object and different names are aligned in the prior art, the invention provides a multi-source heterogeneous data entity alignment method and system, which can improve the accuracy of alignment of the entity data of the same object and different names and solve the problems of low database query efficiency and data redundancy caused by the entity data of the same object and different names. The specific technical scheme is as follows: in a first aspect, an embodiment of the present application provides a method for aligning multiple source heterogeneous data entities, including: The method comprises the steps of obtaining a plurality of entities to be aligned and commodity index time sequence data of each entity to be aligned, constructing corresponding commodity index time sequence curves based on the commodity index time sequence data of each entity to be aligned, extracting multi-dimensional characteristics of each commodity index time sequence curve, calculating similarity between every two commodity index time sequence curves corresponding to the plurality of entities to be aligned based on the multi-dimensional characteristics, and aligning the plurality of entities to be aligned based on the similarity. Preferably, the commodity index timing data includes price timing data, and the commodity index timing curve includes a price fluctuation curve. Preferably, the multi-dimensional features comprise morphological features and periodic features, and the extracting of the multi-dimensional features of each commodity index time sequence curve comprises the steps of normalizing time sequence data in the commodity index time sequence curve to obtain a normalized fluctuation curve, taking the normalized fluctuation curve as the morphological features, and performing fast Fourier transform on the normalized fluctuation curve to obtain the periodic features. Preferably, the commodity index time sequence curves corresponding to the plurality of entities to be aligned comprise a first fluctuation curve and a second fluctuation curve, the calculating of the similarity between every two commodity index time sequence curves corresponding to the plurality of entities to be aligned based on the multidimensional feature comprises the steps of calculating a pearson correlation coefficient between the morphological feature of the first fluctuation curve and the morphological feature of the second fluctuation curve, calculating a dynamic time warping distance between the morphological feature of the first fluctuation curve and the morphological feature of the second fluctuation curve, calculating cosine similarity between the periodic feature of the first fluctuation curve and the periodic feature of the second fluctuation curve, and calculating the similarity of the first fluctuation curve and the second fluctuation curve based on the pearson correlation coefficient, the dynamic time warping distance and the cosine similarity. Preferably, the multi-dimensional feature further comprises a trend feature, the extracting of the multi-dimensional feature of each commodity index time sequence curve comprises performing first-order difference calculation on the normalized fluctuation curve to obtain the trend feature, the calculating of the similarity between every two commodity index time sequence curves corresponding to the plurality of entities to be aligned based on the multi-di