CN-115982151-B - Data redundancy identification method and device, electronic equipment and storage medium
Abstract
The embodiment of the application provides a data redundancy identification method, a device, electronic equipment and a storage medium, wherein the data redundancy identification method comprises the steps of obtaining fields to be identified in at least two data tables in a preset database, obtaining the support degree of the fields to be identified, and identifying and obtaining reasonable redundancy fields in the fields to be identified based on the support degree of the fields to be identified, wherein the reasonable redundancy fields indicate fields with the occurrence frequency higher than a preset value. The embodiment of the application realizes the identification of the legal redundant fields.
Inventors
- DENG JUAN
- LUO XIU
Assignees
- 中国移动通信集团贵州有限公司
- 中国移动通信集团有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20211013
Claims (6)
- 1. A data redundancy identification method, characterized in that the data redundancy identification method comprises: acquiring fields to be identified in at least two data tables in a preset database; Acquiring the support degree of the field to be identified; Based on the support degree of the fields to be identified, identifying and obtaining reasonable redundant fields in the fields to be identified, wherein the reasonable redundant fields indicate fields with occurrence frequency higher than a preset value; acquiring a field to be calculated in redundancy except the reasonable redundancy field in the field to be identified; calculating to obtain the data redundancy between a first data table and a second data table based on the fields to be calculated in each data table and a preset reference index, wherein the first data table and the second data table are any two data tables in the at least two data tables; The identifying to obtain the reasonable redundant field in the field to be identified based on the support of the field to be identified comprises the following steps: Screening and obtaining a target field with the support degree higher than a preset support degree threshold value in the fields to be identified based on the support degree of the fields to be identified; any two of the target fields form a public field set, the support degree of the public field set is calculated, and the target public field set with the support degree higher than the preset support degree threshold value in the public field set is obtained through screening; and determining the fields in the target public field set as the reasonably redundant fields.
- 2. The method of claim 1, wherein the reference index comprises at least one of a non-common field overlap ratio, a table-located domain overlap ratio, a storage period overlap ratio, a table-located layer overlap ratio, a source table overlap ratio, and a common field overlap ratio; And the non-public field corresponding to the non-public field overlapping ratio and the public field corresponding to the public field overlapping ratio are obtained by screening from the fields to be calculated redundancy.
- 3. The method for redundant identification of data according to claim 1 or 2, wherein, The calculating to obtain the data redundancy between the first data table and the second data table based on the to-be-redundant calculation field in each data table and a preset reference index includes: Calculating to obtain a weight value corresponding to each reference index, wherein the weight value corresponding to each reference index is a ratio of the weight score of the reference index to the weight scores of all the reference indexes, and the weight scores of the reference indexes are calculated based on the importance degree of the reference indexes in all the reference indexes; And calculating the product of each reference index and the weight value corresponding to the reference index, and determining the sum of all the calculated products as the data redundancy.
- 4. A data redundancy identification apparatus, the redundancy identification apparatus comprising: The first acquisition module is used for acquiring fields to be identified in at least two data tables in a preset database; the second acquisition module is used for acquiring the support degree of the field to be identified; the redundancy identification module is used for identifying and obtaining reasonable redundancy fields in the fields to be identified based on the support degree of the fields to be identified, wherein the reasonable redundancy fields indicate fields with the occurrence frequency higher than a preset value; The redundancy identification module is used for screening and obtaining target fields with the support degree higher than a preset support degree threshold value in the fields to be identified based on the support degree of the fields to be identified, forming a public field set by any two of the target fields, calculating the support degree of the public field set, screening and obtaining a target public field set with the support degree higher than the preset support degree threshold value in the public field set, and determining the fields in the target public field set as the reasonable redundancy fields; the data redundancy recognition device further includes: and the redundancy calculation module is used for acquiring the fields to be calculated except the reasonable redundancy fields in the fields to be identified, and calculating the data redundancy between a first data table and a second data table based on the fields to be calculated in each data table and a preset reference index, wherein the first data table and the second data table are any two data tables in the at least two data tables.
- 5. An electronic device comprising a processor, a memory and a program or instruction stored on the memory and executable on the processor, the program or instruction when executed by the processor implementing the steps of the data redundancy identification method of any one of claims 1 to 3.
- 6. A readable storage medium, characterized in that it stores thereon a program or instructions which, when executed by a processor, implement the steps of the data redundancy identification method according to any one of claims 1 to 3.
Description
Data redundancy identification method and device, electronic equipment and storage medium Technical Field The present application belongs to the technical field of mobile communications, and in particular, relates to a data redundancy identification method, a device, an electronic apparatus, and a storage medium. Background A data warehouse (Data Warehouse) is a strategic collection that provides all types of data support for enterprise business development decision making, and is a single data store created for analytical reporting and decision support purposes. The data warehouse has the characteristics of large data volume, various data and the like, especially the data volume is increased in P level in the 21 st century, and due to the increase of the data volume, the cost of management of the data by enterprises is also increased rapidly, how to reasonably manage the data of the data warehouse and effectively identify the redundancy condition of metadata, and the data warehouse storage resources are released by carrying out targeted backup or deletion according to the identified redundancy table, so that the data warehouse resources are the main means for enterprises to be ensured to be reasonably used. One way to identify redundant metadata in a data warehouse is to judge the redundancy of the tables according to the familiarity degree of the metadata by data management personnel, and the other way is to screen, compare and check all the tables in the warehouse for redundancy. The first method requires that data management staff is familiar with all tables in a warehouse and has high requirements on staff, the second method carries out cyclic comparison on all tables in the warehouse, whether the tables are redundant or not is judged by comparing similar conditions of fields, but the process can judge the field errors with high similarity but reasonable nature as redundant fields, so that the identified redundancy is not fit with the actual fact, and the redundancy identification accuracy is not high. Disclosure of Invention The embodiment of the application aims to provide a data redundancy identification method, a device, electronic equipment and a storage medium, so as to solve the problem of low accuracy of data redundancy identification in the related technology. In a first aspect, an embodiment of the present application provides a data redundancy identification method, where the data redundancy identification method includes: acquiring fields to be identified in at least two data tables in a preset database; Acquiring the support degree of the field to be identified; and identifying and obtaining reasonable redundant fields in the fields to be identified based on the support degree of the fields to be identified, wherein the reasonable redundant fields indicate fields with occurrence frequency higher than a preset value. In a second aspect, an embodiment of the present application provides a data redundancy identification apparatus, including: The first acquisition module is used for acquiring fields to be identified in at least two data tables in a preset database; the second acquisition module is used for acquiring the support degree of the field to be identified; And the redundancy identification module is used for identifying and obtaining reasonable redundancy fields in the fields to be identified based on the support degree of the fields to be identified, wherein the reasonable redundancy fields indicate fields with the occurrence frequency higher than a preset value. In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, and a program or instruction stored on the memory and executable on the processor, the program or instruction implementing the steps of the method according to the first aspect when executed by the processor. In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which when executed by a processor implement the steps of the method according to the first aspect. In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and where the processor is configured to execute a program or instructions to implement a method according to the first aspect. In the implementation of the application, the fields to be identified in at least two data tables in the preset database are obtained, the support degree of the fields to be identified is obtained, and then the reasonable redundant fields in the fields to be identified are identified based on the support degree of the fields to be identified, wherein the reasonable redundant fields indicate the fields with the occurrence frequency higher than the preset value, so that the identification of the redundant fields with high similarity bu