CN-119106275-B - Artificial intelligence public data platform based on big data processing self-updating redundancy elimination

CN119106275BCN 119106275 BCN119106275 BCN 119106275BCN-119106275-B

Abstract

The invention relates to a self-updating redundancy removing method based on big data processing and an artificial intelligent public data platform thereof; the method comprises the steps of comparing each data item with the same data type as the data to be stored on the basis of the sparse feature matrix corresponding to each data item with the sparse feature matrix of the data to be stored to obtain the data item of the data to be determined in the public data platform, performing feature value completion on the basis of the sparse feature matrix of the data to be stored to obtain a complete feature completion matrix, and comparing the feature completion matrix with the feature matrix of the data to be determined to determine whether self-updating redundancy elimination is performed or not. The invention is suitable for big data processing, and utilizes artificial intelligence technology to convert data content comparison into characteristic value comparison, and greatly improves the processing efficiency of self-updating redundancy elimination on the basis of not reducing redundancy elimination accuracy.

Inventors

FAN HONG
HU WEI

Assignees

南京费雪克劳德科技有限公司

Dates

Publication Date: 20260512
Application Date: 20240814

Claims (8)

1. A self-updating redundancy elimination method based on big data processing, the method comprising: Step S1, dividing each data item stored in a public data platform to obtain one or more data areas, obtaining a characteristic value of each data area, and constructing a characteristic matrix corresponding to each data item by using the characteristic value; the data type to be stored comprises images, voice, text and/or video; Step S2, calculating a feature matrix corresponding to each feature value type Selecting the feature matrix corresponding to the first U feature value types with the largest entropy from all feature value types, and performing sparse processing on the feature matrix corresponding to the first U feature value types to obtain U sample sparse feature matrices; calculating a feature matrix corresponding to each feature value type u Entropy of (2) Specifically, the entropy is calculated by adopting the following formulas (1) - (3) And wherein: Is the element number in the feature matrix; Is positioned at Is a characteristic value of (2); (1); (2); (3); the sparse processing is carried out to obtain a U sample sparse feature matrix Specifically, for each feature, a feature matrix corresponding to the type u of the feature In the feature matrix Part of elements are deleted, so that when the sparse window slides arbitrarily in the feature matrix, any deleted element in the sparse window exists and at least exists Adjacent elements which are not deleted, except boundary elements; setting the element value of the deleted partial element to 0; Step S3, receiving data to be stored and dividing the data to obtain one or more data areas, obtaining the characteristic value of each data area, and constructing characteristic matrixes of the data items corresponding to different characteristic value types by using the characteristic values; S4, comparing each data item with the same data type as the data to be stored with the U characteristic value types and the corresponding sample sparse characteristic matrixes associated with each data item with the current sparse characteristic matrix of the corresponding characteristic value type of the data to be stored; when one data item exists, so that a sample sparse feature matrix corresponding to each feature value type is similar to a current sparse feature matrix of a corresponding feature value type, taking the existing one data item as the data item of the data to be determined; Step S5, performing eigenvalue complementation on the current sparse feature matrix of the data to be stored to obtain a feature complementation matrix; The storage and management of the big data are performed by using a storage device to store the collected data, a corresponding database is established, and management and calling are performed; Step S6, acquiring a first data area corresponding to a current sparse feature matrix of data to be stored, acquiring a second data area which is not the first data area in the data to be stored, acquiring a third data area corresponding to a sample sparse feature matrix of the data to be determined, acquiring a fourth data area which is not the third data area in the data to be determined, comparing data contents of data items in the third data area and the fourth data area, and entering the step S7 when the similarity of the third data area and the fourth data area is larger than a high similarity threshold; And S7, performing self-updating to remove redundancy and deleting the data to be stored.
2. The self-updating redundancy elimination method based on big data processing according to claim 1, wherein the data type to be stored comprises structured, semi-structured and/or unstructured big data.
3. The self-updating redundancy elimination method based on big data processing according to claim 2, wherein the feature value of each data area is one or more types, and each feature value type corresponds to a feature matrix.
4. A self-updating redundancy elimination method based on big data processing according to claim 3, wherein said partitioning is a semantically independent data partitioning.
5. An artificial intelligence public data platform based on big data processing self-updating redundancy elimination, characterized in that the artificial intelligence public data platform is used for realizing the big data processing self-updating redundancy elimination method based on any one of claims 1 to 4.
6. An artificial intelligence public data server for self-updating redundancy elimination based on big data processing, characterized in that it comprises a processor, said processor and a memory are coupled, said memory stores program instructions, when said program instructions stored in said memory are executed by said processor, implementing the self-updating redundancy elimination method based on big data processing according to any of claims 1-4.
7. A big data processing based self-updating redundancy elimination system, characterized in that the system is used for realizing the big data processing based self-updating redundancy elimination method according to any of claims 1-4.
8. A computer readable storage medium comprising a program which, when run on a computer, causes the computer to perform the big data processing based self-updating redundancy elimination method of any of claims 1 to 4.

Description

Artificial intelligence public data platform based on big data processing self-updating redundancy elimination [ Field of technology ] The invention belongs to the technical field of big data, and particularly relates to an artificial intelligent public data platform based on big data processing self-updating redundancy elimination. [ Background Art ] The rapid development and wide application of artificial intelligence technology is pushing society into a new era of intellectualization and digitization. With the continuous progress of technology and the powerful support of policies, artificial intelligence is expected to play a key role in more fields, and promote the high-quality development of economy and society. 'Big data' generally refers to data sets that are large in number and difficult to collect, process, analyze, and also refers to data that is stored in a traditional infrastructure for a long period of time. Big data oriented storage is to persist a huge number of data sets into a computer that are difficult to collect, process, analyze. The large data storage and management needs to use a memory to store the collected data, establish a corresponding database and carry out management and calling. With the explosive growth of data sizes, storage and management requirements for large data are rapidly increasing with increasing data volumes, which presents new challenges to the performance, capacity, and architecture of storage systems. The complex data types of big data, including structured, semi-structured, unstructured big data, etc., make de-redundancy a greater challenge. In the traditional technology, different redundancy elimination strategies can be formulated according to factors such as the access frequency and the modification frequency of data, stricter redundancy elimination strategies can be adopted for data with higher access frequency to reduce the storage and transmission of repeated data, and more flexible redundancy elimination strategies can be adopted for data with higher modification frequency to ensure the integrity and consistency of the data. In summary, the self-updating redundancy elimination of big data has some new problems and challenges, the big data presents two remarkable characteristics when the self-updating redundancy elimination is carried out, one is large in data quantity and complex in data type, the traditional redundancy elimination capacity mode is usually developed aiming at the comparison of data items one by one and the fusion comparison of different data types, the development of the technology is usually focused on the improvement of data regularity and comparison algorithm, the problem of large data quantity in big data storage is not focused on the straight face, the artificial intelligence technology is not used in the big data processing process, and based on the problems, the invention is suitable for big data processing, utilizes the artificial intelligence technology to compare the data content into characteristic values, combines the comparison of the characteristic values with the comparison of sparse data content, and greatly improves the processing efficiency of the self-updating redundancy elimination on the basis of not reducing the redundancy elimination accuracy. [ Invention ] In order to solve the above problems in the prior art, the present invention provides a self-updating redundancy removing method based on big data processing and an artificial intelligence public data platform thereof, wherein the method comprises: Step S1, dividing each data item stored in a public data platform to obtain one or more data areas, obtaining a characteristic value of each data area, and constructing a characteristic matrix corresponding to each data item by using the characteristic value; Step S2, calculating entropy of a feature matrix M u corresponding to each feature value type, selecting a feature matrix corresponding to the first U feature value types with the largest entropy from all feature value types, and performing sparse processing on the feature matrix corresponding to the first U feature value types to obtain U sample sparse feature matrices; calculating a feature matrix corresponding to each feature value type u The entropy S u of the characteristic matrix is calculated by adopting the following formulas (1) - (3), wherein u= 1~U, (i, j) is the element number in the characteristic matrix; is a characteristic value at position (i, j); For a feature matrix M u corresponding to a type U such as each feature, deleting part of elements in the feature matrix M u, so that when a sparse window slides randomly in the feature matrix, any deleted element exists in the sparse window, at least XS adjacent elements which are not deleted exist, and boundary elements are excluded; step S3, receiving data to be stored and dividing the data to obtain one or more data areas, obtaining the characteristic value of each data area, and constructing charact