Search

CN-117194352-B - Data compression method, device, electronic equipment and readable storage medium

CN117194352BCN 117194352 BCN117194352 BCN 117194352BCN-117194352-B

Abstract

The application discloses a data compression method, a data compression device, electronic equipment and a readable storage medium, and belongs to the technical field of data processing. The method comprises the steps of obtaining N data records, wherein each data record comprises a plurality of field values, each field value corresponds to one field, constructing a data dictionary according to the repetition frequency of the field value corresponding to the same field in the N data records, storing the field value of each field, which is not repeated, in the data dictionary, determining the position index of each field value of each data record in the data dictionary, converting each field value of each data record into a corresponding position index, obtaining a position index record corresponding to each data record, and storing the data dictionary and the position index record corresponding to each data record. The scheme provided by the application can solve the problems of higher computational complexity and lower compression efficiency in the prior compression technology.

Inventors

  • LI LIN
  • XU MINGWEI
  • LI XIAOHAI
  • Zheng Binge
  • CHEN HUI

Assignees

  • 咪咕文化科技有限公司
  • 中国移动通信集团有限公司

Dates

Publication Date
20260512
Application Date
20230907

Claims (10)

  1. 1. A method of data compression, comprising: acquiring N data records, wherein each data record comprises a plurality of field values, and each field value corresponds to one field; constructing a data dictionary according to the repetition frequency of field values corresponding to the same field in the N data records, wherein the field values of which each field is not repeated are stored in the data dictionary; Determining the position index of each field value of each data record in the data dictionary, and converting each field value of each data record into a corresponding position index to obtain a position index record corresponding to each data record; storing the data dictionary and the position index record corresponding to each data record; After the N data records are acquired, before the storing of the data dictionary and the position index record corresponding to each data record, the method further includes: determining a reference data record according to the repetition frequency of each field value of each data record in the corresponding field, wherein the reference data record is the data record with the highest sum of the repetition frequencies of the field values in the N data records; After determining the position index of each field value of each data record in the data dictionary, before storing the data dictionary and the position index record corresponding to each data record, the method further comprises: Constructing an index area according to the difference information of the position index record corresponding to the reference data record and the position index record corresponding to other data records, wherein the position index record corresponding to the reference data record and the difference index record of the other data records relative to the reference data record are stored in the index area, and the other data records are data records except the reference data record in the N data records; the storing the data dictionary and the position index record corresponding to each data record includes: the data dictionary and the index area are stored.
  2. 2. The method of claim 1, wherein field values of the N data records that correspond to a same field and that are not repeated are stored in a same row of the data dictionary, and wherein field values of the same field stored in each row of the data dictionary are ordered according to repetition frequency; the determining the location index of each field value of each data record in the data dictionary comprises the following steps: And determining the column number of each field value in the corresponding row data as a position index according to the corresponding row data of the field corresponding to each field value of each data record in the data dictionary.
  3. 3. The method of claim 2, wherein field values of the same field stored in each row of the data dictionary are ordered in a repeating order from high to low.
  4. 4. The method of claim 1, wherein determining the reference data record based on the repetition frequency of each field value of each data record in the corresponding field comprises: Constructing a dictionary tree corresponding to the N data records by taking each field value of each data record as a node, and merging repeated field values of the same field into the same node in the process of constructing the dictionary tree, wherein the dictionary tree comprises N paths, and each path corresponds to one data record in the N data records; acquiring the repetition frequency of each node in each path, and determining the sum of the repetition frequencies of each node in each path; and determining the data record corresponding to the path with the highest sum of the repeated frequencies of all nodes in the N paths as the reference data record.
  5. 5. The method according to claim 1, wherein the constructing an index area from difference information of the position index record corresponding to the reference data record and the position index record corresponding to the other data record includes: Determining a target field with difference between the position indexes in the position index record corresponding to the first data record and the position index record corresponding to the reference data record, and determining the position index difference corresponding to the target field, wherein the first data record is any one of the other data records; And the serial number of the target field and the position index difference are written into the index area in a correlated mode.
  6. 6. The method of claim 5, wherein the writing the sequence number of the target field and the position index difference into the index field comprises: and under the condition that the position index differences corresponding to the plurality of target fields are the same in the first data record, the plurality of target fields and one position index difference are associated and written into the index area.
  7. 7. A method of decompressing data, comprising: obtaining compressed data, wherein the compressed data is obtained by compressing N data records by the data compression method according to any one of claims 1 to 6, and the compressed data comprises a data dictionary corresponding to the N data records and a position index record corresponding to each data record; Inquiring each field value of each data record from the data dictionary according to the position index record corresponding to each data record to obtain each data record; in the case that the compressed data includes a data dictionary and an index area, the searching each field value of each data record from the data dictionary according to the position index record corresponding to each data record to obtain each data record includes: Acquiring position index records corresponding to other data records according to the position index records corresponding to the reference data records stored in the index area and the difference index records of the other data records relative to the reference data records, wherein the other data records are data records except the reference data records in the N data records; And according to the position index record corresponding to the reference data record and the position index record corresponding to the other data records, inquiring each field value of the reference data record from the data dictionary to obtain the reference data record, and inquiring each field value of each data record in the other data records from the data dictionary to obtain the other data records.
  8. 8. A data compression apparatus, comprising: the data acquisition module is used for acquiring N data records, each data record comprises a plurality of field values, and each field value corresponds to one field; The dictionary construction module is used for constructing a data dictionary according to the repetition frequency of the field values corresponding to the same field in the N data records, wherein the field values of which each field is not repeated are stored in the data dictionary; the position index module is used for determining the position index of each field value of each data record in the data dictionary, converting each field value of each data record into a corresponding position index and obtaining a position index record corresponding to each data record; The record storage module is used for storing the data dictionary and the position index record corresponding to each data record; The data acquisition module further includes: The basic accurate stator module is used for determining a basic data record according to the repetition frequency of each field value of each data record in a corresponding field, wherein the basic data record is one data record with the highest sum of the repetition frequencies of the field values in the N data records; the location index module further includes: an index area sub-module, configured to construct an index area according to difference information of position index records corresponding to other data records and position index records corresponding to the reference data records, where the index area stores position index records corresponding to the reference data records and difference index records of the other data records relative to the reference data records, and the other data records are data records except the reference data records in the N data records; The record storage module includes: And the storage sub-module is used for storing the data dictionary and the index area.
  9. 9. An electronic device comprising a processor, a memory and a program or instruction stored on the memory and executable on the processor, which when executed by the processor implements the steps of the data compression method of any one of claims 1 to 6 or the data decompression method of claim 7.
  10. 10. A readable storage medium, wherein a program or instructions is stored on the readable storage medium, which when executed by a processor, implements the steps of the data compression method according to any one of claims 1 to 6 or the data decompression method according to claim 7.

Description

Data compression method, device, electronic equipment and readable storage medium Technical Field The application belongs to the technical field of data processing, and particularly relates to a data compression method, a data compression device, electronic equipment and a readable storage medium. Background To save data storage space, data is typically compressed and stored. In the prior art, a dictionary-based compression algorithm, such as an LZ777 compression algorithm, is generally adopted for data compression and storage, so that disk space occupied by stored data is reduced. The dictionary-based compression algorithm is essentially implemented by continuously searching for a common portion of data and then replacing the common portion with a symbol. The compression has higher computational complexity and lower compression efficiency due to the need to continually find the common part and to use symbols instead of the common part. Disclosure of Invention The embodiment of the application provides a data compression method, a device, electronic equipment and a readable storage medium, which can solve the problems of higher computation complexity and lower compression efficiency of the existing compression algorithm. In a first aspect, an embodiment of the present application provides a data compression method, including: acquiring N data records, wherein each data record comprises a plurality of field values, and each field value corresponds to one field; constructing a data dictionary according to the repetition frequency of field values corresponding to the same field in the N data records, wherein the field values of which each field is not repeated are stored in the data dictionary; Determining the position index of each field value of each data record in the data dictionary, and converting each field value of each data record into a corresponding position index to obtain a position index record corresponding to each data record; And storing the data dictionary and the position index record corresponding to each data record. Optionally, after the acquiring N data records, before the storing the data dictionary and the position index record corresponding to each data record, the method further includes: determining a reference data record according to the repetition frequency of each field value of each data record in the corresponding field, wherein the reference data record is the data record with the highest sum of the repetition frequencies of the field values in the N data records; After determining the position index of each field value of each data record in the data dictionary, before storing the data dictionary and the position index record corresponding to each data record, the method further comprises: Constructing an index area according to the difference information of the position index record corresponding to the reference data record and the position index record corresponding to other data records, wherein the position index record corresponding to the reference data record and the difference index record of the other data records relative to the reference data record are stored in the index area, and the other data records are data records except the reference data record in the N data records; the storing the data dictionary and the position index record corresponding to each data record includes: the data dictionary and the index area are stored. Optionally, the field values corresponding to the same field and not repeated in the N data records are stored in the same row of the data dictionary, and the field values of the same field stored in each row of the data dictionary are ordered according to the repetition frequency; the determining the location index of each field value of each data record in the data dictionary comprises the following steps: And determining the column number of each field value in the corresponding row data as a position index according to the corresponding row data of the field corresponding to each field value of each data record in the data dictionary. Optionally, the field values of the same field stored in each row in the data dictionary are ordered in the order of the repetition frequency from high to low. Optionally, the determining the reference data record according to the repetition frequency of each field value of each data record in the corresponding field includes: Constructing a dictionary tree corresponding to the N data records by taking each field value of each data record as a node, and merging repeated field values of the same field into the same node in the process of constructing the dictionary tree, wherein the dictionary tree comprises N paths, and each path corresponds to one data record in the N data records; acquiring the repetition frequency of each node in each path, and determining the sum of the repetition frequencies of each node in each path; and determining the data record corresponding to the path with the highest sum of