Search

CN-122026921-A - Materialized compression coding method for monotonically preserving order of short character strings

CN122026921ACN 122026921 ACN122026921 ACN 122026921ACN-122026921-A

Abstract

The invention discloses a materialized compression coding method of monotonous order-preserving of short character strings, which carries out byte inversion, length information embedding, logical right shift and high-order mask shielding on original variable-length character strings with limited lengths, and the method is used for nondestructively mapping the original variable-length character strings into an integer value with fixed length and strictly ensuring that dictionary orders of character strings before and after mapping are completely consistent with numerical orders of the integers; the fixed-length integer code is used for directly replacing the original character string to participate in the sequencing, aggregation and comparison operation of the database, so that the problems of branch prediction failure, discontinuous memory access, vectorization difficulty and the like caused by processing variable-length data in the traditional scheme are greatly avoided, and the CPU execution efficiency and the cache utilization rate are greatly improved while the memory occupation is greatly reduced.

Inventors

  • TENG JIANPING
  • YI GUOLEI
  • WANG MENG
  • LIAN LINJIANG
  • XIAO KANG
  • MA RUYUE
  • YANG YONGQIANG
  • NIU XIANHUI
  • CHEN MINGYU

Assignees

  • 北京飞轮数据科技有限公司

Dates

Publication Date
20260512
Application Date
20260203

Claims (10)

  1. 1. The materialized compression coding method for monotonically preserving the order of the short character strings is characterized by comprising the following steps of: s1, receiving original variable-length character string data, and judging whether the length of the original variable-length character string is smaller than a preset threshold value or not; S2, for short character strings with the length smaller than a preset threshold value, executing a fixed-length integer coding process, wherein the short character strings are subjected to lossless mapping to form an integer coding value with the fixed length, and dictionary sequence consistency between the short character strings and the integer coding value is strictly maintained in the mapping process, wherein the bit width of the integer coding value is determined according to a target integer type; s3, the fixed-length integer coding process specifically comprises the following steps: s31, sequentially filling all bytes forming the short character string into corresponding byte bits from high order to low order of the target integer according to the reverse sequence from the tail end to the head of the character string to obtain a first intermediate integer; s32, acquiring an original length value of the short string, performing preset arithmetic transformation on the original length value, and storing the original length value into at least one least significant byte bit of the first intermediate integer to obtain a second intermediate integer; S33, performing at least one-bit logical right shift operation on the second intermediate integer to obtain a third intermediate integer; S34, carrying out bitwise operation on the third intermediate integer and a preset mask value, wherein the mask value is used for shielding sign bits and irrelevant high bits of the target integer type to obtain a final integer coding value; s4, storing the integer code value in a memory to replace the original variable-length character string and the associated offset array thereof for subsequent sorting, aggregation or comparison operation.
  2. 2. The materialized compression coding method of monotonically preserving order of short strings according to claim 1, wherein the decoding step of the method comprises: a1, when the original character string needs to be restored, reading an integer code value; A2, extracting information in least significant byte bits of integer code values, and obtaining the length of an original short string through inverse transformation; A3, performing logical left shift operation opposite to the step S33 on the integer code value to obtain a recovered byte sequence; A4, according to the extracted length, starting from the high-order bytes of the recovered byte sequence, copying the corresponding number of bytes in sequence, and reconstructing the original character string data.
  3. 3. The method of claim 1, wherein in step S31, a byte order of the character string is reversed filled in a small-end byte order storage format, so that byte bits corresponding to the preceding characters in the character string are ensured to be at a higher comparison weight position in a numerical comparison order of the target integer, thereby maintaining a monotonic order-preserving relationship between the character string dictionary order and the integer numerical order.
  4. 4. The method according to claim 1, wherein the step S32 is characterized in that the predetermined arithmetic conversion includes shifting the original length value by one bit, and storing the length value shifted by one bit to the lowest at least one byte of the first intermediate integer, such that the length information occupies a part of the lower bits of the first intermediate integer and does not cover the higher character information obtained by the byte inversion stuffing.
  5. 5. The method of claim 1, wherein in step S34, the predetermined mask value is a bit pattern of a maximum positive integer that can be represented by the target integer type.
  6. 6. The method of claim 1, wherein in step S34, the bit-wise AND operation is used to force zero to the sign bit of the third intermediate integer and the high position beyond the information carrying range of the character string byte, ensure that the integer code value is a non-negative fixed-length integer, and ensure the consistency of the code results across different computing platforms.
  7. 7. The method for materialized compression encoding of a monotonically preserving order of short strings according to claim 1, wherein the predetermined threshold is less than or equal to 16 bytes.
  8. 8. The method of claim 5, wherein the target integer type is SMALLINT, INT, BIGINT or LARGEINT, the byte length of the target integer type corresponds to 2 bytes, 4 bytes, 8 bytes or 16 bytes, and the byte length of the selected integer type is not less than the original length of the character string to be encoded.
  9. 9. The method for materialized compression encoding of single-tone order preservation of short strings according to claim 1, wherein the method is applied to an execution engine of a massive parallel processing database system, and is particularly used for vectorizing, sorting, aggregating or grouping strings, and integer encoding values directly participate in comparison operation, hash operation or modular operation based on integers.
  10. 10. The method of claim 9, wherein the sorting operation directly compares the values of the integer codes corresponding to the plurality of strings, the comparison result is equivalent to the dictionary sequence comparison result of the original strings, and the aggregation operation directly uses the integer codes as keys to perform hash calculation and barreling.

Description

Materialized compression coding method for monotonically preserving order of short character strings Technical Field The invention relates to the technical field of database query processing and data storage, in particular to a materialized compression coding method for monotonically preserving order of short character strings. Background In current database systems, particularly in the MPP database based on columnar storage, the processing of variable-length string type data is usually represented by its original bytes, and common storage schemes use a double-array structure of "byte arrays" in combination with "offset arrays" to store string content continuously. In order, aggregation, etc., the execution engine needs to directly access and manipulate the original byte sequence of these variable-length strings to complete logical decisions and groupings by comparing or calculating full string hash values byte by byte. However, the above manner of directly operating variable-length character string original data requires dynamic boundary judgment in the process of comparing or calculating hash because of unfixed character string length, which leads to frequent conditional branching, so that a branch predictor of a modern CPU is very easy to fail, seriously interrupts an instruction pipeline, and reduces execution efficiency. Meanwhile, due to the discontinuous and unaligned storage characteristics of the variable-length character strings in the memory, the memory access mode is irregular, a prefetching mechanism of a CPU cache line (CACHE LINE) cannot be effectively utilized, a large number of cache misses are caused, and the memory access delay is increased. These factors together limit the processing potential of the CPU, and in particular the Single Instruction Multiple Data (SIMD) unit within the CPU, making string-related operations difficult to implement for efficient vectorized execution. Therefore, the invention provides a materialized compression coding method for monotonically preserving the order of short character strings aiming at the problems. Disclosure of Invention In order to solve the problems of high branch prediction failure rate and low memory access efficiency caused by directly processing variable-length character strings in the prior art, which limit the vectorization capability of a CPU, the invention provides a short-character-string-oriented efficient materialized compression coding method, which is used for mapping original variable-length character strings into a fixed-length integer without damage and strictly maintaining dictionary sequence consistency in the coding process, wherein the integer can be directly used for sorting, hashing, comparing and other operations, thereby greatly improving the execution efficiency. The invention adopts the technical scheme that the materialized compression coding method for monotonically preserving the order of the short character strings comprises the following steps: S1, receiving original variable-length character string data, and judging whether the length of the original variable-length character string is smaller than or equal to 16 bytes; S2, for short character strings with the length less than 16 bytes, performing a fixed-length integer coding process, wherein the short character strings are subjected to lossless mapping to form an integer coding value with the fixed length, and dictionary sequence consistency between the short character strings and the integer coding value is strictly maintained in the mapping process, wherein the bit width of the integer coding value is determined according to a target integer type; s3, the fixed-length integer coding process specifically comprises the following steps: s31, sequentially filling all bytes forming the short character string into corresponding byte bits from high order to low order of the target integer according to the reverse sequence from the tail end to the head of the character string to obtain a first intermediate integer; S32, acquiring an original length value of a short character string, shifting the original length value by one bit leftwards, and storing the length value shifted leftwards into at least one byte of the lowest of a first intermediate integer, so that the length information occupies part of low bits of the first intermediate integer and high-order character information obtained by byte inversion filling is not covered, and a second intermediate integer is obtained; S33, performing at least one-bit logical right shift operation on the second intermediate integer to obtain a third intermediate integer; S34, carrying out bitwise operation on the third intermediate integer and a preset mask value, wherein the mask value is used for shielding sign bits and irrelevant high bits of the target integer type to obtain a final integer coding value; s4, storing the integer code value in a memory to replace the original variable-length character string and the associate