CN-122019648-A - Big data export method and device based on script push-down and asynchronous compression transmission
Abstract
The invention relates to the technical field of big data processing, in particular to a big data export method and device based on script push-down and asynchronous compression transmission. The invention packages the query logic corresponding to the exported task into the executable script, and executes the executable script on the database server, so that the query and the file generation are completed on the database server, the risk of memory overflow is fundamentally avoided, further, the local file of the server is compressed, only the script and the compressed result file are transmitted, only the initial small script and the final compressed result file need to be transmitted through a network, the real-time streaming of the original mass data is avoided, the size of the transmitted file is reduced, the network bandwidth is saved, the client program does not need to be blocked and wait during the task execution, other requests can be processed, and the overall throughput and the resource utilization rate of the system are improved through asynchronous monitoring and acquisition results.
Inventors
- PENG XIAOGANG
- LI JING
Assignees
- 武汉绿色网络股份有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20251230
Claims (10)
- 1. The big data export method based on script push-down and asynchronous compression transmission is characterized by comprising the following steps: packaging query logic corresponding to the export task into an executable script, and transmitting the executable script to a database server; executing the executable script on the database server to drive a database client to execute an export task and write an export result into a server local file; Compressing the server local file to obtain a compressed result file; And asynchronously monitoring the state of the export task on the database server, and pulling the compressed result file from the database server to the local after the completion of the export task is monitored.
- 2. The big data export method based on script push and asynchronous compression transmission according to claim 1, wherein the encapsulating the query logic corresponding to the export task as an executable script specifically comprises: the client program queries a system and parts system table of the database, and obtains partition keys of the target table and a corresponding partition value range to obtain partition information, wherein the partition keys are time fields or service custom fields, and partitions are divided in units of days or months; Dynamically generating an executable script containing a cyclic execution structure based on the partition information, wherein the executable script comprises an independent query command, a temporary output path and a file format definition aiming at each partition, the file format is a CSV format, and the query command is clickhouse-client-q' SELECT FROM target table_ { partition name } WHERE partition key = '{ partition name }' AND { query parameter } "-format CSV >/tmp/derived task_ { task ID } _ { partition name }. CSV.
- 3. The big data export method based on script push and asynchronous compression transmission according to claim 1, wherein the executing the executable script on the database server specifically comprises: acquiring key performance indexes including CPU load, memory utilization and current concurrent query number by querying a system.metrics table and a system.processes table of a database; presetting a safety threshold and an alert threshold for each index in the key performance indexes; When all indexes in the key performance indexes are lower than the corresponding safety threshold, starting a plurality of clickhouse-client processes to execute the slicing inquiry according to the maximum concurrency number; When all indexes in the key performance indexes exceed the corresponding safety threshold value but are lower than the warning threshold value, entering a throttling mode, and reducing the concurrent query number to a preset value or increasing random delay during query; And when any one of the key performance indexes exceeds the corresponding warning threshold, marking the corresponding export task as a pause state, recording a check point, and executing after waiting for the reevaluation of the next scheduling period.
- 4. The method for exporting big data based on script push-down and asynchronous compression transmission according to claim 1, wherein the compressing the server local file to obtain a compressed result file specifically comprises: analyzing file characteristics of the server local file, and dynamically selecting a target compression algorithm from a plurality of compression algorithms based on the file characteristics to compress the server local file, wherein the file characteristics comprise at least one of file quantity, text redundancy and file size.
- 5. The method for exporting big data based on script push and asynchronous compression transmission according to claim 4, wherein the dynamically selecting a target compression algorithm from a plurality of compression algorithms based on the file characteristics to compress a server local file specifically comprises: When the number of the files exceeds a number threshold or the text redundancy exceeds a redundancy threshold, carrying out compression processing on the local files of the server by adopting Zstandard algorithm or gzip algorithm to obtain the compression result files; and when the number of the files is smaller than or equal to a preset value and the size of the single file is smaller than a size threshold, selecting an LZ4 algorithm to compress the server local file so as to obtain the compressed result file.
- 6. The script push and asynchronous compression transmission based big data export method of claim 1, further comprising, prior to compression processing the server local file: identifying and screening out server local files with data line numbers smaller than a line number threshold value; and merging the screened server local files until the data line number of the merged server local file reaches or exceeds the line number threshold value.
- 7. The method for exporting big data based on script push-down and asynchronous compressed transmission according to claim 1, wherein the asynchronous monitoring of the status of the exporting task on the database server, after monitoring that the exporting task is completed, pulls the compressed result file from the database server to the local place, specifically includes: the client program starts a timing monitoring task and sets a monitoring period, wherein the monitoring content comprises the existence of a local compressed result file of a database server, the stability of the file size and the complete executability of a query process corresponding to an executable script; When all three conditions are met, the export task is judged to be completed, and the client program pulls the corresponding compression result file to the local through the safe copy command.
- 8. The big data export method based on script push and asynchronous compression transmission according to claim 7, wherein the client program is a Java application program, the database server is ClickHouse server, the executable script is Shell script, and the database client is a clickhouse-client tool native to ClickHouse server.
- 9. The big data deriving device based on script push-down and asynchronous compression transmission is characterized by comprising a processor and a memory for storing executable instructions of the processor; Wherein the processor is configured to perform the script-based push-down and asynchronous compression transmission big data derivation method of any of claims 1-8.
- 10. A non-transitory computer storage medium storing computer-executable instructions for execution by one or more processors for performing the script-based push and asynchronous compression transmission big data derivation method of any of claims 1-8.
Description
Big data export method and device based on script push-down and asynchronous compression transmission Technical Field The invention relates to the technical field of big data processing, in particular to a big data export method and device based on script push-down and asynchronous compression transmission. Background In the field of data processing, it is often necessary to export a large number of query results from a database as files. Conventional exporting methods based on Java applications, such as JDBC (english: java Database Connectivity), typically follow a procedure in which an application initiates a query to a database, the database returns a result set, and the application reads data line by line through an I/O stream and writes to a local file. This approach exposes the following serious drawbacks when handling large amounts of data (e.g., hundreds of millions of rows): Java programs need to create and maintain database connections and I/O streams. When the data volume is extremely large, the result set continuously occupies a large amount of memory at the database end, in the network transmission process and at the Java application end, so that the memory overflow of the Java virtual machine is extremely easy to cause program breakdown. The whole result set needs to be transmitted from the database server to the application server through the network, so that a large amount of network bandwidth is occupied, and the transmission process takes a long time. The exported files are eventually written to the disk of the application server, and the writing of large amounts of data consumes valuable disk I/O resources, potentially affecting the performance of other core services on the application server. Traditional exporting is a synchronous or semi-synchronous process, where applications need to wait continually for database returns and data writes to complete, during which resources are occupied for a long period of time, and system throughput is low. In view of this, overcoming the drawbacks of the prior art is a problem to be solved in the art. Disclosure of Invention The invention solves the technical problems of low system throughput and inconvenient management of small files caused by high memory overflow of a client, high network resource occupation, high I/O pressure of a disk of an application server and synchronous blocking in the traditional large-data batch export, thereby realizing the purpose of exporting mass data with high efficiency and low consumption. The invention adopts the following technical scheme: The first aspect provides a big data export method based on script push-down and asynchronous compression transmission, comprising the steps of packaging query logic corresponding to an export task into an executable script, and transmitting the executable script to a database server; executing the executable script on the database server to drive a database client to execute an export task and write an export result into a server local file; Compressing the server local file to obtain a compressed result file; And asynchronously monitoring the state of the export task on the database server, and pulling the compressed result file from the database server to the local after the completion of the export task is monitored. Preferably, the packaging the query logic corresponding to the export task as an executable script specifically includes: the client program queries a system and parts system table of the database, and obtains partition keys of the target table and a corresponding partition value range to obtain partition information, wherein the partition keys are time fields or service custom fields, and partitions are divided in units of days or months; Dynamically generating an executable script containing a cyclic execution structure based on the partition information, wherein the executable script comprises an independent query command, a temporary output path and a file format definition aiming at each partition, the file format is a CSV format, and the query command is clickhouse-client-q' SELECT FROM target table_ { partition name } WHERE partition key = '{ partition name }' AND { query parameter } "-format CSV >/tmp/derived task_ { task ID } _ { partition name }. CSV. Preferably, the executing the executable script on the database server specifically includes: acquiring key performance indexes including CPU load, memory utilization and current concurrent query number by querying a system.metrics table and a system.processes table of a database; presetting a safety threshold and an alert threshold for each index in the key performance indexes; When all indexes in the key performance indexes are lower than the corresponding safety threshold, starting a plurality of clickhouse-client processes to execute the slicing inquiry according to the maximum concurrency number; When all indexes in the key performance indexes exceed the corresponding safety threshold value but are lo