CN-121996712-A - Data processing method and data processing apparatus

CN121996712ACN 121996712 ACN121996712 ACN 121996712ACN-121996712-A

Abstract

The present disclosure relates to the field of big data technologies, and in particular, to a data processing method and a data processing device. The method comprises the steps of obtaining at least one table contained in a target metadata base and partition file catalogues of each table through metadata service when preset conditions for executing scanning operation are met, traversing actual files in each partition file catalogue according to a preset scanning mode, and obtaining file meta-information of each traversed actual file through metadata service, wherein the file meta-information comprises one or more of actual modification time and actual data quantity, the scanning mode comprises any one of full-quantity scanning and incremental scanning, and generating a scanning completion report corresponding to the scanning operation based on the file meta-information of each actual file.

Inventors

GONG TIANCHENG
ZHANG JIKUAN
SHI XIUTAO

Assignees

聚好看科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20251212

Claims (10)

1. A data processing apparatus, comprising: A communicator configured to communicate with the scheduler; A controller configured to: Querying at least one table contained in a target metadata base and a partition file directory of each table based on the metadata service when a preset condition for executing the scanning operation is met; Traversing the actual files in each partition file directory according to a preconfigured scanning mode, and acquiring file meta-information of each traversed actual file through the metadata service, wherein the file meta-information comprises one or more of actual modification time and actual data volume, and the scanning mode comprises any one of full-scale scanning and incremental scanning; and generating a scanning completion report corresponding to the scanning operation based on the file meta information of each actual file.
2. The data processing apparatus according to claim 1, wherein the controller, when satisfying a preset condition for performing a scanning operation, acquires at least one table contained in the target metadata base and a partition file directory of each table through the metadata service, is further configured to: Upon receiving a call request for calling a target application programming interface, determining that a preset condition for performing a scanning operation is satisfied, querying at least one table contained in a target metadata database based on a metadata service, and a partition file directory of each table.
3. The data processing apparatus according to claim 1, wherein the controller, when satisfying a preset condition for performing a scanning operation, acquires at least one table contained in the target metadata base and a partition file directory of each table through the metadata service, is further configured to: When the timing scanning task is triggered at the current moment, determining that preset conditions for executing the scanning operation are met, and inquiring at least one table contained in the target metadata base and a partition file directory of each table based on the metadata service.
4. The data processing apparatus according to claim 1, wherein the file meta information includes an actual modification time; The method comprises the steps of traversing the actual files in each partition file directory according to a preconfigured scanning mode, and obtaining the file meta-information of each traversed actual file through the meta-data service, wherein the method is further configured to: and traversing the actual files in each partition file directory according to a preconfigured scanning mode, and acquiring the actual modification time of each traversed actual file through the metadata service.
5. The data processing apparatus of claim 4, wherein the file meta information further includes an actual data amount; the controller is further configured to: And determining the actual data quantity of each actual file based on the historical scanning time of the last scanning operation of the target metadata base and the actual modification time of each actual file.
6. The data processing apparatus according to claim 5, wherein the controller, when performing the determination of the actual data amount of each actual file based on the historical scan time of the last execution of the scan operation of the target metadata base and the actual modification time of each actual file, is further configured to: And for each actual file, acquiring a file format of the actual file when the actual modification time is later than the historical scanning time of the last scanning operation of the target metadata base, calculating theoretical data volume according to a data calculation method corresponding to the file format, and determining that the actual data volume of the actual file is the theoretical data volume based on the theoretical data volume.
7. The data processing device of claim 6, wherein the controller is further configured to: And when the actual modification time is earlier than the historical scanning time of the last scanning operation of the target metadata base, determining the actual data volume of the actual file as the corresponding historical data volume when the last scanning operation of the target metadata base is executed.
8. The apparatus according to claim 5, wherein the history scan time is 0 when the last scan mode is a full-scale scan; when the last scanning mode is incremental scanning, the historical scanning time is the actual scanning time for executing the last scanning operation.
9. The data processing apparatus according to any one of claims 1 to 8, wherein said scan completion report includes at least one or more of a table name of said at least one table, a file name of each actual file in said partition file directory of each table, an actual data amount of each said actual file, and a current scan time at which said scan operation is currently performed.
10. A method of data processing, comprising: When the preset condition for executing the scanning operation is met, acquiring at least one table contained in the target metadata base and a partition file directory of each table through metadata service; Traversing the actual files in each partition file directory according to a preconfigured scanning mode, and acquiring file meta-information of each traversed actual file through the metadata service, wherein the file meta-information comprises one or more of actual modification time and actual data volume, and the scanning mode comprises any one of full-scale scanning and incremental scanning; and generating a scanning completion report corresponding to the scanning operation based on the file meta information of each actual file.

Description

Data processing method and data processing apparatus Technical Field The present disclosure relates to the field of big data technologies, and in particular, to a data processing method and a data processing device. Background Hive is used as a data warehouse tool under Hadoop ecology, metadata (table structures and the like) of Hive are stored in a relational database (such as MySQL), and actual data files are stored in a distributed file system (such as Hadoop distributed file system (Hadoop Distributed FILE SYSTEM, HDFS) and amazon simple storage service (Amazon Simple Storage Service, S3)). Data volume statistics is a core task of data management and management, and the size of a table, partition change and data growth trend need to be monitored. Currently, the main industry adopts ‌ to execute structured query language (Structured Query Language, SQL) (for example, select count (x)) ‌ to count data, but this method has a significant drawback that executing SQL requires Hive or Spark operation to be started, and consumes a large amount of computing resources (central processing unit (Central Processing Unit, CPU) and memory), and in particular, in a batch processing scenario, resource competition is easily caused, so that other computing tasks are blocked. Therefore, ‌ is an urgent issue to be resolved in reducing the resource consumption of executing SQL in Hive data volume statistics scenario. Disclosure of Invention In order to solve the technical problems described above, the present disclosure provides a data processing method and a data processing apparatus. In a first aspect, the disclosure provides a data processing device, which comprises a communicator configured to communicate with a scheduler, a controller configured to query at least one table contained in a target metadata base and a partition file directory of each table based on a metadata service when a preset condition for executing a scanning operation is met, traverse actual files in each partition file directory according to a preconfigured scanning mode, and acquire file meta-information of each traversed actual file through the metadata service, wherein the file meta-information comprises one or more of actual modification time and actual data amount, the scanning mode comprises any one of full-volume scanning and incremental scanning, and generate a scanning completion report corresponding to the scanning operation based on the file meta-information of each actual file. The second aspect provides a data processing method, which comprises the steps of obtaining at least one table contained in a target metadata base and partition file catalogues of each table through metadata service when preset conditions for executing scanning operation are met, traversing actual files in each partition file catalogue according to a preset scanning mode, and obtaining file meta-information of each traversed actual file through metadata service, wherein the file meta-information comprises one or more of actual modification time and actual data quantity, the scanning mode comprises any one of full-volume scanning and incremental scanning, and a scanning completion report corresponding to the scanning operation is generated based on the file meta-information of each actual file. In a third aspect, the present disclosure provides a computer readable storage medium having a computer program stored thereon, the computer program being executable by a controller to perform the data processing method of any one of the second aspects. In a fourth aspect, the present disclosure provides a computer program product for, when run on a computer, causing the computer to perform the data processing method as any one of the second aspects provides. It should be noted that the above-mentioned computer instructions may be stored in whole or in part on the first computer readable storage medium. The first computer readable storage medium may be packaged together with the controller of the data processing apparatus or may be packaged separately from the controller of the data processing apparatus, which is not limited in this disclosure. The descriptions of the second aspect, the third aspect and the fourth aspect in the present disclosure may refer to the detailed description of the first aspect, and the beneficial effects of the descriptions of the second aspect, the third aspect and the fourth aspect may refer to the beneficial effect analysis of the first aspect, which is not repeated herein. In the present disclosure, the names of the above-mentioned data processing apparatuses do not constitute limitations on the apparatuses or function modules themselves, and in actual implementations, these apparatuses or function modules may appear under other names. Insofar as the function of each device or functional module is similar to the present disclosure, it is within the scope of the present disclosure and the equivalents thereof. These and other aspects of the discl