CN-121996634-A - File merging method, device, equipment, storage medium and program product
Abstract
The application provides a file merging method, a device, equipment, a storage medium and a program product, and particularly relates to the technical field of data processing. The method comprises the steps of identifying small files stored in a distributed storage system by a target data table, recording format information of each small file, combining the small files through a distributed computing frame according to the recorded format information to generate a plurality of aggregated files, writing the aggregated files into a temporary storage catalog, and migrating the aggregated files to the formal storage catalog of the target data table after the aggregated files pass through integrity verification. The method is used for achieving efficient small file merging and supporting continuous access of streaming data.
Inventors
- WU CHUANLONG
- ZHANG WEICHAO
- ZHOU JIAWEI
- LAI HAN
Assignees
- 北京四维图新科技股份有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20260126
Claims (10)
- 1. A method of merging files, comprising: Identifying small files stored in the distributed storage system by the target data table, and recording format information of each small file; According to the recorded format information, merging the small files through a distributed computing framework to generate a plurality of aggregation files; Writing the aggregate file into a temporary storage directory, and after the aggregate file passes the integrity check, migrating the aggregate file to a formal storage directory of the target data table.
- 2. The method according to claim 1, wherein the merging the small files through a distributed computing framework according to the recorded format information to generate a plurality of aggregate files includes: Converting the data in the small file into a data structure supported by a distributed computing framework, so as to read the data of the small file into a memory of the distributed computing framework; And respectively merging the data stored in the memory according to different original format types according to the recorded format information to generate a plurality of aggregation files.
- 3. The method according to claim 2, wherein the merging processing is performed on the data stored in the memory according to the recorded format information and different original format types respectively to generate a plurality of aggregate files, including: Acquiring task execution parameters and output control parameters set by a user; And respectively merging the data stored in the memory according to the original format type of the data based on the task execution parameters and the output control parameters to generate a plurality of aggregation files.
- 4. A method according to any one of claims 1-3, wherein identifying small files of a target data table stored in a distributed storage system comprises: And if the target data table is detected to have a file with the file size smaller than the preset file size threshold value in the data files stored in the distributed storage system, identifying the file as the small file.
- 5. The method of claim 4, wherein the recognition target data table precedes a doclet stored in a distributed storage system, the method further comprising: and adjusting the preset file size threshold based on the writing frequency and the file growth rate of the real-time data stream.
- 6. A method according to any one of claims 1-3, wherein said migrating said aggregate file to a formal storage directory of said target data table after said aggregate file passes an integrity check comprises: After the aggregate file passes the integrity check, the aggregate file is migrated from the temporary storage directory to the formal storage directory of the target data table, and the small file is deleted from the formal storage directory.
- 7. A method according to any one of claims 1-3, wherein after said migrating said aggregate file to a formal storage directory of said target data table, said method further comprises: and updating the metadata information of the target data table to allow a user to access the aggregation file according to the metadata information.
- 8. A document merge device, comprising: The identification module is used for identifying small files stored in the distributed storage system by the target data table and recording format information of each small file; The merging module is used for merging the small files through a distributed computing framework according to the recorded format information to generate a plurality of aggregation files; And the migration module is used for writing the aggregate file into the temporary storage directory, and migrating the aggregate file to the formal storage directory of the target data table after the aggregate file passes the integrity check.
- 9. The file merging device is characterized by comprising a memory and a processor; The memory stores computer-executable instructions; The processor executing computer-executable instructions stored in the memory, causing the processor to perform the method of any one of claims 1-7.
- 10. A computer-readable storage medium or computer program product comprising, The computer readable storage medium having stored therein computer executable instructions which when executed by a processor are for implementing the method of any of claims 1-7; and/or the number of the groups of groups, The computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-7.
Description
File merging method, device, equipment, storage medium and program product Technical Field The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for merging files. Background In a big data platform, the Hive table is used as a core storage model of a data warehouse and is widely applied to scenes such as offline analysis, data lakes and the like. With the diversification of service requirements, hive tables face frequent operations such as small-batch writing (e.g. streaming data access), partition insertion, and the like, which results in a large number of small files (e.g. tens of KB to several MB) in the underlying HDFS storage. The small files can cause serious performance bottlenecks, namely firstly, the NameNode metadata pressure of the HDFS is increased suddenly to possibly influence cluster stability, secondly, a large number of files are required to be opened during inquiry, so that I/O and scheduling expenses are overlarge, inquiry efficiency is obviously reduced, and finally, storage resource waste is serious and file management complexity is high. In particular, in a streaming integrated architecture (such as Lambda architecture) or a real-time digital bin scenario, the conventional small file merging scheme generally requires that new data be prohibited from being written in the merging process, which contradicts the real-time requirement of continuous access of streaming data. For example, in a real-time logistics monitoring system, thousands of trace data may be generated per second, and if a conventional merging method is adopted, the merging task may block writing of new data, so that the data is lost or delayed. Disclosure of Invention The embodiment of the application provides a file merging method, a device, equipment, a storage medium and a program product, which are used for realizing efficient small file merging and simultaneously supporting the continuous access effect of streaming data. In a first aspect, an embodiment of the present application provides a method for merging files, including: Identifying small files stored in the distributed storage system by the target data table, and recording format information of each small file; According to the recorded format information, merging the small files through a distributed computing framework to generate a plurality of aggregation files; Writing the aggregate file into a temporary storage directory, and after the aggregate file passes the integrity check, migrating the aggregate file to a formal storage directory of the target data table. In a second aspect, an embodiment of the present application provides a file merging apparatus, including: The identification module is used for identifying small files stored in the distributed storage system by the target data table and recording format information of each small file; The merging module is used for merging the small files through a distributed computing framework according to the recorded format information to generate a plurality of aggregation files; And the migration module is used for writing the aggregate file into the temporary storage directory, and migrating the aggregate file to the formal storage directory of the target data table after the aggregate file passes the integrity check. In a third aspect, an embodiment of the present application provides a file merging apparatus, including a memory, a processor; The memory stores computer-executable instructions; the processor executes computer-executable instructions stored by the memory such that the processor performs the various possible implementations of the first aspect and/or the first aspect as described above. In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored therein computer-executable instructions for implementing the various possible implementations of the above first aspect and/or the first aspect when executed by a processor. In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the various possible implementations of the above first aspect and/or the first aspect. According to the file merging method, device, equipment, storage medium and program product provided by the embodiment of the application, the small files of the target data table in the distributed storage system are identified and format information is recorded, the small files are merged according to the format by utilizing the distributed computing framework to generate the aggregated file, and then the aggregated file is written into the temporary storage catalog and is transferred to the formal storage catalog after the integrity verification, so that the number of the small files is effectively reduced, metadata management pressure and I/O (in