CN-117149430-B - Data standardization processing method and device
Abstract
The application discloses a data normalization processing method and device, which are applied to a server of a data normalization processing system, and comprise the steps of determining a starting identifier chain of a processing flow of each sub-data in at least one sub-data of original data, obtaining waiting time of each starting identifier chain, standard execution time of each starting identifier and data processing capacity of each starting identifier sub-chain, updating each execution component sequence corresponding to each starting identifier chain according to the waiting time, the standard execution time and the data processing capacity so that the execution time of the total processing flow is smaller than the execution time before updating, and processing the original data according to each updated execution component sequence until the execution of each execution component sequence is completed, so as to obtain target data. The application adjusts the execution sequence of the execution assembly according to the waiting time of the starting identifier chain, the standard execution time of the starting identifier and the data processing amount of the starting identifier sub-chain, reduces the memory pressure and shortens the processing time.
Inventors
- CHEN LEI
- Yang Yongbang
Assignees
- 深圳前海微众银行股份有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20230911
Claims (9)
- 1. A data normalization processing method, applied to a server of a data normalization processing system, the method comprising: Determining a starting identifier chain of a processing flow of each sub-data in at least one sub-data of original data, wherein a plurality of starting identifiers in each starting identifier chain are in one-to-one correspondence with a plurality of execution components in an execution component sequence, a plurality of starting identifiers Fu Zilian in each starting identifier chain are in one-to-one correspondence with a plurality of execution component subsets in the execution component sequence, and the total processing flow of the original data comprises the processing flow of each sub-data in the at least one sub-data; Acquiring the waiting time of each starting identifier chain, the standard execution time of each starting identifier in the plurality of starting identifiers and the data processing amount of each starting identifier sub-chain in each starting identifier chain, wherein the starting identifiers in each starting identifier sub-chain are in one-to-one correspondence with the execution components in the corresponding execution component sub-chain; updating each execution component sequence corresponding to each starting identifier chain according to the waiting time, the standard execution time and the data processing amount so that the execution duration of the total processing flow is smaller than the execution duration before updating; processing the original data according to the updated each execution component sequence until the execution of each execution component sequence is completed, and obtaining target data; The updating each execution component sequence corresponding to each starting identifier chain according to the waiting time, the standard execution time and the data processing amount so that the execution duration of the total processing flow is smaller than the execution duration before updating comprises the following steps: Determining a first execution priority of each boot identifier chain and a second execution priority of each boot identifier Fu Zilian in the each boot identifier chain, the boot identifiers Fu Zilian being in one-to-one correspondence with the subset of execution components, the first execution priority being for indicating a probability size that the boot identifier chain was executed first, the second execution priority being for indicating a probability size that the boot identifiers Fu Zilian were executed first, the determining the first and second execution priorities comprising calculating the first and second execution priorities using a MLCATBP algorithm, including obtaining a priority setting and a multi-tier setting of the MLCATBP algorithm, the priority setting being for indicating an impact relationship of the multi-tier setting with the execution priority, the priority setting including a priority proportional to the latency and inversely proportional to the standard execution time and inversely proportional to the data throughput; And updating each execution component sequence corresponding to each starting identifier chain according to the first execution priority and the second execution priority so that the execution duration of the total processing flow is smaller than the execution duration before updating.
- 2. The method of claim 1, wherein the determining a first execution priority of each of the startup identifier chains and a second execution priority of each of the startup identifiers Fu Zilian in each of the startup identifier chains, the startup identifiers Fu Zilian being in one-to-one correspondence with the subset of execution components, the first execution priority being used to indicate a probability size that the startup identifier chain was executed first, the second execution priority being used to indicate a probability size that the startup identifier Fu Zilian was executed first, comprises: determining a first level priority of each starting identifier chain according to the standard execution time, wherein the first level priority is used for indicating the probability size of the starting identifier chain executed first at the standard execution time level; Determining a second level priority of each starting identifier chain according to the data processing amount, wherein the second level priority is used for indicating the probability size of the starting identifier chain executed first at the data processing amount level; Determining a third level priority of each starting identifier chain according to the waiting time, wherein the third level priority is used for indicating the probability size of the starting identifier chain executed first at the waiting time level; Acquiring a preset first performance influence coefficient; and determining the first execution priority of each starting identifier chain according to the first layer priority, the second layer priority, the third layer priority and the first performance influence coefficient.
- 3. The method of claim 1, wherein the determining a first execution priority of each of the startup identifier chains and a second execution priority of each of the startup identifiers Fu Zilian in each of the startup identifier chains, the startup identifiers Fu Zilian being in one-to-one correspondence with the subset of execution components, the first execution priority being used to indicate a probability size that the startup identifier chain was executed first, the second execution priority being used to indicate a probability size that the startup identifier Fu Zilian was executed first, comprises: Determining a fourth level priority of each start identifier Fu Zilian according to the standard execution time, where the fourth level priority is used to indicate a probability size that the start identifier Fu Zilian is executed first at the standard execution time level; Determining a fifth level priority of each starting identifier Fu Zilian according to the data processing amount, wherein the fifth level priority is used for indicating the probability that the starting identifier sub-chain is executed first at the data processing amount level; acquiring a preset second performance influence coefficient; and determining a second execution priority of each start identifier Fu Zilian in each start identifier chain according to the fourth-level priority, the fifth-level priority and the second performance impact coefficient.
- 4. A method according to any one of claims 1 to 3, wherein said processing said original data according to said each updated execution assembly sequence until said each execution assembly sequence is executed to obtain target data comprises: Obtaining the maximum identical starting identifier Fu Zilian of each starting identifier chain, wherein the maximum identical starting identifier sub-chain is at least two starting identifiers Fu Zilian with the same arrangement sequence and the same corresponding starting identifier Fu Zilian parameters in at least two starting identifier chains, and the starting identifier Fu Zilian parameters are values pointed by the starting identifier sub-chains; And storing the intermediate data generated when the original data is processed according to the maximum identical starting identifier Fu Zilian in an intermediate data cache pool, wherein the intermediate data cache pool is used for storing and reading the intermediate data for reading and re-reading.
- 5. The method of claim 4, wherein after storing intermediate data generated when the original data is processed according to the maximum identical start-up identifier Fu Zilian in an intermediate data cache pool, the method further comprises: And when the starting identifier chain is detected to comprise the maximum identical starting identifier Fu Zilian, calling the corresponding intermediate data in the intermediate data cache pool, and continuing the subsequent processing.
- 6. The method of claim 1, wherein determining the start identifier chain for each sub-data process flow in the at least one sub-data of the original data comprises: Acquiring a data source number of the original data; Determining the guide information of the starting identifier chain according to the data source number; and determining the starting identifier chain according to the guide information.
- 7. The method of claim 6, wherein said determining the starting identifier chain from the direction information comprises: judging whether the starting identifier chain is correct or not; If not, stopping data normalization processing and displaying abnormality; If yes, the starting identifier chain is stored in a configuration buffer, and the configuration buffer is used for non-permanently storing the starting identifier chain for reading and re-reading.
- 8. The method of claim 7, wherein after said determining if the starting identifier chain is correct, the method further comprises: judging whether the starting identifier chain exists in the configuration cache; If not, acquiring the starting identifier chain in a database, and storing the starting identifier chain in the configuration cache, wherein the database is used for permanently storing the starting identifier chain for reading and re-reading; If yes, acquiring the starting identifier chain in the configuration cache.
- 9. A data normalization processing arrangement, applied to a server of a data normalization processing system, for performing the method according to any of claims 1-8, comprising: A first processing unit, configured to determine a start identifier chain of a processing procedure of each sub-data in at least one sub-data of the original data, where a plurality of start identifiers in each start identifier chain corresponds to a plurality of execution components in an execution component sequence one by one, and a plurality of start identifiers Fu Zilian in each start identifier chain corresponds to a plurality of execution component subsets in the execution component sequence one by one, and a total processing procedure of the original data includes a processing procedure of each sub-data in the at least one sub-data; a first receiving unit, configured to obtain a waiting time of each starting identifier chain, a standard execution time of each starting identifier in the plurality of starting identifiers, and a data processing amount of each starting identifier sub-chain in each starting identifier chain, where the starting identifiers in each starting identifier sub-chain are in one-to-one correspondence with execution components in a corresponding execution component subset; And the second processing unit is used for updating each execution component sequence corresponding to each starting identifier chain according to the waiting time, the standard execution time and the data processing amount so that the execution duration of the total processing flow is smaller than the execution duration before updating, and processing the original data according to each updated execution component sequence until the execution of each execution component sequence is finished, so as to obtain target data.
Description
Data standardization processing method and device Technical Field The application relates to the technical field of big data, in particular to a data standardization processing method and device. Background When acquiring data from multiple data sources, there often exist differences in the format of the data provided by different data sources, the naming of the data items, etc., for example, when a bank acquires tax data of each tax office, the number of data items of the original data provided by each tax office is different, the naming of the data items is different, and when acquiring data, it is difficult for the bank to extract required content from the original data provided by multiple tax offices or there are many errors in the extracted content. Therefore, it is necessary to normalize data from a plurality of data sources. When the current technical scheme is used for data standardization processing, a large number of data tables are needed to be created in a database to store data or different processing subsystems are customized according to different data sources, the code repetition rate is high, excessive occupation of memory resources is caused, the processing time is too long, and the use requirement cannot be met. Therefore, how to lighten the memory pressure and shorten the time of the data normalization processing becomes a technical problem to be further solved Disclosure of Invention The application provides a data normalization processing method and device, which are used for solving the problems of excessive occupation of memory resources and overlong processing time, reducing memory pressure and shortening the data normalization processing time. In a first aspect, an embodiment of the present application provides a data normalization processing method, which is applied to a server of a data normalization processing system, where the method includes: Determining a starting identifier chain of a processing flow of each sub-data in at least one sub-data of original data, wherein a plurality of starting identifiers in each starting identifier chain are in one-to-one correspondence with a plurality of execution components in an execution component sequence, a plurality of starting identifiers Fu Zilian in each starting identifier chain are in one-to-one correspondence with a plurality of execution component subsets in the execution component sequence, and the total processing flow of the original data comprises the processing flow of each sub-data in the at least one sub-data; Acquiring the waiting time of each starting identifier chain, the standard execution time of each starting identifier in the plurality of starting identifiers and the data processing amount of each starting identifier sub-chain in each starting identifier chain, wherein the starting identifiers in each starting identifier sub-chain are in one-to-one correspondence with the execution components in the corresponding execution component sub-chain; updating each execution component sequence corresponding to each starting identifier chain according to the waiting time, the standard execution time and the data processing amount so that the execution duration of the total processing flow is smaller than the execution duration before updating; And processing the original data according to the updated each execution component sequence until the execution of each execution component sequence is completed, so as to obtain target data. In a second aspect, an embodiment of the present application provides a data normalization processing device, which is applied to a server of a data normalization processing system, and the device includes: A first processing unit, configured to determine a start identifier chain of a processing procedure of each sub-data in at least one sub-data of the original data, where a plurality of start identifiers in each start identifier chain corresponds to a plurality of execution components in an execution component sequence one by one, and a plurality of start identifiers Fu Zilian in each start identifier chain corresponds to a plurality of execution component subsets in the execution component sequence one by one, and a total processing procedure of the original data includes a processing procedure of each sub-data in the at least one sub-data; a first receiving unit, configured to obtain a waiting time of each starting identifier chain, a standard execution time of each starting identifier in the plurality of starting identifiers, and a data processing amount of each starting identifier sub-chain in each starting identifier chain, where the starting identifiers in each starting identifier sub-chain are in one-to-one correspondence with execution components in a corresponding execution component subset; And the second processing unit is used for updating each execution component sequence corresponding to each starting identifier chain according to the waiting time, the standard execution time and