CN-122019520-A - Data quality evaluation system based on large model
Abstract
The invention relates to a data quality evaluation system based on a large model, and belongs to the technical field of data management and artificial intelligence. The system adopts a layered architecture, and comprises a hardware layer, a service layer, an application layer, a service layer and an interface layer, wherein the hardware layer is used for constructing a uniform resource pool through a virtualization technology, the service layer is used for providing basic services such as intelligent computing, digital wallets and the like, the application layer is used for carrying out quality evaluation and result coding uplink of cores, the service layer is used for carrying out system configuration and evaluation management, and the interface layer is used for realizing interconnection with an external platform. According to the invention, by introducing a large model and utilizing a transfer learning or knowledge distillation technology to carry out adaptability training, an intelligent evaluation system with the cooperation of the large model and the small model is constructed, and high-precision automatic quality evaluation of different fields and different structural data is realized. Meanwhile, the whole-course coding and the uplink of the evaluation process and the result are carried out through the block chain technology, so that the non-tamper property of the evaluation process and the traceability of the result are ensured, and the accuracy, the transparency and the credibility of the evaluation result are remarkably improved.
Inventors
- WANG PEIPEI
- ZHANG JIN
- GUO FEI
- Zhuang Yuntao
Assignees
- 江苏金屹城科技发展有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260107
Claims (11)
- 1. A large model-based data quality assessment system, comprising: The hardware layer is used for virtualizing the network equipment, the computing equipment, the storage equipment and the safety equipment and constructing a formed communication, calculation, storage and safety resource pool; the service layer is deployed above the hardware layer and comprises a network communication service, an intelligent computing service, a data storage service, an information security service, a digital wallet service and a basic general service; the application layer is deployed above the service layer and comprises a quality evaluation module and a coding uplink module; The business layer is deployed above the application layer and comprises a system management module, a configuration management module and an evaluation management module; and the interface layer is arranged above the service layer and is configured to be connected with an external platform based on the network communication service.
- 2. The large model based data quality assessment system of claim 1, wherein in the service layer, the smart computing service is configured to adaptively train a large model platform trained data quality assessment model using transfer learning or knowledge distillation, and wherein the digital wallet service is configured to communicate with a smart contract.
- 3. The large model based data quality evaluation system of claim 1, wherein in the application layer, the quality evaluation module is configured to invoke the intelligent computing service to evaluate a plurality of quality indicators of data to be evaluated by using the selected evaluation model to generate a data quality coefficient; The code-up module is configured to invoke the digital wallet service to encrypt the relevant index codes of the quality assessment process and then to upload the encrypted codes to a smart contract.
- 4. The large model based data quality assessment system of claim 3 wherein said plurality of quality indicators comprises normalization, integrity, accuracy, consistency, timeliness, and accessibility of data.
- 5. The large model based data quality evaluation system according to claim 1, wherein in the business layer, the system management module is configured to manage external platform information communicated with the system and configuration information of the system itself; The configuration management module is configured to configure the selection of the evaluation model, the information of the evaluated data and the extraction rule of the evaluated data; the evaluation management module is configured to select an evaluation model and a data sample to be evaluated according to the configuration of the configuration management module, and call the application layer to perform quality evaluation and quality coefficient calculation.
- 6. The large model based data quality assessment system of claim 1, wherein the interface layer is configured to interface with one or more of a high quality data set construction platform, a blockchain platform, an assessed data system, a data asset management platform, and a system operation and maintenance platform via a preset communication protocol.
- 7. The large model based data quality assessment system of claim 6 wherein said preset communication protocol comprises HTTPS, API, GRPC, SDK.
- 8. The large model based data quality assessment system of claim 3 wherein said application layer is configured to operate as logic: s1, after receiving a service request, carrying out parameter configuration; S2, judging whether a corresponding model exists, if so, determining the model, otherwise, acquiring a data quality evaluation model from a high-quality data set construction platform, optimizing the data quality evaluation model, and determining the model; S3, extracting sample data from the evaluated data system; s4, calculating a data quality coefficient of the sample data based on the model driven in the step S2 to obtain the data quality coefficient; s5, performing quality evaluation coding of the associated blockchain platform on the data quality coefficient; And S6, feeding back a service request according to the coding result to finish data quality evaluation.
- 9. The large model based data quality evaluation system of claim 8, wherein the data quality coefficient is calculated as follows: A. parameter configuration, namely performing system parameter setting by Cli or Web model, wherein the system parameter setting comprises industry attribute of the evaluated data, total data amount C of the evaluated data and a data structure S; B. Analyzing parameter attributes of the evaluated data, selecting a proper evaluation model by a system in a labeling mode, acquiring an adapted data quality evaluation model from a high-quality data set platform if the corresponding model does not exist, and optimizing a small model to obtain the evaluation model; C. and (3) calculating an extraction model, namely selecting a sample extraction model according to the total data amount, the data type and the data structure and the following formula: Wherein, R is a sample extraction model, And C, the total data amount is calculated, CNT is a data resource total amount demarcation point, SUM (S) is the type of data structure, N is the data structure type demarcation point, Ri, i-th type random model; D. Extracting sample data, namely extracting the evaluated data resources according to the selected random extraction model to obtain sample data resources with the sample size of N, and sending the sample data resources to an intelligent computing service; E. The intelligent computing service data quality index selection, namely configuring a data quality coefficient computing index system and weight values Xi of all indexes by utilizing a matched and optimized computing model according to the information of industries, sample data and the like of data resources; F. the intelligent computing service calculates the coefficient of each index system, wherein the intelligent computing service calculates the total amount of the data to be evaluated and the data resource data meeting the index requirement by relying on the evaluation model, and calculates the coefficient of each index system according to the following formula : ; G. the intelligent computing service calculates the data quality coefficient, namely, the data quality coefficient alpha is calculated according to the following formula: ; Where alpha is the data quality coefficient in the data resource, N is the number of the selected data quality indexes, I is the i-th data quality evaluation index, Xi, the weight value of the ith data quality evaluation index, Index coefficient of the ith data quality evaluation index.
- 10. The large model based data quality assessment system according to claim 8, wherein said quality assessment encoding steps are as follows: a. Encoding the industry to which the data belongs, data sample parameters, evaluation model parameters, evaluation time and evaluation results in the data quality calculation process according to a specified rule; b. Establishing a secure channel, namely connecting the system with the intelligent contract address, and establishing a trusted channel between the network security service and the network communication service and the intelligent contract; c. And (3) data uplink, namely sending the coded data quality evaluation information to a block chain platform uplink through a secure channel, receiving and recording uplink result responses.
- 11. The data quality evaluation system based on large model as set forth in claim 9, wherein the data quality coefficient calculated by the application layer According to the request of the data asset assessment system, the data quality coefficient is fed back to the data asset assessment system, and the data asset assessment system obtains the data utility of the data asset according to the following calculation model, so that the final data asset valuation is obtained: U= (1+l)(1-r); Wherein, U is the data asset valuation, Alpha is the data quality coefficient of the data, Beta is the data flow coefficient, And l is the monopoly coefficient of the data, And r, realizing a risk coefficient by the data value.
Description
Data quality evaluation system based on large model Technical Field The invention relates to the technical field of data management and artificial intelligence, in particular to a data quality evaluation system based on a large model. Background With the deep development of the big data age, data has become a key production element. In the industrial field, huge amounts of data are reserved to drive production optimization, service innovation and mode transformation. In order to standardize and promote the value realization of the data resources, the national level goes out of important policies such as 'temporary regulations for related accounting processing of enterprise data resources', 'guiding opinion about reinforced data asset management', and the like, and provides clear policy guiding and implementation paths for industrial enterprises to explore data asset capitalization, advance data asset evaluation into forms and capitalization operation. In this context, scientific and reliable assessment of industrial data assets is a prerequisite to release their value, regardless of the valuation model employed, data quality is the core basis and key input to measure and determine the value of the data asset. At present, data quality evaluation mainly depends on national standard information technology data quality evaluation index (GB/T36344-2018), and an evaluation framework is established by the standard from multiple dimensions such as standardization, integrity, accuracy, consistency, timeliness, accessibility and the like. However, in the practical application process of asset assessment for complex, massive and heterogeneous data resources in the industrial field, the standard faces a series of outstanding challenges: 1. The evaluation standard is difficult to land and has strong subjective dependence, the industrial field is subdivided into a plurality of industries, and the quality connotation and the importance weight difference of various data resources (such as equipment operation data, process parameters, supply chain information and the like) are extremely large. How to select an adaptive index subset from the standards and scientifically determine the weight of each index aiming at a specific evaluation object, and the method is seriously dependent on expert experience. The method has the advantages that the subjectivity and the consistency of the evaluation process are strong, an objective and reusable evaluation model is difficult to form, and the method becomes a common problem puzzling an evaluation mechanism and an industrial enterprise. 2. The evaluation efficiency is low, the cost is high, the industry is the data enrichment industry, and the data set to be evaluated is often huge in scale and complex in structure. The traditional quality evaluation method based on manual sampling or rule calculation is long in time consumption, and is difficult to meet the requirement on time efficiency in the asset evaluation. The evaluation of the whole or large-scale sample is carried out, which will generate the manpower cost and the computing resource cost which are difficult to bear, and restricts the large-scale development of the data asset work. 3. The reliability and the public confidence of the results are insufficient, namely the current data element market participates in the main body in multiple ways, and the data quality evaluation mechanism uniformly recognized by the authority authorities is lacking. The evaluation result generated by the centralized system or the single institution has opaque process, difficult verification of conclusions, lack of a fair audit tracing mechanism, and difficulty in obtaining wide acceptance of transaction parties and regulatory departments, thereby influencing market acceptance and fluxion of the data asset value based on the result. Thus, there is a need in the industry for an innovative solution to overcome the above drawbacks. Although artificial intelligence technology brings dawn to this end, simply applying a large model faces the problem of high deployment cost and difficult domain knowledge migration, while relying on a small model only has limited evaluation capability. At the same time, how to ensure the transparency of the automated evaluation process and the non-tamper-resistance of the results remains a technical blind spot. In view of this, there is an urgent need to construct a novel industrial data quality evaluation system and method, which can integrate the powerful cognition of a large model with the high efficiency and accuracy of a small model, and solidify evaluation traces by using a blockchain technology, so as to realize the high efficiency, intelligent and credible quality evaluation of industrial data resources, and provide a solid and reliable technical support for the accurate evaluation and compliance of data assets. Disclosure of Invention The invention aims to provide a data quality evaluation system based on a large mo