CN-122021605-A - Sample data generation method and device, computer equipment and storage medium

CN122021605ACN 122021605 ACN122021605 ACN 122021605ACN-122021605-A

Abstract

The application belongs to the technical field of artificial intelligence, and relates to a method, a device, computer equipment and a storage medium for generating sample data, wherein the method comprises the steps of desensitizing acquired corpus data to obtain first corpus data; the method comprises the steps of generating first corpus data based on a condition generating module to obtain second corpus data, screening third corpus data meeting requirements from the second corpus data, carrying out model evaluation on a preset business model based on the third corpus data, identifying a specified sample with difficult model processing based on an obtained evaluation index, obtaining a business scene of the specified sample, analyzing the business scene to obtain an enhanced scene, obtaining relevant corpus data corresponding to the enhanced scene from a corpus, carrying out enhancement processing on the relevant corpus data based on an enhancement strategy to generate target sample data, and outputting the target sample data. The method and the device can be applied to sample generation scenes in the field of financial science and technology, and can improve the generation efficiency, accuracy and compliance of target sample data.

Inventors

QU XIAOYANG

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260512
Application Date: 20260115

Claims (10)

1. A method of generating sample data, comprising the steps of: obtaining corpus data from a preset data source, and performing desensitization processing on the corpus data to obtain corresponding first corpus data; performing data generation processing on the first corpus data based on a preset condition generation module to obtain corresponding second corpus data; performing business verification on the second corpus data to screen out third corpus data meeting requirements from the second corpus data; Performing model evaluation on a preset business model based on the third corpus data, and identifying a specified sample with difficult model processing based on the obtained evaluation index; Acquiring a service scene of the appointed sample, and analyzing the service scene to obtain a corresponding enhanced scene; Acquiring related corpus data corresponding to the enhanced scene from a preset corpus; and carrying out enhancement processing on the related corpus data based on a preset enhancement strategy to generate corresponding target sample data, and carrying out output processing on the target sample data.
2. The method for generating sample data according to claim 1, wherein the step of generating the second corpus data based on the data generation processing performed by the preset condition generating module on the first corpus data includes: Invoking a preset condition generation model and a semantic equivalence transformer based on the condition generation module; Based on a preset prompting project and a control token, the first corpus data is processed by using the condition generation model, and corresponding first corpus generation data is obtained; Performing variant generation processing on the first corpus data based on the semantic equivalent converter to obtain corresponding second corpus generation data; integrating the first corpus generation data and the second corpus generation data to obtain corresponding corpus integration data; And taking the corpus integration data as the second corpus data.
3. The method for generating sample data according to claim 1, wherein the step of performing business verification on the second corpus data to screen out third corpus data meeting requirements from the second corpus data specifically comprises: Performing numerical logic verification on the second corpus data to obtain a corresponding verification result; performing event time sequence verification on the second corpus data to obtain a corresponding verification result; Compliance examination is carried out on the second corpus data, and corresponding examination results are obtained; Performing rule verification on the second corpus data to obtain a corresponding rule verification result; Performing result integration on the verification result, the examination result and the rule verification result to obtain a corresponding target result; And filtering the second corpus data based on the target result to obtain corresponding qualified corpus data, and taking the qualified corpus data as the third corpus data.
4. The method for generating sample data according to claim 3, wherein the step of performing rule verification on the second corpus data to obtain a corresponding rule verification result specifically comprises: performing keyword search and sentence structure analysis on the second corpus data to identify appointed content related to accounting processing in the second corpus data; calling preset accounting rules, and verifying the appointed content based on the accounting rules; if the specified content accords with the accounting rules, generating a first rule verification result of the second corpus data passing rule verification; And if the appointed content does not accord with the accounting rules, generating a second rule verification result that the second corpus data fails to pass the rule verification.
5. The method for generating sample data according to claim 1, wherein the step of performing model evaluation on a preset business model based on the third corpus data and identifying a specified sample with difficulty in model processing based on the obtained evaluation index specifically comprises: performing data division processing on the third corpus data based on a preset division strategy to obtain a corresponding verification set; Performing model evaluation on the service model based on the verification set to obtain a corresponding evaluation index; Acquiring an index threshold corresponding to the evaluation index; Based on the evaluation index and the index threshold, performing sample screening processing on the verification set by using a preset difficult case screener to obtain a corresponding difficult case sample; And taking the difficult sample as the appointed sample.
6. The method for generating sample data according to claim 1, wherein the step of performing enhancement processing on the relevant corpus data based on a preset enhancement policy to generate corresponding target sample data specifically includes: preprocessing the related corpus data to obtain processed target corpus data; Acquiring a plurality of preset enhancement strategies; Screening target enhancement strategies from all the enhancement strategies; sample generation processing is carried out on the target corpus data based on the target enhancement strategy, and corresponding first sample data are obtained; screening the first sample data based on a preset screening strategy to obtain corresponding second sample data; and taking the second sample data as the target sample data.
7. The method for generating sample data according to claim 6, wherein the step of performing screening processing on the first sample data based on a preset screening policy to obtain corresponding second sample data specifically comprises: Acquiring a preset sample evaluation index; constructing a corresponding evaluation processing flow based on the sample evaluation index; based on the evaluation processing flow, invoking a preset automatic evaluation tool to perform sample screening on the first sample data so as to remove samples which do not meet the requirements in the first sample data, and obtaining corresponding third sample data; Rechecking the third sample data; And if the third sample data passes the rechecking, taking the third sample data as the second sample data.
8. A sample data generating apparatus, comprising: the first processing module is used for acquiring corpus data from a preset data source and performing desensitization processing on the corpus data to obtain corresponding first corpus data; the generation module is used for carrying out data generation processing on the first corpus data based on a preset condition generation module to obtain corresponding second corpus data; the screening module is used for carrying out business verification on the second corpus data so as to screen third corpus data meeting requirements from the second corpus data; The identification module is used for carrying out model evaluation on a preset business model based on the third corpus data and identifying a specified sample with difficult model processing based on the obtained evaluation index; The analysis module is used for acquiring the service scene of the appointed sample and analyzing the service scene to obtain a corresponding enhanced scene; the acquisition module is used for acquiring relevant corpus data corresponding to the enhanced scene from a preset corpus; And the second processing module is used for carrying out enhancement processing on the related corpus data based on a preset enhancement strategy so as to generate corresponding target sample data, and carrying out output processing on the target sample data.
9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which when executed by the processor implement the steps of the method of generating sample data as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium, having stored thereon computer-readable instructions which, when executed by a processor, implement the steps of the method of generating sample data according to any of claims 1 to 7.

Description

Sample data generation method and device, computer equipment and storage medium Technical Field The application relates to the technical field of artificial intelligence, and can be applied to the field of financial science and technology, in particular to a method and device for generating sample data, computer equipment and a storage medium. Background Under the current rapid development of artificial intelligence technology in the financial field, high-quality and large-scale sample data plays a vital role in training a high-performance model. However, current financial sample data faces a number of problems. On the one hand, the sample data is seriously insufficient in diversity, the financial field covers a large number of professional terms and complex business logics, for example, in the financial insurance field, the insurance terms relate to a plurality of professional vocabularies, such as 'claim free amount', 'claim proportion', 'waiting period', and the like, and the business logics of different risks are greatly different, such as health risks and car risks are quite different in terms of risk assessment, claim settlement process, and the like. The method makes the collection and arrangement of the sample data extremely difficult, and the financial staff often needs to participate in labeling, so that the labor cost is high and the efficiency is low. On the other hand, existing data enhancement techniques have significant drawbacks. Although common methods such as synonym replacement and back translation can increase the data amount to a certain extent, the professionality of financial corpora and the correctness of business logic are difficult to maintain. In the field of financial insurance, if simple synonym replacement is performed on insurance clauses, such as replacing "no-claim amount" with "no-claim amount", the accuracy of technical terms is changed, ambiguity may be caused, deviation of business logic may occur due to the difference of language conversion in the back translation process, and even a fact error or compliance risk is introduced, so that the accuracy of generating sample data cannot be practically ensured. Therefore, an effective sample data generation method is needed to solve the above-mentioned problems and provide reliable data support for training high-performance financial models. Disclosure of Invention The embodiment of the application aims to provide a method, a device, computer equipment and a storage medium for generating sample data, which are used for solving the technical problems that the existing method for generating the sample data is low in efficiency and cannot guarantee the accuracy of generating the sample data. In a first aspect, a method for generating sample data is provided, including: obtaining corpus data from a preset data source, and performing desensitization processing on the corpus data to obtain corresponding first corpus data; performing data generation processing on the first corpus data based on a preset condition generation module to obtain corresponding second corpus data; performing business verification on the second corpus data to screen out third corpus data meeting requirements from the second corpus data; Performing model evaluation on a preset business model based on the third corpus data, and identifying a specified sample with difficult model processing based on the obtained evaluation index; Acquiring a service scene of the appointed sample, and analyzing the service scene to obtain a corresponding enhanced scene; Acquiring related corpus data corresponding to the enhanced scene from a preset corpus; and carrying out enhancement processing on the related corpus data based on a preset enhancement strategy to generate corresponding target sample data, and carrying out output processing on the target sample data. In a second aspect, there is provided a sample data generating apparatus, including: the first processing module is used for acquiring corpus data from a preset data source and performing desensitization processing on the corpus data to obtain corresponding first corpus data; the generation module is used for carrying out data generation processing on the first corpus data based on a preset condition generation module to obtain corresponding second corpus data; the screening module is used for carrying out business verification on the second corpus data so as to screen third corpus data meeting requirements from the second corpus data; The identification module is used for carrying out model evaluation on a preset business model based on the third corpus data and identifying a specified sample with difficult model processing based on the obtained evaluation index; The analysis module is used for acquiring the service scene of the appointed sample and analyzing the service scene to obtain a corresponding enhanced scene; the acquisition module is used for acquiring relevant corpus data corresponding to the enhanced scen