Search

CN-116244719-B - Federal learning WOE coding method, device, equipment and storage medium

CN116244719BCN 116244719 BCN116244719 BCN 116244719BCN-116244719-B

Abstract

The application discloses a federally learned WOE coding method, a federally learned WOE coding device, federally learned WOE coding equipment and a federally learned WOE coding storage medium. The method comprises the steps of obtaining a first matrix, generating n third matrixes corresponding to n samples by means of the first matrix and a second matrix, wherein the j-th column comprises k first WOE values corresponding to k bins of the j-th characteristic, the j-th column of the second matrix comprises n first random numbers corresponding to the j-th characteristic of the n samples, the j-th column of the third matrix corresponding to the i-th sample comprises k second WOE values corresponding to k bins of the j-th characteristic, the second WOE values are obtained by the first WOE values and first random numbers of first symbols, transmitting the n third matrixes to second electronic equipment, and generating a fourth matrix, wherein the j-th column comprises n second random numbers corresponding to the j-th characteristic of the n samples, and the second random numbers are first random numbers of second symbols (opposite to the first symbols). The application can safely realize WOE coding on the premise of not revealing the distribution information of the labels and the characteristics of the sample.

Inventors

  • CAI CHAOCHAO
  • ZHANG PENG
  • Li Dating
  • NIU ZIRU
  • SHAN JINYONG

Assignees

  • 北京数牍科技有限公司

Dates

Publication Date
20260512
Application Date
20221226

Claims (10)

  1. 1. A federally learned WOE encoding method applied to a first electronic device, comprising: Obtaining a first matrix, wherein the first matrix is a k multiplied by m matrix, and the j-th column of the first matrix comprises k first WOE values which are in one-to-one correspondence with k sub-boxes of a j-th characteristic, wherein the k sub-boxes of the j-th characteristic are k sub-boxes obtained by sub-boxes of n samples by a second electronic device according to the j-th characteristic, k and n are integers larger than 1, j is a positive integer smaller than or equal to m, and m is the characteristic number of the samples included by the second electronic device; Generating n third matrices corresponding to the n samples one by using the first matrix and the second matrix, wherein the second matrix is an n×m matrix, a j-th column of the second matrix comprises n first random numbers corresponding to the j-th feature of the n samples one by one, the third matrix is a k×m matrix, a j-th column of the third matrix corresponding to the i-th sample comprises k second WOE values corresponding to k bins of the j-th feature one by one, the second WOE value corresponding to the t bin of the j-th feature is obtained by a first WOE value corresponding to the t bin of the j-th feature, and the first random number of the j-th feature of the i-th sample of a first symbol; The n third matrixes are sent to the second electronic equipment, the n third matrixes are used for the second electronic equipment to generate a fifth matrix, the fifth matrix is an n multiplied by m matrix, the j-th column of the fifth matrix comprises n second WOE values corresponding to the j-th feature of the n samples one by one, and the second WOE value corresponding to the j-th feature of the i-th sample is a second WOE value corresponding to a sub-box of the third matrix corresponding to the i-th sample, to which the i-th sample belongs, in the j-th feature of the third matrix corresponding to the i-th sample; Generating a fourth matrix, wherein the fourth matrix is an n×m matrix, a j-th column of the fourth matrix comprises n second random numbers corresponding to the j-th features of the n samples, the second random numbers corresponding to the j-th features of the i-th samples are first random numbers corresponding to the j-th features of the i-th samples of a second symbol, and the first symbol is opposite to the second symbol.
  2. 2. The method of claim 1, wherein a second WOE value corresponding to a t-th bin of the j-th feature is summed by a first WOE value corresponding to the t-th bin of the j-th feature and a first random number of the j-th feature of the i-th sample; the second random number corresponding to the j-th feature of the i-th sample is a negative number of the first random number corresponding to the j-th feature of the i-th sample.
  3. 3. The method of claim 1, wherein the obtaining the first matrix comprises: Generating a key pair, wherein the key pair comprises a public key and a private key; Encrypting a first column vector by using the public key to obtain a second column vector, wherein the first column vector comprises n first labels which are in one-to-one correspondence with the n samples, and the second column vector comprises n second labels which are in one-to-one correspondence with the n samples; transmitting the public key and the second column vector to the second electronic device; receiving target information sent by the second electronic equipment, wherein the target information comprises the numbers of k sub-boxes of the j-th characteristic after encryption, and the positive sample number and the negative sample number of the k sub-boxes of the j-th characteristic; Decrypting the target information using the private key; And determining k first WOE values corresponding to the k sub-boxes of the jth feature one by using the positive sample number and the negative sample number of the k sub-boxes of the jth feature after decryption.
  4. 4. A federally learned WOE encoding method applied to a second electronic device, comprising: Receiving n third matrixes which are sent by first electronic equipment and are in one-to-one correspondence with n samples, wherein the third matrixes are k multiplied by m matrixes, the j-th column of the third matrixes corresponding to the i-th samples comprises k second WOE values which are in one-to-one correspondence with k sub-boxes of the j-th characteristics, the third matrixes are generated by the first electronic equipment by using the first matrixes and the second matrixes, the first matrixes are k multiplied by m matrixes, the j-th column of the first matrixes comprises k first WOE values which are in one-to-one correspondence with k sub-boxes of the j-th characteristics, the second matrixes are n multiplied by m matrixes, the j-th column of the second matrixes comprises n first random numbers which are in one-to-one correspondence with the j-th characteristics of the n-th samples, and the second WOE values are obtained by the corresponding first WOE values and the first random numbers of the first symbols; Determining a second WOE value of the bin of the jth feature corresponding to the ith sample as a second WOE value of the jth feature corresponding to the ith sample to obtain a fifth matrix, wherein the fifth matrix is an n×m matrix, a jth column of the fifth matrix comprises n second WOE values corresponding to the jth feature of the n samples one by one, the bin of the jth feature corresponding to the ith sample is the positive integer smaller than or equal to n after the second electronic device bins the n samples with the jth feature, i is the bin of the ith sample, and the fourth matrix is also generated by the first electronic device, and is an n×m matrix, a jth column of the fourth matrix comprises n second random numbers corresponding to the jth feature of the n samples one by one, and the second random numbers are the first random numbers of a second symbol and the first random numbers of the second symbol opposite sign.
  5. 5. The method of claim 4, wherein prior to receiving n third matrices sent by the first electronic device that are in one-to-one correspondence with the n samples, the method further comprises: Receiving a public key and a second column vector sent by the first electronic device, wherein the second column vector comprises n second labels which are in one-to-one correspondence with the n samples; determining the positive sample number and the negative sample number of the t sub-box of the j-th feature by using a second label corresponding to each sample included in the t sub-box of the j-th feature, so as to obtain the positive sample number and the negative sample number of the k sub-boxes of the j-th feature, wherein t is a natural number smaller than k; Encrypting the numbers of k bins of the j-th feature and the positive and negative numbers of samples of the k bins of the j-th feature by using the public key; And sending target information to the first electronic equipment, wherein the target information comprises the encrypted numbers of k sub-boxes of the j-th characteristic, and the positive sample number and the negative sample number of the k sub-boxes of the j-th characteristic.
  6. 6. The method of claim 5, wherein determining the positive number of samples and the negative number of samples of the t bin of the j feature using the second label corresponding to each sample included in the t bin of the j feature comprises: Determining the sum of second labels corresponding to all samples included in the t sub-box of the j-th characteristic as the negative sample number of the t sub-box of the j-th characteristic; And subtracting the negative sample number of the t sub-box of the jth feature from the target value to obtain the positive sample number of the t sub-box of the jth feature, wherein the target value is the sample number included in the t sub-box of the jth feature.
  7. 7. A federally learned WOE coding apparatus, the apparatus comprising: The first acquisition module is used for acquiring a first matrix, wherein the first matrix is a k multiplied by m matrix, the j-th column of the first matrix comprises k first WOE values which are in one-to-one correspondence with k sub-boxes of a j-th characteristic, the k sub-boxes of the j-th characteristic are k sub-boxes obtained by sub-boxes of n samples by a second electronic device according to the j-th characteristic, k and n are integers which are larger than 1, j is a positive integer which is smaller than or equal to m, and m is the characteristic number of the samples included by the second electronic device; a first generating module, configured to generate n third matrices corresponding to the n samples one by using the first matrix and a second matrix, where the second matrix is an nxm matrix, a j-th column of the second matrix includes n first random numbers corresponding to the j-th feature of the n samples one by one, the third matrix is a kxm matrix, a j-th column of the third matrix corresponding to the i-th sample includes k second WOE values corresponding to k bins of the j-th feature one by one, a second WOE value corresponding to a t bin of the j-th feature is obtained by a first WOE value corresponding to the t bin of the j-th feature, and a first random number of the j-th feature of the i-th sample of a first symbol; A first sending module, configured to send the n third matrices to the second electronic device, where the n third matrices are used for the second electronic device to generate a fifth matrix, where the fifth matrix is an nxm matrix, a j-th column of the fifth matrix includes n second WOE values corresponding to the j-th feature of the n samples one-to-one, and the second WOE value corresponding to the j-th feature of the i-th sample is a second WOE value corresponding to a bin to which the i-th sample belongs in the j-th feature in the third matrix corresponding to the i-th sample according to the second electronic device; The second generation module is used for generating a fourth matrix, wherein the fourth matrix is an n multiplied by m matrix, a j-th column of the fourth matrix comprises n second random numbers corresponding to the j-th features of the n samples one by one, the second random numbers corresponding to the j-th features of the i-th samples are first random numbers corresponding to the j-th features of the i-th samples with second symbols, and the first symbols are opposite to the second symbols.
  8. 8. A federally learned WOE coding apparatus, the apparatus comprising: the second receiving module is used for receiving n third matrixes which are sent by the first electronic equipment and are in one-to-one correspondence with n samples, wherein the third matrixes are k multiplied by m matrixes, and the j-th column of the third matrixes corresponding to the i-th samples comprises k second WOE values which are in one-to-one correspondence with k sub-boxes of the j-th characteristic; the third matrix is generated by the first electronic device by using a first matrix and a second matrix, the first matrix is a kxm matrix, a j-th column of the first matrix comprises k first WOE values which are in one-to-one correspondence with k bins of a j-th feature, the second matrix is an n x m matrix, a j-th column of the second matrix comprises n first random numbers which are in one-to-one correspondence with the j-th feature of the n samples, and the second WOE values are obtained by the corresponding first WOE values and the first random numbers of a first symbol; A second determining module, configured to determine a second WOE value of the bin of the jth feature corresponding to the ith sample as a second WOE value corresponding to the jth feature of the ith sample, to obtain a fifth matrix, where the fifth matrix is an n×m matrix, and a jth column of the fifth matrix includes n second WOE values corresponding to the jth features of the n samples in a one-to-one manner; the first electronic device further generates a fourth matrix, wherein the fourth matrix is an n multiplied by m matrix, the j-th column of the fourth matrix comprises n second random numbers which are in one-to-one correspondence with the j-th feature of the n samples, the second random numbers are the first random numbers of a second symbol, and the first symbol is opposite to the second symbol.
  9. 9. A federally learned WOE coding apparatus, comprising a processor and a memory storing computer program instructions, the processor implementing the federally learned WOE coding method according to any one of claims 1 to 3 or the federally learned WOE coding method according to any one of claims 4 to 6 when executing the computer program instructions.
  10. 10. A computer readable storage medium, having stored thereon computer program instructions which, when executed by a processor, implement the federally learned WOE encoding method of any one of claims 1 to 3, or the federally learned WOE encoding method of any one of claims 4 to 6.

Description

Federal learning WOE coding method, device, equipment and storage medium Technical Field The application belongs to the technical field of data processing, and particularly relates to a federal learning WOE coding method, a federal learning WOE coding device, federal learning WOE coding equipment and a federal learning storage medium. Background With the development of big data, importance of data privacy and data security has become a worldwide trend. In order to realize the joint modeling of multiple participants on the premise of protecting the data privacy and the data safety, federal learning (FEDERATED LEARNING) is introduced. In federal learning, feature coding is required by evidence weights (Weight ofEvidence, WOE) values, which are used to reflect the distribution of positive and negative samples. In conventional machine learning modeling, the WOE value can be calculated by formula (1) based on the label (label) of the sample: wherein WOE t represents the WOE value of the t-th bin, bad t_sum represents the negative number of samples of the t-th bin, bad Total represents the negative number of samples of the full bin, good t_sum represents the positive number of samples of the t-th bin, and Good Total represents the positive number of samples of the full bin. However, in federal learning of multiple participants, in a case where only one participant has a tag of a sample, and the other participants have distribution information of features of the sample, how to perform WOE encoding while ensuring that the tag of the sample and the distribution information of features of the sample are not revealed is needed to be solved. Disclosure of Invention The embodiment of the application provides a WOE coding method, device, equipment and storage medium for federal learning, which can realize WOE coding in federal learning under the condition that labels of samples and distribution information of characteristics of the samples are not leaked. In a first aspect, an embodiment of the present application provides a federally learned WOE encoding method, applied to a first electronic device, where the method includes: Obtaining a first matrix, wherein the first matrix is a k multiplied by m matrix, and the j-th column of the first matrix comprises k first WOE values which are in one-to-one correspondence with k sub-boxes of a j-th characteristic, wherein the k sub-boxes of the j-th characteristic are k sub-boxes obtained by sub-boxes of n samples by a second electronic device according to the j-th characteristic, k and n are integers larger than 1, j is a positive integer smaller than or equal to m, and m is the characteristic number of the samples included by the second electronic device; Generating n third matrices corresponding to the n samples one by using the first matrix and the second matrix, wherein the second matrix is an n×m matrix, a j-th column of the second matrix comprises n first random numbers corresponding to the j-th feature of the n samples one by one, the third matrix is a k×m matrix, a j-th column of the third matrix corresponding to the i-th sample comprises k second WOE values corresponding to k bins of the j-th feature one by one, the second WOE value corresponding to the t bin of the j-th feature is obtained by a first WOE value corresponding to the t bin of the j-th feature, and the first random number of the j-th feature of the i-th sample of a first symbol; transmitting the n third matrices to the second electronic device; Generating a fourth matrix, wherein the fourth matrix is an n×m matrix, a j-th column of the fourth matrix comprises n second random numbers corresponding to the j-th features of the n samples, the second random numbers corresponding to the j-th features of the i-th samples are first random numbers corresponding to the j-th features of the i-th samples of a second symbol, and the first symbol is opposite to the second symbol. In a second aspect, an embodiment of the present application provides a federally learned WOE encoding method, applied to a second electronic device, where the method includes: receiving n third matrixes which are transmitted by the first electronic equipment and are in one-to-one correspondence with n samples, wherein the third matrixes are k multiplied by m matrixes, and a j-th column of the third matrixes corresponding to the i-th samples comprises k second WOE values which are in one-to-one correspondence with k sub-boxes of the j-th characteristic; Determining a second WOE value of the sub-bin of the jth feature corresponding to the ith sample as the second WOE value of the jth feature corresponding to the ith sample to obtain a fifth matrix, wherein the fifth matrix is an n×m matrix, a jth column of the fifth matrix comprises n second WOE values corresponding to the jth feature of the n samples one to one, the sub-bin of the jth feature corresponding to the ith sample is a sub-bin where the ith sample is located after the second electroni