Search

CN-115270941-B - Method, device, equipment and storage medium for training data classification model

CN115270941BCN 115270941 BCN115270941 BCN 115270941BCN-115270941-B

Abstract

The application provides a method, a device, equipment and a storage medium for training a data classification model, which can be applied to the fields of artificial intelligence or big data and the like and is used for the problem of low efficiency of training the data classification model. The method comprises the steps of respectively controlling the storage nodes based on received data selection instructions, respectively selecting a plurality of candidate data from stored candidate data as a plurality of training data, respectively controlling the storage nodes, performing multi-round iterative training on corresponding data classification models to be trained based on the plurality of training data to obtain a plurality of trained data classification models, and controlling the calculation node to obtain the plurality of trained data classification models from the storage nodes and perform model fusion on the plurality of trained data classification models to obtain a target data classification model.

Inventors

  • CHEN YUFEI
  • LI MINGHAO
  • Wen Wenliu
  • LI ZHENDA
  • JI YONGFEI
  • DENG QI
  • Lv tu
  • YANG AO
  • XIE LIYING
  • YIN ZHIHUA

Assignees

  • 天翼云科技有限公司
  • 天翼云科技有限公司

Dates

Publication Date
20260421
Application Date
20220715
Priority Date
20220715

Claims (9)

  1. 1. A method of training a data classification model, characterized by being applied to a distributed data space comprising a computing node for processing data and a plurality of storage nodes for storing data, each of the storage nodes storing backup data in other storage nodes, comprising: Based on SQL sentences which are received by the computing nodes and serve as model creation instructions, respectively controlling the storage nodes to build a data classification model to be trained based on the names and initial model parameters of the model structures carried by the model creation instructions; based on SQL sentences which are received by the computing nodes and serve as data selection instructions, the storage nodes are respectively controlled to execute, wherein one storage node selects a plurality of candidate data from the candidate data stored in at least one storage node in the storage nodes to serve as a plurality of training data; respectively controlling the storage nodes, and performing multiple rounds of iterative training on the corresponding data classification models to be trained based on the training data to obtain a plurality of trained data classification models; The computing node is controlled to acquire training model parameters of the plurality of trained data classification models from the plurality of storage nodes, the training model parameters are weighted and fused to obtain comprehensive model parameters, and a target data classification model is obtained based on the names of the model structures and the comprehensive model parameters.
  2. 2. The method of claim 1, further comprising, prior to separately controlling the plurality of storage nodes to select a plurality of candidate data from the stored respective candidate data as a plurality of training data: when receiving a model creation instruction from a client, establishing communication connection between the client and the computing node, wherein the model creation instruction carries initial model parameters; based on the initial model parameters, building a data classification model to be trained on the computing nodes; and controlling the computing node, and issuing the data classification model to be trained to the storage nodes.
  3. 3. The method of claim 2, wherein controlling the computing node to issue the data classification model to be trained to the plurality of storage nodes comprises: controlling the computing node, and storing the data classification model to be trained into a cache; Receiving model training instructions from the client and controlling the computing node to forward the model training instructions to the plurality of storage nodes; and respectively controlling the storage nodes to read the data classification model to be trained from a cache when receiving the model training instruction.
  4. 4. The method of claim 1, wherein controlling the plurality of storage nodes, respectively, performs a plurality of iterative training on the corresponding data classification model to be trained based on the selected plurality of training data, and obtains a plurality of trained data classification models, comprising: during each iteration training, the following operations are performed: Respectively controlling the storage nodes to perform one round of iterative training on the corresponding data classification model to be trained based on one training data, obtaining corresponding training loss, and counting corresponding training time; respectively controlling the storage nodes, and entering the next round of iterative training when the obtained training loss is determined to not meet the corresponding training target or when the counted training time length is determined to exceed the corresponding preset time length; And respectively controlling the storage nodes, and obtaining a plurality of trained data classification models when the obtained training loss meets the corresponding training targets.
  5. 5. The method of claim 1, wherein controlling the computing node to build the target data classification model based on the integrated model parameters and a model structure of the trained data classification model comprises: controlling the computing node, and constructing a data classification model to be verified based on the comprehensive model parameters and the model structure of the trained data classification model; Respectively controlling the plurality of storage nodes, and selecting a plurality of candidate data from the stored candidate data as a plurality of verification data respectively, wherein the verification data has associated classification labels; controlling the computing node to receive the plurality of verification data from the plurality of storage nodes, and respectively determining respective prediction classifications of the plurality of verification data by adopting the data classification model to be verified; And controlling the computing node, and taking the data classification model to be verified as the target data classification model when the error between the obtained prediction classification and the corresponding classification label is determined to meet a preset error condition.
  6. 6. An apparatus for training a data classification model, applied to a distributed data space, the distributed data space comprising a computing node and a plurality of storage nodes, the computing node configured to process data, the storage nodes configured to store data, each of the storage nodes having backup data stored in other storage nodes, comprising: The acquisition module is used for respectively controlling the storage nodes to build a data classification model to be trained based on the names and initial model parameters of the model structures carried by the model creation instructions based on SQL sentences which are received by the calculation nodes and serve as the model creation instructions; based on SQL sentences which are received by the computing nodes and serve as data selection instructions, the storage nodes are respectively controlled to execute, wherein one storage node selects a plurality of candidate data from the candidate data stored in at least one storage node in the storage nodes to serve as a plurality of training data; The processing module is used for respectively controlling the storage nodes, and carrying out multi-round iterative training on the corresponding data classification models to be trained based on the training data to obtain a plurality of trained data classification models; the processing module is also used for controlling the computing node to acquire training model parameters of the plurality of trained data classification models from the plurality of storage nodes, carrying out weighted fusion on the training model parameters to acquire comprehensive model parameters, and acquiring a target data classification model based on the name of the model structure and the comprehensive model parameters.
  7. 7. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1 to 5.
  8. 8. A computer device, comprising: a memory for storing program instructions; a processor for invoking program instructions stored in the memory and executing the method according to any of the claims 1-5 according to the obtained program instructions.
  9. 9. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method of any one of claims 1-5.

Description

Method, device, equipment and storage medium for training data classification model Technical Field The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for training a data classification model. Background With the continuous development of technology, more and more devices can provide data classification services through a trained data classification model, and the data classification services can be used for determining the category to which data belongs. In the related art, a method for obtaining a trained data classification model is generally that a device builds a data classification model to be trained locally, then sends a data acquisition request to a storage unit, and the storage unit returns each training data to the device based on the data acquisition request. The device may obtain a trained data classification model after performing multiple rounds of iterative training of the data classification model locally based on the obtained respective training data. Therefore, when the trained data classification model is obtained, the device can read each storage data from the storage unit, and the data classification model is adopted to sequentially classify the storage data. However, the device needs to send a data acquisition request to the storage unit, and wait for the storage unit to respond to the data acquisition request before acquiring each training data, so that the process of acquiring the training data is complex, and the problem of low efficiency of the training data classification model is caused. As can be seen, in the related art, the efficiency of training the data classification model is low. Disclosure of Invention The embodiment of the application provides a method, a device, computer equipment and a storage medium for training a data classification model, which are used for solving the problem of low efficiency of training the data classification model. In a first aspect, a method of training a data classification model is provided, applied to a distributed data space, the distributed data space including a computing node for processing data and a plurality of storage nodes for storing data, comprising: Based on the received data selection instruction, respectively controlling the storage nodes, and selecting a plurality of candidate data from the stored candidate data as a plurality of training data; Respectively controlling the storage nodes, and performing multiple rounds of iterative training on the corresponding data classification models to be trained based on the training data to obtain a plurality of trained data classification models; And controlling the computing node to acquire the plurality of trained data classification models from the plurality of storage nodes, and carrying out model fusion on the plurality of trained data classification models to acquire a target data classification model. Optionally, before controlling the plurality of storage nodes respectively and selecting a plurality of candidate data from the stored candidate data as a plurality of training data, the method further includes: when receiving a model creation instruction from a client, establishing communication connection between the client and the computing node, wherein the model creation instruction carries initial model parameters; based on the initial model parameters, building a data classification model to be trained on the computing nodes; and controlling the computing node, and issuing the data classification model to be trained to the storage nodes. Optionally, controlling the computing node, and issuing the data classification model to be trained to the plurality of storage nodes includes: controlling the computing node, and storing the data classification model to be trained into a cache; Receiving model training instructions from the client and controlling the computing node to forward the model training instructions to the plurality of storage nodes; and respectively controlling the storage nodes to read the data classification model to be trained from a cache when receiving the model training instruction. Optionally, the multiple storage nodes are controlled respectively, and based on the selected multiple training data, multiple rounds of iterative training are performed on the corresponding data classification model to be trained, so as to obtain multiple trained data classification models, including: during each iteration training, the following operations are performed: Respectively controlling the storage nodes to perform one round of iterative training on the corresponding data classification model to be trained based on one training data, obtaining corresponding training loss, and counting corresponding training time; respectively controlling the storage nodes, and entering the next round of iterative training when the obtained training loss is determined to not meet the corresp