CN-121996251-A - Data collection system for network field
Abstract
The invention provides a data collection system oriented to the network field. The system performs data collection based on distributed heterogeneous computing resource nodes. In the system, a user only needs to define a task pipeline and a node type which need to be used for data collection through a declarative experiment specification, the data collection type of the task pipeline is determined according to the task pipeline, the system receives the experiment specification submitted by the user, compiles and generates an instruction set for environment configuration and atomic task execution according to the experiment specification, and issues the instruction set to a target node in a data collection module, and the whole process of scheduling of resource nodes, environment configuration and atomic task execution is automatically completed according to the instruction set, so that sample data under the type of the task pipeline is obtained. According to the system, the experimental specifications are compiled into the executable instructions aiming at the specific nodes and the environment automatically, the writing complexity of the experimental scripts is reduced, and the data production efficiency, the scale and the reliability are improved.
Inventors
- LIU ZIHAO
- ZHANG GUANGXING
- QIAO MINGYU
- JIANG HAIYANG
- GONG YI
Assignees
- 中国科学院计算技术研究所
Dates
- Publication Date
- 20260508
- Application Date
- 20260115
Claims (10)
- 1. The data collection system oriented to the network field is characterized by carrying out data collection based on a plurality of distributed heterogeneous computing resource nodes, wherein the system comprises a problem abstraction module, a platform core module and a data generation module, and the data generation module comprises a plurality of heterogeneous computing resource nodes; The problem abstract module is configured to receive an experimental specification for data collection, which is constructed by a user according to data collection requirements determined by machine learning downstream tasks, and comprises a task pipeline for data collection and a node type for executing tasks, wherein the task pipeline comprises a plurality of atomic tasks with dependency relationships, the data collection category of the task pipeline is determined according to the task pipeline, and the node type comprises a server type and a client type; the platform core module is configured to compile and generate an executable instruction set for environment configuration and atomic task execution according to the experimental specification, and issue the instruction set to a target node in the data collection module; The data generation module is configured to configure an environment on a target node according to the issued instruction set and execute the atomic tasks according to the sequence of each atomic task in the task pipeline to obtain the execution state and the original result data of the atomic tasks, and upload the execution state and the original result data to the platform core module; The platform core module is further configured to collect original result data to obtain sample data under the category to which the corresponding task pipeline belongs.
- 2. The system of claim 1, wherein the platform core module comprises a coordination service unit and a deployment service unit, the deployment service unit comprising a compiler and a connection manager; the coordination service unit is configured to receive an experiment specification and request a compiler to analyze the experiment specification to obtain an analysis result; The deployment service unit is configured to: Compiling and generating an executable instruction set for environment configuration and atomic task execution according to the analysis result by a compiler, wherein the executable instruction set comprises an environment configuration instruction and an atomic task execution instruction; And issuing the instruction set to a target node under the corresponding node type according to the node type in the experimental specification through the connection manager.
- 3. The system of claim 2, further comprising a database and a deployment module coupled to the connection manager, the database for persisting a set of executable instructions for environment configuration and atomic task execution, wherein configuring the environment on the target node comprises: The connection manager issues an environment configuration instruction to the target node through the deployment module so as to finish environment preparation on the target node, and after the environment preparation is finished, whether the environment meets the requirement is verified; After the environment meets the requirements, the connection manager instantiates an executor on the target node through the deployment module that waits to receive an atomic task execution instruction to execute the atomic task of the data collection.
- 4. The system of claim 3, wherein the platform core module further comprises an execution service unit, and wherein an executor on the target node executes the atomic task according to the received atomic task execution instruction to obtain an execution state and original result data of the atomic task; The platform core module is further configured to: the gateway of the execution service unit receives the execution state of the atomic task and the original result data obtained by the executor, and performs data processing on the original result data to obtain the processed original result data; And judging whether all the atomic tasks in the task pipeline are executed or reach a termination condition by the coordination service unit, and sending a resource cleaning instruction to the target node by the coordination service unit when all the atomic tasks in the task pipeline are executed or reach the termination condition so as to control the target node to clean resources, wherein the method comprises stopping temporary service, deleting temporary files and unloading temporary software.
- 5. The system of claim 4, wherein the execution service unit comprises a gateway and a processor, the gateway being configured to receive the execution state of the atomic task and the raw result data obtained by the executor and transmit the received execution state and raw result data to the processor; The processor is configured to perform data processing on the original result data to obtain processed original result data, write the execution state and the processed original result data into the database for persistent storage, and the data processing mode comprises format conversion and data verification; the coordination service unit is further configured to judge whether all the atomic tasks in the task pipeline are executed or reach a termination condition according to the latest execution state of all the atomic tasks in the task pipeline periodically retrieved from the database.
- 6. The system of claim 2, wherein the system comprises a task library comprised of a plurality of atomic tasks including a port scan task, a traffic generation task, a server deployment task, a packet capture task; the construction method of the task pipeline comprises the steps of obtaining a plurality of atomic tasks from a task library according to data collection requirements and combining the atomic tasks according to set dependency relations to obtain the task pipeline.
- 7. The system of claim 6, wherein each network infrastructure device is considered a heterogeneous computing resource node, the compiler configured to: searching and analyzing all atomic tasks from a task library according to a task pipeline to obtain an analysis result; And generating an executable instruction set for environment configuration and atomic task execution according to the analysis result and the node type in the experimental specification, wherein the instruction set can be operated on network base equipment in various network environments, and the various network environments comprise a local machine room, an AWS cluster, a campus network, a hybrid cloud or a multi-cloud environment.
- 8. The system of claim 1, wherein the system records corresponding traceable context information during the collection of each sample data, including experimental specifications, an executable instruction set for environmental configuration and atomic task execution, and execution order of atomic tasks in a task pipeline.
- 9. The system of claim 1, wherein the machine learning downstream task is a brute force crack detection task, a malware identification task, a video fingerprint identification task, a VPN classification task, or an APT detection task.
- 10. A data collection method implemented on the basis of the system of one of claims 1 to 9, characterized in that the method comprises: Receiving an experimental specification for data collection, which is constructed by a user according to data collection requirements determined by machine learning downstream tasks, through a problem abstraction module, wherein the experimental specification comprises a task pipeline for data collection and a node type for executing tasks, the task pipeline comprises a plurality of atomic tasks with dependency relationships, the data collection category of the task pipeline is determined according to the task pipeline, and the node type comprises a server type and a client type; Compiling and generating an executable instruction set for environment configuration and atomic task execution according to experimental specifications through a platform core module, and issuing the instruction set to a target node in a data collection module; the method comprises the steps that an environment is configured on a target node through a data generation module according to an issued instruction set, atomic tasks are executed according to the sequence of each atomic task in a task pipeline, the execution state and original result data of the atomic tasks are obtained, and the execution state and the original result data are uploaded to a platform core module; and collecting original result data through a platform core module to obtain sample data under the category to which the corresponding task pipeline belongs.
Description
Data collection system for network field Technical Field The invention relates to the field of machine learning and artificial intelligence, in particular to the field of data acquisition of machine learning and artificial intelligence, and more particularly relates to a data collection system oriented to the network field. Background Machine Learning (ML) and Artificial Intelligence (AI) technologies have shown great potential in the field of network operation and security, and are widely expected to solve complex problems such as distributed denial of service (DDoS) attack detection, malware classification, network intrusion recognition, etc. However, despite the frequent occurrence of high performance models in academic research reports, deployment of these models in an industrial actual production environment is crudely performed, with a core obstacle in the serious lack of high quality training data for network-oriented models. The vitality of the model stems from the data, and the "data bottleneck" faced by AI models in the network domain directly results in the lack of generalization capability thereof, namely, the model which is excellent in one specific environment (such as laboratory simulation or a certain enterprise network), and the performance of the model is obviously reduced when the model is deployed in a real heterogeneous network environment with different user behaviors, network topologies or attack approaches. Therefore, constructing a tool capable of efficiently and automatically generating high-quality network data of "AI-Ready" has become a key premise for promoting the development of network AI. Currently, the main acquisition technology path for supplying data to the training network AI model has inherent drawbacks that make it difficult to meet the large-scale, high-quality data requirements, some acquisition technologies as follows: 1) Based on simulated data collection, the method generates data by simulating network traffic and attack behavior in a simulation environment (e.g., NS-3, mininet). The method has the advantages of accurate data label, strong experimental reproducibility and convenient academic research as in reference [1 ]. However, the limitation is particularly remarkable in that the existing simulation data sets (such as DARRA 1998 and CIC-IDS) are mostly customized for specific research targets, so that the data collection work is highly specialized and cannot be reused. Such as a data acquisition process customized for video fingerprinting, because the final goal of data acquisition, the required data characteristics, and the data labeling logic are completely different, the acquired codes and processes cannot be reused, such as cannot be directly used for quality of service (QoE) inference tasks, and huge repeated development is caused. Meanwhile, the simulation environment is a simplified model of a real network, and the whole complexity of the simulation environment is difficult to reproduce. Models trained on such simulated data tend to learn specific shortcuts or biases of the simulation environment, and fail to capture key features in the real network, resulting in serious out-of-distribution generalization problems encountered at the time of actual deployment, i.e., a model that performs excellently on the training data, whose performance drops dramatically when new data different from the training data distribution is encountered, as in reference [2]. For example, when an intrusion detection model deployed in a campus network environment is migrated to an enterprise cloud network environment, the performance of the model may be degraded. 2) Passive data acquisition, which captures real traffic (e.g., PCAP data) directly from the production network passively through mirroring, probe, etc. techniques. The greatest value is the authenticity of the data. On the one hand, the labeling cost is high, the quality is difficult to control, the normal and abnormal behaviors of the mass flow data are distinguished and extremely depend on manual analysis of safety specialists, the cost is high, the efficiency is low, the labeling consistency is difficult to ensure, and the method becomes a huge bottleneck for supervising the application of a learning model. On the other hand, real traffic contains a large amount of sensitive information, the collection, storage and use of which face serious legal and regulatory challenges, making high quality data sets difficult to disclose and share. 3) Active data acquisition, which actively generates and acquires traffic in a real or near-real target environment, is considered to be an ideal way to generate data with both authenticity and markability. But the implementation process is extremely complex, and involves a tedious and error-prone process. First, heterogeneous hardware devices (such as x86 servers, ARM architecture raspberry pies, cloud virtual machines) and complex software dependencies need to be configured during th