Search

EP-4740151-A1 - LLM-BASED AUTOMATIC DISCOVERY OF OT AND IT NETWORKS

EP4740151A1EP 4740151 A1EP4740151 A1EP 4740151A1EP-4740151-A1

Abstract

The presently disclosed subject matter includes a computer system and computer- implemented method for linguistic passive automatic discovery of operational technology (OT) and information technology (IT) networks, which utilize a novel large language model (LLM) trained to comprehensively "understand" configuration files, notwithstanding the fact that configuration data stored in configuration files is not a natural language. The discovery can be applied on networks of various sizes, including large networks comprising many assets (e.g., tens of thousands of endpoints and servers or more), as well as smaller networks such as subnets, multi sites, or even sites.

Inventors

  • FIALKOW, Tal

Assignees

  • Dream Security Ltd.

Dates

Publication Date
20260513
Application Date
20240707

Claims (20)

  1. 1. A computer-implemented method of network discovery, comprising: collecting configuration files from at least one network, each configuration file comprising configuration data that characterizes different assets in the at least one network and their respective connections and properties; applying on the configuration files a large language model (LLM) trained to provide as output responsive to at least one received instruction, a respective response comprising specific configuration data extracted from the configuration files; wherein the LLM is capable of retrieving configuration data pertaining to multiple assets and their respective connections in the at least one network.
  2. 2. The computer-implemented method of claim 1 comprising generating, based on the retrieved configuration data, an ontology of the at least one network including, at least part of: information on the assets, information on network topology, and information on rules and policies.
  3. 3. The computer-implemented method of claim 1 or 2 comprising: using the LLM for generating, for each asset and/or connection in the at least one network, a respective data object, wherein all data-objects are characterized by a uniform data-object format.
  4. 4. The computer-implemented method of any one of the preceding claims comprising: generating a graph representing the retrieved configuration data, wherein each data-object of a respective asset retrieved from the retrieved configuration data is designated as a node in the graph, and each connection retrieved from the retrieved configuration data is designated as a vertex in the graph connecting between two nodes, and where each node or vertex is stored with its respective properties.
  5. 5. The computer-implemented method of any one of the preceding claims comprising: applying the LLM on the configuration files in the at least one network for determining a type of configuration file, and, responsive to determining a configuration 33 file of a certain type, automatically generating a regular expression adapted for parsing the configuration file and retrieving from the configuration file the respective configuration data.
  6. 6. The computer-implemented method of any one of the preceding claims comprising: applying the LLM on the configuration files in the at least one network for determining a type of configuration file, and, responsive to determining a configuration file of a certain type, automatically generating a computer program code adapted for parsing the configuration file and retrieving from the configuration file the respective configuration data.
  7. 7. The computer-implemented method of any of the preceding claims, wherein the LLM includes a transformer model that provides a next token in a sequence based on previous tokens in the sequence.
  8. 8. The computer-implemented method of any one of the preceding claims comprising: comparing configuration data in a respective response of one or more instructions to configuration data in the configuration files, and, in case of a discrepancy, discarding the respective response and re-processing the one or more instructions by the LLM.
  9. 9. The computer-implemented method of any one of the preceding claims wherein a training dataset for training the LLM comprises synthetic configuration files.
  10. 10. The computer-implemented method of claim 9, wherein the synthetic configuration files are generated by a process that utilizes network design software, comprising: manipulating parameters of the network design software to thereby obtain a diverse collection of network designs; and executing the network designs to thereby obtain a respective diverse collection of configuration files.
  11. 11. The computer-implemented method of claim 10, wherein manipulation of parameters of network design software includes intentionally inserting errors into network designs in the diverse collection of network designs, thereby obtaining configuration files that contain errors.
  12. 12. The computer-implemented method of any one of claims 9 to 11, wherein the training dataset includes an assembly of the synthetic configuration files, proprietary configuration files, and human annotated configuration files.
  13. 13. The computer-implemented method of any one of claims 1 to 12, wherein the training dataset comprises providing, as input, multiple instruction-response pairs, and training the LLM to provide, in response to a given instruction, a respective response comprising specific configuration data extracted from the configuration files.
  14. 14. The computer-implemented method of claim 13, wherein the multiple instruction-response pairs include instruction-Regex pairs used for training the LLM to automatically generate, responsive to determining that a configuration file is of a certain type, a regular expression adapted for parsing the configuration file and retrieving from the configuration file the respective configuration data.
  15. 15. The computer-implemented method of claim 13, wherein the multiple instruction-response pairs include instruction-computer code pairs used for training the LLM to automatically generate, responsive to determining that a configuration file is of a certain type, a computer program code adapted for parsing the configuration file and retrieving from the configuration file the respective configuration data.
  16. 16. The computer-implemented method of any one of the preceding claims, wherein the at least one network is an operation technology network and/or an information technology network.
  17. 17. The computer-implemented method of any one of claims 1 to 16 wherein the LLM includes ' n' embedded tokens, each generated from a respective token extracted from the configuration files, the computer-implemented method comprising: applying in the LLM a selective attention mechanism, when providing a response to a received instruction, comprising: selecting a subset of 'm' from the 'n' embedded tokens and using only the embedded tokens in the subset when applying the attention mechanism.
  18. 18. The computer-implemented method of claim 17, further comprising: for at least one embedded token generated from the received instruction: calculating a dot product using the subset of 'm' embedded tokens, giving rise to — dot products: m selecting from the subset, a group of k embedded tokens with the highest respective dot products; expanding each of 'k' embedded tokens with a plurality of additional 'p' Tl adjacent tokens, giving rise to a final collection of — + k * p embedded tokens; applying a normalization function on respective dot product of the final collection of embedded tokens, thereby obtaining a respective set of weights, and applying the weights for obtaining a contextualized vector for the at least one embedded token.
  19. 19. A computer program product comprising a computer readable storage medium retaining a program of instructions, which, when read by a computer processor, causes the computer processor to perform a method according to any one of claims 1 to 18.
  20. 20. A non-transitory program storage device readable by a computer, tangibly embodying a program of instructions executable by the computer to perform a method according to any one of claims 1 to 18.

Description

LLM-BASED AUTOMATIC DISCOVERY OF OT AND IT NETWORKS FIELD OF THE PRESENTLY DISCLOSED SUBJECT MATTER The presently disclosed subject matter relates to computer systems and methods of network discovery management and cybersecurity. BACKGROUND Network discovery is a process dedicated for the identification and understanding of the various assets that constitute networks, and particularly operational technology (OT) networks and Information Technology (IT) networks. By applying network discovery in an organization, valuable information regarding the organization can be obtained, including the identification, profiling, and mapping of the different assets which constitute the OT and IT networks, and the identification of connectivity and communication pathways between these assets. Through network discovery, organizations can improve their understanding of the network infrastructure, assess its security status, detect potential vulnerabilities, develop risk mitigation plans, and implement suitable security measures to protect their operational technology assets against malicious cyber attacks, safeguarding their reliability and availability, and increasing the overall security posture of the network. GENERAL DESCRIPTION Network discovery in operational technology (OT) and information technology (IT) networks is a challenging task due to technical complexities, manual labor, vendor diversity, and software version variations. Traditionally, network administrators undertake labor-intensive efforts to manually identify and document each asset within the network. In OT environments, the presence of legacy systems, proprietary protocols, and heterogeneous architectures further complicate network discovery. The lack of standardized documentation and the dependence on vendor-specific tools often impede efficient and accurate identification of assets. Additionally, frequent software updates and varying versions further hinder the network discovery process. In IT networks, the dynamic nature of the infrastructure, constant addition of new devices, and frequent changes in software configurations, present additional challenges. The sheer scale of IT networks, distributed across multiple sites or subnets, further exacerbates the difficulty of asset discovery. Manual inventory processes become labor- intensive and time-consuming, making it challenging to maintain an up-to-date and accurate view of the network assets. The existing solutions for network discovery often rely on a combination of network scans, manual inventory audits, and vendor-specific tools. These methods are resource-intensive, error-prone, and time-consuming. They require significant human effort, specialized knowledge, installation of software or physical hardware within the network, and ongoing coordination with different vendors and software versions. The presently disclosed subject matter is related to an innovative approach that addresses these challenges and streamlines the network discovery process in both OT and IT networks. The suggested approach considerably reduces reliance on manual labor, vendor-specific tools, and the complexities associated with different software versions, while ensuring accurate, efficient, passive, and rapid network visibility and asset identification. The presently disclosed subject matter includes a computer system and method for linguistic passive automatic discovery of operational technology (OT) and information technology (IT) networks. The discovery can be applied on networks of various sizes, including large networks comprising many assets (e.g., tens of thousands of endpoints and servers or more), as well as smaller networks such as subnets, multi sites, or even sites. In some examples, boundaries of a network (defining which assets belong to a specific network) are defined by a router. The computer systems and methods disclosed herein utilize a novel large language model (LLM) trained to comprehensively "understand" configuration files, notwithstanding the fact that configuration data stored in configuration files is not a natural language. Configuration files are special types of files that store information defining various assets and their interactions in an OT or IT network. Each configuration file describes a respective network asset. Assets include, for example, various devices, and particularly network infrastructure components such as switches, routers, gateways, firewalls, sensors, etc. Each configuration file contains information (referred to herein as "configuration data") of a respective asset (device). The information in configuration files includes for example: Information on the assets (or "entities") in the network, including for example, data such as asset class (e.g., hardware, software), asset type (e.g., computer, router, switch, hub, application, operating system, etc.), IP addresses, hostname, network settings, access control, protocols, applications, etc. Information on the network topolo