EP-4102771-B1 - INFORMATION LEAKAGE DETECTION METHOD AND DEVICE USING THE SAME

EP4102771B1EP 4102771 B1EP4102771 B1EP 4102771B1EP-4102771-B1

Inventors

HUANG, CHIUNG-YING
LI, HUEI-TANG
TSENG, YI-CHUNG
CHEN, WEI-AN

Dates

Publication Date: 20260506
Application Date: 20220608

Claims (8)

An information leakage detection method, comprising: obtaining network connection data of an electronic device (12); extracting log data related to a domain name system from the network connection data; analyzing a domain name system request in the log data to obtain a plurality of character distribution feature values according to an analysis result, wherein the character distribution feature values reflect a character distribution status of a domain name in the domain name system request under different classification rules; and determining whether the domain name system request is a malicious domain name system request (401) by a machine learning model (202) according to the character distribution feature values, wherein the malicious domain name system request (401) is used to carry leaked data to a remote host (13), wherein the information leakage detection method further comprises: determining a first occurrence frequency of the malicious domain name system request (401) according to a number of occurrences of the malicious domain name system request (401) within a first time range (T(D)); and verifying a determination result of the machine learning model (202) according to the first occurrence frequency of the malicious domain name system request (401) after the machine learning model (202) determines that the domain name system request is the malicious domain name system request (401), wherein a step of verifying the determination result of the machine learning model (202) according to the first occurrence frequency of the malicious domain name system request (401) comprises: marking the domain name system request as a misjudgment of the malicious domain name system request (401) if the first occurrence frequency is not higher than a critical value; and adjusting a decision logic of the machine learning model (202) according to the misjudgment.
The information leakage detection method according to claim 1, wherein the character distribution feature values comprise a first type feature value and a second type feature value, wherein the first type feature value reflects a first character distribution status of the domain name under a first classification rule, the second type feature value reflects a second character distribution status of the domain name under a second classification rule, and the first classification rule is different from the second classification rule.
The information leakage detection method according to claim 1, wherein a step of analyzing the domain name system request in the log data to obtain the character distribution feature values according to the analysis result comprises: analyzing the domain name system request to obtain a plurality of evaluation parameters; and obtaining the character distribution feature values according to the evaluation parameters, wherein the evaluation parameters reflect at least two among a total number of characters comprised in a meaningful string in the domain name, a total number of all characters in the domain name, a total number of numerals in the domain name, a total number of non-repeated characters in a third-level domain name in the domain name, a total number of all characters except a first-level domain name and a second-level domain name in the domain name, a number of appearances of a character appearing most in the third-level domain name in the domain name, a number of occurrences of numerals being adjacent to letters in the third-level domain name in the domain name, a total number of characters meeting a specific condition in the third-level domain name in the domain name, and an entropy value of the third-level domain name in the domain name.
The information leakage detection method according to claim 1, further comprising: obtaining a second occurrence frequency of the malicious domain name system request (401); and determining the critical value according to the second occurrence frequency.
An information leakage detection device, comprising: a storage circuit (22), configured to store network connection data and a machine learning model (202) of an electronic device (12); and a processor (21), coupled to the storage circuit (22) and configured to: extract log data related to a domain name system from the network connection data; analyze a domain name system request in the log data to obtain a plurality of character distribution feature values according to an analysis result, wherein the character distribution feature values reflect a character distribution status of a domain name in the domain name system request under different classification rules; and determine whether the domain name system request is a malicious domain name system request (401) by the machine learning model (202) according to the character distribution feature values, wherein the malicious domain name system request (401) is used to carry leaked data to a remote host (13), wherein the processor (21) is further configured to: determine a first occurrence frequency of the malicious domain name system request (401) according to a number of occurrences of the malicious domain name system request (401) within a first time range (T(D)); and verify a determination result of the machine learning model (202) according to the first occurrence frequency of the malicious domain name system request (401) after the machine learning model (202) determines that the domain name system request is the malicious domain name system request (401), wherein an operation of verifying the determination result of the machine learning model (202) according to the first occurrence frequency of the malicious domain name system request (401) comprises: marking the domain name system request as a misjudgment of the malicious domain name system request (401) if the first occurrence frequency is not higher than a critical value; and adjusting a decision logic of the machine learning model (202) according to the misjudgment.
The information leakage detection device according to claim 5, wherein the character distribution feature values comprise a first type feature value and a second type feature value, wherein the first type feature value reflects a first character distribution status of the domain name under a first classification rule, the second type feature value reflects a second character distribution status of the domain name under a second classification rule, and the first classification rule is different from the second classification rule.
The information leakage detection device according to claim 5, wherein an operation of analyzing the domain name system request in the log data to obtain the character distribution feature values according to the analysis result comprises: analyzing the domain name system request to obtain a plurality of evaluation parameters; and obtaining the character distribution feature values according to the evaluation parameters, wherein the evaluation parameters reflect at least two among a total number of characters comprised in a meaningful string in the domain name, a total number of all characters in the domain name, a total number of numerals in the domain name, a total number of non-repeated characters in a third-level domain name in the domain name, a total number of all characters except a first-level domain name and a second-level domain name in the domain name, a number of appearances of a character appearing most in the third-level domain name in the domain name, a number of occurrences of numerals being adjacent to letters in the third-level domain name in the domain name, a total number of characters meeting a specific condition in the third-level domain name in the domain name, and an entropy value of the third-level domain name in the domain name.
The information leakage detection device according to claim 5, wherein the processor (21) is further configured to: obtain a second occurrence frequency of the malicious domain name system request (401); and determine the critical value according to the second occurrence frequency.

Description

BACKGROUND Technical Field The disclosure relates to an information leakage detection technology, and more particularly to an information leakage detection method and a device using the same. Description of Related Art A domain name system (DNS) is an Internet service, which may be used as a distributed database mapping domain names and Internet protocol (IP) addresses to each other to provide people with easier access to the Internet. For example, when a terminal device needs to open a web page using a certain domain name, the terminal device may send a DNS request to a responsible DNS server. After receiving the DNS request, the DNS server may resolve the DNS request and send a DNS response to the terminal device, so as to inform the terminal device of an IP address corresponding to the domain name through the domain name system response. Generally speaking, most network security systems (such as firewalls) do not block DNS requests and DNS responses to avoid affecting normal network connection of terminal devices. However, due to this fact, once a hacker or a malicious program sends a DNS request for information leakage, such as carrying and sending sensitive data of a terminal device in the DNS request to a remote host, most network security systems may not be able to detect or prevent such information leakage. CHOWDHARY AASTHA ET AL: "DNS Tunneling Detection using Machine Learning and Cache Miss Properties", 2021 5TH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS), IEEE,6 May 2021 (2021-05-06), pages 1225-1229,andAHMED JAWAD ET AL: "Monitoring Enterprise DNS Queries for Detecting Data Exfiltration From Internal Hosts", IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, IEEE, USA, vol. 17, no. 1,10 September 2019 (2019-09-10), pages 265-279 disclose methods to detect DNS tunneling. In US 2021/126901 A1 a service computes a plurality of features of a subdomain for which a Domain Name System (DNS) query was issued. The service aggregates the plurality of computed features into a feature vector. The service uses the feature vector as input to a machine learning classifier, to determine whether the subdomain is a DNS tunneling domain name. The service provides an indication that the subdomain is a DNS tunneling domain name, when the machine learning classifier determines that the subdomain is a DNS tunneling domain name. US 2016/337391 A1 and US 2018/109494 A1 are also relevant prior art documents. SUMMARY The disclosure provides an information leakage detection method and a device using the same, which may improve efficiency of detecting a domain name system (DNS) request and/or a domain name used by a hacker or a malicious program for information leakage. An embodiment of the disclosure provides an information leakage detection method, including the following steps. Network connection data of an electronic device is obtained. Log data related to a DNS is extracted from the network connection data. A DNS request in the log data is analyzed to obtain multiple character distribution feature values according to an analysis result. The character distribution feature values reflect a character distribution status of a domain name in the DNS request under different classification rules. A machine learning model determines whether the DNS request is a malicious DNS request according to the character distribution feature values, and the malicious DNS request is used to carry leaked data to a remote host. An embodiment of the disclosure further provides an information leakage detection device, including a storage circuit and a processor. The storage circuit is configured to store network connection data and a machine learning model of an electronic device. The processor is coupled to the storage circuit and is configured to perform the following operations. Log data related to a DNS is extracted from the network connection data. A DNS request in the log data is analyzed to obtain multiple character distribution feature values according to an analysis result. The character distribution feature values reflect a character distribution status of a domain name in the DNS request under different classification rules. The machine learning model determines whether the DNS request is a malicious DNS request according to the character distribution feature values, and the malicious DNS request is used to carry leaked data to a remote host. Based on the above, after the network connection data of the electronic device is obtained, the log data related to the DNS may be extracted from the network connection data. Next, the DNS request in the log data may be analyzed to obtain the character distribution feature values according to the analysis result, and the character distribution feature values reflect the character distribution status of the domain name in the DNS request under different classification rules. In following, the machine learning model determines whether the DNS request is the ma