Search

JP-7857251-B2 - Device for generating a list of dangerous websites, method for generating a list, and program for generating a list of dangerous websites

JP7857251B2JP 7857251 B2JP7857251 B2JP 7857251B2JP-7857251-B2

Inventors

  • 澤谷 雪子
  • 磯原 隆将

Assignees

  • KDDI株式会社

Dates

Publication Date
20260512
Application Date
20230509

Claims (11)

  1. A web search service includes a learning data collection unit that collects multiple training search results using a predetermined number of the top frequently occurring search queries, and A domain extraction unit that extracts domains that frequently appear in the aforementioned learning search results according to predetermined rules, A discovery data collection unit collects search results for identifying dangerous websites using search queries consisting of multiple conditions, A dangerous site extraction unit extracts URLs containing domains different from those extracted by the domain extraction unit from the search results for the aforementioned dangerous site detection, A list generation device comprising: a list management unit that manages the URLs extracted by the dangerous site extraction unit as a list of dangerous sites.
  2. The domain extraction unit extracts top-level domains as the frequently occurring domains, as described in claim 1 of the list generation apparatus.
  3. The domain extraction unit extracts attribute-type domains that have been pre-registered as frequently occurring domains, as described in claim 1 of the list generation device.
  4. The data collection unit for discovery includes site categories in the plurality of conditions, as described in any one of claims 1 to 3 of the list generation device.
  5. The list generation device according to any one of claims 1 to 3, wherein the discovery data collection unit includes multiple search keywords as the multiple conditions.
  6. The list generation device according to any one of claims 1 to 3, wherein the learning data collection unit sets an upper limit on the number of sites to be collected as search results for learning.
  7. The dangerous site extraction unit applies an index for evaluating the importance of words in documents, treating the plurality of learning search results and the dangerous site discovery search results as documents and domains as words, and extracts URLs containing domains whose scores exceed a predetermined threshold, according to any one of claims 1 to 3.
  8. The list generation device according to any one of claims 1 to 3, wherein the list management unit checks, at a predetermined timing, whether each URL included in the list is included in the search results using the same search query as when it was searched by the discovery data collection unit, and deletes it from the list if it is not included in the search results.
  9. The list generation device according to any one of claims 1 to 3, wherein the list management unit checks, at predetermined intervals, whether the language used on the web page based on each URL included in the list has changed, and deletes it from the list if it has changed.
  10. In a web search service, a training data collection step involves collecting multiple training search results using a predetermined number of the top frequently occurring search queries, and A domain extraction step in which domains that frequently appear in the aforementioned training search results are extracted according to predetermined rules, A discovery data collection step that collects search results for detecting dangerous sites using search queries consisting of multiple conditions, A dangerous site extraction step which extracts URLs containing domains different from those extracted in the domain extraction step from the search results for the aforementioned dangerous site detection, A list generation method in which a computer performs a list management step for managing the URLs extracted in the dangerous site extraction step as a list of dangerous sites.
  11. A list generation program for causing a computer to function as a list generation device according to any one of claims 1 to 3.

Description

This invention relates to a technology for collecting URLs of dangerous websites. Traditionally, there have been dangerous websites that harm users who visit them, such as phishing sites or malware distribution sites. Because these websites are often not immediately recognizable as dangerous, there is a need for a system that can automatically identify and mitigate such harm. Therefore, services are provided that, for example, verify the safety of a website by comparing it against a list of unsafe sites (e.g., Non-Patent Document 1), or notify users of the safety of URLs displayed in search results (e.g., Non-Patent Document 2). Furthermore, Non-Patent Document 3 proposes a method for identifying domains involved in botnets by performing machine learning based on Whois information (registrant name, registration date, contact information, etc.). Furthermore, Patent Document 1 proposes a method for detecting website tampering based on the host transition status when accessing a specific site. Patent No. 6055726 Google LLC, Google Safe Browsing, Internet, <https://developers.google.com/safe-browsing/v4/>, February 27, 2023Trend Micro Incorporated, regarding the "Trend Toolbar" function of Virus Buster Cloud, Internet, <https://helpcenter.trendmicro.com/ja-jp/article/tmka-18502>, January 10, 2023.Masahiro Hisayama, Ryoichi Sasaki, "A Method for Identifying Malicious Domains Using the WHOIS Structure of Domains," Multimedia, Distributed, Cooperative and Mobile (DICOMO2016) Symposium, July 2016. This figure shows the functional configuration of the list generation device in the embodiment.This is a flowchart illustrating the procedure for generating a list in the embodiment. [First Embodiment] The following describes a first embodiment of the present invention. The list generation device of this embodiment collects URLs of dangerous websites on the internet and generates and manages a blacklist to restrict user access. Search results from web search services include frequently occurring top-level domains (TLDs) and less frequently occurring TLDs, with many dangerous sites, including tampered websites, belonging to the latter group. The reasons for this are thought to be as follows: In other words, due to geographical and linguistic differences in keyword searches by search engines, for example, search results within Japan often contain URLs ending in ".jp" or ".com," while search results within Canada often contain URLs ending in ".ca" or ".com." However, dangerous sites are often tampered with through attacks targeting vulnerable legitimate sites overseas, and therefore often have ccTLDs other than ".jp" while being Japanese-language web pages. This embodiment of the list generation device utilizes this feature to list URLs containing rare TLDs (Top-Level Domains) that differ from frequently occurring TLDs in the search results, labeling them as dangerous sites. Figure 1 shows the functional configuration of the list generation device 1 in this embodiment. The list generation device 1 is an information processing device (computer) such as a server device or a personal computer, and includes a control unit 10 and a storage unit 20, as well as various data input/output devices and communication devices. The control unit 10 controls the entire list generation device 1. It operates as the respective functional units described later by appropriately reading and executing various programs stored in the storage unit 20, thereby realizing the functions described in this embodiment. The control unit 10 may be a CPU. The storage unit 20 is a storage area for various programs and data that enable the hardware group to function as a list generation device 1, and may be ROM, RAM, flash memory, or a hard disk drive (HDD). Specifically, the storage unit 20 stores a program (list generation program) for causing the control unit 10 to execute each of the functions of this embodiment, as well as a list of URLs to be managed, and various databases, etc. The control unit 10 comprises a learning data collection unit 11, a domain extraction unit 12, a discovery data collection unit 13, a dangerous site extraction unit 14, and a list management unit 15. The learning data collection unit 11 collects multiple learning search results using a predetermined number of the most frequently occurring search queries in a web search service. Specifically, the learning data collection unit 11 collects the responses, which are the search results for each of the N search queries that are ranked highly in overall search results (for example, the search ranking for all categories over the past 30 days) in statistical information from web search services (for example, Google Trends, etc.) in order to collect TLDs that frequently occur in a specific language and region. The training data collection unit 11 uses the search results for these N search queries as training data. If the number of responses to a single query is large, the training data collec