Search

KR-20260067722-A - MULTI-THREAD-BASED DATA COLLECTION ENGINE DEVICE

KR20260067722AKR 20260067722 AKR20260067722 AKR 20260067722AKR-20260067722-A

Abstract

The present invention relates to a multi-threaded data collection engine device. A multi-threaded data collection engine device according to an embodiment of the present invention includes a memory storing a program for collecting data from a website based on multi-threading and a processor for executing the program, and the processor performs parallel processing of data collection tasks by simultaneously executing a plurality of threads.

Inventors

  • 오상묵
  • 박우용
  • 이승현

Assignees

  • 주식회사 인핸스

Dates

Publication Date
20260513
Application Date
20241106

Claims (6)

  1. Memory where a program that collects data from a website based on multi-threading is stored; and It includes a processor that executes the above program, The above processor performs parallel processing of data collection tasks by executing multiple threads simultaneously. Multi-threaded data collection engine device.
  2. In paragraph 1, The above processor performs the above data collection task by distributing IP addresses. Multi-threaded data collection engine device.
  3. In paragraph 1, The above processor collects cookies at a preset interval and performs the data collection task on the above website that requires login or session maintenance. Multi-threaded data collection engine device.
  4. In paragraph 1, The above processor provides a user interface and offers the function of setting the data collection target website, period, and storage path. Multi-threaded data collection engine device.
  5. In paragraph 4, The above processor provides a real-time job status monitoring function, a job completion confirmation function, and an error detection function through the above user interface. Multi-threaded data collection engine device.
  6. In paragraph 1, The above processor performs scheduling and management of the above data collection task and performs retries in the event of an error. Multi-threaded data collection engine device.

Description

Multi-threaded Data Collection Engine Device The present invention relates to a multi-threaded data collection engine device. As the importance of web data has become increasingly prominent, technologies for collecting data in the web environment are widely used in various fields, and are particularly essential in big data analysis, price comparison sites, and news and social media monitoring. According to conventional technology, data is primarily collected using a single-threaded method, which results in slow speeds and low efficiency when collecting large-scale data. Since the single-threaded method can only process one task at a time, there are limitations to browsing multiple web pages simultaneously or performing data collection in parallel. In other words, the amount that can be processed in a given time is limited, leading to increased costs when collecting the same amount of data. Furthermore, according to conventional technology, it is possible to block data collection engine devices on specific websites by performing data collection tasks using the same IP address. Consequently, there are issues where the continuity of data collection is degraded and it is difficult to ensure the accuracy and reliability of the data collection. Although a method utilizing multiple proxy servers has been proposed to resolve the aforementioned problems, the performance of data collection tasks remains limited due to the inefficiency of selecting and managing proxy servers. When collecting data from various websites, it is difficult to effectively respond to all of them using a single data collection method, as each website applies different structures and security mechanisms. Relying solely on browser-based or API-based data collection methods has limitations in that it cannot flexibly respond to structural changes or security policies, and it also suffers from frequent errors during the collection process that compromise the integrity of the collected data. According to conventional technology, user session management using cookies cannot be performed, which presents a problem in that data collection on websites requiring login or session maintenance is difficult. Consequently, data collection on specific websites is restricted, and the efficiency of data collection operations is reduced. FIG. 1 illustrates a data collection system according to an embodiment of the present invention. FIG. 2 illustrates a new job registration screen according to an embodiment of the present invention. FIG. 3 illustrates a test screen according to an embodiment of the present invention. FIG. 4 illustrates a status check screen of a data collection engine device according to an embodiment of the present invention. FIG. 5 illustrates a screen for scheduling and automation of a data collection process according to an embodiment of the present invention. FIG. 6 is a block diagram showing a computer system for implementing a method according to an embodiment of the present invention. The aforementioned objectives of the present invention, as well as other objectives, advantages, and features, and the methods for achieving them, will become clear from the embodiments described in detail below together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below but can be implemented in various different forms, and the following embodiments are provided merely to easily inform those skilled in the art of the purpose, structure, and effects of the invention, and the scope of the rights of the present invention is defined by the description in the claims. Meanwhile, the terms used in this specification are for describing the embodiments and are not intended to limit the invention. In this specification, the singular form includes the plural form unless specifically stated otherwise in the text. As used in this specification, "comprises" and/or "comprising" do not exclude the presence or addition of one or more other components, steps, actions, and/or elements to the mentioned components, steps, actions, and/or elements. According to an embodiment of the present invention, the speed and stability of data collection are maximized by operating in a multi-threaded environment and utilizing various proxy servers, and large-scale data collection tasks are performed quickly and efficiently. According to an embodiment of the present invention, each thread independently navigates web pages and collects data, and multiple threads are executed simultaneously to perform parallel processing of data collection tasks, thereby significantly improving the performance of data collection. According to an embodiment of the present invention, a synchronization mechanism is introduced to prevent the occurrence of conflicts between threads and resource contention problems that may occur in a multi-threaded environment. According to an embodiment of the present invention, by performing data collection tasks using