Search

JP-7854759-B1 - Data preprocessing device

JP7854759B1JP 7854759 B1JP7854759 B1JP 7854759B1JP-7854759-B1

Abstract

[Challenge] To accelerate the speed of AI development and reduce labor costs by automating data preprocessing in AI development. [Solution] The data preprocessing device includes means for generating a first machine learning model trained on a training dataset consisting of a combination of management information and context information; means for generating a second machine learning model trained on a training dataset consisting of a combination of management information, context information, and the content of preprocessing to be applied to the management information; means for inputting one piece of management information to the first machine learning model and identifying the output of the first machine learning model as one piece of context information relating to one piece of management information; means for inputting one piece of management information and one piece of context information to the second machine learning model and identifying the output of the second machine learning model as the content of preprocessing to be applied to one piece of management information; and means for transforming one piece of management information according to the content of preprocessing to be applied to the identified piece of management information. [Selection Diagram] Figure 2

Inventors

  • 西郷 孝一
  • 池田 雅人
  • 諸冨 大樹

Assignees

  • 株式会社D4All

Dates

Publication Date
20260507
Application Date
20251111
Priority Date
20250212

Claims (11)

  1. A first model generation means generates a first machine learning model that trains on a training dataset consisting of a combination of management information and contextual information related to said management information, and outputs contextual information related to said management information when said management information is input. A second model generation means generates a second machine learning model that, when the management information and the context information related to the management information are input, outputs the content of the preprocessing to be applied to the management information, by training on a training dataset consisting of a combination of the management information, context information related to the management information, and the content of the preprocessing to be applied to the management information. A model storage means for storing parameters that define the operation of the first machine learning model and the second machine learning model, A context identification means that inputs one piece of the management information to the first machine learning model and identifies the output of the first machine learning model as one piece of the context information relating to the one piece of management information, A processing identification means that inputs the management information of the first and the context information of the first to the second machine learning model, and identifies the output of the second machine learning model as the content of the preprocessing to be performed on the management information of the first, The system includes an information format conversion means for converting the specified management information according to the content of the preprocessing to be performed on the specified management information, A data preprocessing device characterized in that the preprocessing is a process of converting target data into data in a format that can be used for data science or AI (Artificial Intelligence) development.
  2. The data preprocessing apparatus according to claim 1, characterized in that the preprocessing is a process of converting the management information into qualitative data, quantitative data, or annotation data, which is data with annotations attached.
  3. The data preprocessing apparatus according to claim 2, characterized in that the qualitative data is text data and the quantitative data is discrete or continuous data.
  4. The data preprocessing device according to claim 1, characterized in that the preprocessing involves vectorizing, aggregating, textualizing, or annotating the management information.
  5. The data preprocessing device according to claim 4, characterized in that the aggregation is a process of converting to descriptive statistics or a process of converting to a grand total.
  6. A data preprocessing method performed by a computer, The first model generation means trains a training dataset consisting of a combination of management information and contextual information related to the management information, and generates a first machine learning model that outputs contextual information related to the management information when the management information is input. The second model generation means trains a training dataset consisting of a combination of the management information, contextual information relating to the management information, and the content of the preprocessing to be performed on the management information, and generates a second machine learning model that outputs the content of the preprocessing to be performed on the management information when the management information and the contextual information relating to the management information are input. The context identification means inputs one piece of the management information to the first machine learning model and identifies the output of the first machine learning model as one piece of the context information relating to the one piece of management information, The processing identification means inputs the first management information and the first context information to the second machine learning model, and identifies the output of the second machine learning model as the content of the preprocessing to be performed on the first management information. The information format conversion means includes the step of converting the specified management information in accordance with the content of the preprocessing to be performed on the specified management information, A data preprocessing method characterized in that the preprocessing is a process of converting target data into data in a format that can be used for data science or AI (Artificial Intelligence) development.
  7. The data preprocessing method according to claim 6, characterized in that the preprocessing is a process of converting the management information into qualitative data, quantitative data, or annotation data, which is data with annotations attached.
  8. The data preprocessing method according to claim 7, characterized in that the qualitative data is text data and the quantitative data is discrete or continuous data.
  9. The data preprocessing method according to claim 6, characterized in that the preprocessing involves vectorizing, aggregating, textualizing, or annotating the management information.
  10. The data preprocessing method according to claim 9, characterized in that the aggregation is a process of converting to descriptive statistics or a process of converting to a grand total.
  11. A data preprocessing program for causing a computer to perform the method according to any one of claims 6 to 10.

Description

This relates to technologies in data science or AI (Artificial Intelligence) development. In recent years, the information processing capabilities of computers have improved dramatically. Thanks to this, data science and AI technologies, which analyze and process large amounts of information to produce valuable insights, are attracting significant attention. Against this backdrop, numerous technological proposals are being made in the fields of data science and AI. For example, Patent Document 1 proposes an information processing system that reduces the data science expertise required of users when converting unstructured data into structured data. Japanese Patent Publication No. 2024-039064 This figure illustrates the overview of the data preprocessing device according to this embodiment.This is a functional block diagram of the data preprocessor according to this embodiment.This figure shows an example of the hardware configuration of the data preprocessing device according to this embodiment.This flowchart shows an example of the processing flow by the data preprocessing device according to this embodiment. The embodiments for carrying out the present invention will be described with reference to the drawings. (Operating principle of the data preprocessor according to this embodiment) The operating principle of the data preprocessing device (hereinafter simply referred to as "this device") 100 according to this embodiment will be explained using Figures 1 and 2. Figure 1 is a diagram showing the connection relationship between this device 100 and other devices, and Figure 2 is a functional block diagram of this device 100. As shown in Figure 1, the device 100 is connected to one or more external devices 260 via a communication network 270. The communication network 270 may be a wired communication network or a wireless communication network. The external devices 260 may be, for example, a POS (Point of Sales) system. As shown in Figure 2, the device 100 includes a pre-processing information storage means 110, a post-processing information storage means 120, a model storage means 130, a first model generation means 140, a second model generation means 150, a context identification means 160, a processing identification means 170, and an information format conversion means 180. The pre-processing information storage means 110 stores the management information 210 before processing by the information format conversion means 180. The management information 210 may be, for example, customer purchase information, skin information, health management information, etc., but it can be any other type of information. The management information 210 is provided, for example, from the POS system 260. The post-processing information storage means 120 stores the management information 215 processed by the information format conversion means 180. The management information 215 is data in a format usable for data science or AI (Artificial Intelligence) development. The management information 215 may be qualitative data, quantitative data, or annotation data (data with annotations). Qualitative data is text data, and quantitative data is data in the form of discrete or continuous values. Furthermore, the management information 215 is data in a vectorized, aggregated, documented, or annotated format based on the management information 210. Aggregation involves converting the data into descriptive statistics or into a grand total. The model storage means 130 stores parameters that define the operation of the trained machine learning models 240 and 250, which have been trained by the first model generation means 140 and the second model generation means 150, which will be described later. The first model generation means 140 trains the machine learning model 240 with a training dataset consisting of combinations of management information 210 and contextual information 220 related to the management information 210. The training dataset consisting of combinations of management information 210 and contextual information 220 related to the management information 210 is prepared in advance. By doing so, the first model generation means 140 generates a first machine learning model 240 that, upon input of management information 210, outputs contextual information 220 related to the management information 210. The learning algorithm used by the first model generation means 140 is not particularly limited. Here, context information 220 is background information that determines what meaning the target data has. Contextual information 220 includes, for example, time information (timestamp (creation date and time, update date and time), chronological position (past/present/future prediction), temporal urgency (immediacy, real-time), expiration date, best-before date, seasonality, periodicity, etc.), location/spatial information (geographical location (country, region, city), physical/virtual space, geopolitical risk area, regulatory juris