Search

JP-2026074607-A - Dataset exploration system

JP2026074607AJP 2026074607 AJP2026074607 AJP 2026074607AJP-2026074607-A

Abstract

[Problem] When collecting datasets for a specific purpose, it is necessary to make it easy to collect multiple datasets, including not only datasets directly related to that purpose, but also datasets that are not directly related but may be useful. [Solution] The system for searching datasets calculates the vector values of each of the multiple datasets stored in the database, calculates the degree of similarity of the vector values between the multiple datasets (12), accepts a keyword (20), searches for datasets related to the keyword from among the datasets stored in the database based on the degree of similarity of the vector values (30), and presents the searched dataset as a search result (50). [Selection Diagram] Figure 2

Inventors

  • 藤田 幸久

Assignees

  • トヨタ自動車株式会社

Dates

Publication Date
20260507
Application Date
20241021

Claims (5)

  1. A system for exploring datasets, A database means configured to store multiple datasets, A dataset vectorization means configured to calculate the vector values of the dataset, A vector value similarity calculation means configured to calculate the degree of similarity of vector values between multiple datasets, and a keyword receiving means configured to receive keywords, A dataset search means configured to search for datasets related to the keyword from among the datasets stored in the database means based on the degree of similarity of the vector values, A system including a dataset presentation means configured to present the datasets searched by the dataset selection means as search results.
  2. A system according to claim 1, wherein the dataset search means is configured to select a starting dataset from datasets stored in the database means based on its relevance to the keyword, and to search for datasets related to the keyword based on the degree of similarity of the vector values between each of the datasets stored in the database means and the starting dataset.
  3. A system according to claim 2, wherein the dataset search means is configured to select a dataset to be searched whose degree of association with the starting dataset exceeds a predetermined value.
  4. A system according to claim 3, wherein the dataset search means is configured to select a plurality of datasets as the starting dataset, select a group of datasets whose association degree exceeds a predetermined value for each of the starting datasets, calculate the diversity for each group of datasets, and the dataset presentation means is configured to present the groups of datasets as search results in descending order of diversity.
  5. A system according to claim 2, wherein the dataset search means is configured to select a dataset to be searched where the degree of overlap between the data collection region or collection period of the dataset and the starting dataset exceeds a predetermined value, and the degree of association with the starting dataset also exceeds a predetermined value.

Description

This invention relates to a system for searching for datasets in a group of datasets that have been digitized and stored electromagnetically in a database device or storage device. More specifically, it relates to a system for searching for datasets related to an arbitrary keyword. In this specification, a dataset refers to a set of data recorded in a format such as CSV or JSON, where numerical values or text are listed as data for each item. Various system configurations have been proposed for searching and extracting specific data from digitized data sets. For example, Patent Document 1 proposes a configuration for keyword searching using associative memory, where ambiguous keywords such as phone numbers and product numbers are stored in associative memory to reduce the ambiguity of input information, and the system detects parts of the input query that match these ambiguous keywords. Patent Document 2 proposes a configuration where the user inputs search keywords (e.g., rental properties in Tokyo) and a concept (e.g., medical institutions), and the system retrieves keywords conceptually similar to those keywords and concepts (e.g., ~General Hospital) and outputs them to the network. Non-Patent Document 1 proposes a configuration that collects open data published on the World Wide Web (WWW), organizes its metadata (title, author, description, etc.), and performs keyword searches on that metadata to obtain a dataset. Japanese Patent Publication No. 2013-33473Japanese Patent Publication No. 2007-12039 “Google Dataset Search: Building a search engine for datasets in an open Web ecosystem”, Natasha Noy, Matthew Burgess, Dan Brickley, 28th Web Conference (WebConf 2019), ACM https://datasetsearch.research.google.com/ https://research.google/pubs/google-dataset-search-building-a-search-engine-for-datasets-in-an-openweb-ecosystem/ Figure 1 is a schematic representation of a computer on which the dataset search system according to this embodiment is implemented.Figure 2 is a block diagram showing the configuration of the dataset search system according to this embodiment.Figure 3 is a flowchart illustrating the dataset search process in the system according to this embodiment.Figure 4 is a schematic diagram illustrating the association between datasets stored in the database according to this embodiment based on vector similarity.Figure 5 is a schematic diagram illustrating the links between datasets based on their degree of association with the starting dataset (starting node) according to this embodiment. The present invention will be described in detail below with reference to the attached figures, with reference to several preferred embodiments. In the figures, the same reference numerals indicate the same parts. Configuration of the Computer Device The dataset search system according to this embodiment may be implemented by operation according to a program on a computer device 1 of a type commonly used in this field, as illustrated in Figure 1. The computer device 1 is equipped with a CPU, a storage device, and an input/output device (I/O) interconnected by a bidirectional common bus in a typical configuration. The storage device includes a memory PM that stores each program used to perform the calculations in this embodiment, and a work memory WM and data memory DM (5) used during calculations. Instructions to the computer device 1 by the searcher, as well as the display and output of search results and other information, are made through a computer terminal device 2 connected to the computer device 1. The computer terminal device 2 is equipped with a monitor 3 and input devices 4 such as a keyboard and mouse in a typical configuration. When the program is started, the searcher can give various instructions and inputs to the computer device 1 using the input devices 4 according to the display on the monitor 3 in accordance with the program's procedure, and can also visually confirm the calculation status and calculation results from the computer device 1 on the monitor 3. The dataset to be searched may be stored in the data memory DM(5) equipped in the computer device 1, or it may be stored in a cloud system accessible via any communication network. Referring to the configuration diagram 2 of the data search system , the dataset search system according to this embodiment generally consists of a dataset storage unit 10, a feature extraction unit 11, a feature storage unit 15, a keyword input unit 20, a search network generation unit 30, a priority presentation component determination unit 40, and a search result display unit 50. The dataset storage unit 10 is a database where the datasets to be searched are stored. As already mentioned, a dataset is a set of data recorded in formats such as CSV or JSON, where numerical or text data is listed for each item. The data content can be anything, such as stock prices, weather forecasts, factory operations, personnel information, or sales performance. The dataset sto