JP-2026076330-A - Document search system
Abstract
[Challenge] Search for documents while considering the concept of a document. [Solution] The system comprises an input unit, a first processing unit, a storage unit, a second processing unit, and an output unit. The input unit has the function of inputting a first document, and the first processing unit processes the first document. The first unit has the function of creating a graph structure, the storage unit has the function of storing the second graph structure, the second processing unit has the function of calculating the similarity between the first graph structure and the second graph structure, the output unit has the function of supplying information, the first processing unit has the function of dividing the first document into multiple tokens, the nodes and edges of the first graph structure have labels, and the labels are composed of multiple tokens. [Selection Diagram] Figure 1
Inventors
- 桃 純平
- 郷戸 宏充
Assignees
- 株式会社半導体エネルギー研究所
Dates
- Publication Date
- 20260511
- Application Date
- 20260217
- Priority Date
- 20191025
Claims (1)
- It has an input section, a first processing section, a storage section, a second processing section, and an output section. The input unit has the function of inputting the first document to the first processing unit. The first processing unit has the function of creating a first graph structure from the first document. The first graph structure has a first node, a second node, and an edge, The edge has a first direction, The first direction is determined based on the semantic relationship between the first node and the second node, The storage unit has the function of storing a second graph structure. The second processing unit has a function to calculate the similarity between the first graph structure and the second graph structure. The output unit has the function of supplying information regarding the calculated similarity. Document search system.
Description
One aspect of the present invention relates to a document retrieval system. Another aspect of the present invention relates to a method for retrieving documents. Various search technologies are available for searching documents. Traditional document searches primarily use word (string) searches. For example, PageRank is used for web pages, and thesauruses are used in the patent field. Additionally, searches using sets of words are also employed. There are methods to express document similarity using the ARD coefficient, Dice coefficient, Simpson coefficient, etc. Additionally, there are methods such as tf-idf, Bag of Words (BoW), and Doc2Ve. One method involves using c, for example, to vectorize documents and then comparing their cosine similarity. Furthermore, there are methods for finding desired documents by evaluating the similarity of strings of text using methods such as the Hamming distance, Levenshtein distance, and Jaro-Winkler distance. In addition, Patent Document 1 discloses a language processing device that compares the similarity of sentences by converting the constituent units of a sentence into string structures and calculating the distance between these string structures. Japanese Patent Publication No. 2005-258624 Figure 1 shows an example of a document search system.Figure 2 is a flowchart showing an example of a method for searching for documents.Figures 3A to 3C show the results obtained in each step.Figures 4A to 4C show the results obtained in each step.Figures 5A to 5D show the results obtained in each step.Figures 6A to 6C show the results obtained in each step.Figure 7 shows an example of the hardware for a document search system.Figure 8 shows an example of the hardware for a document search system. Embodiments will be described in detail with reference to the drawings. However, it will be readily apparent to those skilled in the art that the present invention is not limited to the following description, and that its form and details can be modified in various ways without departing from the spirit and scope of the invention. Therefore, the present invention shall not be construed as being limited to the descriptions of the embodiments shown below. In the invention described below, the same reference numerals are used in common across different drawings for parts that are identical or have similar functions, and repeated explanations are omitted. Furthermore, when referring to similar functions, the hatch patterns are the same, and reference numerals may not be assigned. Furthermore, the position, size, and scope of each component shown in the drawings may not represent the actual position, size, and scope for the sake of ease of understanding. Therefore, the disclosed invention is not necessarily limited to the position, size, and scope disclosed in the drawings. Furthermore, it should be noted that the ordinal numbers "first,""second," and "third" used in this specification are added to avoid confusion of constituent elements and do not imply any numerical limitation. (Embodiment 1) In this embodiment, a document search system and a method for searching for documents according to one aspect of the present invention will be described with reference to Figures 1 to 4C. <Document Search System> Figure 1 is a diagram showing the configuration of the document search system 100. In other words, Figure 1 can be said to be an example of the configuration of a document search system that is one embodiment of the present invention. The document search system 100 may be installed on an information processing device such as a personal computer used by the user. Alternatively, the processing unit of the document search system 100 may be installed on a server, and it may be used by accessing it from a client PC via a network. As shown in Figure 1, the document search system 100 includes an input unit 101 and a graph structure creation unit 10 2. The system comprises a similarity calculation unit 103, an output unit 104, and a storage unit 105. The processing unit includes a graph structure creation unit 102 and a similarity calculation unit 103. The input unit 101 receives the document 20. The document 20 is a document specified by the user for search purposes. The document 20 is text data, audio data, or image data. Input unit 10 1. Input devices include keyboards, mice, touch sensors, microphones, scanners, and cameras. The document search system 100 may have a function to convert audio data into text data. For example, the graph structure creation unit 102 may have this function. Alternatively, the document search system 100 may further have an audio-to-text conversion unit that has this function. The document search system 100 may have an optical character recognition (OCR) function. This allows it to recognize characters contained in image data and create text data. For example, the graph structure creation unit 102 may have this function. Alternatively, the document search system 100 ma