CN-121996621-A - Patent document storage and retrieval method and device based on hierarchical hash mapping
Abstract
The disclosure belongs to the technical field of nuclear power, and particularly relates to a patent document storage and retrieval method and device based on hierarchical hash mapping. The method provided by the disclosure determines at least two levels of hash directories according to the continuous character bits of the hash value, thereby constructing a storage path which takes the country code as a first-level directory, takes the at least two levels of hash directories as subdirectories and takes the patent identifier as a last-level directory, and stores the target patent file and the related file thereof under the storage path, so that an object storage cluster does not need to be deployed, a database does not need to be used as a path index, the path structure is fixed, automatic deduction is realized, and the method is suitable for long-term archiving and offline searching scenes.
Inventors
- Shen Qiangbin
- Ke Chunxiao
- FU LIANG
- MIN YANLI
- LIN YING
- LI LING
- YIN DONGMIN
- HOU LEI
- HAN WANXIN
- DUAN FEIHU
Assignees
- 同方知网数字科技有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20251225
Claims (10)
- 1. A patent document storage and retrieval method based on hierarchical hash mapping is characterized by comprising the following steps: acquiring a patent identifier of an input target patent document; extracting a country code from the patent identification; Calculating a hash value based on the patent identification; determining at least two stages of hash catalogues according to the continuous character bits of the hash value; Constructing a storage path which takes the country code as a primary catalog, takes the at least two-stage hash catalog as a subdirectory and takes the patent identifier as a final catalog; And storing the target patent file and the related file thereof under the storage path.
- 2. The method of claim 1, wherein calculating a hash value based on the patent identification comprises calculating the patent identification using an MD5 algorithm to generate a hash value.
- 3. The method of claim 2, wherein determining at least two levels of hash directories based on particular consecutive character bits of the hash value comprises: Selecting the first two characters of the hash value as a primary hash catalog; And selecting the third to fourth bit characters of the hash value as a secondary hash directory.
- 4. The method of claim 1, wherein the association file comprises at least one of a PDF full text file, an XML text file, an image file, and a metadata file.
- 5. The method of any one of claims 1 to 4, further comprising migrating historical data: scanning historical patent files in a source storage system; Analyzing the identification of the historical patent file to obtain a corresponding country code; calculating a corresponding hash value according to the identifier of the historical patent file, and generating at least two stages of hash catalogues according to a preset rule; constructing a target storage path according to the country code, the at least two-stage hash catalogue and the identifier of the historical patent file; And moving the history patent file and the related files to the target storage path.
- 6. The method of claim 5, wherein the migration process is performed in parallel processing.
- 7. The method of claim 5, further comprising generating a migration log during migration, the migration log comprising a source file path, a target file path, and verification information.
- 8. A hierarchical hash map-based patent document storage and retrieval device, comprising: the acquisition module is used for acquiring the patent identifier of the input target patent document; the extraction module is used for extracting the country code from the patent identifier; The calculating module is used for calculating a hash value based on the patent identifier; The determining module is used for determining at least two stages of hash catalogues according to the continuous character bits of the hash value; The construction module is used for constructing a storage path taking the country code as a primary catalog, the at least two-stage hash catalog as a subdirectory and the patent identifier as a final catalog; and the storage module is used for storing the target patent file and the related file thereof under the storage path.
- 9. A hierarchical hash map based patent document storage and retrieval device, the device comprising: A processor; A memory for storing processor-executable instructions; wherein the processor is configured to perform the method of any one of claims 1 to 7.
- 10. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 7.
Description
Patent document storage and retrieval method and device based on hierarchical hash mapping Technical Field The invention belongs to the technical field of nuclear power, and particularly relates to a patent document storage and retrieval method and device based on hierarchical hash mapping. Background Millions of newly generated patent documents are disclosed annually by various national patent institutions (CN, US, JP, EP, WO, etc.). There are huge amounts of patent PDF, XML, picture data to store. However, the storage schemes based on Linux file systems that are widely used at present generally have the following problems: 1. And when the number of the files in one directory exceeds tens of thousands, the directory entry searching performance of file systems such as EXT4, XFS and the like is obviously reduced. Copying and viewing are difficult. 2. The file retrieval path is complex in calculation and low in efficiency, and the system often relies on a database (MySQL, postgres) to store file indexes, and then maps the file indexes to a physical path, so that the retrieval link is long. 3. File migration is difficult, namely database indexes need to be frequently queried when historical data are migrated in batches, and parallel migration of tens of millions of files is difficult. 4. The large file system is not suitable for complex file relations of patent scenes (such as a peer file, an image+XML+PDF, and the like). Thus, traditional file system structures have difficulty supporting the ever-increasing tens to hundreds of millions of patent file archiving needs. Disclosure of Invention In order to overcome the problems in the related art, a patent document storage and retrieval method and device based on hierarchical hash mapping are provided. According to an aspect of the embodiments of the present disclosure, there is provided a patent document storing and retrieving method based on hierarchical hash mapping, including: acquiring a patent identifier of an input target patent document; extracting a country code from the patent identification; Calculating a hash value based on the patent identification; determining at least two stages of hash catalogues according to the continuous character bits of the hash value; Constructing a storage path which takes the country code as a primary catalog, takes the at least two-stage hash catalog as a subdirectory and takes the patent identifier as a final catalog; And storing the target patent file and the related file thereof under the storage path. In one possible implementation, calculating the hash value based on the patent identification includes calculating the patent identification using an MD5 algorithm to generate the hash value. In one possible implementation, determining at least two levels of hash directories according to specific consecutive character bits of the hash value comprises: Selecting the first two characters of the hash value as a primary hash catalog; And selecting the third to fourth bit characters of the hash value as a secondary hash directory. In one possible implementation, the association file includes at least one of a PDF full text file, an XML text file, an image file, and a metadata file. In one possible implementation, the method further includes migrating the historical data: scanning historical patent files in a source storage system; Analyzing the identification of the historical patent file to obtain a corresponding country code; calculating a corresponding hash value according to the identifier of the historical patent file, and generating at least two stages of hash catalogues according to a preset rule; constructing a target storage path according to the country code, the at least two-stage hash catalogue and the identifier of the historical patent file; And moving the history patent file and the related files to the target storage path. In one possible implementation, the migration process is performed in parallel processing. In one possible implementation, the method further comprises the step of generating a migration log in the migration process, wherein the migration log comprises a source file path, a target file path and verification information. According to another aspect of the embodiments of the present disclosure, there is provided a patent document storing and retrieving apparatus based on hierarchical hash mapping, including: the acquisition module is used for acquiring the patent identifier of the input target patent document; the extraction module is used for extracting the country code from the patent identifier; The calculating module is used for calculating a hash value based on the patent identifier; The determining module is used for determining at least two stages of hash catalogues according to the continuous character bits of the hash value; The construction module is used for constructing a storage path taking the country code as a primary catalog, the at least two-stage hash catalog as a subdirector