CN-121980017-A - Journal information retrieval method based on big data drive

CN121980017ACN 121980017 ACN121980017 ACN 121980017ACN-121980017-A

Abstract

The invention relates to the technical field of data processing, in particular to a journal information retrieval method based on big data driving, which comprises the steps of collecting multi-source journal data, preprocessing to obtain journal features, constructing a knowledge graph based on the features, training an initial model to obtain a pre-training language model, outputting a retrieval result based on the knowledge graph, acquiring a matching accuracy of an analysis result and the knowledge graph, determining whether the accuracy of journal information retrieval meets requirements, determining whether a context semantic attention span threshold of the model needs to be increased if the analysis result is not met, determining whether the processing effectiveness of the obtained data meets the requirements if the noise ratio of a core field of the obtained data does not meet the requirements, determining whether the matching contribution weight coefficient of the knowledge graph needs to be reduced if the noise ratio of the core field of the data does not meet the requirements, and determining the negative sample missing simulation proportion of the model based on the acquisition missing rate of the core field of the data if the knowledge graph does not meet the requirements. The invention improves the accuracy of journal information retrieval.

Inventors

Li Foyuan

Assignees

北京建筑大学

Dates

Publication Date: 20260505
Application Date: 20260409

Claims (10)

1. The journal information retrieval method based on big data driving is characterized by comprising the following steps: collecting multi-source journal data, sequentially cleaning, standardizing, denoising and extracting features of the multi-source journal data to obtain journal features, constructing a knowledge graph based on the journal features, and training an initial model based on the journal features to obtain a pre-training language model; performing semantic analysis on journal retrieval information based on the pre-training language model to obtain an analysis result, matching the analysis result with the knowledge graph to obtain a matching degree, and outputting a retrieval result based on the matching degree; acquiring the matching accuracy of the analysis result and the knowledge graph, and determining whether the accuracy of journal information retrieval meets the requirement or not based on the matching accuracy of the analysis result and the knowledge graph; If the accuracy of the journal information retrieval does not meet the requirement, determining whether a context semantic attention span threshold of a pre-training language model needs to be increased; if the context semantic attention span threshold of the pre-training language model does not need to be increased, acquiring the core field noise ratio of the multi-source journal data to determine whether the processing effectiveness of the multi-source journal data meets the requirement; If the processing effectiveness of the multi-source journal data does not meet the requirements, determining whether a knowledge graph matching contribution degree weight coefficient needs to be reduced or not; and if the knowledge graph matching contribution degree weight coefficient does not need to be reduced, determining the negative sample deletion simulation proportion of the pre-training language model based on the core field acquisition deletion rate of the multi-source journal data.
2. The big data driven journal information retrieval method according to claim 1, wherein determining whether the accuracy of journal information retrieval meets the requirement based on the analysis result and the matching accuracy of the knowledge graph includes: comparing the analysis result with the matching accuracy of the knowledge graph with a preset second accuracy; If the matching accuracy of the analysis result and the knowledge graph is greater than or equal to the preset second accuracy, determining that the accuracy of journal information retrieval meets the requirement; if the matching accuracy of the analysis result and the knowledge graph is smaller than the preset second accuracy, determining that the accuracy of journal information retrieval is not in accordance with the requirement.
3. The big data driven journal information retrieval method of claim 2, wherein determining whether an increase in a contextual semantic attention span threshold of the pre-trained language model is required comprises: Comparing the analysis result with the matching accuracy of the knowledge graph with a preset first accuracy and a preset second accuracy respectively; If the matching accuracy of the analysis result and the knowledge graph is smaller than or equal to the preset first accuracy, determining that a context semantic attention span threshold of the pre-training language model needs to be increased; And if the matching accuracy of the analysis result and the knowledge graph is larger than the preset first accuracy and smaller than the preset second accuracy, determining that the context semantic attention span threshold of the pre-training language model does not need to be increased.
4. The big data driven journal information retrieval method as in claim 3, wherein the magnitude of the increase in the contextual semantic attention span threshold of the pre-trained language model is determined by presetting a difference between a first accuracy and a matching accuracy of the parsing result and the knowledge graph.
5. The big data driven journal information retrieval method as in claim 4, wherein determining whether the processing availability of the multi-source journal data meets the requirements based on the core field noise ratio of the multi-source journal data comprises: Comparing the noise ratio of the core field of the multi-source journal data with a preset first ratio; if the core field noise ratio of the multi-source journal data is smaller than or equal to the preset first ratio, determining that the processing effectiveness of the multi-source journal data meets the requirements, and determining whether the context semantic attention span threshold of the pre-training language model meets the requirements; And if the noise ratio of the core field of the multi-source journal data is larger than the preset first ratio, determining that the processing effectiveness of the multi-source journal data does not meet the requirement.
6. The big data driven journal information retrieval method as set forth in claim 5, wherein determining whether a knowledge-graph matching contribution weighting factor needs to be reduced includes: Comparing the core field noise ratio of the multi-source journal data with the preset first ratio and the preset second ratio respectively; If the noise duty ratio of the core field of the multi-source journal data is larger than the preset first duty ratio and smaller than the preset second duty ratio, determining that the matching contribution degree weight coefficient of the knowledge-graph needs to be reduced; And if the noise duty ratio of the core field of the multi-source journal data is larger than or equal to the preset second duty ratio, determining that the weight coefficient of the matching contribution degree of the knowledge graph does not need to be reduced.
7. The big data driven journal information retrieval method according to claim 6, wherein the reduction of the weight coefficient of the knowledge-graph matching contribution is determined by the difference between the noise ratio of the core field of the multi-source journal data and the preset first ratio.
8. The big data driven journal information retrieval method of claim 7, wherein determining the negative sample deletion simulation ratio of the pre-training language model based on the core field acquisition deletion rate of the multi-source journal data comprises: comparing the core field acquisition deletion rate of the multi-source journal data with a preset deletion rate; If the core field acquisition deletion rate of the multi-source journal data is smaller than or equal to the preset deletion rate, determining that the acquisition integrity of the multi-source journal data meets the requirements, and determining whether the knowledge graph matching contribution degree weight coefficient meets the requirements without increasing the negative sample deletion simulation proportion of the pre-training language model; If the core field acquisition deletion rate of the multi-source journal data is larger than the preset deletion rate, determining that the acquisition integrity of the multi-source journal data is not in accordance with the requirement, and increasing the negative sample deletion simulation proportion of the pre-training language model is needed.
9. The big data driven journal information retrieval method according to claim 8, wherein the core field collection miss rate of the multi-source journal data is a ratio of a number of deletions of the core field to a total number of the core fields in the multi-source journal data.
10. The big data driven journal information retrieval method according to claim 9, wherein the increase in the negative sample deletion simulation ratio of the pre-training language model is determined by a difference between a core field acquisition deletion rate of the multi-source journal data and a preset deletion rate.

Description

Journal information retrieval method based on big data drive Technical Field The invention relates to the technical field of data processing, in particular to a journal information retrieval method based on big data driving. Background In the prior art, the journal information retrieval method of big data relies on keyword matching or simple text similarity calculation, so that the problems of insufficient retrieval precision, weak data noise resistance and weak deletion capability exist, meanwhile, the existing method generally lacks multi-source data fusion processing capability, has limited understanding capability on professional terms nested and cross-discipline semantic expressions, is easy to be interfered by redundant information, lacks traceability, and is difficult to meet the high requirements of scenes such as academic research, literature metering analysis and the like on retrieval precision, so that the problem of insufficient accuracy of journal information retrieval exists. The Chinese patent publication No. CN119357468A discloses a journal matching recommendation method and device based on big data, the method comprises the steps of obtaining a manuscript to be detected, extracting characteristic information in the manuscript to be detected to form a target characteristic vector, obtaining historical journal data, extracting characteristic information of each journal to form a corresponding journal characteristic vector, training a machine learning model based on each journal characteristic vector to obtain a journal matching model, inputting the target characteristic vector into the journal matching model to obtain a plurality of matching journals and corresponding matching degrees, calculating similarity of the target characteristic vector and the journal characteristic vector of each matching journal, and sorting each matching journal based on the similarity and the matching degree to obtain a recommended journal list. Therefore, the journal matching recommendation method and device based on big data have the problems that the accuracy of journal information retrieval is insufficient due to the fact that a complete retrieval link is not built, journal receiving preference is matched only by virtue of paper titles and abstracts, a field noise suppression and data compensation mechanism is lacked, a semantic matching shallow layer is formed, and a quantifiable parameter optimization system is not available. Disclosure of Invention Therefore, the invention provides a journal information retrieval method based on big data driving, which is used for solving the problem that the accuracy of journal information retrieval is insufficient because a complete retrieval link is not constructed in the prior art, and the journal information retrieval preference is matched only by virtue of paper titles and abstracts, a field noise suppression and data compensation mechanism is lacked, a semantic matching shallow layer is lacked, and a quantifiable parameter optimization system is not available. In order to achieve the above object, the present invention provides a journal information retrieval method based on big data driving, comprising: collecting multi-source journal data, sequentially cleaning, standardizing, denoising and extracting features of the multi-source journal data to obtain journal features, constructing a knowledge graph based on the journal features, and training an initial model based on the journal features to obtain a pre-training language model; performing semantic analysis on journal retrieval information based on the pre-training language model to obtain an analysis result, matching the analysis result with the knowledge graph to obtain a matching degree, and outputting a retrieval result based on the matching degree; acquiring the matching accuracy of the analysis result and the knowledge graph, and determining whether the accuracy of journal information retrieval meets the requirement or not based on the matching accuracy of the analysis result and the knowledge graph; If the accuracy of the journal information retrieval does not meet the requirement, determining whether a context semantic attention span threshold of a pre-training language model needs to be increased; if the context semantic attention span threshold of the pre-training language model does not need to be increased, acquiring the core field noise ratio of the multi-source journal data to determine whether the processing effectiveness of the multi-source journal data meets the requirement; If the processing effectiveness of the multi-source journal data does not meet the requirements, determining whether a knowledge graph matching contribution degree weight coefficient needs to be reduced or not; and if the knowledge graph matching contribution degree weight coefficient does not need to be reduced, determining the negative sample deletion simulation proportion of the pre-training language model based on the core