CN-122021842-A - Knowledge service platform thematic corpus construction method
Abstract
The application belongs to the technical field of knowledge product construction, and particularly relates to a method for constructing a thematic corpus of a knowledge service platform, which comprises the following steps of firstly, analyzing a user retrieval log of the knowledge service platform and perceiving the change of a user on the document demand; analyzing the document demand content, namely analyzing the document interacted with the user by a knowledge service platform in a demand change interval period after judging that the demand of the user changes, mining the core content of the document demand of the user in the demand change interval period, providing topics and initial query words of the topics for constructing a thematic corpus, expanding the initial query words of each topic to obtain an expanded query word sequence, and constructing the thematic corpus, namely carrying out semantic retrieval on the knowledge service platform by a plurality of groups of expanded query word sequences of each topic to obtain the document construction thematic corpus.
Inventors
- ZHU YUHU
- CHU TAO
- FENG LI
Assignees
- 中国航空工业集团公司西安飞机设计研究所
Dates
- Publication Date
- 20260512
- Application Date
- 20260416
Claims (14)
- 1. The method for constructing the thematic corpus of the knowledge service platform is characterized by comprising the following steps of: Step one, document demand change sensing: Analyzing a user retrieval log of a knowledge service platform, and sensing the change of a user on document requirements; Step two, analyzing the content of the literature requirements: after judging that the user's demand for the literature changes, analyzing the literature interacted with the user by the knowledge service platform in the demand change interval period, mining the core content of the user's demand for the literature in the demand change interval period, and providing topics and initial query words of the topics for constructing the thematic corpus; step three, expanding the literature query word: carrying out semantic expansion on the initial query words of each theme to obtain an expanded query word sequence; Fourth, construction of a thematic corpus: and carrying out semantic retrieval on a knowledge service platform by using a plurality of groups of extended query word sequences of each topic so as to obtain a document construction thematic corpus.
- 2. The method for constructing a topic text set on a knowledge service platform of claim 1, wherein step one includes: S11, collecting and analyzing user retrieval logs of a knowledge service platform, and extracting retrieval hot word sets K ', K ' ' in a front time period and a rear time period, wherein the retrieval hot word sets are ordered retrieval word lists; s12, measuring the difference between the search hot word sets K ', K ' ' in the front and rear time periods by adopting the sequence Jacard distance index, and judging whether the requirement of the user on the document changes or not.
- 3. The method for constructing a topic text set on a knowledge service platform according to claim 2, wherein in S11, when the search hot word set is extracted, preprocessing is performed on a user search log, including word segmentation, stop word removal, word frequency statistics, sorting according to word frequency from high to low, and the first 100 search words are extracted to form the search hot word set.
- 4. The method for building a topic document set on a knowledge service platform according to claim 3, wherein in S12, if a S-JD distance S-JD (K ', K ") of the search hot word set in two time periods is greater than a distance threshold δ, it is determined that a user' S requirement for documents is changed.
- 5. The method for constructing a topic document set on a knowledge service platform according to claim 4, wherein in the second step, documents browsed, downloaded, collected and shared by a user on the knowledge service platform during a requirement change interval are analyzed.
- 6. The method for constructing a thematic corpus on a knowledge service platform of claim 5, wherein the second step comprises: s21, vectorizing the titles of documents in the knowledge service platform to construct a document title vector matrix; S22, performing dimension reduction on the document title vector matrix; S23, clustering titles which are vectorized and represented by the interactive literature between a knowledge service platform and a user in a requirement change interval period to form a plurality of title clusters; S24, generating the theme of each title cluster and the initial query word of the theme.
- 7. The method for constructing a topic corpus of a knowledge service platform according to claim 6, wherein in S21, a paraphrase-multilingual-MiniLM-L12-v2 pre-training language model is used to transform the titles of documents in the knowledge service platform into high-dimensional semantic vectors; in S22, performing dimension reduction processing on the high-dimension semantic vector matrix of the document title by adopting UMAP algorithm.
- 8. The method for constructing a topic text set of a knowledge service platform according to claim 7, wherein in S23, based on a Ward algorithm of hierarchical clustering, titles of the knowledge service platform and the user with interactive document vectorized representation are clustered in a requirement change interval period; S24 specifically comprises the following steps: Calculating TF-IDF values of various vocabularies in the title clusters, wherein TF is the occurrence frequency of the vocabularies in the current title cluster, and IDF is the reciprocal of the occurrence frequency of the vocabularies in all the title clusters; calculating cosine distances cosim between each vocabulary in the title cluster and the cluster centroid; Calculating a comprehensive evaluation value alpha (TF-IDF) + (1-alpha) cosim of each vocabulary in the title cluster, wherein alpha is a linear combination parameter, and taking 0.5; And selecting the vocabulary with the highest comprehensive evaluation value as the theme of the title cluster, and selecting the vocabulary with the front comprehensive evaluation value in the title cluster as the initial query word of the theme.
- 9. The method for constructing a thematic corpus on a knowledge service platform according to claim 8, wherein the third step comprises: S31, constructing a semantic enhanced vocabulary co-occurrence network of each theme; S32, using initial query words of all topics as central words, and performing forward and backward expansion of semantics according to a vocabulary co-occurrence network to form forward expansion query words and backward expansion query words; s33, calculating semantic contribution scores of all words in the forward expansion query word and the backward expansion query word, reserving the word with the highest semantic contribution score, and combining the word with the initial query word to serve as an expansion query word; s34, using the forward expansion query words and the backward expansion query words as new central words, expanding the semantics forwards and backwards in one way according to a vocabulary co-occurrence network to form new forward expansion query words and backward expansion query words, calculating and reserving the vocabulary with the highest semantic contribution score, and adding the vocabulary into the expansion query words.
- 10. The knowledge service platform thematic corpus construction method according to claim 9, wherein S31 specifically includes: Counting co-occurrence rate of vocabulary pairs in the literature titles under each topic; and marking semantic relation directions for vocabulary pairs with high co-occurrence rate to form a semantic enhanced vocabulary co-occurrence network.
- 11. The method for constructing a thematic corpus on a knowledge service platform according to claim 10, wherein in step three, step S34 is repeated a plurality of times.
- 12. The method for constructing a thematic corpus on a knowledge service platform according to claim 11, wherein the fourth step comprises: s41, carrying out vectorization representation on a plurality of groups of extended query word sequences of each theme, generating a query vector, and carrying out document retrieval on a knowledge service platform; S42, the retrieved documents are arranged, and a thematic corpus of each theme is constructed.
- 13. The method for constructing a topic corpus of a knowledge service platform according to claim 12, wherein in S41, based on paraphrase-multilingual-MiniLM-L12-v2 pre-trained language models, the expanded query terms of each topic are converted into semantic vectors, normalized, and semantic query vectors are generated, semantic search is performed on the knowledge service platform by using a semantic search model, and documents with similarity greater than a specified threshold are searched and selected.
- 14. The method for constructing a topic text set on a knowledge service platform according to claim 13, wherein in S42, the topic text set of each topic is constructed by comprehensively considering the topic relevance R of the document, the quality Q of the document, and the timeliness T of the document, calculating the document value DV, and sequencing from big to small.
Description
Knowledge service platform thematic corpus construction method Technical Field The application belongs to the technical field of knowledge product construction, and particularly relates to a method for constructing a thematic corpus of a knowledge service platform. Background Under the background that the development mode of aviation equipment is continuously innovated and iterative upgrading is accelerated, scientific researchers have higher requirements on the speed and the precision of knowledge service. How to accurately and timely sense the change of scientific researchers to the demands of documents from the user behavior log of the knowledge service platform, and accordingly, to efficiently and high-quality produce targeted knowledge products becomes a core challenge for technological information work under new situation. At present, scientific and technological informatics personnel mainly rely on modes of participating in technical example meeting, carrying out downlink communication with scientific and research personnel and the like, passively capturing demands of the scientific and research personnel on documents, then constructing a search, searching in a knowledge service platform, and further manually screening search results to organize the search results into knowledge products in specific forms. This manual dominant knowledge product production model has significant drawbacks: Demand perception lag-relying on non-real-time, intermittent communication, inability to continuously and automatically monitor demand dynamics, results in a slow demand response to rapidly changing, urgent literature generated by researchers, especially during technological attack. The production period is long, the updating and maintenance are difficult, the final knowledge product organization is obtained from the requirement collection, the search type construction and the literature, the whole process is highly dependent on manual operation, the construction period is long, and the dynamic updating and maintenance are difficult to be carried out according to the requirement change in time. Along with the gradual perfection of the knowledge service platform on the user behavior records, the change of the user on the document demand can be effectively captured by utilizing a big data analysis technology, knowledge products such as thematic corpuses and the like can be quickly built according to the document demand content of the user based on the technologies such as semantic clustering, semantic retrieval and the like, and the demands of the user on the document can be timely met. Disclosure of Invention The application aims to provide a method for constructing a thematic corpus of a knowledge service platform, which is used for supporting automatic, timely and accurate construction of the thematic corpus and improving the response speed and quality of knowledge service. The technical scheme of the application is as follows: A method for constructing a thematic corpus of a knowledge service platform comprises the following steps: Step one, document demand change sensing: Analyzing a user retrieval log of a knowledge service platform, and sensing the change of a user on document requirements; Step two, analyzing the content of the literature requirements: after judging that the user's demand for the literature changes, analyzing the literature interacted with the user by the knowledge service platform in the demand change interval period, mining the core content of the user's demand for the literature in the demand change interval period, and providing topics and initial query words of the topics for constructing the thematic corpus; step three, expanding the literature query word: carrying out semantic expansion on the initial query words of each theme to obtain an expanded query word sequence; Fourth, construction of a thematic corpus: and carrying out semantic retrieval on a knowledge service platform by using a plurality of groups of extended query word sequences of each topic so as to obtain a document construction thematic corpus. Optionally, in the method for constructing a topic text set of a knowledge service platform, step one includes: S11, collecting and analyzing user retrieval logs of a knowledge service platform, and extracting retrieval hot word sets K ', K ' ' in a front time period and a rear time period, wherein the retrieval hot word sets are ordered retrieval word lists; s12, measuring the difference between the search hot word sets K ', K ' ' in the front and rear time periods by adopting the sequence Jacard distance index, and judging whether the requirement of the user on the document changes or not. Optionally, in the method for constructing a topic text set of a knowledge service platform, in S11, when the search hot word set is extracted, preprocessing is performed on a user search log, including word segmentation, stop word removal, word frequency statistics, sorting is performed accord