CN-122021613-A - Technology development analysis method and system based on patent large language model

CN122021613ACN 122021613 ACN122021613 ACN 122021613ACN-122021613-A

Abstract

Technology development analysis method and system based on patent big language model relates to big data analysis field. The method can identify the evolution characteristics of a technical system at different stages, key technical nodes and the change trend of a technical community structure by constructing a cross-time artificial intelligent technical network and analyzing the importance of the network structural characteristics and the technical nodes, thereby providing technical support for research and judgment of the development situation of the artificial intelligent technology and technical layout decision. The invention is suitable for technical analysis scenes of millions of patent data.

Inventors

HONG TAO
ZHANG JIAHUI
WEN TIANYI

Assignees

哈尔滨工业大学

Dates

Publication Date: 20260512
Application Date: 20260203

Claims (10)

1. The technical development analysis method based on the patent large language model is characterized by comprising the following steps of: s1, acquiring text data of related patents in the artificial intelligence field to be analyzed, and preprocessing the patent text data; S2, inputting the preprocessed patent text data in the S1 into a pre-trained domain self-adaptive large language model, carrying out semantic analysis on the patent text data through the large language model, and automatically identifying and extracting key technical units involved in the patent; the key technical units at least comprise one of main integrated technical nodes for representing patent core purposes or application targets, secondary technical nodes for representing core algorithms or functional modules required for realizing the main integrated technical nodes, and technical member nodes for representing basic technical elements or bottom layer components of the secondary technical nodes; S3, constructing a hierarchical connection relation among technical nodes according to the dependency relation, the inclusion relation or the function implementation relation among the technical nodes extracted in the S2, and generating a hierarchical technical structure diagram representing a single patent technical structure based on the hierarchical connection relation; S4, carrying out semantic vectorization representation on the technical nodes of different patents in the S3, and mapping each technical node to a uniform semantic feature space; s5, based on the S4, converging the patent technology structures in each time period according to a preset time dimension, and constructing a technology network reflecting the combination relation among the technologies in the artificial intelligence field; S6, analyzing the artificial intelligent technology network constructed in the S5 to acquire structural feature changes of the technology network in different time periods, and identifying the evolution trend of the artificial intelligent technology structure, key technology nodes and technology paradigm changes through analyzing the node importance, the technology community structure and the technology node flow conditions in the technology network.
2. The method for analyzing technical development based on a large patent language model according to claim 1, wherein the text data of the related patent in the artificial intelligence field to be analyzed obtained in S1 includes a title, a abstract and a claim, and the text data is subjected to preprocessing operations such as cleaning, word segmentation and denoising.
3. The method for analyzing technical development based on a patent large language model according to claim 1, wherein the method for pre-training the large language model in S2 is as follows: s2.1, acquiring text data of patents including titles, abstracts and claims, and labeling patent and technical node data; S2.2, processing the patent text data acquired in the step 1 to generate a question-answer data set, a prompt set, a patent information set and a response data set in a knowledge graph format; S2.3, adopting Llama-7B as a pre-training model, combining LoRA low-rank adaptation method, performing fine-tuning training on hardware, and constructing a PTKG-LLM model suitable for the technical knowledge graph.
4. The technology development analysis method based on the patent large language model according to claim 1, wherein the hierarchical technology structure diagram representing the technical structure of the single-piece patent generated in S3 is a directed acyclic diagram, and the technology node of each patent is stored in JSON format.
5. The technology development analysis method based on the patent large language model according to claim 1, wherein the method for constructing the technology network reflecting the combination relation between the technologies in the artificial intelligence field in S5 is as follows: S5.1, performing systematic text preprocessing on the original technical name, wherein the systematic text preprocessing comprises professional term word segmentation, noise filtering based on a stop word list and morphological reduction normalization operation by adopting domain dictionary guidance; s5.2, mapping the processed technical names to a high-dimensional semantic space through a pre-trained BERT model, and optimizing vector space distribution through contrast learning to enable the technical nodes with similar semantics to be tightly gathered in an embedded space; S5.3, analyzing semantic vectors by using a DBSCAN density clustering algorithm, wherein the DBSCAN density clustering algorithm processes technical clusters with different densities through a self-adaptive neighborhood radius adjustment strategy, and automatically determines optimal cluster division by adopting a clustering quality assessment mechanism based on contour coefficients, and the central vector of each cluster serves as a semantic prototype of the cluster to realize unified representation of diversified technical expressions.
6. The technology development analysis method based on the patent large language model according to claim 1, wherein the analysis method of the artificial intelligence technology network constructed in S5 in S6 includes a network whole analysis method, a network sub-community analysis method, a node analysis method; the network overall analysis method comprises the following steps: Wherein, the Representing nodes And node Is a connection weight of (2); the network sub-community analysis method comprises the following steps: Wherein, the Representing nodes And node The actual side weights between the two, And The degrees of the two nodes are respectively, For the total weight of all edges in the network, To indicate the function, when the node And The value is 1 when belonging to the same community, otherwise, the value is 0; The node analysis method comprises feature vector centrality Eigenvector Centrality and centrality PageRank implementation.
7. The technology development analysis method based on the patent large language model according to claim 1, wherein in S6, the artificial intelligence technology network constructed in S5 is analyzed, and the method for obtaining the structural feature change of the technology network in different time periods is as follows: wherein G is a directed graph, N is the number of nodes, and L is the number of edges.
8. A system for analyzing technical developments based on a patent large language model, said system comprising: The input module is used for acquiring text data of related patents in the artificial intelligence field to be analyzed and preprocessing the patent text data; The technical node extraction module is used for inputting the pre-processed patent text data in the input module into a pre-trained field self-adaptive large language model, carrying out semantic analysis on the patent text data through the large language model, and automatically identifying and extracting key technical units involved in the patent; the key technical units at least comprise one of main integrated technical nodes for representing patent core purposes or application targets, secondary technical nodes for representing core algorithms or functional modules required for realizing the main integrated technical nodes, and technical member nodes for representing basic technical elements or bottom layer components of the secondary technical nodes; the technical hierarchy relation construction and single patent structure generation module is used for constructing hierarchy connection relations among technical nodes according to the dependency relations, the inclusion relations or the function implementation relations among the technical nodes extracted by the technical node extraction module and generating a hierarchy technical structure diagram representing a single patent technical structure based on the hierarchy connection relations; The technical node semantic fusion module is used for carrying out semantic vectorization representation on technical nodes of different patents in the technical hierarchy relation construction and single patent structure generation module, and mapping each technical node to a uniform semantic feature space; the cross-time artificial intelligence technology network construction module is used for converging the patent technology structures in each time period according to a preset time dimension based on the technology node semantic fusion module to construct a technology network reflecting the combination relation among technologies in the artificial intelligence field; The technical structure evolution analysis module is used for analyzing the artificial intelligence technical network constructed by the cross-time artificial intelligence technical network construction module, acquiring structural feature changes of the technical network in different time periods, and identifying the artificial intelligence technical structure evolution trend, key technical nodes and technical paradigm changes through analyzing node importance, technical community structures and technical node flow conditions in the technical network.
9. A computer storage medium having stored thereon a computer program, which when executed by a processor performs the method of any of claims 1-7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the method of any of claims 1-7.

Description

Technology development analysis method and system based on patent large language model Technical Field The invention relates to the technical field of big data analysis, in particular to a method and a system for analyzing structural analysis and evolution of artificial intelligence technology driven by a big language model based on patent text. Background With the rapid development of artificial intelligence technology, related technology has been widely used in various fields such as computer vision, natural language processing, speech recognition, intelligent manufacturing, and smart cities. Artificial intelligence techniques exhibit obvious multidisciplinary crossover, rapid iteration, and highly combinatorial features, and new techniques often result from the combination and modification of existing algorithms, models, hardware, and application modules. A great deal of artificial intelligence related technical achievements continue to emerge in a patent form, so that how to systematically describe the artificial intelligence technical structure and the evolution rule thereof becomes an important problem in technical information analysis, scientific and technological strategic decision and industrial layout. 1. Prior art based on keywords or class numbers and drawbacks thereof In the prior art, the artificial intelligent patent analysis method mainly relies on keyword retrieval or an International Patent Classification (IPC)/Cooperative Patent Classification (CPC) system to screen and statistically analyze patent documents. For example, some researches search patent titles, abstracts or claims through presetting related keyword sets of artificial intelligence to analyze technical development trend, and other researches directly perform statistics and visual analysis on the artificial intelligence patent based on related classification numbers such as G06N, G F. However, the above methods have significant limitations: 1. The keyword retrieval method is easily influenced by semantic ambiguity and expression diversity, and has the problems of high false detection rate and insufficient recall rate; The IPC/CPC classification system has statics and hysteresis, and is difficult to reflect new technology, new paradigm and cross-domain fusion characteristics which are continuously appeared in the artificial intelligence field in time; 3. the method can only be used for statistics on the whole patent level, and the combination relation and hierarchical structure among different technical modules in the patent are difficult to reveal. Therefore, it is difficult to accurately characterize the real structural morphology of artificial intelligence techniques in the prior art that rely only on keywords or class numbers. 2. Prior art based on citation network or co-occurrence network and its defect In order to overcome the defects of a single classification method, a part of the prior art introduces patent citation network analysis or technical keyword co-occurrence network analysis, and the association relationship between technologies is researched by constructing nodes and edges. For example, by analyzing the citation relationships between patents, identifying core patents or key technology paths, or by constructing a technology network through keyword co-occurrence frequencies, to reveal research hotspots and technology clusters. However, the above method still has the following disadvantages: 1. the quotation relationships reflect legal or knowledge quotation relationships more and do not necessarily correspond to real technical dependencies or functional combination relationships; 2. Co-occurrence networks are usually based on word frequency statistics, and cannot distinguish functional roles and hierarchical positions of different technologies in patents; 3. Most of the networks are flat structures, recursive combination characteristics of main integration-secondary formation-technical components commonly existing in artificial intelligence technology are difficult to express, and accurate depiction of structural recombination, paradigm migration and key node evolution of a technical system in a time dimension is difficult. 3. Prior art based on machine learning or text mining and its shortcomings In recent years, some prior art has begun to attempt to analyze patent text using machine learning or natural language processing methods, such as clustering or topic recognition of patents using topic models, word vectors, or traditional deep learning models. However, this type of approach is generally limited by the inadequate understanding capability of the model for long text and limited field applicability: 1. focusing on topic classification or similarity calculation, rather than technical structure resolution; 2. The technical units with engineering significance are difficult to accurately extract from the complex texts such as the claims, the structural representation reflecting the technical combination relat