CN-121997928-A - Multi-feature fusion ancient-text named-entity recognition method based on large language model
Abstract
The invention discloses a multi-feature fusion ancient text named entity recognition method based on a large language model. The method comprises the steps of extracting sentence semantic vectors by utilizing a pre-trained ancient language model GuwenBERT, carrying out dependency syntactic analysis by utilizing Stanza, extracting structural features such as main-name dependency, centering dependency and syntax tree depth, splicing the semantic features and the structural features to form fusion representation, constructing a similarity index based on FAISS to search a small number of labeling examples, and guiding a large language model to sequentially complete clue identification, reasoning judgment and entity labeling by using a structured reasoning prompt (Structured Reasoning Prompt) so as to realize automatic identification of ancient names, place names, government post names and time names. The invention obviously improves the accuracy and the robustness of the ancient book named entity identification under the condition of a small sample, and provides a high-efficiency technical scheme for ancient book digitization and knowledge graph construction.
Inventors
- HU YUANCHENG
- SU LONGLONG
- WANG SHUAI
Assignees
- 中北大学
Dates
- Publication Date
- 20260508
- Application Date
- 20251031
Claims (7)
- 1. A multi-feature fusion ancient-text named-entity recognition method based on a large language model is characterized by comprising the following steps: S1, corpus input and preprocessing are used for carrying out standardized processing on paleo-text, wherein the standardized processing comprises unified complex and simplified body, sentence segmentation and word segmentation and irrelevant symbol removal, and an input corpus suitable for model processing is formed; s2, basic semantic feature extraction, namely encoding an input sentence by utilizing a pre-training ancient text semantic model GuwenBERT to obtain semantic vector representation of the sentence, wherein the semantic vector is the deep representation of the sentence in a high-dimensional semantic space; S3, extracting the syntactic structure features, namely calling Stanza a dependency syntactic analysis tool to perform dependency syntactic analysis on the ancient sentence, and extracting syntactic feature vectors comprising a main-predicate dependent number, a centering dependent number, a core verb depth and a maximum syntactic tree depth; S4, constructing a multi-feature fusion representation and a similar sample retrieval library, namely splicing the semantic vector obtained in the step S2 with the syntactic structure feature vector extracted in the step S3 to form a multi-feature vector representation of fusion semantics and structure for subsequent similarity retrieval; S5, constructing a structured prompt, namely constructing a structured reasoning prompt (Structured Reasoning Prompt) according to the retrieved example, wherein the prompt comprises three stages, namely a clue identification stage (Clue Identification) for guiding a model to identify possible entity clues, a reasoning judgment stage (Diagnostic Reasoning) for requesting the model to infer entity categories by combining context and syntactic characteristics, a result output stage (Structured Labeling) for outputting the identification result in a specified labeling format; s6, large language model reasoning and outputting, namely inputting the structured prompt constructed in the step S5 into a Large Language Model (LLM), leading the model to sequentially finish cue extraction, reasoning decision and label generation, and outputting an ancient-text named-entity recognition result.
- 2. The method of claim 1, wherein in the step S2, the specific step of extracting the sentence basic semantic feature vector is that firstly, the pre-training model GuwenBERT is utilized to convert the ancient text into the ancient text vector, then, a multi-head self-attention mechanism is introduced to capture the richer context information in the ancient text vector, and finally, the basic semantic feature vector representation is obtained through residual connection and layer normalization (Add & Nor) and Feed-Forward Network (FFN) to capture the context semantics and the characteristic grammar features of the ancient text.
- 3. The method of claim 1, wherein in step S3, the specific steps of extracting a series of core features from the syntax level are: (1) Master dependency count (f 1 (x), nsubj count), which calculates the number of instances of all nsubj (noun subject) dependencies in a sentence, used to measure the core event number or subject complexity of the sentence: f 1 (x)=#{(i→j)∈E|rel(i→j)=nsubj} (2) Adjective modifier count (f 2 (x), amod count) the number of instances of all amod (adjective modifier) dependencies in the feature statistics sentence to reflect the adjective modifier components in the sentence: f 2 (x)=#{(i→j)∈E|rel(i→j)=amod} (3) ROOT verbs count (f 3 (x)), in Universal Dependencies (UD) specification, the syntax tree points to the core head node of the sentence through a special imaginary ROOT node ROOT, we define the ROOT verb set R (x) as a set of nodes pointed to by all ROOT relationships and part of speech verbs or auxiliary verbs: The root verb number f 3 (x), defined as the cardinality of the collection, can be used to identify parallel sentence or multi-center sentence structures: f 3 (x)=|R(x)| (4) The maximum depth (f 4 (x), max TREE DEPTH) of the dependency tree is a characteristic for measuring the maximum depth of the dependency syntax tree and is a common index of sentence structure complexity and nesting degree, firstly, a child node set of a node is defined as ch (i) = { j| (i → j) ∈E }, and the recursion depth (i) of the node is defined as follows: the maximum dependency tree depth f 4 (x) is the recursion depth of the syntactic core head node r uniquely determined by the root relationship: f 4 (x)=depth(r)。
- 4. A syntactic structure feature extraction step according to claim 3, finally constructing feature vectors and normalizing, combining said four scalar features into a syntactic structure feature vector
- 5. The method of claim 1, wherein in a small sample Learning (Few-shot Learning) scenario, a traditional deep Learning model is often difficult to fully train due to data scarcity, so that a retrieval enhancement module (Retriever) is introduced in the research, and a K nearest neighbor retrieval (KNN) mechanism is utilized to guide the reasoning process of the model by utilizing similar samples retrieved from a small amount of training data, so that the robustness and generalization capability of the model under a low-resource condition are remarkably improved.
- 6. The method of claim 1, wherein the hint construction module (Prompt Construction Module) is a core link for implementing recognition of named entities with few samples in the method, the design concept is based on Clue And Reasoning Prompting (CARP) theory, and the module guides a Large Language Model (LLM) to simulate a human cognitive process through explicit stepwise reasoning hints, so as to implement recognition of named entities with logically interpretable and reasoning consistency.
- 7. The method of claim 1, wherein the large language model reasoning module (LLM-Based Inference Module) is a core execution unit for the system to complete the task of identifying named entities, and the module uses a generated thinking chain reasoning paradigm based on DeepSeek large model to complete knowledge expression and entity discrimination by means of language generation.
Description
Multi-feature fusion ancient-text named-entity recognition method based on large language model Technical Field The invention belongs to the technical field of natural language processing and artificial intelligence, and particularly relates to a multi-feature fusion ancient text named entity recognition method based on a large language model, which is suitable for application scenes such as ancient book digitization, knowledge graph construction, historical document intelligent analysis and the like. Background With the development of digital human (Digital Humanities) research, the digital processing of large-scale ancient text corpus becomes a research hotspot. Ancient-text named-entity Recognition (NER) is one of the core tasks of ancient-text information extraction, and aims to automatically recognize semantic entities such as name, place name, official title of a position, time name and the like from ancient Chinese text. However, ancient texts have the characteristics of complex grammar structure, common omission phenomenon, high vocabulary ambiguity and the like, so that the traditional NER method based on rule or statistical learning is difficult to obtain ideal effects. Although the deep learning and transformation model is excellent in modern Chinese NER tasks in recent years, the migration effect is limited due to the fact that the ancient text labels are scarce in corpus, and the semantic and syntactic characteristics are obviously different. The prior art mainly has the following problems of lack of semantic modeling mechanism aiming at paleo-text characteristics and strong sample dependence. The supervised learning model needs a large amount of annotation data, the ancient Chinese NER annotation cost is high, the sample number is limited, the prompting mechanism is simple, and the existing large language model lacks structural reasoning capability in zero sample or few sample ancient Chinese NER tasks, so that the recognition accuracy is low. Therefore, it is necessary to design a ancient NER method that can fuse semantic features, syntactic features and context priors under low resource conditions. Disclosure of Invention In order to overcome the defects of the prior art, the invention provides a multi-feature fusion ancient-language named-entity recognition method (GuNER-SR) based on a large language model. The method combines a pre-training semantic model and a syntactic characteristic extraction tool, and realizes accurate identification of paleo-language entities by constructing multi-source characteristic representation and structured reasoning prompt. The method comprises the following steps: (1) Basic semantic feature extraction, namely acquiring basic semantic vector representation of an input text by using a pre-training ancient text model (GuwenBERT); (2) Extracting syntactic structural features such as main-predicate relation, centering relation and syntax tree depth in ancient texts by utilizing Stanza dependency syntactic analyzer; (3) Feature fusion and retrieval enhancement, namely splicing semantic vectors and syntactic features to form multi-feature representation, constructing a similarity index library through FAISS, and rapidly retrieving the most relevant examples by using a reverse file index (IndexIVFFlat); (4) Constructing a three-stage structured reasoning prompt (clue recognition reasoning judgment label generation) according to the search result, and guiding the large language model to finish entity labeling; (5) And (3) reasoning and outputting the large language model, namely outputting a labeling result conforming to the format of the @ @ entity|category # # according to the prompt by the large language model, and generating a final named entity recognition result. Compared with the prior art, the method has the advantages that obvious performance improvement is achieved on a plurality of public paleo-entity identification data sets, the F1 value is stabilized by more than 88%, and the understanding and reasoning capacity of the large model in paleo-entity scenes is improved. Drawings FIG. 1 is a flow chart of a detailed implementation of the present invention FIG. 2 is a diagram of a model frame according to the present invention Detailed Description 1. Basic semantic feature vector extraction Firstly, dividing an original ancient text according to characters, adding a special mark [ CLS ] at the beginning of the text, and adding [ SEP ] at the end of the text to obtain a sequence S to be processed. The sequence S is then input into a pre-trained GuwenBERT model, which, through word embedding layers and multi-layer transducer encoders, results in an initial concealment vector H (0) for each position. The main purpose of this step is to map discrete text information into a continuous vector representation, facilitating the processing of subsequent depth models. To enhance modeling capabilities for long-distance dependencies of ancient text and complex syntactic