CN-116127954-B - Dictionary-based new work specialized Chinese knowledge concept extraction method
Abstract
The invention discloses a dictionary-based new work specialty Chinese knowledge concept extraction method which comprises the following steps of 1) obtaining new work specialty related subdivision professions, converting teaching materials and teaching outline of all courses into text data, 2) obtaining corresponding words through word segmentation processing by utilizing the related text data, training word2vec word vector models and original word vectors on the basis, 3) obtaining a large number of related course specialty vocabulary sets through crawler technology, selecting corresponding specialty keywords as seeds, inputting the keywords into the trained word2vec models, obtaining vocabularies with similarity above a threshold value, forming a new work specialty knowledge concept dictionary together with the vocabularies of the segmented words, 4) constructing a NECE model, carrying out knowledge concept recognition on the original teaching materials of the courses, and storing a concept set. The method can construct the professional vocabulary set and the word vector of the corresponding course by using the word2vec model, and extract the concept of the professional course by using the NECE model so as to lay a data foundation for knowledge graph construction of the education system.
Inventors
- LI BIN
- CHEN QIANG
- WU YUXIN
- LI KUN
Assignees
- 扬州大学
Dates
- Publication Date
- 20260508
- Application Date
- 20221229
Claims (4)
- 1. The new technical and scientific professional Chinese knowledge concept extraction method based on the dictionary is characterized by comprising the following steps of: (1) Acquiring the courses of the disciplines of the new disciplines, and converting the teaching materials and the teaching outline of all the courses into original text data; (2) Obtaining corresponding words by word segmentation processing of the original text data, and training word2vec word vector models and original word vectors on the basis of the corresponding words; (3) Manually selecting related professional important concepts as seed words, using the word2vec model trained in the step (2) to perform word vector similarity calculation, selecting a skip-gram model to perform word vector similarity calculation, respectively calculating similarity with different seed words, taking Top K as an external word stock, converting the Top K into a designated dimension word vector suitable for a NECE model, and forming a professional dictionary of a new work related course with the word groups in a word cutting file; (4) Constructing a new engineering knowledge concept recognition NECE model, carrying out knowledge concept recognition on the original text data, and storing a concept set; the step (4) comprises the following steps: (41) Constructing a new engineering knowledge concept recognition NECE model, and pre-training the NECE model through the existing labeled text data, wherein the NECE model comprises a character embedding layer, a dictionary matching layer, a sequence modeling layer and a CRF layer from bottom to top in sequence; (42) In the NECE model, the character embedding layer at the bottom layer carries out character vector embedding representation on the original text data, carries out matching search on each character in a dictionary to find the vocabulary of the character in the dictionary, then splices corresponding word information in the dictionary with the original text word vector to input a sequence modeling layer which models the dependency relationship among the characters, adopts a bidirectional long-short-term memory network and a convolution neural network, carries out label inference on the whole character sequence on the sequence modeling layer based on a conditional random field layer at one time, thereby marking each character, completing recognition of knowledge concepts, outputting new relevant courses concepts and storing the new relevant courses concepts.
- 2. The method for extracting new industrial and scientific specialized Chinese knowledge concepts based on dictionary according to claim 1, wherein said step (1) comprises the steps of: (11) The subject specialized course of the new work department comprises computer vision processing, natural language processing and deep learning based on python, and corresponding teaching materials and teaching outlines are collected as text data; (12) And obtaining BMES sequences corresponding to the text data through manual labeling.
- 3. The method for extracting new industrial and scientific specialized Chinese knowledge concepts based on dictionary according to claim 1, wherein said step (2) comprises the steps of: (21) Performing data preprocessing on the original text data obtained in the step (1), wherein the data preprocessing comprises complex conversion, xml symbol removal, stop word and interference word removal, word bar content processing into single-row data, and obtaining word segmentation files corresponding to the original text data through jieba word segmentation or LTP word segmentation technology; (22) Training word2vec models using the gensim package of python and word segmentation file data, the vocabulary sets of courses of different disciplines being trained separately; (23) And (3) feature processing, namely converting text data into data which can be recognized by a computer, and converting vocabulary in a word segmentation file into a word vector with specified dimension by using a skip-gram model of word2 vec.
- 4. The method of dictionary-based extraction of new technical expertise chinese knowledge concepts in accordance with claim 1, wherein said step (42) comprises the steps of: (421) For the NECE model, the original text entered is considered to be: ; Wherein, the Representing each character entered, each character entered is represented using a dense matrix vector, Representing a word table lookup operation that converts a character into a vector: ; (422) Adding word information vector into word embedding vector, for each Chinese character in random input original text data Searching all the matched words by using a professional dictionary of a new work related course, and dividing all the matched words into four sets of 'B', 'M', 'E', 'S', wherein the construction of the four sets is described as follows: ; ; ; (4); Where L is a dictionary as used herein, Representative of In (a) a subsequence of (b) If one set is empty, namely the character does not inquire any matching word, adding a special character 'empty' into the empty word set; (423) Converting the four classes of sets into vectors of specified dimensions, and weighting the set of words S by calculating the frequency of each word as a weight when the vector is compressed for representation, specifically assuming that z (w) represents the frequency of occurrence of the selected word among the fixed corpus, as: ; Wherein: ; Where w represents the selected knowledge concept vocabulary, all the vocabularies in the four word sets of "B", "M", "E" and "S" are subjected to weight normalization, and if the word w is found to be contained by another subsequence of matching words during statistics, the word frequency of w will not increase; (424) Combining the four word sets into a feature of a fixed dimension and adding it to the representation of each character, concatenating the representations of the four word sets, the final representation of each character consisting of: ; ; Wherein, the Representing the above weight function; (425) After adding the new work related professional dictionary information, the generated vector is used for generating the information of the work related professional dictionary Input to the sequence modeling layer, this layer is implemented using a single layer BiLSTM, the forward LSTM network is defined as follows: ; ; ; Where σ is the sigmoid function multiplied by the corresponding matrix element, Representing the multiplication of the corresponding position elements of the matrix, W and b representing the trainable parameters in the model, backward LSTM having the same definition as forward LSTM but modeling the sequence in reverse order, hidden state of the ith connection of forward LSTM and backward LSTM A context information related representation of ci is constructed; (426) At the top of the sequence modeling layer, the label inference is performed on the entire character sequence at one time by applying the sequential conditional random field layer, specifically defined as follows: (10) Wherein, the Representative of Is selected from the group consisting of a sequence of possible tags, Is a super-parameter fixed in the model, and (11) Wherein the method comprises the steps of And Are trainable parameters of the tag pairs (y', y), for the prediction of the tag, at a given input sequence Searching for a tag sequence with the highest conditional probability Solving using Viterbi algorithm 。
Description
Dictionary-based new work specialized Chinese knowledge concept extraction method Technical Field The invention belongs to the field of knowledge concept extraction, and particularly relates to a dictionary-based novel method for extracting technical and scientific specialized Chinese knowledge concepts. Background In recent years MOOC (Massive Open Online Course) and online education have been actively discussed in the field of intelligent education. In the intelligent education platform, mass teaching behavior data and knowledge resources are accumulated along with the injection of users. Analysis mining and analysis of these two important classes of educational data has injected new impetus for the development of intelligent education. Most of the current online education platforms are organized by taking courses as main bodies, and are classified according to the fields of subjects, universities and years according to course information. Course concepts are implicit in the course and require the learner to organize, sort, and sort through the exploration of the study. How to automatically extract course concepts in a course by utilizing big data is one of hot spots and difficulties in the current intelligent education research. Although related researches have been developed for knowledge concept extraction by students at home and abroad at present, most of the researches are based on English knowledge concept extraction at present, and Chinese knowledge concept extraction needs to be further researched firstly due to the natural advantages of English language. And most extraction modes are based on knowledge concept extraction at the character level, so that only the related information of the characters can be used, word level information in the original text can not be fully utilized, and therefore, the situation that a plurality of terms can not be completely extracted can be caused when the concepts are extracted. Therefore, the invention provides a method for extracting the new industrial and scientific knowledge concepts, which improves the accuracy of concept extraction by adding new industrial and scientific professional related dictionary information, and greatly improves the speed of the concept extraction process by adding new word information in combination with an embedding mode. Disclosure of Invention The invention aims to overcome the defects of the existing concept extraction method, and provides a new technical and scientific professional Chinese knowledge concept extraction method based on a dictionary, which uses a NECE (NEW ENGINEERING Concept Extraction) model to extract the concepts of an original text, simultaneously utilizes word information in sentences, adds a new vector fusion mode and improves accuracy and speed. The technical scheme is that the invention provides a new technical and scientific professional Chinese knowledge concept extraction method based on a dictionary, which comprises the following steps: (1) Acquiring new relevant subdivision professional courses of the work department, and converting teaching materials and teaching outlines of all courses into original text data; (2) Obtaining corresponding words by word segmentation processing through text data, and training word2vec word vector models and original word vectors on the basis; (3) Obtaining a related course professional vocabulary set through a python technology, selecting a corresponding professional keyword as a seed, inputting the seed into a word2vec model after training, obtaining a vocabulary of similarity Top K, and forming a new technical knowledge concept dictionary together with the vocabulary of the segmentation; (4) And constructing a new engineering knowledge concept recognition NECE model, carrying out knowledge concept recognition on the course original teaching materials, and storing a concept set. Further, the step (1) includes the steps of: (11) For professional courses of new work departments, including python-based computer vision processing, natural language processing and deep learning, collecting corresponding teaching materials and teaching outlines as text data; (12) And manually labeling knowledge concepts to obtain BMES sequences corresponding to the text data. Further, the step (2) includes the steps of: (21) Performing data preprocessing on the original text data obtained in the step (1), wherein the data preprocessing comprises complex conversion, xml symbol removal, stop word and interference word removal, word bar content processing into single-row data, and obtaining word segmentation files corresponding to the original text data through jieba word segmentation or LTP word segmentation technology; (22) Training a word2vec model by using a gensim package of python and word segmentation file data, wherein vocabulary sets of different professional courses are separately trained; (23) And (3) feature processing, namely converting text data into data which can be recognized