CN-121981122-A - Automatic entity labeling method for vertical industry policy and regulation
Abstract
The invention relates to the technical field of artificial intelligence, in particular to an automatic entity labeling method for vertical industry policy and regulation, which comprises the following steps: document format parsing, whether it is a text document, dynamic splitting into text blocks based on semantic integrity, text vectorization, entity identification and attributes, entity vectorization, clustering based on vector similarity, etc. According to the invention, through the full-flow automation of text segmentation, LLM entity identification and disambiguation, knowledge network construction and map fusion driven by semantics, the complicated links of reading, understanding, labeling and rechecking in the traditional manual labeling are replaced, the automatic flow reduces the demands on professional labeling personnel and quality inspection personnel, and in addition, the dynamic segmentation strategy (reducing LLM single-time processing amount), vector retrieval (rapidly positioning similar entities) and map fusion (incrementally updating knowledge) are cooperated, so that the labeling period is greatly shortened.
Inventors
- LIU YUNFENG
- GUO WEIJIA
- WANG XIAOYAN
- WANG DENGYONG
- JIN YANYU
- WANG YANG
- LIU JIANAN
Assignees
- 哈尔滨思和信息技术股份有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20251205
Claims (10)
- 1. An automated entity marking method for vertical industry policy regulations, the method comprising the steps of: The method comprises the steps of (1) analyzing a document format, namely analyzing the document to be processed after obtaining the document; Judging whether the analyzed text is a text document, if not, converting the text into the text by an OCR or audio conversion technology, and then entering the next step, if so, directly entering the next step; Dividing the text document into text blocks which are not overlapped with each other and keep independent semantic integrity by taking a chapter title, a section title, a strip number, a money start symbol and a paragraph separator in the text document as cutting points; vectorizing the text, namely vectorizing each segmented text block and generating a unique identifier chunk-id for each text block; fifthly, entity identification and attribute, namely, analyzing and identifying the entities one by utilizing a large language model LLM aiming at the segmented and vectorized text blocks, and distributing entity types, entity descriptions, entity names and associated chunk-ids for each entity; performing vectorization on the entity, namely performing vectorization processing on the entity type, the entity description, the entity name and the associated chunk-id distributed in the step (five) after combining to obtain an entity combination vector; (seventh) clustering based on vector similarity, namely calculating the entity combination vector, and classifying the entity combination vector with similarity exceeding a preset threshold value into a preliminary similar entity set; Performing entity disambiguation, namely checking the preliminary similar entity sets one by one through a large language model LLM, verifying semantic consistency and attribute matching degree, obtaining a final similar entity set after disambiguation, and disambiguating a plurality of entity combination vectors in the final similar entity set into single entity combinations; Constructing a knowledge network, namely searching associated text fragments through a chunk-id in an entity combination vector, acquiring text fragments and a relation network similar to a selected entity by utilizing vector search, integrating the searched text and the relation network through a large language model LLM, and constructing a knowledge network of the current document; (ten) acquiring entities in the knowledge network, namely extracting entity information from the knowledge network constructed in the step (nine) through vector retrieval; Judging whether the entity is unique in the knowledge graph or not, judging whether the entity information obtained in the step (ten) is unique in the initial knowledge graph or not, if not, entering the step (twelve), and if so, directly entering the step (fourteen); Performing vector similarity comparison on the entity information obtained in the step (ten) and the existing entity set in the initial knowledge graph, and performing disambiguation on the entity of the current document and the existing entity in the initial knowledge graph when the similarity reaches a preset threshold value to eliminate the entity ambiguity of the cross-document; Combining the disambiguated entities into the initial knowledge graph and recording the mapping relation between the combined entities and the initial knowledge graph entities; The method comprises the steps of (fourteen) integrating map relations, namely calculating the matching degree based on the entity attribute of the current document and the attribute characteristics of service information in an initial knowledge map, marking the entity and associated text fragment of the current document as related to the service information and establishing a relation when the matching degree reaches a threshold value, and directly inserting the entity information of the current document into the initial knowledge map and establishing the relation with an original map if the matching degree does not reach the threshold value; and fifteen, constructing a comprehensive knowledge graph, namely integrating graph entity combination and graph relation to form the comprehensive knowledge graph covering the policy and regulation and the business scene.
- 2. The automatic entity labeling method for vertical industry policy regulations of claim 1, wherein said dynamic text segmentation based on semantic integrity in step (iii) is based on sentence boundaries of text as segmentation basis.
- 3. The automated entity labeling method of claim 1, wherein the large language model LLM in step (five) includes, but is not limited to, qwen-Max models, the entity types being used to distinguish specific categories of entities, the entity descriptions being used to provide detailed interpretations of entities to distinguish name-like entities.
- 4. The automated entity marking method of claim 1, wherein the predetermined threshold in step (seven) is dynamically adjusted according to entity type.
- 5. The automatic entity labeling method for vertical industry policy and regulation according to claim 1, wherein in the step (eight), the entity disambiguation verifies semantic consistency and attribute matching degree of the preliminary similar entity set through a large language model LLM, so as to ensure that the disambiguated entity is unique and accurate.
- 6. The method for automatically labeling entities according to claim 1, wherein the knowledge network construction in the step (nine) comprises text backtracking retrieval and graph structure retrieval, wherein the original text fragments related to the entities are retrieved through chunk-id retrieval, the relationship network similar to the entities in the initial knowledge graph is obtained through vector retrieval, and finally the knowledge network of the current document is integrated and generated.
- 7. The automated entity labeling method of claim 1, wherein in step (twelve), the map entity disambiguation disambiguates entities of the current document from similar entities in the initial knowledge map by vector similarity calculation and attribute matching, eliminating entity ambiguity across documents.
- 8. The automatic entity labeling method of claim 1, wherein the matching degree calculation in the step (fourteen) is based on semantic similarity between entity attribute features and business information attribute features, and the matching degree threshold is dynamically adjusted according to the vertical industry field.
- 9. The automated entity labeling method of claim 1, further comprising a real-time quality monitoring mechanism for real-time identification of labeling inconsistencies or errors during step (eight) entity disambiguation and step (twelve) atlas entity disambiguation by vector similarity detection and authenticity assessment of large language model LLM.
- 10. The method for automatically labeling entities according to claim 1, wherein the method is applied to the vertical industry, and further comprises dynamically generating sub-graphs based on comprehensive knowledge graphs, supporting real-time intelligent question-answering, and rapidly outputting policy basis and relevant policy fragments corresponding to services through semantic understanding and intention recognition of a large language model LLM.
Description
Automatic entity labeling method for vertical industry policy and regulation Technical Field The invention relates to the technical field of artificial intelligence, in particular to an automatic entity labeling method for vertical industry policy and regulation. Background Along with the rapid development of artificial intelligence technology, the construction and application of large models in the vertical field have become the core direction for promoting the intelligent transformation of industries, the training of such models is highly dependent on the expertise in the specific field, the knowledge usually exists in unstructured forms such as massive business data, complex policy regulations and specific business processes, the unstructured industry knowledge is converted into training data for machine learning models, the training data can be realized through systematic data labeling process, and in general, the larger the scale of a high-quality labeling data set is, the more effective the training effect and performance of the large models in the vertical field can be improved; At present, the main flow technical scheme of the data marking in the vertical field industry mainly follows the following flow that firstly, a marking task comprising a plurality of units to be marked is issued by a system, secondly, manual marking is completed by marking personnel with field expertise, and finally, the accuracy of marking results is ensured by quality inspection personnel through manual rechecking, however, the flow has obvious defects: 1. the quality control is excessively dependent on manual work, the accuracy of the labeling result is completely dependent on a final manual rechecking link, and an automatic checking mechanism is lacked; 2. the labeling period is long, namely, a long time interval exists between manual labeling and manual quality inspection, so that the overall data processing efficiency is low; 3. The labor cost is high and the dependence is strong, a large amount of labeling personnel and quality inspection personnel with professional knowledge are needed to be input, and the labor cost is high. 4. Error feedback is delayed, namely, a real-time quality monitoring and error correcting mechanism in the labeling process is lacked, labeling errors are difficult to find in early stages, and are easy to accumulate until quality inspection links are identified, so that the processing period is further prolonged, and the reworking cost is increased; Therefore, how to improve the data labeling efficiency in the vertical field, reduce the excessive dependence on professional manpower, shorten the labeling period, and realize the real-time quality control and the timely correction of errors in the labeling process has become a key technical problem to be solved urgently at present. Disclosure of Invention The invention aims to solve the technical problems of low data marking efficiency, long period, excessive dependence on manual quality inspection, difficulty in timely finding and correcting marking errors and the like in the industry of the vertical field in the prior art, and therefore, the invention provides an automatic entity marking method for the policies and regulations of the vertical field. The invention adopts the following technical scheme for realizing the purposes: the invention provides an automatic entity labeling method for vertical industry policy and regulation, which comprises the following steps: the method comprises the steps of (1) analyzing a document format, namely analyzing the document to be processed after obtaining the document; judging whether the analyzed text is a text document (including word, pdf, txt and the like), if not, converting the text into the text through OCR or audio conversion technology, and then entering the next step; Dividing the text document into text blocks which are not overlapped with each other and keep independent semantic integrity by taking a chapter title, a section title, a strip number, a money start symbol and a paragraph separator in the text document as cutting points; vectorizing the text, namely vectorizing each segmented text block and generating a unique identifier chunk-id for each text block; (V) entity identification and attribute, namely, analyzing and identifying the entities one by utilizing a large language model LLM aiming at the segmented and vectorized text blocks, and distributing entity types (used for distinguishing entity categories such as 'service objects', 'service scenes', and the like), entity descriptions, entity names and associated chunk-ids (namely, unique identifiers of the text blocks from which the entities are derived) for each entity; Performing vectorization on the entity, namely performing vectorization processing on the entity type, the entity description, the entity name and the associated chunk-id distributed in the step (five) after combining to obtain an entity combination vector; (seventh) base