CN-122021823-A - Method for constructing public security industry knowledge base based on large model
Abstract
The invention belongs to the technical field of data processing, and particularly discloses a method for constructing a public security industry knowledge base based on a large model, which comprises the following steps of firstly acquiring data through an internal system of a public security organization and an external public channel; the method comprises the steps of collecting data, carrying out cleaning and preliminary integration on the collected data, extracting entity, entity relation and entity attribute from the cleaned and integrated data and outputting a knowledge triplet table, then fusing the extracted knowledge, storing the fused clean knowledge into an internal table and synchronizing the knowledge in the internal table to a GDB (graphic data center) to form a public security industry knowledge base. The invention realizes the data storage and the knowledge base construction in public safety industry by means of data acquisition, data preprocessing, knowledge extraction, knowledge fusion and conflict resolution, and solves the problem that the data are difficult to realize high-efficiency integration and utilization due to the fact that the public safety system data are scattered in different service systems and have different formats, thereby influencing the working efficiency.
Inventors
- GAO LINGLING
- MA LEI
- ZENG QIAO
- Nian Dongyong
- WANG YUQI
- DUAN HAIBIN
- WANG JINCHAO
- YANG GAO
- ZHANG YI
- ZHANG ZITAO
Assignees
- 中电科联海创智信息科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20251222
Claims (8)
- 1. The method for constructing the public security industry knowledge base based on the large model is characterized by comprising the following steps of: s1, data acquisition, namely connecting an internal system of a public security agency and an external public channel through a JDBC interface or a Restful API interface to acquire structured data, unstructured text data, image data and audio data; S2, data preprocessing, namely cleaning and primarily integrating structured data, unstructured text data, image data and audio data through a REGEXP_REPLACE function, a SUBSTR function and a SPLIT function in MaxCompute; S3, knowledge extraction, namely inputting the preprocessed data into MaxCompute, extracting entities, entity relations and entity attributes from the preprocessed data through a UDF and UDTF function call ERNIE large model, and outputting a knowledge triplet table; S4, knowledge fusion and conflict resolution, namely respectively carrying out entity disambiguation, relation alignment and conflict detection on the extracted entity, entity relation and entity attribute through a character string similarity matching function in SQL logic in MaxCompute; And S5, constructing a knowledge base, namely storing the fused clean knowledge into internal tables of MaxCompute respectively, and periodically synchronizing the knowledge in the internal tables into the GDB to form a public security industry knowledge base.
- 2. The method of claim 1, wherein in S1, the structured data includes a case file list, a household information list, and a rule and regulation database, the unstructured text data includes a case description, an interrogation record, and a policy file, the image data includes a monitoring screenshot, a certificate photograph, and an on-site investigation image, and the audio data includes an alarm receiving recording and an interrogation recording.
- 3. The method for constructing public safety industry knowledge base based on large model according to claim 1, wherein in S2, the method for cleaning and integrating data comprises replacing a plurality of continuous blank symbols in structured data with single blank symbols by REGEXP_REPLACE (content, \\s+ ',') character string processing function, calling NLP basic service of the Arian by UDTF function, performing word segmentation, part-of-speech labeling and named entity recognition on long text data in structured data and unstructured text data, outputting a structured word segmentation result table, reading OSS path of picture from table ods_image_table by UDF function, calling UDF get_image_feature function to extract feature vector of picture, storing result in new feature table by using feature_vector as column name, calling service of the Arian function to convert ASR in audio data into text, and generating new text data.
- 4. The method for constructing public safety industry knowledge base based on large model according to claim 1, wherein in S2, the cleaned and integrated data needs to be checked and corrected, specifically, the method includes using a COUNT (x) aggregation function to query the record NUMBER of missing key fields in the cleaned and integrated data, using COALESCE function or CASE white function to fill and mark the missing key fields, and using a ROW_number () OVER (component by.) function to remove the repeated data if the repeated data exists in the cleaned and integrated data.
- 5. The method of claim 1, wherein in S3, the output format of the knowledge triplet list is document ID, entity relationship and entity attribute.
- 6. The method of claim 1, wherein in S4, WHEN performing entity disambiguation, similarity of entity names is calculated BY SOUNDEX functions or edit distances UDF, then a unique entity is allocated to similar entities BY using recursive logic or iterative calculation, the relationship alignment is expressed BY unifying entity relationships BY GROUP BY keywords and CASE white functions, max_by and min_by functions are used for conflict knowledge, and the most reliable value is selected according to data source confidence or timestamp.
- 7. The method of claim 1, wherein in S5, the internal table MaxCompute includes an entity table and a relationship fact table, wherein the entity table is used to store standardized information of all entities, the relationship fact table is used to store all relationships between entities, and the knowledge in the relationship fact table is stored in time division.
- 8. The method of claim 1, wherein in S5, the internal table of MaxCompute is synchronized into the GDB by using DataWorks or DataX data integration function, and in the GDB, the knowledge in the internal table is constructed into a visual knowledge graph with the entity as node and the entity relationship as edge.
Description
Method for constructing public security industry knowledge base based on large model Technical Field The invention relates to the technical field of data processing, in particular to a method for constructing a public security industry knowledge base based on a large model. Background With the wide application of information technology in public safety field and the continuous expansion of public safety business, public safety systems accumulate large-scale and various data resources. Such data include case files, regulatory regulations, intelligence information, business specifications, etc., which play a vital role in the successful development of public safety work. The case file records the whole process information of various cases from occurrence and investigation to case settlement in detail, provides direct reference basis for subsequent case analysis, experience summarization and processing of similar cases, the rule and regulation is the rule and basis of public safety law enforcement, ensures that each law enforcement action can be relied on, the information is the key element for observing abnormal behavior trend, preventing abnormal behavior activity and precisely striking illegal abnormal behavior, and the service specification ensures standardization and standardization of various service flows of public safety, thereby improving the overall working efficiency and quality. However, these data in the current public safety system are scattered in different service systems, and the service systems have non-uniform data formats and standards due to the differences of factors such as construction time, technical architecture, development team and the like. Some systems use traditional relational databases to store structured data, such as population information, case files, etc., while for unstructured data such as video surveillance, network public opinion, etc., different systems may use unique storage and management modes. The diversity of the data formats and the inconsistency of the standards lead to the difficulty in realizing efficient integration and utilization of the data, and seriously affect the efficiency of public safety work. Because of the high complexity of public safety domain data, a large amount of unstructured and semi-structured data, such as investigation reports of accident sites, inquiry strokes of witness, text descriptions of surveillance videos, etc., are difficult to process efficiently by conventional methods. Conventional approaches typically rely on predefined data structures and patterns that are efficient for the processing of structured data, but lack the ability to automatically extract and integrate knowledge in the face of unstructured and semi-structured data. For example, when processing an accident scene investigation report, the conventional method may need to manually extract key information, such as occurrence time, place, scene characteristics, etc. of the case, and convert the key information into a structured data format for storing the information in a knowledge base, which not only consumes a great deal of manpower and time, but also is easy to miss or error information. It is counted that in some basic public safety authorities, processing a complex unstructured case data, the time for manually extracting key information takes several hours on average, and the error rate is as high as 10% -15%. Therefore, it is necessary to design a method for constructing public security industry knowledge base based on a large model, so as to solve the problem that in the prior art, public security system data are scattered in different service systems and have different formats, so that the data are difficult to realize efficient integration and utilization, thereby influencing the working efficiency. Disclosure of Invention The invention aims to provide a method for constructing a public safety industry knowledge base based on a large model, which aims to solve the problem that in the prior art, public safety system data are scattered in different service systems and have different formats, so that the data are difficult to realize efficient integration and utilization, and the working efficiency is affected. In order to achieve the purpose, the basic scheme provided by the invention is that a method for constructing a public security industry knowledge base based on a large model comprises the following steps: s1, data acquisition, namely connecting an internal system of a public security agency with an external public channel through a JDBC interface or a Restful API interface to directly acquire structured data, unstructured text data, image data and audio data; S2, data preprocessing, namely cleaning and primarily integrating structured data, unstructured text data, image data and audio data through a REGEXP_REPLACE function, a SUBSTR function and a SPLIT function in MaxCompute; S3, knowledge extraction, namely inputting the preprocessed data into MaxComput