CN-122019656-A - Intelligent collection, extraction and analysis system and method for computing power center information based on large model
Abstract
The invention discloses a large-model-based intelligent acquisition and extraction analysis system and method for computing force center information, comprising a task scheduling module, a multi-source information acquisition module, a content preprocessing module, an AI intelligent analysis module, a rule checking and enhancing module, a multi-round directional search completion module, an intelligent merging and de-duplication module and a result processing and outputting module. The method comprises the steps of automatically starting tasks, searching multi-source information in parallel and preprocessing, extracting key information by using a large language model, checking and standardizing through domain rules, evaluating information integrity, dynamically initiating multi-round directional searching to complement missing data, performing intelligent merging and de-duplication based on text, geography and multi-dimensional feature fusion, and finally outputting structured data and visually displaying. The invention realizes the efficient, accurate and full-automatic acquisition and analysis of the information of the calculation force center through AI and rule fusion processing, closed-loop optimization and multidimensional intelligent deduplication.
Inventors
- WANG DONGSHENG
- LIU HENGZHI
- Ru Jialiang
- HU XIAOTING
- TU XINYI
Assignees
- 上海国创驭算人工智能科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260120
Claims (9)
- 1. The intelligent collection, extraction and analysis system for the information of the computing center based on the large model is characterized by comprising the following components: the task scheduling module is used for automatically starting the information acquisition task of the computing center according to a preset time strategy or event triggering condition, and carrying out real-time monitoring and log recording on the execution state of the task; The multi-source information acquisition module is connected to the task scheduling module and is used for receiving task instructions, searching a plurality of internet information sources and databases in parallel based on dynamically generated search words, and acquiring initial webpage content or data interface return content containing information related to a computing center; The content preprocessing module is connected to the multi-source information acquisition module and is used for carrying out format cleaning on the acquired initial content, applying a predefined rule based on keyword matching to carry out rapid filtering so as to reject the content obviously irrelevant to the theme of the computing center and output effective content; The AI intelligent analysis module is connected to the content preprocessing module and is used for receiving the preprocessed effective content, constructing a special prompt word containing a system instruction, a user instruction and a text to be analyzed, calling a large language model to perform semantic understanding and analysis, and extracting a predefined structured key information field from the effective content; The rule checking and enhancing module is connected to the AI intelligent analysis module and is used for loading a professional knowledge rule base in the field of computing power, and performing checking and standardization processing on key information fields extracted by the AI intelligent analysis module, wherein the key information fields at least comprise unified conversion of computing power units; the multi-round directional search completion module is respectively connected to the output end of the rule checking and enhancing module and the input end of the multi-source information acquisition module, and is used for evaluating the integrity of checked information, dynamically generating a supplementary search word based on current information when the loss of a key information field is recognized, and feeding back to the multi-source information acquisition module to start a new round of directional search and information extraction flow to form closed loop feedback; The intelligent merging and deduplication module is connected with the rule checking and enhancing module and the multi-round directional searching and supplementing module and is used for comparing, merging and deduplicating the calculation center information records from different retrieval rounds and different information sources, wherein the comparison is based on a multidimensional feature strategy of fusion text similarity, geographic information consistency and calculation center information matching degree so as to judge whether a plurality of records point to the same calculation center entity or not, and the information fusion and redundancy elimination are carried out on the records pointing to the same entity; The result processing and outputting module is connected to the intelligent merging and de-duplication module and is used for summarizing and formatting the final merged and de-duplicated high-quality structured computing center information data, visually displaying the data through a graphical user interface and/or providing a structured data file export function.
- 2. The intelligent collection and extraction analysis system of the calculation center information based on the large model of claim 1, wherein the preset special prompt words in the AI intelligent analysis module at least comprise: A system instruction part for defining the task role of the large language model as a calculation center information extraction expert; A user instruction section for explicitly enumerating a list of key information fields of the computing center to be extracted from the text, and designating an output format; and the processing rule part is used for providing at least one field constraint of name normalization of the computing force center, geographic information level judgment, computing force data association verification and data integrity check.
- 3. The intelligent collection and extraction analysis system of the calculation force center information based on the large model of claim 1 is characterized in that the rule checking and enhancing module predefines a unified conversion rule of calculation force units, and specifically executes the following operations: converting the calculated force value described by the original text into a numerical value taking PFLOPS as a unit, and converting the numerical value into a conversion formula: Standard value (PFLOPS) =original number x order coefficient x unit coefficient x 10 -15 ; the order coefficient is determined according to Chinese order words, and the unit coefficient is determined according to calculation force unit words; And carrying out rationality check on the result, and filtering abnormal values exceeding a preset reasonable range.
- 4. The intelligent collection and extraction analysis system of the large-model-based computing center information according to claim 1, wherein the intelligent merging and de-duplication module comprises: The text similarity calculation specifically adopts the longest public subsequence algorithm to calculate the similarity score between the names of the computing power centers; the geographic information consistency judgment requires that the provincial administrative division names in the information items are required to be completely consistent, and the similarity of the administrative division names with the municipal level or finer granularity is not lower than a first preset threshold; The matching degree of the calculation force information is judged, and the numerical value difference of the calculation force information is required to be within a second preset threshold range after the calculation force information is uniformly converted into units PFLOPS; and respectively distributing different decision weights for the text similarity, the geographic information consistency and the computing power information matching degree by the multidimensional feature strategy, wherein the weight of the text similarity is highest.
- 5. The intelligent acquisition and extraction analysis system for computing force center information based on a large model as claimed in claim 1, wherein in the multi-round directional search completion module, the termination condition of closed loop feedback is that the filling rate of all necessary filling fields reaches a preset completion threshold or the cycle execution cycle reaches a preset maximum cycle threshold.
- 6. The intelligent collection, extraction and analysis method for the information of the computing center based on the large model is characterized by comprising the following steps: S1, automatically starting an information acquisition task of a computing center through a preset trigger mechanism; S2, dynamically constructing an initial search word according to the feature word library of the computing center, and executing multi-source parallel search to acquire original content data; S3, carrying out content pretreatment on the original content data obtained in the step S2, wherein the content pretreatment comprises denoising, format standardization and rule primary screening; S4, utilizing a large language model, and based on special prompt words, structurally extracting key information fields of the computing force center from the content preprocessed in the step S3; s5, applying a predefined calculation force field rule to verify and normalize the extracted key information field, wherein the verification and normalization at least comprises unified conversion of calculation force units; S6, judging the integrity of the information processed in the step S5, dynamically generating a supplementary search word based on the missing field if the key information field is identified to be missing, and returning to the step S2 to execute a new round of directional information acquisition and processing; s7, performing intelligent comparison, merging and de-duplication operation of multi-dimensional feature fusion on a plurality of pieces of information obtained through multi-round directional information acquisition and processing, wherein the multi-dimensional feature at least comprises text similarity, geographic information consistency and calculation force information matching degree; And S8, outputting the processed computing force center information into a structured data file, and supporting visual display and manual interaction.
- 7. The intelligent collection and extraction analysis method for the information of the computing center according to claim 6, wherein the construction of the special prompt word in the step S4 comprises the following steps: providing system instructions for defining a large language model as a computational power center information extraction expert; providing a user instruction explicitly listing fields to be extracted and an output format; Constraints are provided that include computing force domain specific processing rules.
- 8. The method for intelligently collecting, extracting and analyzing information of a computing center according to claim 6, wherein the intelligent merging and deduplication operation of the multidimensional feature fusion in the step S7 specifically comprises the following steps: Calculating the text similarity of the names of the computing force centers among different information items; judging the consistency of geographic information among different information items, wherein provincial level information is required to be completely consistent; After unifying the units, comparing whether the calculated force values of the different information items are within an acceptable error range; and integrating the judgment results of the dimensions, and making a decision according to preset weight distribution to determine whether the information items describe the same force center entity.
- 9. The method for intelligent collection, extraction and analysis of information about a computing center according to claim 6, wherein the closed-loop optimization procedure in step S6 is terminated under the condition that a filling rate of a preset set of necessary-to-be-filled segments meets a requirement or a cycle of execution reaches an upper limit.
Description
Intelligent collection, extraction and analysis system and method for computing power center information based on large model Technical Field The invention relates to the technical field of information processing and data mining, in particular to an intelligent collection, extraction and analysis system and method for computing force center information based on a large model. Background With the rapid development of digital economy and artificial intelligence technology, a computing center is used as a key infrastructure for centrally providing computing, storage and network services, and information such as construction scale, regional distribution, service capability and the like becomes an important basis for industrial planning, resource scheduling and market analysis. However, the information related to the computing center is generally dispersed in various information sources such as government public reports, enterprise networks, industry media, academic papers and the like, and has the characteristics of heterogeneous data sources, unstructured formats, strong content specialization, frequent updating and the like, thereby providing a serious challenge for efficient, accurate and comprehensive information acquisition and analysis. At present, the collection and processing of the information of the computing center mainly depend on the following technical schemes, and the various schemes have remarkable limitations in practical application: (1) Manual arrangement and semi-automatic collection modes, which rely on manual retrieval, reading and input or assist grabbing by means of simple scripts. The method has the main defects of 1) low efficiency, incapability of adapting to the requirements of large quantity and quick updating of information of the computing center, time and labor consumption in manual processing, difficulty in realizing real-time acquisition of large-scale information, 2) poor consistency, disordered data format and uneven quality caused by inconsistent understanding of different personnel on the same information and input standards, 3) weak expandability, incapability of adapting to dynamic expansion of multi-source and multi-type information sources, and high system maintenance cost. (2) The technology automatically acquires information from a specific website through preset grabbing rules and analysis rules based on regular expressions or XPath. However, the limitation is very remarkable, firstly, the writing and maintenance workload of the rules is huge, and the rules are highly dependent on the structure of the target website, and once the website is modified, the rules fail and have poor adaptability. Secondly, it is difficult to effectively process unstructured and semi-structured text content by the traditional method, and the common technical terms, ambiguous tables and context semantics in the description of the computing force center lack of understanding capability, so that the extraction accuracy is low. Finally, such techniques often lack intelligent cross-source alignment and deduplication mechanisms, are prone to generating large amounts of redundant and contradictory data, and have poor data availability. (3) The general large language model is directly called, and although the large language model has stronger natural language understanding and generating capability, the general large language model still has obvious limitations in the direct application in the task of extracting information of a computing center, namely 1) the general large model is lack of special training for the computing industry and is lack of understanding of professional knowledge such as computing force unit conversion, industry terms, policy specifications and the like, 2) the processing cost is high, the cost of directly calling the API of the commercial large model to process massive text data is extremely high and is not suitable for large-scale industrialized application, 3) the processing capability of the long text is limited, namely the related report or page content of the computing center is longer, and the limit of the single processing length of the model is exceeded, so that the information extraction is incomplete. (4) The method is characterized by comprising a depth research type model, wherein the model is focused on the depth reasoning and multi-step retrieval of complex problems, is suitable for writing analysis reports or solving comprehensive problems, but has the following mismatching in a computing force center information acquisition scene, namely 1) the task positioning is inconsistent, the computing force center information acquisition belongs to a 'breadth' task, a large amount of structured atomic information is emphasized to be efficiently collected from multiple sources, and the depth analysis and comprehensive reasoning are not performed, and 2) the processing efficiency is low, namely the depth research model is usually used for multi-round and long-link re