CN-121980024-A - Automatic generation method and system based on dynamic industry analysis report
Abstract
The invention discloses an automatic generation method and system based on a dynamic industry analysis report, and relates to the technical field of artificial intelligence. The method comprises the steps of carrying out semantic understanding and value judgment on a page and internal blocks thereof by adopting an LLM model in an information acquisition stage, realizing a crawling process, reducing invalid data from a source, constructing an intelligent recognition and processing mechanism facing a webpage block level structure, disassembling the page into different types of content blocks, carrying out value evaluation on external chain blocks, guiding a crawler to intelligently adjust crawling depth and direction, integrating multi-mode analysis into information extraction, enabling pictures and chart content to be organic components of industry analysis, configuring industry attention points, constructing an automatic generation system based on dynamic industry analysis report, and realizing long-term information accumulation and automatic report generation taking the attention points as cores. The invention can be used for carrying out high-efficiency, accurate and interpretable automatic information acquisition and deep analysis on industry dynamics.
Inventors
- ZHANG XUEJUN
- WANG CHAOMIN
- Fei Wenyue
- WU XIAOJIE
- REN SHUAI
- LOU JIANJUN
- SONG JIAFENG
Assignees
- 北京集智未来人工智能产业创新基地有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20251225
Claims (10)
- 1. The automatic generation method of the report based on the dynamic industry analysis is characterized by being realized by a focus configuration and task scheduling module, a semantic driving information acquisition module, a webpage analysis and multi-mode fusion module, an information extraction and storage module and a dynamic analysis and report generation module based on the automatic generation system of the report based on the dynamic industry analysis, and comprises the following steps: s1, defining a structured field of key points of an IT industry to be grabbed by the focus configuration and task scheduling module, and performing task scheduling in a fixed period triggering mode to generate a task object; S2, the semantic driving information acquisition module constructs an initial information source according to the generated task object, performs preprocessing to obtain an initial URL set, submits the initial URL set to a webpage capture engine to obtain original webpage HTML content; S3, the webpage analysis and multi-mode fusion module carries out block decomposition on the webpage based on a crawling result to obtain a plurality of candidate blocks, calculates the text ratio of each candidate block, inputs the text content and the text ratio corresponding to each candidate block into a LLM model to distinguish the text block, an outer link block, an advertisement block and a navigation block, identifies the link value in the outer link block, directly filters low-value links, adds high-value links into a crawling queue, adopts the VLM model to identify characters in a webpage picture to generate picture structural description, and adopts a mode of combining the LLM model with rule matching to extract meta-information fields; S4, inputting the text block, the picture structural description and the meta information field into the information extraction and storage module, and extracting information through an LLM model to generate an information extraction result; S5, inputting the information extraction result into the dynamic analysis and report ordering generation module to score and order, screening the information through scoring to obtain a screened information set, clustering the screened information set to obtain an information clustering result, constructing an event chain through the information clustering result, automatically generating a chapter structure according to a report frame preset by an industry attention point based on the event chain and the information clustering result, selecting an information subset corresponding to each chapter, inputting the information subset into the LLM model to generate a text corresponding to each chapter, and splicing the texts corresponding to all chapters to generate a final IT industry analysis report.
- 2. The automatic generation method of dynamic industry analysis report based on claim 1, wherein the structured fields of the industry key points comprise industry category, subdivision topic, analysis dimension, language preference, update frequency and semantic description fields; wherein, the industry category is used for identifying the industry to which the industry belongs; Wherein the subdivision topic is used to describe a particular track or topic; Wherein the analysis dimension is used to represent aspects of greater interest to the user; Wherein, the language preference refers to the expression language which the user expects to report to be finally adopted; The update frequency is used for judging how often a new analysis report is generated for the attention point by the scheduling module, and can be usually in units of hours, days or weeks; The semantic description field is used for carrying out expansibility explanation on the attention point in natural language.
- 3. The automatic generation method of dynamic industry analysis report according to claim 1, wherein the step of determining whether the original HTML content of the page is high-value content by using LLM model in S2, if so, preferentially crawling the internal high-value links to obtain the crawling result includes: Adopting an LLM model to judge the correlation between the whole page and the current industry attention point, and when the correlation is low, and the internal links point to the content irrelevant to the industry attention point, marking the URL as a low-value source by the system, and not performing recursion crawling along the links of the page; when the LLM model identifies that the page contains text blocks highly relevant to the current industry attention point and that the internal links have high-value candidates pointing to deeper information, the system assigns a certain priority to the high-value links and adds the high-value links to the next round of crawling queue.
- 4. The automatic generation method of a report based on dynamic industry analysis according to claim 1, wherein the step of performing block decomposition on the page in step S3 to obtain a plurality of candidate blocks, calculating a text ratio of each candidate block, and inputting text content and the text ratio corresponding to each candidate block into the LLM model to distinguish the text block, the outer link block, the advertisement block and the navigation block comprises: S31, dividing the page into a plurality of candidate blocks according to DOM structure levels, paragraph labels, text lengths and visual layout information; S32, inputting the text content, the position in the page, the label information of the father node and the text ratio of each candidate block into the LLM model to judge the type of the candidate block, and outputting a block type judging result; S33, carrying out weighted fusion on the block type judging result and the text ratio to distinguish text blocks, outer chain blocks, advertisement blocks and navigation blocks, wherein when the block is judged to be the text block, the block is selected into an important processing range of information extraction, and when the block is judged to be the outer chain block or the advertisement block and the text ratio is lower, the block is classified into a non-important area.
- 5. The automatic generation method of dynamic industry analysis report according to claim 1, wherein the identifying text in a page picture by using a VLM model to generate a picture structured description comprises: extracting a link address and peripheral text information of each picture, wherein the peripheral text information comprises a picture title, lower description text, alt attribute and a drawing; Inputting the picture and the peripheral text information in the picture into a VLM model, identifying the text content contained in the picture, extracting the title, coordinate axis mark, data tag and legend description in the picture through multi-mode understanding, and generating a section of text description text with strong readability by the language, number and structural relation contained in the picture.
- 6. The automated dynamic industry analysis report generation method of claim 1, wherein the intelligence extraction results comprise a comprehensive summary of content describing key information about the point of interest of the page in the form of a continuous paragraph, a structured element list extracting key information items from text including business names, country and region, product or technology names, project names, fund scales, productivity values, time points, policy names, issuing authorities, and event categories; The original page information sub-table records the URL, the site domain name, the grabbing time, the page type, the overall relevance score and the overall abstract of each grabbed webpage.
- 7. The automatic report generating method based on dynamic industry analysis according to claim 1, wherein the step of inputting the information extraction result into the dynamic analysis and report ordering generating module for grading and ordering, and screening the information by grading to obtain a screened information set, the step of clustering the screened information set to obtain an information clustering result, and the step of constructing an event chain by the information clustering result comprises the following steps: s51, inputting the information extraction result into the dynamic analysis and report ordering generation module, and grading through a plurality of preset indexes to obtain grading scores, wherein the information with high grading numbers is included in a main report section, and the information with low grading numbers is combined and summarized or listed in an annex; s52, based on the screened information set, converting each piece of information into vector representation by adopting a text embedding model, and normalizing similar information into the same topic cluster by a clustering algorithm; S53, in each topic cluster, according to the distribution time sequence, the information is connected in series to form a time sequence chain, and the chain is integrated and summarized through the LLM model to obtain an event chain.
- 8. A dynamic industry analysis report based automatic generation system for implementing the dynamic industry analysis report based automatic generation method according to any one of claims 1 to 7, characterized in that the system comprises: The attention point configuration and task scheduling module is used for defining a structured field of key points of the IT industry to be grabbed, performing task scheduling in a fixed period triggering mode and generating a task object; The semantic driving information acquisition module is used for constructing an initial information source according to the generated task object and preprocessing the initial information source to obtain an initial URL set, submitting the initial URL set to the webpage capture engine to obtain the original HTML content of the webpage; The webpage analysis and multi-mode fusion module is used for carrying out block decomposition on a webpage based on a crawling result to obtain a plurality of candidate blocks, calculating the text ratio of each candidate block, inputting text content and the text ratio corresponding to each candidate block into the LLM model to distinguish the text block, the outer chain block, the advertisement block and the navigation block, identifying the link value in the outer chain block, directly filtering a low-value link, adding a high-value link into a crawling queue, identifying characters in a webpage picture by adopting the VLM model to generate picture structural description, and extracting meta-information fields by adopting a mode of combining the LLM model with rule matching; The information extraction and storage module is used for extracting the information of the text block, the picture structural description and the meta information field through the LLM model to generate an information extraction result; The dynamic analysis and report ordering generation module is used for grading and ordering information extraction results, screening information through grading to obtain screened information sets, clustering the screened information sets to obtain information clustering results, constructing an event chain through the information clustering results, automatically generating a chapter structure according to a report frame preset by an industry attention point based on the event chain and the information clustering results, selecting information subsets corresponding to each chapter, inputting the information subsets into the LLM model to generate texts corresponding to each chapter, and splicing the texts corresponding to all chapters to generate a final IT industry analysis report.
- 9. An automatic generation device for a report based on dynamic industry analysis, characterized in that the automatic generation device for a report based on dynamic industry analysis comprises: A processor; A memory having stored thereon computer readable instructions which, when executed by the processor, implement the method of any of claims 1 to 7.
- 10. A computer readable storage medium having stored therein program code which is callable by a processor to perform the method of any one of claims 1 to 7.
Description
Automatic generation method and system based on dynamic industry analysis report Technical Field The invention relates to the technical field of artificial intelligence, in particular to an automatic generation method and system based on a dynamic industry analysis report. Background In the technical direction of automatically generating reports by using a large language model, a class of industry report automatic generation systems based on an LLM model has appeared in recent years. Representative solutions of this type of prior art typically have the general feature of capturing web content related to an industry keyword from the internet or a designated site using a search engine or simple crawler, and uniformly cleaning the content to plain text. Secondly, the texts are abstracted, generalized and rewritten through a large language model to generate structured or semi-structured report contents, such as sections of 'industry profile', 'main event', 'market scale', 'competition situation'. And thirdly, in the partial scheme, a report text is output according to a specified format through a LLM model according to a predefined template, and then rendered into a PDF or Word form by a front-end or document generation module. In the above schemes, the information collection link often adopts a relatively independent crawler module or search module. One typical method is that the system calls the search interface periodically, retrieves the relevant web page links using keywords, and then uniformly crawls the web page content to which the links point. The collected pages are generally treated as a unit of text in its entirety, with little fine division of the internal structure of the page. The system then integrates or splices the texts and then gives the texts to the LLM, and the key information extraction, the abstract and the report writing are completed by the model. Some schemes will strengthen the degree of structuring by simple physical extraction (e.g., company name, amount and time) on this basis, but most still dominate the "full abstract + paragraph generation". Meanwhile, in the prior art, a plurality of rounds of calling LLM models are tried to be used as a pipeline, keywords of a document are extracted by one model or one prompt, then the keywords are subjected to extension description or trend analysis by the next round of model calling, and finally the LLM models are called once again to sort an industrial report. However, most of these schemes still operate at the "document level" in that the input is a single article or text content of a small number of screened documents, and the crawler, link extension, web page structure understanding and picture understanding links are typically not engaged in decision making or control by the LLM model, but rather are split as opposed to LLM model processing. Although the prior art described above has introduced LLM models for report generation, there are various shortcomings. First, the acquisition phase and LLM model analysis phase remain separate in most scenarios. The crawler or the search module generally does not have semantic understanding capability, can only grasp data according to preset keywords, a fixed site list or crawling depth, and cannot dynamically judge whether a certain page and a certain link have high value on a specific industry attention point in the capturing process. The splitting results in a large number of irrelevant, duplicate or low value pages being captured together and passed to the LLM model for processing, which not only increases computational costs, but also dilutes the quality of the report content. Secondly, the fine understanding of the internal structure of the page is insufficient. The existing scheme is often to simply process a complete webpage as a text string, can not distinguish a text area, a recommended outer link area, an advertisement area and a navigation area at a block level, and also lacks overall utilization of structural characteristics such as link density, text density and the like in the webpage. With this coarse-grained processing, reports generated by the LLM model are highly likely to be interfered by a lot of noise information, or key information hidden deep in the text is ignored, while links in the recommendation list are lacking in intelligent screening. Third, processing of multi-modal content is extremely limited. A large amount of industry information is attached to charts, flowcharts, product comparison charts, policy reading charts or screenshot with characters, the traditional report generation scheme either completely ignores the picture content or simply keeps the picture links, and no structural analysis is carried out on the numbers, titles and semantic information in the picture links, so that the industry report lacks key data support and visual presentation. Fourth, existing solutions, while using LLM models to generate reports, are still relatively primitive in terms of dy