CN-122019861-A - Webpage main body information extraction method, equipment and medium

CN122019861ACN 122019861 ACN122019861 ACN 122019861ACN-122019861-A

Abstract

The application discloses a webpage main body information extraction method, equipment and medium, and relates to the technical field of Internet information processing. The method comprises the steps of obtaining visual rendering information of a Document Object Model (DOM) tree and at least one HTML element of a webpage through a headless browser, dividing the webpage into at least one visual block by using a visual separation algorithm based on the visual rendering information, calculating visual feature scores of the visual blocks, carrying out weighted fusion on the visual feature scores of DOM nodes in the DOM tree and structural feature scores of the DOM nodes to obtain comprehensive weight scores of the DOM nodes, and determining the DOM nodes with the comprehensive weight scores being larger than a preset weight threshold as container nodes of main content, so that information extraction is carried out on the container nodes to obtain main information of the webpage. Thus, the webpage body information extraction with high accuracy, good robustness and excellent efficiency can be realized.

Inventors

YANG SHANGYONG
HOU HUAN
ZHOU XIANGLONG
LI YUFENG

Assignees

山东浪潮科学研究院有限公司

Dates

Publication Date: 20260512
Application Date: 20251211

Claims (10)

1. The webpage main body information extraction method is characterized by comprising the following steps of: obtaining a Document Object Model (DOM) tree of a webpage and visual rendering information of at least one HTML element through a headless browser; dividing the webpage into at least one visual block by using a visual separation algorithm based on the visual rendering information, and calculating a visual feature score of the visual block; weighting and fusing the visual feature scores of the DOM nodes in the DOM tree with the structural feature scores of the DOM nodes, and calculating the comprehensive weight score of the DOM nodes, wherein the visual feature scores of the DOM nodes are the visual feature scores of the visual blocks corresponding to the DOM nodes; determining the DOM node with the comprehensive weight score larger than a preset weight threshold as a container node of the main content; And extracting information in the container node of the main content to obtain the main information of the webpage.
2. The method of claim 1, wherein prior to the dividing the web page into at least one visual block using a visual separation algorithm based on the visual rendering information, the method further comprises: filtering invisible elements from the DOM tree; And determining the browser view port of the webpage as an initial visual block.
3. The method of claim 2, wherein the dividing the web page into at least one visual block using a visual separation algorithm based on the visual rendering information comprises: Determining the boundary of a current sub-element in the visual block based on the visual rendering information, wherein the sub-element is an element actually displayed in the DOM tree; Determining candidate separators of the visual block in a vertical direction and a horizontal direction based on boundaries of the sub-elements; Determining the candidate separator as a target separator under the condition that the strength of the candidate separator is larger than a preset strength threshold, wherein the strength is determined based on the difference of the width and the pattern of the blank area between the adjacent subelements; Dividing the current visual block into two sub-blocks based on the target separator, wherein the sub-blocks are new visual blocks; And repeating the steps for the sub-blocks until the target separator does not exist in the visual block.
4. The method of claim 1, wherein the visual feature score comprises a visual importance score and a content density score; The calculating the visual feature score of the visual block includes: determining a visual importance score of the visual block according to the area ratio corresponding to the visual block and the central position ratio, wherein the visual importance score is used for indicating the importance degree of the visual block in the field of view of a user; And determining the content density score of the visual block according to the ratio of the area occupied by the text node of the visual block to the area of the visual block, wherein the content density score is used for representing the density degree of text information in the visual block.
5. The method of claim 1, wherein the weighted fusion of the visual feature scores of DOM nodes in the DOM tree with the structural feature scores of the DOM nodes, and calculating the comprehensive weight score of the DOM nodes, comprises: Establishing a mapping relation between the visual block and the DOM node according to the overlapping degree of the visual rectangle of the DOM node in the DOM tree and the visual block; calculating the structural feature score of the DOM node based on the structural features of the DOM node; And carrying out weighted fusion on the visual feature score of the DOM node and the structural feature score of the DOM node to obtain the comprehensive weight score of the DOM node.
6. The method of claim 5, wherein the calculating the structural feature score of the DOM node based on the structural features of the DOM node comprises: determining the label score of the DOM node according to the label of the DOM node and the pre-defined weight dictionary; Determining semantic scores of the DOM nodes according to the regular expression matching of the node class names and the IDs of the DOM nodes; Determining the link text ratio of the DOM node according to the ratio of the text length of the hyperlink label of the DOM node to the total text length of the DOM node; And determining the structural feature score of the DOM node based on the tag score, the semantic score and the link text ratio.
7. The method of claim 1, wherein the determining that the DOM node for which the composite weight score is greater than a preset weight threshold is a container node for subject content comprises: Traversing the DOM tree by adopting a greedy algorithm from bottom to top; And under the condition that the target ratio is larger than a preset ratio threshold and the area of the father node is larger than the preset area threshold, determining the father node as a container node of the main content, wherein the target ratio is the ratio between the sum of the comprehensive weight score of the father node and the comprehensive weight score of the child node of the father node.
8. The method according to claim 1, wherein the extracting information in the container node of the main content to obtain main information of the web page includes: Obtaining a title based on the < h1> or < h2> tag text, or font-bolded text blocks within the container node; traversing all text nodes in the container nodes, and segmenting according to paragraph labels or visual block separation to obtain texts; searching text in a matching time-date format in the container node and corresponding brother nodes by using a predefined regular expression pattern to obtain release time.
9. A web page body information extraction apparatus, characterized in that the apparatus comprises: At least one processor; And a memory communicatively coupled to the at least one processor; Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a web page body information extraction method as claimed in any one of claims 1 to 8.
10. A computer storage medium storing computer executable instructions which, when executed, implement a method of web page body information extraction as claimed in any one of claims 1 to 8.

Description

Webpage main body information extraction method, equipment and medium Technical Field The present application relates to the field of internet information processing technologies, and in particular, to a method, an apparatus, and a medium for extracting web page body information. Background With the explosive growth of internet information, the automatic extraction of body content (such as news texts, commodity descriptions, etc.) from web pages has become a key technology for web crawlers, search engines, data aggregation, etc. Existing mainstream extraction techniques are mainly divided into two categories: One is a method based on the structure of a document object model (Document Object Model, DOM) that locates the subject content using preset rules or machine learning models (based on characteristics of tags, class names, IDs, etc.) by analyzing the HTML DOM tree of the web page. The disadvantage is very pronounced, being extremely dependent on the specific DOM structure. Once the website is modified, the HTML tag or style name changes, the preset rule is immediately disabled, manual re-adaptation is required, and the maintenance cost is high. In addition, the noise content of a large number of navigation bars, advertisements, related recommendations, etc. in a page may be similar to the text in DOM structure, resulting in misextraction. Another is a computer vision based method that processes the entire web page as an image, using layout analysis algorithms to identify the subject content area. The method is not affected by DOM structure change, and has strong robustness. However, the method has the disadvantages of huge calculation amount and low processing speed, cannot meet the requirement of large-scale crawling, and is difficult to accurately extract the text level. Disclosure of Invention The embodiment of the application provides a webpage body information extraction method, equipment and medium, which are used for solving the technical problems of how to realize webpage body information extraction with high accuracy, good robustness and excellent efficiency. According to the method, a Document Object Model (DOM) tree of a webpage and visual rendering information of at least one HTML element are obtained through a headless browser, the webpage is divided into at least one visual block based on the visual rendering information by using a visual separation algorithm, visual feature scores of the visual blocks are calculated, weighted fusion is conducted on the visual feature scores of the DOM nodes in the DOM tree and the structural feature scores of the DOM nodes, comprehensive weight scores of the DOM nodes are calculated, the visual feature scores of the DOM nodes are the visual feature scores of the visual blocks corresponding to the DOM nodes, the DOM nodes with the comprehensive weight scores being larger than a preset weight threshold are determined to be container nodes of main content, and information extraction is conducted in the container nodes of the main content to obtain main information of the webpage. In a second aspect, an embodiment of the present application further provides a device for extracting web page body information, where the device includes at least one processor, and a memory communicatively connected to the at least one processor, where the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform a method for extracting web page body information according to the first aspect. In a third aspect, an embodiment of the present application further provides a computer storage medium storing computer executable instructions, where the computer executable instructions implement a method for extracting web page body information according to the first aspect. The webpage main body information extraction method, the webpage main body information extraction equipment and the webpage main body information extraction medium provided by the embodiment of the application have the following beneficial effects: According to the method, the device and the system, the Document Object Model (DOM) tree of the webpage and visual rendering information of at least one HTML element can be obtained through a headless browser, the webpage is divided into at least one visual block based on the visual rendering information, visual feature scores of the visual blocks are calculated through a visual separation algorithm, the visual feature scores of DOM nodes in the DOM tree and the structural feature scores of the DOM nodes are subjected to weighted fusion to obtain comprehensive weight scores of the DOM nodes, and finally the DOM nodes with the comprehensive weight scores being larger than a preset weight threshold are determined to be container nodes of main content, so that information extraction is conducted on the container nodes, and main information of the webpage is