CN-121980109-A - Webpage element similarity detection method based on self-adaptive clustering
Abstract
The invention discloses a webpage element similarity detection method based on self-adaptive clustering, which comprises the following steps of S1, inputting field XPath processing, S2, element path extraction, S3, similarity calculation, S4, distribution analysis, S5, self-adaptive clustering, S6, XPath generation and S7, wherein the XPath processing is used for receiving XPath paths of at least one target field as input and verifying format validity of the XPath paths, the HTML elements of a corresponding webpage are positioned according to the verified XPath paths, the complete label paths of the HTML elements from self nodes to webpage root nodes are extracted, the similarity of all element pairs obtained in the step S3 is constructed into a symmetrical similarity matrix, the statistical distribution analysis is carried out on comprehensive similarity values in the similarity matrix by adopting kernel density estimation, the multi-objective clustering is carried out on the HTML elements based on distribution peaks and significant intervals identified in the step S4, and S7, the optimized general XPath expression is output. The method has the advantages of strong self-adaptability, high accuracy, good robustness, high efficiency, strong universality and the like.
Inventors
- LIU BAOQIANG
- XIAO YUNFEI
- Hu Runrong
Assignees
- 深圳数阔信息技术有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260407
Claims (6)
- 1. The method for detecting the similarity of the webpage elements based on the self-adaptive clustering is characterized by comprising the following steps of: S1, inputting field XPath processing, namely receiving XPath paths of at least one target field as input, verifying format validity of the XPath paths, and returning error prompt and terminating the current flow if grammar errors exist in the XPath paths; s2, element path extraction, namely positioning corresponding webpage HTML elements according to the verified XPath path, extracting complete label paths of the HTML elements from self nodes to webpage root nodes, constructing element identifiers containing label marks of the HTML elements, and caching the complete label paths and the element identifiers; S3, similarity calculation, namely carrying out pairwise pairing on the HTML elements corresponding to all the complete label paths extracted in the step S2 to obtain element pairs, respectively calculating path structure similarity and similarity of the names for each element pair, and fusing the path structure similarity and the similarity of the names in a weighted combination mode to obtain comprehensive similarity of the element pairs, wherein the path structure similarity is calculated based on the ratio of the common prefix length and the longest path length of the complete label paths of the two elements, and the similarity of the names is calculated based on Jaccard similarity coefficients of the similarity sets of the two elements; S4, distribution analysis, namely constructing a symmetrical similarity matrix of the comprehensive similarity of all the element pairs obtained in the step S3, carrying out statistical distribution analysis on the comprehensive similarity values in the similarity matrix by adopting kernel density estimation, identifying a distribution peak value and a corresponding significant interval of the comprehensive similarity values, and dynamically adjusting parameters of KDE analysis to optimize a distribution identification result; S5, self-adaptive clustering, namely carrying out multi-objective clustering on the HTML elements based on the distribution peak value and the salient interval identified in the step S4, dynamically adjusting the tolerance range of the clustering to adapt to the characteristics of the salient interval, merging the clusters with the similarity meeting the preset merging condition in the clustering result, and carrying out quality assessment on each cluster after merging, wherein the quality assessment comprises the assessment of the element number of the clusters, the comprehensive similarity standard deviation of the elements in the clusters and the path consistency of the elements in the clusters; s6.XPath is generated, namely extracting common prefixes of complete label paths of all HTML elements from the cluster with optimal quality evaluation results in the step S5, generating a general XPath expression based on the common prefixes, and optimizing the hierarchical structure of the general XPath expression so as to improve the adaptability of the general XPath expression to the structural change of the webpage; S7, outputting an optimized general XPath expression, wherein the general XPath expression is used for identifying similar HTML elements corresponding to target fields in a webpage and supporting the application of subsequent webpage data capture, automatic test and other scenes.
- 2. The method for detecting web page element similarity based on adaptive clustering according to claim 1, wherein in step S2, the method for caching the complete tag path and the element identifier is that the complete tag path and the corresponding element identifier are stored in a memory cache or a local disk cache in an associated manner, a cache validity period is set according to the web page data updating frequency, and if the complete tag path and the element identifier need to be called again within the cache validity period, the complete tag path and the element identifier need not to be read from the cache again, and HTML elements need not to be repositioned and paths need not to be extracted.
- 3. The method for detecting web page element similarity based on adaptive clustering according to claim 1, wherein in step S3, the weighted combination path structure similarity weight range is 0.5-0.8, the similarity weight range is 0.2-0.5, and the specific steps of calculating the path structure similarity are as follows: S311, carrying out segmentation processing on two complete label paths in the element pair, splitting each complete label path into a plurality of path segments according to a label level, wherein each path segment corresponds to an HTML label node; S312, starting from the initial path sections of the two complete label paths, comparing the corresponding path sections one by one, if the label numbers and the fixed attributes of the path sections are completely consistent, judging the path sections to be identical, counting the number of the path sections which are continuously identical, and taking the number as the length of a common prefix ; S313, respectively calculating the total number of path segments of two complete label paths to obtain the path length And Taking out And As the longest path length ; S314, similarity of path structures Wherein The value range of (1) is 0,1, The closer to 1, the more similar the path structure of the two elements is indicated.
- 4. The method for detecting web page element similarity based on adaptive clustering according to claim 3, wherein in step S3, the specific step of calculating the similarity of the similarity is: S321, extracting class names of two HTML elements in an element pair, and if a single HTML element has a plurality of class names, forming all class names into a class name set of the element And If the HTML element has no class name, the corresponding class name set is an empty set; S322 calculating class name set And Is the intersection of (1) Union and union of , wherein, , ; S323 similarity of class names Wherein Representing intersections Is defined as the number of elements of the set, Representing union Is defined as the number of elements of the set, The value of (1) is 0,1, if Is empty, then Indicating that there is no difference in the class name dimensions.
- 5. The method for detecting web page element similarity based on adaptive clustering according to claim 4, wherein in step S3, the weighted combination path structure similarity weight is set to 0.6, the similarity weight is set to 0.4, and the similarity is integrated 。
- 6. The method for detecting web page element similarity based on adaptive clustering according to claim 1, wherein in step S4, the dynamically adjusting parameters of the KDE analysis includes adjusting a kernel function type and a bandwidth, wherein the kernel function type is selected from any one of a gaussian kernel function, an Epanechnikov kernel function and a trigonometric kernel function, and the bandwidth is dynamically adjusted according to the number of samples of the integrated similarity value and the degree of distribution dispersion.
Description
Webpage element similarity detection method based on self-adaptive clustering Technical Field The invention belongs to the technical field of web page structured data processing, and particularly relates to a web page element similarity detection method based on self-adaptive clustering. Background With the rapid development of internet technology, the value of webpage data is increasingly prominent, and the extraction of webpage structured data has become a core requirement in the fields of data analysis, business decision, automatic test and the like. In the process of web page structured data extraction, when a web page contains a plurality of similar data records, such as commodity names, prices, sales in an e-commerce commodity list, titles, release time, authors and the like in a news list, the accurate identification of the attribution relationship of web page elements corresponding to each field is a key premise for ensuring the accuracy of data extraction. Similar data records in web pages typically have similar, but not identical, HTML structures. At present, technical schemes for detecting similarity of webpage elements and identifying attribution of fields in industry are mainly based on fixed rules or preset templates, and a plurality of pain points exist in practical application of the schemes, and the technical schemes are as follows: 1. The condition judgment coverage is limited, namely a condition judgment method based on a fixed rule (for example, a preset rule that the class name of the commodity name element comprises a 'name' "price element is the label of the label and comprises a 'price' class name) only can cover a limited webpage structure change scene, and when the class name, the label level or the attribute value of the webpage element exceeds the preset rule range (for example, the commodity name class name is changed into a item-title and does not comprise a 'name'), the rule is invalid, and the target element cannot be identified. 2. The scene adaptability is insufficient, the prior art scheme generally designs judgment logic aiming at a specific scene (such as a specific electronic commerce platform and a specific news website), and lacks the self-adaption capability for diversified scenes. For example, commodity price element identification logic designed for an E-commerce platform is not directly applied to a B platform because the price element of the A platform adopts a 'price-red' class name and the price element of the B E-commerce platform adopts a 'price-green' class name, and when unforeseen web page structure changes are encountered (such as the web page is adapted to a mobile terminal from a PC terminal, the label level is increased from 3 layers to 5 layers), the existing judgment logic is often not capable of being quickly adjusted, so that field attribution identification fails. 3. The sensitivity of structure change is strong, and the tiny adjustment of the HTML structure of the webpage can lead to the complete failure of the traditional field recognition scheme. For example, the class name of the webpage element is changed from 'news-title' to 'news-headline', and only one word difference, namely, the fixed rule based on class name matching is invalid, so that the traditional scheme needs frequent manual maintenance due to the strong sensitivity, and the maintenance cost is extremely high. 4. The lack of versatility is that solutions optimized for a particular web site often rely on the unique structural features of the web site and cannot be directly migrated to other web sites with different structural features. For example, a user information extraction scheme designed for a blog website can not be reused because the user information of the website is concentrated in div containers with a user-info class name, and the user information of other blog websites can be scattered in a plurality of containers, and even the same type of website (such as different news websites) can have significantly different webpage element structures due to differences of development frames and design styles, so that the traditional scheme needs to be designed for each website independently, and has low development efficiency. 5. The similarity recognition capability is insufficient, and the prior art lacks an effective similarity analysis mechanism, so that data records with similar structures but not identical structures are difficult to recognize. Traditional schemes typically employ single-dimensional matching (e.g., based on class name matching alone or tag path matching alone) and cannot comprehensively evaluate element similarity. In summary, the existing web page element similarity detection and field attribution identification technology has obvious defects in the aspects of adaptability, universality, robustness and accuracy, cannot meet the adaptation requirements of the web page data processing field on diversified web page structures, and cannot cope with challenges