CN-116150533-B - Webpage data processing method and system

CN116150533BCN 116150533 BCN116150533 BCN 116150533BCN-116150533-B

Abstract

The invention discloses a processing method and a processing system for webpage data. The method is applied to the analysis field, and comprises the steps of responding to the change of webpage data of a preset webpage, obtaining new webpage data of the preset webpage, respectively processing title content and text content in the new webpage data to obtain a keyword set corresponding to the title content and a text vector corresponding to the text content, determining screening coefficients of the new webpage data based on the keyword set and the text vector, and storing the new webpage data in response to the screening coefficients being larger than the preset screening coefficients. The invention solves the technical problem of lower matching degree of the screened text.

Inventors

MA LONGFEI
WANG JIAN
HU CAIE
ZENG JIANI
XU HUI
LU SIYUE
WANG LIYONG
LI XIANGLONG
DING YIFENG
ZHOU WENBIN
GAO XIN

Assignees

国网北京市电力公司
国家电网有限公司

Dates

Publication Date: 20260508
Application Date: 20221229

Claims (11)

1. A method for processing web page data, comprising: responding to the change of webpage data of a preset webpage, and acquiring new webpage data of the preset webpage; Respectively processing the title content and the text content in the new webpage data to obtain a keyword set corresponding to the title content and a text vector corresponding to the text content; Determining a screening coefficient of the new webpage data based on the keyword set and the text vector, wherein the screening coefficient is used for representing the matching degree of the new webpage data and a preset screening condition; Storing the new webpage data in response to the screening coefficient being greater than a preset screening coefficient; The method comprises the steps of determining screening coefficients of new webpage data based on a keyword set and a text vector, obtaining a preset keyword set and a preset text vector set, matching each keyword in the keyword set with the preset keyword set to obtain a first coefficient of the keyword set, wherein the first coefficient is used for representing the matching degree of the keyword set and the preset keyword set, matching the text vector with the preset text vector set to obtain a second coefficient of the text vector, wherein the second coefficient is used for representing the matching degree of the text vector and the preset text vector set, and weighting and summing the first coefficient and the second coefficient to obtain the screening coefficient.
2. The method of claim 1, wherein matching each keyword in the set of keywords with the set of preset keywords to obtain a first coefficient of the set of keywords comprises: determining the number of keywords in the keyword set to obtain a first number; Matching each keyword in the keyword set with the preset keyword set to obtain a score of each keyword, wherein the score is used for representing whether the matching of each keyword with the preset keyword set is successful or not; Obtaining the sum of the scores of all keywords in the keyword set to obtain a total score; Obtaining the product of the first quantity and a preset score to obtain a target score, wherein the preset score is used for representing the average value of the scores of all preset keywords in the preset keyword set; and obtaining the ratio of the total score to the target total score to obtain the first coefficient.
3. The method of claim 1, wherein matching the text vector with the set of preset text vectors results in a second coefficient for the text vector, comprising: Determining the number of preset text vectors in the preset text vector set to obtain a second number, wherein different preset text vectors in the preset text vector set are used for representing text vectors of templates of different types of text content; Obtaining the similarity of the text vector and a plurality of preset text vectors in the preset text vectors to obtain a plurality of similarities; Obtaining the sum of the multiple similarities to obtain total similarity; Obtaining the product of the second quantity and the preset similarity to obtain the target similarity; And obtaining the ratio of the total similarity to the target similarity to obtain the second coefficient.
4. The method according to claim 1, wherein the processing of the title content and the text content in the new web page data to obtain the keyword set corresponding to the title content and the text vector corresponding to the text content includes: screening the new webpage data to obtain a text to be screened; Performing word segmentation processing on the title content in the text to be screened to obtain the keyword set; and carrying out semantic analysis on the text content in the text to be screened to obtain the text vector.
5. The method of claim 4, wherein the word segmentation process is performed on the title content in the text to be screened to obtain the keyword set, and the method comprises: Filtering the title content to obtain a first text; and performing word segmentation on the first text by using a word segmentation algorithm to obtain the keyword set.
6. The method of claim 4, wherein performing semantic analysis on the text content in the text to be filtered to obtain the text vector comprises: performing word segmentation processing on the text content by using a word segmentation algorithm to obtain a second text; Determining weights of words in the second text; Acquiring a vector representation of the second text; The text vector is determined based on the weights and the vector representations.
7. The method of claim 6, wherein determining weights for words in the second text comprises: Processing the words by using a word frequency-inverse document frequency algorithm to obtain initial weights of the words; determining a first weight parameter of the word based on the word and a first word located on different pages in the second text; determining a second weight parameter of the word based on the word and a second word located in a different line in the second text; determining a third weight parameter for the term based on the term and all terms in the second text; And adjusting the initial weight based on the first weight parameter, the second weight parameter and the third weight parameter to obtain the weight of the word in the second text.
8. A web page data processing apparatus, comprising: The acquisition module is used for responding to the change of the webpage data of the preset webpage and acquiring new webpage data of the preset webpage; The processing module is used for respectively processing the title content and the text content in the new webpage data to obtain a keyword set corresponding to the title content and a text vector corresponding to the text content; The determining module is used for determining screening coefficients of the new webpage data based on the keyword set and the text vector, wherein the screening coefficients are used for representing the matching degree of the new webpage data and preset screening conditions; the storage module is used for responding to the fact that the screening coefficient is larger than a preset screening coefficient and storing the new webpage data; the determining module is further configured to obtain a preset keyword set and a preset text vector set, match each keyword in the keyword set with the preset keyword set to obtain a first coefficient of the keyword set, wherein the first coefficient is used for representing the matching degree of the keyword set and the preset keyword set, match the text vector with the preset text vector set to obtain a second coefficient of the text vector, wherein the second coefficient is used for representing the matching degree of the text vector and the preset text vector set, and weight sum is performed on the first coefficient and the second coefficient to obtain the screening coefficient.
9. A system for processing web page data, comprising: the monitoring module is used for responding to the change of the webpage data of the preset webpage and acquiring new webpage data of the preset webpage; the screening module is connected with the monitoring module and is used for respectively processing the title content and the text content in the new webpage data to obtain a keyword set corresponding to the title content and a text vector corresponding to the text content, and determining a screening coefficient of the new webpage data based on the keyword set and the text vector, wherein the screening coefficient is used for representing the matching degree of the new webpage data and preset screening conditions; the database module is used for responding to the fact that the screening coefficient is larger than a preset screening coefficient and storing the new webpage data; The screening module is further used for obtaining a preset keyword set and a preset text vector set, matching each keyword in the keyword set with the preset keyword set to obtain a first coefficient of the keyword set, wherein the first coefficient is used for representing the matching degree of the keyword set and the preset keyword set, matching the text vector with the preset text vector set to obtain a second coefficient of the text vector, wherein the second coefficient is used for representing the matching degree of the text vector and the preset text vector set, and weighting and summing the first coefficient and the second coefficient to obtain the screening coefficient.
10. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored program, wherein the program, when run, controls a device in which the computer readable storage medium is located to perform the method of any one of claims 1 to 7.
11. A processor for running a program, wherein the program when run performs the method of any one of claims 1 to 7.

Description

Webpage data processing method and system Technical Field The invention relates to the field of analysis, in particular to a webpage data processing method and system. Background At present, the external environment of the power industry is continuously changeable, if important policies and directions cannot be captured in time, adverse effects are generated on the operation and development of a company, and corresponding data can be quickly obtained when texts are analyzed. However, in the prior art, when the content of the web page is screened, the title on the web page is subjected to word segmentation processing, and matching is performed based on the obtained word segmentation, so as to judge whether the content of the web page meets the screening requirement, and the search according to the title cannot be completely matched with the content, so that the screening result is inaccurate. In view of the above problems, no effective solution has been proposed at present. Disclosure of Invention The embodiment of the invention provides a processing method and a processing system for webpage data, which are used for at least solving the technical problem of low matching degree of screened texts. According to one aspect of the embodiment of the invention, a processing method of webpage data is provided, which comprises the steps of responding to the change of the webpage data of a preset webpage, obtaining new webpage data of the preset webpage, respectively processing title content and text content in the new webpage data to obtain a keyword set corresponding to the title content and a text vector corresponding to the text content, determining screening coefficients of the new webpage data based on the keyword set and the text vector, wherein the screening coefficients are used for representing the matching degree of the new webpage data and the preset screening conditions, and storing the new webpage data in response to the screening coefficients being larger than the preset screening coefficients. The method comprises the steps of obtaining a preset keyword set and a preset text vector set, matching each keyword in the keyword set with the preset keyword set to obtain a first coefficient of the keyword set, wherein the first coefficient is used for representing the matching degree of the keyword set and the preset keyword set, matching the text vector with the preset text vector set to obtain a second coefficient of the text vector, wherein the second coefficient is used for representing the matching degree of the text vector and the preset text vector set, and weighting and summing the first coefficient and the second coefficient to obtain the screening coefficient. Optionally, matching each keyword in the keyword set with a preset keyword set to obtain a first coefficient of the keyword set, wherein the first coefficient comprises the steps of determining the number of the keywords in the keyword set to obtain a first number, matching each keyword in the keyword set with the preset keyword set to obtain a score of each keyword, wherein the score is used for representing whether each keyword is successfully matched with the preset keyword set, obtaining the sum of scores of all keywords in the keyword set to obtain a total score, obtaining the product of the first number and the preset score to obtain a target score, wherein the preset score is used for representing the average value of the scores of all preset keywords in the preset keyword set, and obtaining the ratio of the total score to the target total score to obtain the first coefficient. Optionally, matching the text vector with a preset text vector set to obtain a second coefficient of the text vector, wherein the method comprises the steps of determining the number of preset text vectors in the preset text vector set to obtain a second number, wherein different preset text vectors in the preset text vector set are used for representing text vectors of templates of text contents of different types, obtaining similarity of the text vector and a plurality of preset text vectors in the preset text vector to obtain a plurality of similarity, obtaining the sum of the similarity to obtain total similarity, obtaining the product of the second number and the preset similarity to obtain target similarity, and obtaining the ratio of the total similarity to the target similarity to obtain the second coefficient. The method comprises the steps of screening new webpage data to obtain texts to be screened, carrying out word segmentation on the title content in the texts to be screened to obtain a keyword set, and carrying out semantic analysis on the text content in the texts to be screened to obtain text vectors. Optionally, word segmentation is carried out on the title content in the text to be screened to obtain a keyword set, wherein the keyword set comprises the steps of filtering the title content to obtain a first text, and word segmentation is carrie