CN-121979942-A - Internet information integration method and device and electronic equipment
Abstract
The present invention relates to the field of data integration technologies, and in particular, to an internet information integration method, an internet information integration device, and an electronic device. The method comprises the steps of preprocessing internet multi-source information, carrying out data analysis and structuring on the internet multi-source information to obtain a standardized data model, constructing a multi-source assessment model based on the standardized data model to carry out reliability assessment on the internet multi-source information, carrying out information fusion and conflict resolution on the internet multi-source information based on comprehensive reliability scores, and constructing an integrated database based on the internet multi-source information after the information fusion and conflict resolution. The invention realizes the automation, structuring and credibility integration of the related internet information of enterprises by constructing a multisource acquisition engine, a heterogeneous analyzer, a credibility evaluation matrix and an intelligent fusion engine.
Inventors
- Min Xuejuan
Assignees
- 元象互动(北京)科技有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260123
Claims (10)
- 1. An internet information integration method, comprising: Preprocessing internet multisource information; carrying out data analysis and structuring on the internet multisource information to obtain a standardized data model; constructing a multisource assessment model based on the standardized data model so as to evaluate the credibility of the internet multisource information; Information fusion and conflict resolution are carried out on the internet multisource information based on the comprehensive credibility score; and constructing an integrated database based on the internet multi-source information after information fusion and conflict resolution.
- 2. The internet information integration method according to claim 1, wherein the text content in the internet multisource information is segmented, stop words contained in the text content are removed to obtain text keywords, MD5 hash values of the text keywords are calculated for each text keyword to obtain a 128-bit binary vector, the vector is weighted according to word frequency weights of the text keywords in the text content, when the word frequency weights are positive values, the weight of bits corresponding to 1 in the 128-bit binary vector is set to be equal to the word frequency weight, the weight of bits corresponding to 0 is equal to the word frequency weight, when the word frequency weights are negative values, the weight of bits corresponding to 1 in the 128-bit binary vector is set to be equal to the word frequency weight, the weight of bits corresponding to 0 is set to be the word frequency weight, all the weighted vectors of the text keywords are added according to bits to obtain a comprehensive weight vector, each bit of the comprehensive weight vector is judged, if the weight vector is greater than 0 and is smaller than 0, the weight is set to be 0, so as to obtain a 128-bit fingerprint, when the weight is smaller than 0, the new semantic information is analyzed, the new semantic information is stored in the internet multisource information is more than the internet, and the internet multisource information is similar to the internet information, and the new internet multisource information is removed from the internet information is stored.
- 3. The method for integrating internet information according to claim 2, wherein an initial parsing template is generated for internet multi-source information of each information source category, key fields are extracted by combining XPath with CSS selector for HTML/XML pages, JSON/API return data are parsed by key value mapping rule, unstructured text is extracted by entity recognition and relation extraction technology, key elements are extracted, and the parser converts the extracted original internet multi-source information according to a preset field mapping table to uniformly map the original internet multi-source information into a standardized data model.
- 4. The internet information integration method according to claim 3, wherein a hierarchical assignment method is adopted to set information source authority scores of internet multi-source information of different information source categories; For the same field j, the collected values of N different information sources are V (j) = { V1, V2,.. The number of the information sources is vN }, a variation coefficient CV (j) is calculated, when CV (j) is less than or equal to 0.1, an information consistency score is set to be 1, when 0.1< CV (j) <0.5, an information consistency score is set to be [1- (CV (j) -0.1)/0.4 ], and when CV (j) is more than or equal to 0.5, the information consistency score is set to be 0; Taking the time interval between the release time and the current time of the internet multi-source information as the released time length of the information, and determining an information timeliness score based on the released time length of the internet multi-source information, wherein the information timeliness score=max (0, 1-information released time length/preset timeliness parameter); and predefining a necessary field set of each type of internet multisource information, counting the number of successfully filled fields in analysis, and taking the ratio of the number of successfully filled fields to the number of fields in the necessary field set as an information integrity score.
- 5. The method of claim 4, wherein the information source authority score, the information consistency score, the information timeliness score, and the information integrity score of the internet multi-source information are weighted to determine a comprehensive credibility score, wherein the comprehensive credibility score=authority weight×information source authority score+consistency weight×information consistency score+timeliness weight×information timeliness score+integrity weight×information integrity score.
- 6. The method for integrating internet information according to claim 5, wherein the value of the same field in the internet multisource information is v (i), the corresponding integrated reliability score is T (i), so as to determine the final value thereof, text fields in the internet multisource information are fused, the record with the highest integrated reliability score in the field is extracted, the content corresponding to the record with the highest integrated reliability score is taken as a main version, other internet multisource information in the field same as the main version is traversed, the text similarity algorithm is used for analyzing the text similarity of the main version and other internet multisource information in the field same as the main version, and the content with the text similarity lower than 0.7 is stored in the association list as an alternative description.
- 7. The method for integrating internet information according to claim 6, wherein a conflict resolution is performed on logical conflicts in the internet multi-source information, when the same field in the internet multi-source information is inconsistent with the same field, information source authority scores of the internet multi-source information are compared, content with a later release time is adopted if an absolute difference of the information source authority scores of the internet multi-source information is smaller than 0.1, and content with a higher information source authority score is adopted if an absolute difference of the information source authority scores of the internet multi-source information is larger than or equal to 0.1.
- 8. The method for integrating internet information according to claim 7, wherein unified social credit codes of enterprises are used as primary keys, enterprise portrait summary tables and sub-tables of a plurality of different information source categories are built in an integrated database, the integrated internet multi-source information is aggregated according to enterprise dimensions through the analysis processing process of the internet multi-source information, the time lines are ordered according to the distribution time in a reverse order, the internet multi-source information is marked with reliability according to comprehensive reliability scores, the internet multi-source information is marked with high reliability data when the comprehensive reliability scores are greater than or equal to 0.8, the internet multi-source information is marked with medium reliability data when the comprehensive reliability scores are greater than or equal to 0.5 and less than 0.8, and the internet multi-source information is marked with low reliability data to be verified when the comprehensive reliability scores are less than 0.5.
- 9. An internet information integration apparatus applied to the internet information integration method according to any one of claims 1 to 8, comprising: The information processing module is used for preprocessing the internet multisource information; the information analysis module is used for carrying out data analysis and structuring on the internet multi-source information so as to obtain a standardized data model; The credibility evaluation module is used for constructing a multisource evaluation model based on the standardized data model so as to evaluate credibility of the internet multisource information; the fusion analysis module is used for carrying out information fusion and conflict resolution on the internet multisource information based on the comprehensive credibility score; And the database construction module is used for constructing an integrated database based on the internet multi-source information after information fusion and conflict resolution.
- 10. An electronic device, the electronic device comprising: One or more processors; A storage means for storing one or more programs; The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the internet information integration method of any one of claims 1-8.
Description
Internet information integration method and device and electronic equipment Technical Field The present invention relates to the field of data integration technologies, and in particular, to an internet information integration method, an internet information integration device, and an electronic device. Background With the rapid development of internet technology, information is explosively increased, and various websites, platforms, databases and other multi-source information coexist, so that rich data sources are brought for enterprise data integration, network information monitoring, market analysis and other applications. However, the isomerism, repeatability, credibility difference and potential logic conflict of the multi-source information lead to low information integration efficiency, difficult credibility guarantee and difficult direct support of accurate decision making and efficient application. The prior art generally lacks systematic deduplication and similar content recognition mechanisms in terms of internet information integration, resulting in a large amount of redundant information in the integrated database. In the information analysis link, the method often depends on a fixed template, is difficult to adapt to diversified and dynamically-changed webpage structures, and has high analysis failure rate. For credibility evaluation of multi-source information, only a single factor (such as source authority) is considered, and a multi-dimensional comprehensive quantitative evaluation model is lacked, so that an evaluation result is one-sided. When information is fused, the information is simply stacked or randomly selected, an intelligent fusion and conflict resolution strategy based on credibility is lacked, and the accuracy and consistency of the fused information are difficult to ensure. The final constructed database is often loose in structure, is not effectively aggregated according to business entities (such as enterprises), lacks visual marks for data credibility, and is not beneficial to subsequent efficient utilization and deep analysis. Disclosure of Invention The invention aims to provide an Internet information integration method, an Internet information integration device and electronic equipment, so as to solve at least one of the problems in the prior art. In order to achieve the above purpose, the invention adopts the following technical scheme: an internet information integration method, comprising: Preprocessing internet multisource information; carrying out data analysis and structuring on the internet multisource information to obtain a standardized data model; constructing a multisource assessment model based on the standardized data model so as to evaluate the credibility of the internet multisource information; Information fusion and conflict resolution are carried out on the internet multisource information based on the comprehensive credibility score; and constructing an integrated database based on the internet multi-source information after information fusion and conflict resolution. Further, word segmentation is carried out on text content in internet multisource information, stop words contained in the text content are removed to obtain text keywords, MD5 hash values of the text keywords are calculated for each text keyword to obtain 128-bit binary vectors, the vectors are weighted according to word frequency weights of the text keywords in the text content, when the word frequency weights are positive values, weights of bits corresponding to 1 in the 128-bit binary vectors are set to be equal to the word frequency weights, weights of bits corresponding to 0 are set to be equal to the word frequency weights, when the word frequency weights are negative values, weights of bits corresponding to 1 in the 128-bit binary vectors are set to be equal to the word frequency weights, the weight of the bit corresponding to 0 is word frequency weight, the weight vectors of all text keywords are added according to the bit to obtain a comprehensive weight vector, each bit of the comprehensive weight vector is judged, if the weight vector is larger than 0, the weight vector is set to be 1, and if the weight vector is smaller than or equal to 0, the weight vector is set to be 0, so that 128-bit semantic fingerprints are obtained, the Hamming distance between the semantic fingerprints of newly acquired internet multi-source information and the semantic fingerprints of stored internet multi-source information is analyzed, and when the Hamming distance is smaller than or equal to 3, the newly acquired internet multi-source information is judged to be highly similar content, and the internet multi-source information is removed. Further, an initial analysis template is generated for the internet multisource information of each information source type, key fields are extracted for HTML/XML pages in a mode of combining XPath and a CSS selector, JSON/API return data are analyzed by ado