CN-122019523-A - Big data information acquisition method

CN122019523ACN 122019523 ACN122019523 ACN 122019523ACN-122019523-A

Abstract

The invention relates to a big data information acquisition method which comprises the following steps of acquiring search data, browsing data, consultation data and auxiliary data of multiple users in an application system, carrying out invalid data removal and noise data filtering on the acquired data, wherein the invalid data removal comprises null value and nonsensical data rejection, repeated data deduplication and format standardization processing, the noise data filtering comprises regular filtering and statistical filtering, and constructing an invalid and noise data judgment model based on the processed valid data and invalid and noise data. According to the invention, invalid data and noise data can be effectively removed through data preprocessing and double screening of the invalid and noise data judging model, so that the accuracy of data information collection is improved, and the model is continuously optimized along with data accumulation through automatic judgment of the invalid and noise data judging model and an incremental updating mechanism, so that the manual screening workload of collecting data information can be reduced, and the misjudgment rate of automatic judgment is reduced.

Inventors

JI CHENCHEN

Assignees

河北知度知识产权服务有限公司

Dates

Publication Date: 20260512
Application Date: 20260130

Claims (10)

1. The big data information acquisition method is characterized by comprising the following steps: Step 1, acquiring search data, browsing data, consultation data and auxiliary data of multiple users in an application system, wherein the auxiliary data comprises user identity tags; Step 2, performing invalid data removal and noise data filtering on the data acquired in the step 1, wherein the invalid data removal comprises null value and nonsensical data rejection, repeated data deduplication and format standardization processing, and the noise data filtering comprises rule filtering and statistical filtering; step 3, constructing an invalid and noise data judging model based on the valid data processed in the step 2 and the invalid and noise data; And 4, judging and outputting judging results of the effective data, the ineffective data or the noise data through an ineffective and noise data judging model, triggering manual rechecking of the data in a preset suspicious interval, and feeding the rechecking results to the step 3 as a supervision signal for updating the driving model.
2. The method for acquiring big data information according to claim 1, wherein the method for establishing the invalid and noise data judgment model in the step 3 is as follows: Step 31, extracting quantization characteristics of the effective data processed in the step 2, wherein the quantization characteristics comprise at least three of keyword correlation, browsing duration normalization value, text theme matching degree, accessory integrity coefficient and user identity matching degree; Step 32, completing model initialization training by using a manual labeling sample set, wherein the calculation formula of the invalid and noise data judgment model is as follows: In the formula, Is that The predicted value of the time-of-day model, To activate the function, a feature weighted sum is mapped to a [0,1] probability interval, Is that The characteristic weight vector of the moment in time, As a sample feature vector of the sample, Is that Bias term of time.
3. The method for collecting big data information according to claim 2, wherein the driving model update in the step 4 includes updating feature weight vectors and bias terms, and the formula of the driving model update is as follows: In the formula, And The updated feature weight vector and bias term respectively, In order for the rate of learning to be high, Is a sample genuine label.
4. The method for collecting big data information according to claim 1, wherein said search data comprises a search keyword, a search number, a search result click number, and a search time window.
5. The method for collecting big data information according to claim 4, wherein said browsing data includes a browsing page type, a single page stay time, a page interaction behavior, and a browsing path.
6. The method for collecting big data information according to claim 5, wherein the consultation data comprises consultation text content, consultation attachment, consultation channel, consultation time and reply interaction record.
7. The method for collecting big data information according to claim 1, wherein said null value and nonsensical data eliminating is set to eliminate data with empty search keywords, data with consultation text length smaller than a plurality of characters, and data with invalid browsing page user identity tag.
8. The method for collecting big data information according to claim 7, wherein said repeated data deduplication eliminates completely repeated behavior records based on a combination of user identity tag, behavior type and behavior time.
9. The method for collecting big data information according to claim 8, wherein said format normalization process is based on a unified text encoding format, and removing messy codes and nonsensical special symbols.
10. The method for collecting big data information according to claim 1, wherein the rule filtering is used for eliminating data with browsing time length smaller than a plurality of time lengths, low correlation between search keywords and service topics and large proportion of irrelevant subject words in consultation texts by setting a threshold value, and the statistical filtering is used for eliminating extreme abnormal value data exceeding 99% of user behavior range by calculating normal distribution of behavior data.

Description

Big data information acquisition method Technical Field The invention relates to the technical field of data processing, in particular to a big data information acquisition method. Background With popularization of internet applications, more user behavior data (such as retrieval, browsing, consultation data, etc.) are accumulated in the application system, and the data are core basis for analyzing user demands and optimizing service experience. However, the conventional method relies on fixed rule filtering in the current data acquisition and processing process, when a service scene changes or user input information changes, the rule needs to be manually and frequently adjusted, so that the operation is complicated and the response is lagged, and a great amount of invalid data can be generated by actions such as false clicking, repeated submission, nonsensical input and the like in the operation of the user, and meanwhile, noise information such as non-service related search keywords, short-term browsing records and the like is mixed, so that the effective data utilization rate is low. In view of the above, there is a need for a big data information collection method capable of collecting user behavior data, rejecting invalid and noise data, and continuously improving determination accuracy by model update. Disclosure of Invention (One) solving the technical problems The invention provides a big data information acquisition method, which solves the problems in the background technology. (II) technical scheme In order to achieve the purpose, the invention provides the following technical scheme that the big data information acquisition method comprises the following steps: Step 1, acquiring search data, browsing data, consultation data and auxiliary data of multiple users in an application system, wherein the auxiliary data comprises user identity tags; Step 2, performing invalid data removal and noise data filtering on the data acquired in the step 1, wherein the invalid data removal comprises null value and nonsensical data rejection, repeated data deduplication and format standardization processing, and the noise data filtering comprises rule filtering and statistical filtering; step 3, constructing an invalid and noise data judging model based on the valid data processed in the step 2 and the invalid and noise data; And 4, judging and outputting judging results of the effective data, the ineffective data or the noise data through an ineffective and noise data judging model, triggering manual rechecking of the data in a preset suspicious interval, and feeding the rechecking results to the step 3 as a supervision signal for updating the driving model. Preferably, the method for establishing the invalid and noise data judging model in the step 3 is as follows: Step 31, extracting quantization characteristics of the effective data processed in the step 2, wherein the quantization characteristics comprise at least three of keyword correlation, browsing duration normalization value, text theme matching degree, accessory integrity coefficient and user identity matching degree; Step 32, completing model initialization training by using a manual labeling sample set, wherein the calculation formula of the invalid and noise data judgment model is as follows: In the formula, Is thatThe predicted value of the time-of-day model,To activate the function, a feature weighted sum is mapped to a [0,1] probability interval,Is thatThe characteristic weight vector of the moment in time,As a sample feature vector of the sample,Is thatBias term of time. In a further preferred embodiment, the driving model update in step 4 includes updating the feature weight vector and the bias term, and the driving model update has the following formula: In the formula, AndThe updated feature weight vector and bias term respectively,In order for the rate of learning to be high,Is a sample genuine label. In a further preferred embodiment, the search data includes a search keyword, a search number, a search result click amount, and a search time window, the browsing data includes a browsing page type, a single page stay time, a page interaction behavior, and a browsing path, and the consultation data includes consultation text content, a consultation attachment, a consultation channel, a consultation time, and a reply interaction record. In a further preferred embodiment, the null value and nonsensical data rejection is set to reject data with a null search keyword, data with a consultation text length smaller than a plurality of characters, and data with invalid user identity tags of browsing pages, the repeated data deduplication rejects completely repeated behavior records based on combinations of the user identity tags, behavior types and behavior time, and the format standardization process removes messy codes and nonsensical special symbols based on a unified text coding format. In a further preferred embodiment, the rule filtering is to