CN-121980399-A - GOIP fraud point identification method, device, equipment, medium and product
Abstract
The invention discloses a GOIP fraud point identification method, a device, equipment, a medium and a product, wherein the method comprises the steps of collecting a training sample data set based on real-time flow and preprocessing; training the BERT model according to the preprocessed training sample data set to obtain a GOIP fraud number identification model, acquiring sample data to be tested from the whole GOIP equipment based on real-time flow, inputting the sample data to the GOIP fraud number identification model to obtain suspected GOIP fraud numbers, clustering the suspected GOIP fraud numbers according to a preset clustering algorithm to obtain suspected GOIP fraud numbers, carrying out statistical analysis on the suspected GOIP fraud numbers of each suspected GOIP fraud number cluster according to position information, separating high, medium and low grades, and determining the positions of fraud points in a triangular positioning mode. The invention can efficiently and accurately identify GOIP fraud points.
Inventors
- HUANG QIXIONG
- LIN LVFAN
- ZHAO ZHIYONG
- YE YUNFANG
- CHEN ZHEN
- LI JIANMING
Assignees
- 中国移动通信集团福建有限公司
- 中国移动通信集团有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20251225
Claims (10)
- 1. A method for identifying a phip fraud nest point, comprising: Collecting a training sample data set based on a real-time stream, and preprocessing the training sample data set, wherein the training sample data set comprises communication information of GOIP fraud equipment and communication information of normal communication equipment; training the BERT model according to the preprocessed training sample data set to obtain a GOIP fraud number identification model; Acquiring sample data to be detected from full-quantity GOIP equipment based on real-time flow, and inputting the sample data to be detected into the GOIP fraud number identification model to obtain a suspected GOIP fraud number output by the GOIP fraud number identification model; Grouping the suspected GOIP fraud numbers according to a preset grouping algorithm to obtain the suspected GOIP fraud numbers, wherein the grouping algorithm is a combination of a KD tree, a decision tree algorithm, a random forest algorithm and a clustering algorithm; And carrying out statistical analysis on the suspected GOIP fraud numbers in the group of each suspected GOIP fraud number according to the position information, separating high, medium and low grades, and determining the position of a fraud point in a triangular positioning mode.
- 2. The GOIP fraud point identification method of claim 1, wherein the acquiring a training sample dataset based on a real-time stream and preprocessing the training sample dataset comprises: Collecting real-time communication information of the GOIP fraud equipment and communication information of the normal communication equipment by using a Kafka message queue, wherein the communication information comprises a calling number, a called number, a call duration, a call frequency, a call time, a call direction, a call number characteristic, position information and network behavior; Preprocessing the collected communication information of the GOIP fraud equipment and the communication information of the normal communication equipment in real time based on a Flink distributed computing system, wherein the preprocessing comprises data cleaning and data standardization; And carrying out real-time analysis and feature extraction on the preprocessed data according to a streaming machine learning algorithm, and carrying out data annotation on the data processed by the streaming machine learning algorithm to obtain a preprocessed training sample data set.
- 3. The GOIP fraud point recognition method of claim 2, wherein the training the BERT model according to the preprocessed training sample dataset to obtain the GOIP fraud number recognition model comprises: dividing the preprocessed training sample data set into a training set, a verification set and a test set according to a preset proportion; word segmentation and coding are carried out on telephone numbers in the training set, and an ID sequence and an attention mask are generated; inputting the ID sequence and the attention mask into a pre-trained BERT model, mapping the output of the BERT model to the output space of the two classification tasks through a full-connection layer with 2 dimensions, and converting the output of the BERT model into probability distribution by using a Softmax function to obtain a prediction result of the BERT model; calculating the difference between the prediction result of the BERT model and the real label based on a binary cross entropy loss function; calculating the gradient of the binary cross entropy loss function to model parameters through a back propagation algorithm, and updating the model parameters by using AdamW optimizer; and evaluating the model according to the verification set and the test set until the optimal model parameters are obtained, and outputting the trained GOIP fraud number identification model.
- 4. The GOIP fraud point identification method of claim 3, wherein said calculating a gradient of said binary cross entropy loss function to model parameters by a back propagation algorithm and updating model parameters using a AdamW optimizer comprises: layer-by-layer counter-propagation, calculating the gradient of the binary cross entropy loss function relative to the output of the fully connected layer; Calculating the gradient of the weight and bias of the binary cross entropy loss function relative to the fully connected layer; calculating the gradient of the binary cross entropy loss function relative to the hidden layer of the BERT model; Calculating a gradient of the binary cross entropy loss function with respect to each layer of the BERT model; The weights and gradients of the model are updated using AdamW optimizers.
- 5. The GOIP fraud point identification method of claim 1, wherein the grouping the suspected GOIP fraud numbers according to a preset grouping algorithm to obtain the suspected GOIP fraud number groupings comprises: Constructing a KD tree according to the position information, the call time length, the call frequency, the call time and the call direction of the suspected GOIP fraud number; Searching other numbers similar to the position information, the call time length, the call frequency, the call time and the call direction of each suspected GOIP fraud number by utilizing a nearest neighbor search algorithm of the KD tree to obtain a plurality of adjacent number sets; For each adjacent number set, performing feature analysis by utilizing a decision tree algorithm to obtain a decision tree of each adjacent number set; for the decision tree of each adjacent number set, carrying out pattern recognition by using a random forest algorithm to obtain an initial classification result; And carrying out cluster analysis on the numbers in the initial classification result through a K-means algorithm, and dividing the numbers into different groups according to the feature similarity to obtain a plurality of groups of the suspected GOIP fraud numbers.
- 6. The GOIP fraud point identification method of claim 5, wherein constructing a KD tree based on the location information, the call duration, the call frequency, the call time, and the call direction of the suspected GOIP fraud number includes: respectively calculating the longitude, latitude, the call duration, the call frequency, the call time and the variance of the call direction, and selecting the dimension with the largest variance to divide the data set of the suspected GOIP fraud number into two subtrees; For each subtree, calculating the variance of other dimensions except the dimension with the maximum variance, selecting the dimension with the maximum variance from the variances of other dimensions, and dividing the subtree into a left subtree and a right subtree; And repeating the variance calculation and segmentation processes for the left subtree and the right subtree until a preset stopping condition is met.
- 7. A GOIP fraud point recognition device, comprising: The data acquisition module is used for acquiring a training sample data set based on a real-time stream and preprocessing the training sample data set, wherein the training sample data set comprises communication information of GOIP fraud equipment and communication information of normal communication equipment; The model training module is used for training the BERT model according to the preprocessed training sample data set to obtain a GOIP fraud number identification model; the fraud recognition module is used for acquiring sample data to be detected from the full-quantity GOIP equipment based on real-time flow, inputting the sample data to be detected into the GOIP fraud number recognition model, and obtaining a suspected GOIP fraud number output by the GOIP fraud number recognition model; the fraud clustering module is used for clustering the suspected GOIP fraud numbers according to a preset clustering algorithm to obtain the suspected GOIP fraud number clusters, wherein the clustering algorithm is a combination of a KD tree, a decision tree algorithm, a random forest algorithm and a clustering algorithm; The corn position locating module is used for carrying out statistical analysis on the suspected GOIP fraud numbers in the suspected GOIP fraud number groups according to the position information, separating high, medium and low grades, and determining the positions of the corn positions in a triangular locating mode.
- 8. Terminal device, characterized by comprising a processor and a memory, in which a computer program is stored and which is configured to be executed by the processor, which processor, when executing the computer program, implements the GOIP fraud point identification method according to any of claims 1 to 6.
- 9. A computer readable storage medium, wherein the computer readable storage medium stores a computer program, and wherein the device in which the computer readable storage medium is located implements the GOIP fraud point identification method according to any of claims 1 to 6 when the computer program is executed.
- 10. A computer program product, characterized in that it comprises a computer program or computer instructions which, when executed by a processor, implement the GOIP fraud point identification method according to any of claims 1 to 6.
Description
GOIP fraud point identification method, device, equipment, medium and product Technical Field The invention relates to the technical field of IT application, in particular to a GOIP fraud point identification method, a device, equipment, a medium and a product based on a BERT model. Background With the rapid development of information technology, telecommunication phishing means are increasingly diversified, wherein one common fraud mode is fraud by using a GOIP device. The GOIP is a network number technology that can implement voice calls worldwide. However, some lawbreakers use the GOIP equipment for fraudulent activities, which presents a great hazard to society. Therefore, how to effectively identify and hit the phishing points based on GOIP becomes an important problem. The existing fraud point identification method has various problems that on the clustering algorithm, the density clustering algorithm is influenced by data quality, the GMM clustering analysis is sensitive to an initial value, the clustering effect is poor, on the aspect of data instantaneity, a real-time calculation frame is not adopted, the number is not obtained timely, on the aspect of data dependency, the LSTM needs a large amount of marked data for training, however, in practice, the GOIP fraud equipment and normal communication equipment have fewer data sets, and the prediction accuracy is influenced. These problems result in the current difficulty in efficiently and accurately identifying the GOIP fraud points. Disclosure of Invention The invention aims to provide a GOIP fraud point identification method, device, equipment, medium and product based on a BERT model, which are used for collecting and preprocessing data by adopting real-time flow, identifying suspected GOIP fraud numbers by the BERT model without depending on a large amount of marked data, and grouping by utilizing a KD tree in combination with a decision tree algorithm, a random forest algorithm and a clustering algorithm, so that the accuracy and the efficiency of grouping are effectively improved, and the GOIP fraud points are effectively and accurately identified. In order to achieve the above object, an embodiment of the present invention provides a method for identifying a gorp fraud nest point based on a BERT model, including: Collecting a training sample data set based on a real-time stream, and preprocessing the training sample data set, wherein the training sample data set comprises communication information of GOIP fraud equipment and communication information of normal communication equipment; training the BERT model according to the preprocessed training sample data set to obtain a GOIP fraud number identification model; Acquiring sample data to be detected from full-quantity GOIP equipment based on real-time flow, and inputting the sample data to be detected into the GOIP fraud number identification model to obtain a suspected GOIP fraud number output by the GOIP fraud number identification model; Grouping the suspected GOIP fraud numbers according to a preset grouping algorithm to obtain the suspected GOIP fraud numbers, wherein the grouping algorithm is a combination of a KD tree, a decision tree algorithm, a random forest algorithm and a clustering algorithm; And carrying out statistical analysis on the suspected GOIP fraud numbers in the group of each suspected GOIP fraud number according to the position information, separating high, medium and low grades, and determining the position of a fraud point in a triangular positioning mode. As an improvement of the above solution, the collecting a training sample data set based on a real-time stream and preprocessing the training sample data set includes: Collecting real-time communication information of the GOIP fraud equipment and communication information of the normal communication equipment by using a Kafka message queue, wherein the communication information comprises a calling number, a called number, a call duration, a call frequency, a call time, a call direction, a call number characteristic, position information and network behavior; Preprocessing the collected communication information of the GOIP fraud equipment and the communication information of the normal communication equipment in real time based on a Flink distributed computing system, wherein the preprocessing comprises data cleaning and data standardization; And carrying out real-time analysis and feature extraction on the preprocessed data according to a streaming machine learning algorithm, and carrying out data annotation on the data processed by the streaming machine learning algorithm to obtain a preprocessed training sample data set. As an improvement of the above scheme, the training the BERT model according to the preprocessed training sample data set to obtain the GOIP fraud number identification model comprises the following steps: dividing the preprocessed training sample data set into a training set, a verification set