CN-116956213-B - Phishing mail identification model and method based on multi-level multi-feature deep learning

CN116956213BCN 116956213 BCN116956213 BCN 116956213BCN-116956213-B

Abstract

The invention provides a fishing mail identification model and a fishing mail identification method based on multi-level multi-feature and deep learning, wherein the identification model comprises a multi-channel input layer, a multi-level embedding layer and a BILSTM layer, wherein the multi-channel input layer comprises a plurality of sub-channel input layers, the total number of the sub-channel input layers is configured to correspond to the total number of multi-level feature types of input, the multi-level embedding layer comprises a word-level embedding layer and a character-level embedding layer, the total number of the sub-embedding layer is configured to be the total number of types of a first feature, the first feature is used as input, the third feature is mapped and output from a high dimension to a low dimension, the BILSTM layer is configured to be correspondingly arranged and connected to the sub-embedding layer, the third feature is used as input, the output is used as a fourth feature after being spliced, the feature combination layer is configured to take the second feature and the fourth feature as input, the combination feature is output after being fused, the full-connection layer is configured to have a multi-layer neuron structure, and the combination feature is output through the output layer structure after being processed through the multi-layer neuron structure.

Inventors

JIANG YU
FANG BINXING
SUN YANBIN
LI MOHAN
XIAO YIHAN
SHE JIAPING

Assignees

广州大学

Dates

Publication Date: 20260512
Application Date: 20230630

Claims (5)

1. Fishing mail recognition model based on multilayer multi-feature and deep learning, which is characterized by comprising: A multi-channel input layer comprising A plurality of sub-channel input layers, wherein the total number of the sub-channel input layers is configured to correspond to the total number of the input multi-level characteristic types, and each sub-channel input layer is further configured to output different neural networks as direct characteristics; The direct features comprise a first feature and a second feature, wherein the first feature is a feature representation with text context, and the rest is the second feature; A multi-level embedding layer comprising A plurality of sub-embedded layers, the total number of sub-embedded layers being configured as a total number of types of the first feature, and mapping output from a high dimension to a low dimension as a third feature with the first feature as an input; BILSTM layers configured to be disposed correspondingly and connected to the sub-embedded layers, with the third feature being an input and spliced output being a fourth feature; A feature combination layer configured to take the second feature and the fourth feature as inputs, and output as a combination feature after fusion; a fully connected layer configured to have a multi-layer neuron structure including at least one output layer structure having a number of neurons of 1, and the combined features being processed by the multi-layer neuron structure and output through the output layer structure; the multi-channel input layer at least comprises a word feature input layer, a semantic feature input layer, an emotion feature input layer, a URL feature input layer, an accessory name associated feature input layer and a text associated coefficient feature input layer; When the feature combination layer performs fusion, vectors of the second feature and the fourth feature are spliced by a designated axis, and a plurality of one-dimensional features are spliced into a combination feature, wherein the combination feature is an expression set of feature one-dimensional vectors.
2. The phishing mail identification method based on multi-level multi-feature and deep learning is applied to the phishing mail identification model based on multi-level multi-feature and deep learning as claimed in claim 1, and is characterized by comprising the following steps: S100, acquiring mail log data; s200, analyzing mail log data and executing preprocessing; S300, extracting multi-level features in the mail; s400, constructing and pre-training a phishing mail recognition model; S500, inputting the mail and the multi-level features corresponding to the mail to the recognition model, and recognizing the mail type of the user.
3. The phishing mail recognition method based on multi-level multi-feature and deep learning of claim 2, wherein in step S200, the performing preprocessing includes the steps of: Identifying URL links in the mail text and the attachment, carrying out standardization processing on each URL, converting the URL into a standard format so as to eliminate differences of different formats, and representing each URL as a character sequence; If the same mail comprises a plurality of URL features, connecting the extracted URLs by using separators which do not appear in the URLs; Removing other text with URL characteristics from the mail by word segmentation; Obtaining Content field, mail body content, including chinese-english content, but not URL, URL field-URL in all body texts in mail, And an attach field, namely a mail attachment name and a suffix.
4. The phishing mail recognition method based on multi-level multi-feature and deep learning of claim 3, wherein the multi-level features include basic layer word features, logical layer semantic features, cognitive layer emotion features, character layer URL features, attachment name association features and text correlation coefficient features.
5. The method for identifying phishing mails based on multi-level multi-feature and deep learning of claim 4, wherein, When the word features are extracted, the relative frequency of the words appearing in the mail is calculated, the first k Chinese and English feature words are selected as the word features, and the word features are encoded by using a single-Hot One-Hot encoding mode; When semantic features are extracted, training a Chinese-English mixed multilingual single Word2Vec model by using a mail text, and acquiring the meaning and context relation of words in the phishing mail; When the emotion characteristics are extracted, creating emotion atmosphere fear, curiosity and urgency of the phishing mail, manufacturing an emotion text corpus, and calculating mail emotion value scores; When URL features are extracted, word vectors which can represent special symbol semantics are obtained by using an n-gram word segmentation and character level coding mode; when extracting the accessory name association features, segmenting the accessory names by using special characters and then encoding the segmented phrases to obtain advanced features with more expressive force; And when the text correlation coefficient characteristics are extracted, calculating the number of the words in the accessory names appearing in the text, and obtaining the text correlation degree characteristics of the accessory.

Description

Phishing mail identification model and method based on multi-level multi-feature deep learning Technical Field The invention belongs to the technical field of text data processing, and particularly relates to a phishing mail recognition model and method based on multi-level multi-feature and deep learning. Background Phishing software refers to using disguised mail to deceive a recipient to reply information such as account numbers, passwords and the like to a designated recipient, or guiding the recipient to connect to a special webpage through mail content, wherein the webpage is usually disguised as a webpage like a real website, such as a bank or financial management webpage, so that a logger is trusted to be stolen by inputting credit cards or bank card numbers, account names, passwords and the like. Thus, it is necessary to identify the mail to determine whether it is a phishing mail. The phishing recognition method is based on machine learning and deep learning, the basic idea of the phishing mail recognition method based on machine learning is to extract characteristics from mails to be detected, and a classification model is built by using marked phishing mails and legal mails through a machine learning algorithm so as to realize phishing mail recognition. The phishing mail recognition method based on machine learning is simple to implement, but the machine learning algorithm is very dependent on feature extraction, and a feature extraction method needs to be set manually, which may ignore some potential features, so that the accuracy may be relatively low. The phishing mail identification method based on deep learning can automatically extract the features through learning data without manually designing the features. This greatly reduces the burden of data preprocessing and feature engineering and improves the accuracy and generalization ability of the algorithm. However, the current phishing mail identification method based on deep learning only considers the problem of phishing mail identification as a single text classification task, only considers the mail body, the theme or the mail header of the mail as a whole as a text feature input model, and the expected model can automatically learn deep features. The deep learning model usually requires a large amount of data to train to achieve good effects, but in the phishing mail recognition scene, sample data of malicious mails are very rare compared with common mails, which makes the deep learning model difficult to learn deep features. In addition, the differentiated URL features, the phishing mail attachment features and the like in the text are ignored, so that mail feature analysis is incomplete, feature representation is not systematic, and further the model identification effect is affected. In addition, the traditional phishing mail recognition method based on deep learning regards the mail body, the theme or the mail header of the mail as a whole as a text feature input model, the dimension of the text feature data obtained by the traditional method is usually higher, noise and abnormal values in the data are increased, the noise data are easy to interfere, the model accuracy is reduced, and the robustness is poor. Disclosure of Invention Therefore, the technical problem to be solved by the invention is to provide the phishing mail identification model and the method based on multi-level multi-feature and deep learning, which have higher accuracy and better robustness. In a first aspect of the present invention, there is provided a phishing mail recognition model based on multi-level multi-feature and deep learning, comprising: A multi-channel input layer comprising A plurality of sub-channel input layers, wherein the total number of the sub-channel input layers is configured to correspond to the total number of the input multi-level characteristic types, and each sub-channel input layer is further configured to output different neural networks as direct characteristics; The direct features comprise a first feature and a second feature, wherein the first feature is a feature representation with text context, and the rest is the second feature; A multi-level embedding layer comprising A plurality of sub-embedded layers, the total number of sub-embedded layers being configured as a total number of types of the first feature, and mapping output from a high dimension to a low dimension as a third feature with the first feature as an input; BILSTM layers configured to be disposed correspondingly and connected to the sub-embedded layers, with the third feature being an input and spliced output being a fourth feature; A feature combination layer configured to take the second feature and the fourth feature as inputs, and output as a combination feature after fusion; And the full-connection layer is configured to have a multi-layer neuron structure and at least comprises an output layer structure with the number of neurons being 1, and the combin