CN-115883111-B - Phishing website identification method and device, electronic equipment and storage medium
Abstract
The application discloses a phishing website identification method, a phishing website identification device, electronic equipment and a storage medium. The method comprises the steps of extracting at least one first feature and at least one second feature of a website to be identified, wherein the first feature represents URL related features, the second feature represents website page related features, inputting the at least one first feature and the at least one second feature into a set feature fusion network model to obtain a first probability, wherein the first probability represents the probability that the website to be identified is a phishing website, and when the first probability is larger than a set threshold, the website to be identified is determined to be the phishing website.
Inventors
- SUN XIANGXUN
- CHENG BAOPING
- XIE XIAOYAN
Assignees
- 中移(杭州)信息技术有限公司
- 中国移动通信集团有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20210813
Claims (9)
- 1. A phishing website identification method, the method comprising: Extracting at least one first feature and at least one second feature of a website to be identified, wherein the first feature characterizes a feature related to a uniform resource locator URL, the second feature characterizes a feature related to a website page, the at least one first feature comprises character similarity between the URL of the website to be identified and each URL in a set white list, and a feature vector of the URL of the website to be identified, the feature vector is determined based on each character in the URL of the website to be identified and has deep semantic information of the URL of the website to be identified, the character similarity between the URL of the website to be identified and each URL in the set white list is calculated based on the ratio of a first numerical value corresponding to the URL in the set white list to a second numerical value corresponding to the URL, the first numerical value is the difference between a third numerical value corresponding to the URL in the set white list and a fourth numerical value, the third numerical value is the maximum value in the length of the URL of the website to be identified and the URL of the website to be identified, the fourth numerical value is the maximum value in the length of the website to be identified and the website to be identified, and the feature value in the set white list is the length of the website to be identified and the first numerical value corresponding to the URL of the URL to be identified, and the feature in the set keyword to be identified is the first numerical value corresponding to the URL in the set to be edited; inputting the at least one first feature and the at least one second feature into a set feature fusion network model to obtain a first probability, wherein the first probability represents the probability that the website to be identified is a phishing website; And when the first probability is larger than a set threshold value, determining that the website to be identified is a phishing website.
- 2. The phishing website identification method of claim 1, wherein the extracting at least one first feature and at least one second feature of the website to be identified comprises: Matching the set field part of the URL of the website to be identified with the set field part of each URL in the set white list and the set black list respectively to obtain a matching result; and extracting at least one first feature and at least one second feature of the website to be identified under the condition that the matching result indicates that the set field part of the URL of the website to be identified is not matched with the set field part of each URL in the set white list or the set black list.
- 3. The method for identifying phishing websites according to claim 2, wherein the matching the set field portion of the URL of the website to be identified with the set field portion of each URL in the set whitelist and the set blacklist, respectively, includes: Preprocessing the URL of the website to be identified, and converting the URL into a URL in a set format; And matching the set field part of the URL of the set format of the website to be identified with the set field part of each URL in the set white list and the set black list respectively.
- 4. A method of identifying phishing websites according to claim 2 or 3, wherein the method further comprises: and outputting a recognition result corresponding to the website to be recognized under the condition that the matching result represents that the set field part of the URL of the website to be recognized is matched with the set field part of any URL in the set white list or the set black list.
- 5. The method of identifying phishing websites of claim 1, wherein the method further comprises: Based on the character length of the URL of the website to be identified, the character length of each URL in the set white list and the editing distance, calculating to obtain the character similarity between the URL of the website to be identified and each URL in the set white list, and/or, Inputting a character vector corresponding to each character in the URL of the website to be identified into a set feature extraction model to obtain a vector output by the set feature extraction model, and inputting the vector output by the set feature extraction model into a set pooling layer to perform dimension reduction processing to obtain a feature vector of the URL of the website to be identified.
- 6. The phishing website identification method of claim 1, wherein the at least one second feature comprises a feature of Logo in the website page to be identified, and when the feature of the at least one first feature and the at least one second feature are input into a set feature fusion network model, the method comprises: Matching the feature of the Logo in the website page to be identified with each Logo feature in a set Logo feature library to obtain a first matching degree; Inputting the first matching degree into the set feature fusion network model, wherein, And the Logo feature library is obtained based on feature extraction of the Logo in the website page corresponding to each URL in the set white list.
- 7. A phishing website identification apparatus, the apparatus comprising: The extraction unit is used for extracting at least one first feature and at least one second feature of a website to be identified, wherein the first feature characterizes the feature related to a URL, the second feature characterizes the feature related to a website page, the at least one first feature comprises character similarity between the URL of the website to be identified and each URL in a set white list, and a feature vector of the URL of the website to be identified, the feature vector is determined based on each character in the URL of the website to be identified and has deep semantic information of the URL of the website to be identified, the character similarity between the URL of the website to be identified and each URL in the set white list is calculated based on the ratio of a first numerical value corresponding to the URL in the set white list to a second numerical value corresponding to the URL, the first numerical value is the difference between a third numerical value corresponding to the URL in the set white list and a fourth numerical value, the third numerical value is the maximum value in the length of the URL of the website to be identified and the URL of the website to be identified, the fourth numerical value is the maximum value in the length of the URL to be identified and the website to be identified, and the feature value in the set white list is the first numerical value corresponding to the URL of the URL to be identified and the first numerical value in the set keyword to be identified, and the feature of the first numerical value in the set website to be identified is the first numerical value is the keyword to be edited; The input unit is used for inputting the at least one first feature and the at least one second feature into a set feature fusion network model to obtain a first probability, wherein the first probability represents the probability that the website to be identified is a phishing website; And the determining unit is used for determining that the website to be identified is a phishing website when the first probability is larger than a set threshold value.
- 8. An electronic device comprising a processor and a memory for storing a computer program capable of running on the processor, wherein, The processor being adapted to perform the steps of the method of any of claims 1-6 when the computer program is run.
- 9. A storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method according to any of claims 1-6.
Description
Phishing website identification method and device, electronic equipment and storage medium Technical Field The present application relates to the field of information security technologies, and in particular, to a phishing website identification method, a phishing website identification device, an electronic device, and a storage medium. Background Phishing websites refer to false websites that spoof users. In the related art, the phishing websites are mainly identified in three modes, namely, 1, the phishing websites are identified through judging the visual information of the webpage, 2, the phishing websites are identified through judging the characteristics of the marks (Logo) of the webpage, and 3, the phishing websites are identified through judging the characteristics of uniform resource locators (URL, uniform Resource Locator) corresponding to the websites. However, these methods have a high misjudgment rate and low recognition efficiency. Disclosure of Invention Accordingly, the main objective of the embodiments of the present application is to provide a phishing website identification method, device, electronic apparatus and storage medium, so as to solve the problems of high misjudgment rate and low identification efficiency in the related art. In order to achieve the above object, the technical solution of the embodiment of the present application is as follows: the embodiment of the application provides a phishing website identification method, which comprises the following steps: extracting at least one first feature and at least one second feature of a website to be identified, wherein the first feature represents URL related features, and the second feature represents website page related features; inputting the at least one first feature and the at least one second feature into a set feature fusion network model to obtain a first probability, wherein the first probability represents the probability that the website to be identified is a phishing website; And when the first probability is larger than a set threshold value, determining that the website to be identified is a phishing website. In the above solution, the extracting at least one first feature and at least one second feature of the website to be identified includes: Matching the set field part of the URL of the website to be identified with the set field part of each URL in the set white list and the set black list respectively to obtain a matching result; and extracting at least one first feature and at least one second feature of the website to be identified under the condition that the matching result indicates that the set field part of the URL of the website to be identified is not matched with the set field part of each URL in the set white list or the set black list. In the above solution, the matching the set field portion of the URL of the website to be identified with the set field portion of each URL in the set white list and the set black list includes: Preprocessing the URL of the website to be identified, and converting the URL into a URL in a set format; And matching the set field part of the URL of the set format of the website to be identified with the set field part of each URL in the set white list and the set black list respectively. In the above scheme, the method further comprises: and outputting a recognition result corresponding to the website to be recognized under the condition that the matching result represents that the set field part of the URL of the website to be recognized is matched with the set field part of any URL in the set white list or the set black list. In the above aspect, the at least one first feature includes at least one of: character similarity between the URL of the website to be identified and each URL in the set white list; And determining the feature vector based on each character in the URL of the website to be identified. In the above scheme, the method further comprises: Based on the character length of the URL of the website to be identified, the character length of each URL in the set white list and the editing distance, calculating to obtain the character similarity between the URL of the website to be identified and each URL in the set white list, and/or, Inputting a character vector corresponding to each character in the URL of the website to be identified into a set feature extraction model to obtain a vector output by the set feature extraction model, and inputting the vector output by the set feature extraction model into a set pooling layer to perform dimension reduction processing to obtain a feature vector of the URL of the website to be identified. In the above aspect, the at least one second feature includes at least one of: Features of Logo in the web page to be identified; Form features in the web pages to be identified; the ratio of the number of links of the set type to the total number of links in the web page to be identified; And the number of the sensitive