EP-4740124-A1 - MALICIOUS CODE INJECTION DETECTOR

EP4740124A1EP 4740124 A1EP4740124 A1EP 4740124A1EP-4740124-A1

Abstract

The present disclosure provides various systems, methods, and devices for detecting a feature interest in source code. For example, in various aspects, computer-implemented method is provided. The computer-implemented method can include receiving a candidate source code and segmenting the candidate source code into snippets. The computer-implemented method can further include transforming, by an encoder model, the snippets to vectors. Each of the vectors may correspond to a different one of the snippets. The computer implemented-method can further include generating a tensor based on the vectors and applying the tensor to a decoder model. The computer-implemented method can further include determining, by the decoder model, a probability of a feature of interest being in the candidate source code. In one aspect, the feature of interest may include a malicious code injection.

Inventors

YU, Junjun
CLEVELAND, Sam
SONG, Jia

Assignees

Visa International Service Association

Dates

Publication Date: 20260513
Application Date: 20240701

Claims (20)

1. A computer-implemented method, comprising: receiving, by a text preprocessing module, a candidate source code; segmenting, by the text preprocessing module, the candidate source code into snippets; transforming, by an encoder model, the snippets to vectors, wherein each of the vectors corresponds to a different one of the snippets; generating, by an embedding processing module, a tensor based on the vectors; applying, by the embedding processing module, the tensor to a decoder model; and determining, by the decoder model, a probability of a feature of interest being in the candidate source code.
2. The computer-implemented method of Claim 1 , wherein receiving the candidate source code comprises receiving a JavaScript source code corresponding to a webpage.
3. The computer-implemented method of Claim 2, wherein the feature of interest comprises a malicious code injected into the candidate source code.
4. The computer-implemented method of Claim 1 , wherein generating, by the decoder model, the probability of a feature of interest being in the candidate source code comprises: applying the tensor to a bidirectional long-short term memory (Bi-LSTM) layer to generate a Bi-LSTM layer output.
5. The computer-implemented method of Claim 4, wherein generating, by the decoder model, the probability of a feature of interest being in the candidate source code further comprises: applying the Bi-LSTM layer output to an average pooling layer to generate an average pooling layer output; applying the Bi-LSTM layer output to a max pooling layer to generate a max pooling layer output; and combining the average pooling layer output and the max pooling layer output to generate a concentrated output.
6. The computer-implemented method of Claim 5, wherein generating, by the decoder model, the probability of a feature of interest being in the candidate source code further comprises: applying the concentrated output to a first linear layer to generate a first linear layer output; applying the first linear layer output to a rectified linear unit (ReLu) layer to generate a ReLu layer output; applying the ReLu layer output to a dropout layer to generate a dropout layer output; applying the dropout layer output to a second linear layer to generate a second linear layer output; and applying the second linear layer output to a sigmoid layer to generate the probability of a feature of interest being in the candidate source code.
7. The computer-implemented method of Claim 4, wherein transforming, by the encoder model, the snippets to the vectors comprises: applying the snippets to a pre-trained model for programming language processing.
8. The computer-implemented method of Claim 1 , further comprising: training, by a training module, the decoder model based on a labeled dataset, wherein the labeled dataset comprises positive samples of training source code comprising the feature of interest.
9. The computer-implemented method of Claim 8, further comprising: labeling, by the training module, the candidate source code as an additional positive sample based on the probability satisfying a predetermined threshold; and retraining, by the training module, the decoder model based on the candidate source code.
10. The computer-implemented method of Claim 1 , further comprising classifying, by a classification module, the candidate source code as comprising the feature of interest based on the probability exceeding a predetermined threshold.
11. A source code classification system, comprising: an online model inference configured to determine a probability that a candidate source code includes a feature of interest, wherein the online model inference comprises: a text preprocessing module configured to segment the candidate source code into snippets; an encoder model configured to transform the snippets to vectors, wherein each of the vectors corresponds to a different one of the snippets; an embedding processing module configured to generate a tensor based on the vectors; and a decoder model configured to determine the probability that the candidate source code includes the feature of interest based on the tensor.
12. The source code classification system of Claim 11, further comprising: an offline training model for retraining the online model inference, wherein the offline training model is configured to update the online model inference based on a labeled dataset.
13. The source code classification system of Claim 12, wherein the labeled dataset comprises positive samples of source code that each include the feature of interest.
14. The source code classification system of Claim 13, wherein the feature of interest comprises a malicious code injection.
15. The source code classification system of Claim 14, wherein the candidate source code comprises JavaScript source code.
16. The source code classification system of Claim 15, wherein the online model inference is configured to receive the candidate source code from an active webpage and determine the probability that the candidate source code from the active webpage includes the feature of interest.
17. The source code classification system of Claim 16, wherein the offline training model is configured to automatically apply the candidate source code from the active webpage as a labeled dataset for updating the online model based on the probability that the candidate source code from the active webpage includes the feature of interest satisfying a probability threshold.
18. The source code classification system of Claim 12, wherein the decoder model comprises a bidirectional long-short term memory (Bi-LSTM) neural network.
19. The source code classification system of Claim 18, wherein the Bi-LSTM neural network comprises: an average pooling layer; a max pooling layer, wherein outputs of the average pooling layer and the max pooling layer are combined to generate a concentrated output; a first linear layer; a rectified linear unit (ReLu) layer; a dropout layer; a second linear layer; and a sigmoid layer.
20. The source code classification system of Claim 18, wherein the encoder model comprises a Bidirectional Encoder Representations from Transformers (BERT) model.

Description

TITLE MALICIOUS CODE INJECTION DETECTOR CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This application claims the benefit of and priority under 35 U.S.C. § 119(e) to U.S. Provisional Application Serial No. 63/511,746, filed July 3, 2023, entitled “MALICIOUS CODE INJECTION DETECTOR,” the contents of which is hereby incorporated by reference in its entirety herein. TECHNICAL FIELD [0002] At least some aspects of the present disclosure relate to Programming Language Processing (PLP), such as, for example, detecting a feature of interest in source code using a neural network model. BACKGROUND [0003] Malicious actors may target e-commerce websites as a means for stealing consumers’ personal information. For example, credit card fraudsters may attempt to inject malicious code into the source code of an e-commerce website’s checkout webpage. This malicious code injection can be designed to collect credit card credentials provided by consumers while conducting a transaction via the website’s checkout webpage. The fraudsters may then attempt to conduct fraudulent transactions with the stolen credit card credentials. These types of attacks can be mitigated by detecting and removing the malicious code injections. [0004] Detecting malicious code injections can be a complex and time-consuming task. For example, a transaction service provider may have an interest in detecting and addressing malicious code injections that may exist in the source code of thousands of different e-commerce websites. Current approaches to detecting malicious code injections can therefore rely on comparing the source code for each webpage against a list of known malware code signatures. The list of known malware signatures is often manually created and may need to be manually updated as additional known malware signatures are identified. Thus, current approaches to detecting malicious code injections may not be able to detect new and/or unknown malware signatures that have yet to be identified and added to the list known malware signatures. [0005] Accordingly, there exists a need for alternate systems, methods, and devices for detecting features of interest in source code, such as, for example, malicious code injections in source code. The present disclosure provides various solutions that employ a neural network model for detecting features of interest in source code. SUMMARY [0006] In various aspects, the present disclosure provides computer-implemented method for detecting a feature of interest in source code. The computer-implemented method can include receiving, by a text preprocessing module, a candidate source code and segmenting, by the text preprocessing module, the candidate source code into snippets. The computer-implemented method can further include transforming, by an encoder model, the snippets to vectors. Each of the vectors may correspond to a different one of the snippets. The computer implemented-method can further include generating, by an embedding processing module, a tensor based on the vectors and applying, by the embedding processing module, the tensor to a decoder model. The computer-implemented method can further include determining, by the decoder model, a probability of a feature of interest being in the candidate source code. [0007] In various aspects, the present disclosure provides a source code classification system. The source code classification system can include an online model inference configured to determine a probability that a candidate source code includes a feature of interest. The online model inference can include a text preprocessing module, an encoder model, an embedding processing module, and a decoder model. The text preprocessing module can be configured to segment the candidate source code into snippets. The encoder model can be configured to transform the snippets to vectors. Each of the vectors may correspond to a different one of the snippets. The embedding processing module can be configured to generate a tensor based on the vectors. The decoder model can be configured to determine the probability that the candidate source code includes the feature of interest based on the tensor. BRIEF DESCRIPTION OF THE DRAWINGS [0008] In the description, for purposes of explanation and not limitation, specific details are set forth, such as particular aspects, procedures, techniques, etc. to provide a thorough understanding of the present technology. However, it will be apparent to one skilled in the art that the present technology may be practiced in other aspects that depart from these specific details. [0009] The accompanying drawings, together with the detailed description below, are incorporated in and form part of the specification, and serve to further illustrate aspects of concepts that include the claimed disclosure and explain various principles and advantages of those aspects. [0010] The apparatuses and methods disclosed herein have been represented where appropriate by conventional sym