CN-121984748-A - System for detecting password guess attack

CN121984748ACN 121984748 ACN121984748 ACN 121984748ACN-121984748-A

Abstract

A system for detecting password guess attacks comprises a data collection and preprocessing module, a time sequence analysis module, a cluster analysis module, an attack path analysis module and a dynamic risk scoring module. The data collection and preprocessing module is used for data collection, data preprocessing and data distribution. And the time sequence analysis module is used for analyzing the time sequence data by utilizing the multivariate time sequence model and identifying abnormal login behaviors. And the cluster analysis module is used for identifying a user behavior mode and distinguishing normal login behavior from potential abnormal behavior. And the attack path analysis module is used for identifying possible attack paths, finding out the moving track of an attacker in the system, adopting graph theory and path analysis technology, combining user behavior data, constructing an attack path model, and finding out potential attack behaviors through algorithm analysis. The system of the invention obviously improves the safety and reliability of the password authentication system through the modules, and is helpful for protecting the user account from the threat of password guessing attack.

Inventors

LIU JIAN
WANG ZHICHUAN

Assignees

辽宁大学

Dates

Publication Date: 20260505
Application Date: 20260130

Claims (7)

1. The system for detecting the password guess attack is characterized by comprising a data collection and preprocessing module, a time sequence analysis module, a cluster analysis module, an attack path analysis module and a dynamic risk scoring module; the data collection and preprocessing module is used for data collection, data preprocessing and data distribution; The time sequence analysis module is used for analyzing the time sequence data by using an improved multi-variable time sequence model, identifying abnormal login behavior, enabling the detection password to guess and attack the security system by using the improved multi-variable time sequence analysis model, and realizing multi-dimensional collaborative detection of the login behavior time dimension, the frequency dimension and the associated dimension; the cluster analysis module is used for identifying a user behavior mode, grouping a large amount of login data through cluster analysis and distinguishing normal login behaviors from potential abnormal behaviors; and the attack path analysis module is used for identifying possible attack paths through monitoring and analyzing login behaviors in the system, finding out the moving track of an attacker in the system, adopting graph theory and path analysis technology, combining user behavior data, constructing an attack path model and finding out potential attack behaviors through algorithm analysis.
2. The system for detecting a cryptographic guess attack according to claim 1, wherein the data collection is to collect all login requests from server logs, network traffic monitoring, etc., each login request including IP address, user name, timestamp information; The data preprocessing is to divide the original data according to a time window to generate time sequence data, calculate the login frequency vector of each IP address in one day and finally generate the node and the side information required by the login graph structure; the data distribution is to send time series data to a time series analysis module, to send login frequency vectors to a cluster analysis module and to send nodes and side information to an attack path analysis module.
3. The system for detecting a cryptographic guess attack according to claim 1, wherein the time series analysis module comprises time series data construction and multivariate time series model definition; The method comprises the steps of constructing time series data, namely dividing collected login request data according to time windows to form a plurality of time series, wherein each time series represents the login request condition in a certain time window, setting the time window in one day as an hour, setting the login request number in each time window as one data point, and then setting the time series data in one day to contain 24 data points; definition of a multivariate time series model: Is provided with Representing the set of login requests within the ith time window, Representing the jth login request within the ith time window, Defining a multivariate time series model: wherein A is a transfer matrix, B is a noise matrix, As a noise term of the sound, the noise, A multivariate state vector representing a t-th time window, requested by n entries within the window To the point of The composition is that, To describe the state of the previous moment A transition matrix of the extent of the influence on the current state, Is an item of random noise that is introduced, The method is a noise matrix for measuring the action intensity of a noise item on a current state, and modeling of the evolution process of a login request state along with time is realized through a linear equation; Estimating model parameters by a maximum likelihood estimation method: Assume that Is a zero-mean, gaussian noise term, i.e Given a and B, the first and second time slots, given a and B, The conditional distribution of (2) is also gaussian distribution, and the objective is to solve the optimal model parameter estimation value And Operator realized by maximum likelihood estimation method Representing traversal of all possible matrices And Take the value and screen out the likelihood function Parameter combinations up to maximum, likelihood functions Is in the physical meaning of given parameters And Under the condition that the current time series data is observed The process of maximizing the function allows the model parameters and the observed data to realize the best fitting process; The likelihood function is expressed as: the method decomposes the overall likelihood function into a continuous multiplication form of a plurality of time window conditional probability density functions, wherein In order to be a continuous multiplication operator, Representing the total length of the time series, i.e. the total number of time windows, Is a conditional probability density function describing the state at the previous time And model parameters Under the known condition, the current time state The probability density of occurrence integrates the probability contribution of a single time window into the overall likelihood of the whole time sequence through a continuous multiplication operation, and supports the numerical solution of the parameter estimation process. Maximizing the log-likelihood function: the method takes the form of logarithm of the original likelihood function, converts the continuous multiplication operation into continuous addition operation, reduces the calculation complexity of parameter solution, wherein, Representing the sequence of observations given parameter a Log likelihood function values of (a); to add the operator, replace the operator of the original likelihood function, Is the total length of the time series; the conditional log probability density, which is a single time window, describes the state at the previous instant Sum parameters When known, the current state A logarithmic value of the probability of occurrence; Assuming that the noise term Ut is independently co-distributed, we get: the method is derived from Gaussian assumption with noise items distributed independently and uniformly, and is characterized Conditional probability distribution characteristics, symbols Representation is proportional, omitting normalization constants that are independent of parameter optimization; Is an exponential function; is the residual vector of the model predicted value and the actual observed value, reflects the fitting error of the model, and is superscript A transpose operation representing a matrix or vector; Is the covariance matrix of the noise term The function of the whole index term is to quantify the influence of the residual size on the conditional probability, and the smaller the residual, the larger the conditional probability; thus, the log-likelihood function is expressed as: the probability density of the Gaussian distribution is substituted into the expansion result after the log-likelihood function, wherein, As a constant term related to the covariance matrix, Representing covariance matrix Determinant of (2) second term Is the weighted sum of squares of the residuals, the weights are determined by The decision is a core term participating in parameter optimization in the log-likelihood function; Since Σ is a constant term, maximization is required: this formula is a simplified optimization objective obtained by ignoring the constant term, Σ, wherein, Representing the search for the parameter A that maximizes the log-likelihood sum, which is equivalent to Searching for a parameter A which minimizes the sum of squares of the residual errors; the sum of squares of residual errors without weights is smaller, and the fitting effect of the model on observed data is better; minimizing the sum of squares of the prediction errors, wherein the sum of squares of the prediction errors is minimized, and the specific expression is The method determines that the optimal estimation value of the transfer matrix is solved by minimizing the square sum of the prediction errors In which Is a minimum value solving operator used for screening a matrix A which minimizes the value of a subsequent accumulation item, For the accumulation operator for all windows of the time sequence, T represents the total length of the time sequence, Is the observed state vector for the t-th time window, Is based on the state vector of the previous moment And the predicted state vector calculated by the transition matrix a, Transpose operation for single window prediction error vector The product of the vector and the vector realizes the square sum calculation of the error vector, and finally the error square sums of all windows are accumulated to obtain an overall optimization target; And (3) making: The method is a simplified symbol definition of a single time window prediction error, wherein Represents the first The prediction error vector of each time window has the physical meaning of the observation state vector And predicting state vectors By introducing the difference of Can simplify complex error expression, and facilitate the rewriting of subsequent optimization targets, wherein For the transfer matrix to be solved for, Is the first An observation state vector for each time window; The objective function becomes: based on error vectors Simplifying the original optimization target by directly passing the calculation of the error square sum Is expressed by the product of the transpose and itself, wherein Equivalent to Both of which are the sum of squares of the prediction errors of a single time window, The meaning of the (a) is consistent with a core formula, and the formula ensures that the physical meaning of an optimization target is more visual through symbol simplification, namely the accumulated prediction deviation of the whole time sequence is minimized; Based on time sequence observation data, solving an optimal transfer matrix A, minimizing the accumulated prediction deviation of the whole time sequence by a least square method, and finding a transfer matrix A which can enable the sum of squares of the error of the observation state vector X t of the't time window and the prediction state vector AX t-1 based on the previous window X t-1 and the transfer matrix A to be minimum, wherein the objective function is that The method is based on error vector Is not changing the nature of the objective function. Left side Is a simplified sum-of-squares expression of the total prediction error, The right side is the original error square sum calculation form, the two are completely equivalent, and the optimization target is simpler and more visual through symbol simplification; And (3) unfolding to obtain: the method is to sum the squares of the original errors Results after algebraic expansion, wherein To all of The accumulation operator of the time window is used, Is an observed state vector Square sum of squares of (d) and parameter to be solved Is irrelevant; and Is the cross item of the observation vector and the prediction vector, and is all the same as the parameter Correlation; is the sum of squares of the prediction state vectors, again dependent on parameters The whole expansion lays a foundation for subsequent merging simplification and derivative solving: Its derivation core is based on the operational properties of matrix transposition. Transposed satisfaction of matrix products is known Thus, it is After substitution, the equivalence of the left side and the right side of the equation is directly obtained; the result is that the cross terms in the expansion are combined, because And the transpose of the scalar is equal to itself, so that the two crossing terms are equal in value and can be combined into The other two terms remain unchanged, simplifying the formula, taking the derivative of A and making it zero: The method is to solve the parameters Key step of optimal solution, the objective function is related to the parameters to be solved by squaring the combined errors Obtaining an equation satisfied by the optimal parameter by solving a partial derivative and enabling the derivative to be equal to zero, wherein Representing an operator for biasing the matrix A, wherein 0 on the right side of the equation is a zero matrix, which means that the objective function reaches a minimum value at the point; based on the extreme point derivative being zero, the transfer matrix is the objective function The simplification after the derivation is carried out, For the total length of the time series, 、 Respectively is Time of day and time of day The state vector is observed at the moment, And (3) with An accumulation matrix that is a vector product; And (3) finishing to obtain: Finally, the method comprises the following steps: As is made up of the above equation, For an optimal transfer matrix, Is the inverse of the accumulation matrix, given by least squares Minimizing the sum of squares of prediction errors; the matrix is a transition matrix in a multivariable time sequence model, describes the state transition relation from one time window to the next time window, and obtains important information about the dynamic characteristics of the system by analyzing the elements and the characteristic values of the matrix, wherein the important information comprises the association strength and the direction of each state dimension, the system stability and the trend and the periodicity of the state change; Matrix describing state transition relation of adjacent time windows and its elements Is the previous time Variable pair current time Influence intensity quantification of variables: The method is characterized in that the distortion is not generated, The variable state being entirely defined by Variable determination; impact enhancement, corresponding positive feedback/accumulation effect; Impact attenuation, corresponding to negative feedback/loss effects.
4. The system for detecting password guess attacks of claim 1, wherein said cluster analysis module comprises cluster analysis and cluster processing; clustering, namely clustering the frequency vectors by using a K-means clustering algorithm, dividing the frequency vectors into different clusters, and identifying normal and abnormal login behaviors; Clustering, namely sending a clustering result and the identified abnormal login behavior information to a dynamic risk scoring module, and identifying a behavior mode and a strategy of an attacker by analyzing an attack path: Defining each IP address as a set of login requests in a day Where i is the IP address and d is the date; calculating the login frequency vector of each IP address: Wherein, the The number of logins of the IP address i in the time window j. Clustering the frequency vectors using a K-means clustering algorithm: Wherein, the For the kth cluster, μ_k is the mean vector for the kth cluster. The clustering analysis module adopts a K-means clustering algorithm to cluster the IP login frequency vector, divides different clusters to distinguish normal login behavior and abnormal login behavior, and simultaneously transmits the clustering result and abnormal information to the dynamic risk scoring module.
5. The system for detecting a cryptographic guess attack according to claim 4, wherein the cluster analysis module comprises the following working procedures: S1, data preprocessing: collecting login information in a set time period, wherein the login information comprises login time, an IP address, a user ID and a login result; Data cleaning, namely removing invalid or repeated data and processing missing values; extracting important features in login information, namely login time interval, failure times and IP address frequency; S2, feature standardization Normalizing the extracted features to eliminate the dimensional difference among different features, and normalizing or normalizing the feature values; wherein x is a characteristic value, mu is a characteristic mean value, and sigma is a characteristic standard deviation; s3, calculating cluster center In the K-means algorithm, cluster center The updated formula of (2) is: Where, ck is the number of data points in the kth cluster, fi is the frequency vector belonging to the kth cluster; minimizing the sum of squares of errors; the K-means algorithm converges by iteratively optimizing an objective function: s3.1, initializing, namely randomly selecting KK initial cluster centers; S3.1, assigning clusters, wherein each data point is assigned to the cluster center closest to the cluster center; s3.1, updating cluster centers, namely calculating a new center of each cluster; Repeating S2 and S3 until the cluster center is not changed, namely, the absolute value of each dimensional coordinate change of all the cluster centers is less than or equal to a preset threshold value , At this time, the cluster center is considered to be converged, and the algorithm is terminated, or the maximum iteration number is reached; s4, selecting a K-means clustering algorithm to cluster the login data: Determining the number of clusters k using an elbow rule or a contour coefficient method, Clustering calculation, namely clustering the standardized characteristic data to obtain a clustering label of each login information; ; s5, analyzing a clustering result, namely identifying clusters of normal login behaviors and clusters of abnormal login behaviors, carrying out feature statistics on each cluster, and determining a behavior mode; The normal behavior judgment meets any condition that the success rate of login attempt is not lower than a threshold value alpha, the fluctuation range of login interval time is not more than a threshold value beta and accords with a conventional time range, the login frequency is not higher than a threshold value gamma, the abnormal behavior judgment meets any condition that the success rate of login attempt is lower than a threshold value delta, the login interval time is shorter than the threshold value epsilon and has dense login characteristics, the login frequency is higher than the threshold value zeta and is concentrated in a specific period, the login attempt frequency of the same account exceeds the threshold value eta in the same IP address in a short time, or the login attempt of the same account is initiated by not less than theta different IP addresses in a short time.
6. The system for detecting a cryptographic guess attack according to claim 1, wherein the attack path analysis module operates as follows: s1, constructing an attack path diagram, namely representing login behaviors in a system as a diagram, wherein nodes represent system resources or users, and edges represent access behaviors; s2, path detection and analysis, namely using a shortest path algorithm to detect and analyze suspicious paths in the graph; s3, identifying an abnormal path through a predefined safety rule and behavior characteristics, and further analyzing a possible attack path, wherein the specific method comprises the following steps: Constructing a graph structure G= (V, E) of a login request, wherein V is a user node set, and E is a login attempt path between users; defining weights among nodes: Calculating an attack path using a shortest path algorithm: 。
7. The system for detecting a cryptographic guess attack according to claim 1, wherein said dynamic risk scoring module is configured to update and evaluate the risk score of each login request in real time: Defining risk score for each login request Calculation is performed in combination with IP address, time, success rate, for example: Wherein, the Is a weight coefficient; updating risk scores using an Exponentially Weighted Moving Average (EWMA) method: Wherein, the Is a smoothing coefficient. The meaning of each parameter in the formula is as follows: For a single login request Is used to determine the initial risk score of (1), Scoring the trustworthiness of the corresponding IP address of the login request, For a time dimension score based on characteristics of login time interval, period rationality and the like, For the scores corresponding to the IP/account history login success rate, alpha, beta and gamma are weight coefficients meeting alpha+beta+gamma=1, and are respectively used for adjusting the important duty ratio of the three scores in the initial risk score, and the scores can be calibrated according to the system safety requirement; Is the first Real-time risk scoring at a moment in time, Is the first A historical risk score for the moment of time, In order to take the value of 0< And (3) the smoothing coefficient of <1 > is used for balancing the influence of the current login behavior and the historical risk state on the real-time scoring.

Description

System for detecting password guess attack Technical Field The invention relates to the field of network security, in particular to a system for detecting password guessing attack. Background Password guessing attacks are a big problem in the field of information security, and with popularization of the internet and rapid development of information technology, password authentication systems have become key to protecting user privacy and data security. However, password guessing attacks, particularly distributed low-speed attacks and complex high-frequency attacks, remain one of the major threats faced by current password authentication systems. By continually trying different combinations of passwords, an attacker attempts to gain unauthorized access, which not only jeopardizes the privacy security of the user, but can also lead to serious economic and reputation compromises. Existing research has focused mainly on the use of traditional rule-based methods and machine learning techniques to detect and defend against password guess attacks. For example, paper Ara ñ a: discovering and Characterizing Password Guessing ATTACKS IN PRACTICE discovers and describes patterns and features of various password guess attacks by analyzing the actual login request data. Although these approaches improve detection efficiency and accuracy to some extent, many challenges remain in the face of distributed low-speed attacks and high-complexity attack patterns. For example, existing methods have difficulty detecting low frequency, multi-source attacks in real time and are computationally complex when dealing with large-scale data. Disclosure of Invention Aiming at the technical problems in the prior art, the invention provides a system for detecting password guessing attack. First, the improved multivariate time series analysis model can effectively identify abnormal login behavior by capturing the change law of login requests in time. And secondly, the abnormal detection algorithm based on the clustering utilizes clustering analysis to distinguish normal login behavior from abnormal login behavior. Thirdly, the attack path analysis based on graph theory reveals the behavior pattern and strategy of the attacker by constructing and analyzing the graph structure of the login request. Finally, the dynamic risk scoring system combines various factors to update and evaluate the risk of each login request in real time. In order to solve the technical problems, the technical content of the invention is that a system for detecting password guessing attacks comprises a data collection and preprocessing module, a time sequence analysis module, a cluster analysis module, an attack path analysis module and a dynamic risk scoring module; the data collection and preprocessing module is used for data collection, data preprocessing and data distribution; The time sequence analysis module is used for analyzing the time sequence data by using an improved multi-variable time sequence model, identifying abnormal login behavior, enabling the detection password to guess and attack the security system by using the improved multi-variable time sequence analysis model, and realizing multi-dimensional collaborative detection of the login behavior time dimension, the frequency dimension and the associated dimension; the cluster analysis module is used for identifying a user behavior mode, grouping a large amount of login data through cluster analysis and distinguishing normal login behaviors from potential abnormal behaviors; and the attack path analysis module is used for identifying possible attack paths through monitoring and analyzing login behaviors in the system, finding out the moving track of an attacker in the system, adopting graph theory and path analysis technology, combining user behavior data, constructing an attack path model and finding out potential attack behaviors through algorithm analysis. The data collection is to collect all login requests from server logs, network flow monitoring and other sources, wherein each login request comprises IP address, user name and timestamp information; The data preprocessing is to divide the original data according to a time window to generate time sequence data, calculate the login frequency vector of each IP address in one day and finally generate the node and the side information required by the login graph structure; the data distribution is to send time series data to a time series analysis module, to send login frequency vectors to a cluster analysis module and to send nodes and side information to an attack path analysis module. The time sequence analysis module comprises the construction of time sequence data and the definition of a multi-variable time sequence model; 2.1 The time series data is constructed by dividing the collected login request data according to time windows to form a plurality of time series, wherein each time series represents the login request condition in a certain ti