CN-122019611-A - Electronic data evidence obtaining system and method based on big data analysis
Abstract
The invention discloses an electronic data evidence obtaining system and method based on big data analysis, and relates to the technical field of electronic data evidence obtaining, the method comprises the following steps of collecting current case information, dividing case types, counting evidence obtaining keywords of cases of different historic types to construct a historic word stock, obtaining basic keywords based on the current case types, reading electronic data of current user equipment, and evidence obtaining relevant electronic data under the basic keywords; based on user operation behaviors, case time rules and case similarity of the current case and the historical case, screening the historical similar cases, combining recent case evidence obtaining keywords, obtaining candidate words through association rules and vector reasoning, and obtaining an innovation keyword set. The invention enhances the practicability and functionality of electronic data evidence collection by carrying out expansion secondary evidence collection and marking key evidence aiming at dynamic keywords based on big data analysis.
Inventors
- YE WENSHAN
- ZHAO MIN
- ZHAO JING
- WANG RONGYAN
Assignees
- 杭州太斗科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260129
Claims (9)
- 1. The electronic data evidence obtaining method based on big data analysis is characterized by comprising the following steps of: S1, collecting current case information, dividing case types, counting evidence obtaining keywords of cases of different histories to construct a history word stock, obtaining basic keywords based on the current case types, reading current user equipment electronic data, and obtaining evidence from related electronic data under the basic keywords; S2, screening historical similar cases and combining recent case evidence obtaining keywords based on user operation behaviors, case time rules and case similarity of the current cases and the historical cases, obtaining candidate words through association rules and vector reasoning, and obtaining an innovation keyword set; and S3, performing secondary evidence collection based on the obtained innovation keyword geometry and combining the electronic data content of the user equipment, integrating the electronic data evidence under the basic keywords and the electronic data evidence under the innovation keywords, identifying suspicious time periods and suspicious evidence in the current user evidence, and marking the suspicious time periods and suspicious evidence.
- 2. The method for evidence obtaining of electronic data based on big data analysis according to claim 1, wherein S1 comprises the steps of: s11, collecting basic information of a current case, including case names, involved persons and case descriptions, classifying the case into a predefined type by using a text classification model, and outputting a case type label, wherein the method specifically comprises the following steps of: Preprocessing basic information of a current case, including word segmentation processing of case description in the basic information of the current case through jieba word segmentation, filtering nonsensical words based on stop word lists, word shape normalization through a domain dictionary, and inputting preprocessed texts into a BERT model by using BERT models pre-trained on large-scale general corpus and legal corpus to obtain context-related vectors; Collecting historical case data with labels, performing neural network classifier training to obtain a classification model, wherein the historical case data comprises historical case texts, type labels and feature vectors, and inputting the current case feature vectors into the classification model to obtain current case type labels; S12, collecting different historical cases Evidence-based keyword list of (a) Calculate each word For specific case types A kind of electronic device : ; Wherein, the Representing the frequency of occurrence of a word w in all case keywords of the type C, wherein N is the total number of case types, CF (w) is the number of case types containing the word w, a keyword list is built for each case type in descending order of importance scores to obtain a historical word stock, and Top-N words are extracted from the historical word stock based on the current case type label to serve as a basic keyword set ; S13, according to the basic keyword set Reading the current user equipment electronic data, and capturing all the electronic data containing the current user equipment electronic data A data fragment, file or record of any one of the keywords.
- 3. The method for evidence obtaining of electronic data based on big data analysis according to claim 2, wherein S2 comprises the steps of: s21, screening historical cases based on user operation behaviors, case time rules and case similarity extracted under current case equipment to obtain a similar historical case set; S22, extracting all effective evidence obtaining keywords and recent case keywords in the similar historical case set, obtaining candidate words through association rules and vector reasoning, and obtaining innovation keywords.
- 4. The method for evidence collection of electronic data based on big data analysis according to claim 1, wherein S21 comprises the steps of: S211, performing background data grabbing on the current case user equipment, and extracting feature vectors, wherein the feature vectors comprise behavior pattern feature vectors, time rule feature vectors and case content feature vectors, and the method specifically comprises the following steps of: classifying user operation logs in user equipment, defining behavior states of users to determine all unique operation type sets Where m is the number of operation types, for successive pairs of operations in the sequence Statistical operations Transfer to Calculating transition probabilities to obtain a Markov transition probability matrix Meanwhile, time intervals are divided according to 24 hours, time stamps of all operations are mapped to corresponding small time periods, operands of each hour are counted to calculate the current user liveness distribution, namely the proportion of the number of operation events in different hours to the total number of operation events is calculated to obtain liveness distribution vector A, the proportion distribution of various operations is counted to obtain operation type distribution vector Splicing to obtain mode feature vector ; Extracting electronic data from user equipment, counting time stamp sequences of all behaviors, performing Fourier transform on time density functions of the behaviors to obtain frequency spectrum coefficients, simultaneously calculating the mean value, standard deviation, skewness and kurtosis of time intervals between continuous behaviors, and splicing the time stamp sequences to be used as time rule feature vectors ; Aiming at the text of the case description of the current user, the output vector of the [ CLS ] position is taken as the semantic representation vector of the whole case text through the BERT model Simultaneously recording a keyword set; S212, calculating the comprehensive similarity of the current case and the historical case, wherein the behavior pattern similarity The method comprises the following steps: ; Wherein the method comprises the steps of Respectively representing the similarity of the liveness distribution of the current case and the ith historical case, the negative index of the distribution KL divergence of the operation types and the matrix similarity, For the i-th historical case, C represents the current case, Is a weight coefficient, and The sum is 1; Wherein the degree of temporal correlation , The Fourier coefficient cosine similarity and the statistical feature similarity of the current case and the ith historical case are respectively represented; Wherein content relevance Wherein Semantic token vectors representing the case text of the current case, Semantic token vectors representing the case text of the i-th historical case, A keyword set representing a current case and a keyword set of an ith historical case; s213, combining the content relativity, the behavior pattern similarity and the time relativity, and weighting and calculating the comprehensive similarity Wherein Respectively representing behavior pattern similarity weight, time association degree weight and content association degree weight, screening similar cases based on weighted calculation comprehensive similarity, arranging cases with weighted calculation comprehensive similarity larger than similarity threshold in descending order, and reserving Top-n cases as similar history case sets 。
- 5. The method for electronic data forensics based on big data analysis according to claim 4 wherein the step S22 comprises the steps of: S221, extracting similar case keywords according to the history cases in the similar history case set Wherein Is a history case Is used for extracting the key words of recent cases simultaneously Wherein Respectively represent the current time and the history case Is used for the time of occurrence of (a), For the time window of the recent case, a comprehensive seed set is constructed Wherein A basic keyword set; S222, constructing a transaction database by using keywords in the comprehensive seed set Wherein The support degree is calculated respectively through an Apriori algorithm Confidence level Degree of lifting Screening rules meeting a minimum threshold condition: ; Wherein, the Representing the minimum support threshold, the minimum confidence threshold and the minimum lifting threshold respectively, and further extracting candidate words ; S223, outputting corresponding vectors for each keyword k through a pre-trained word vector model To make semantic reasoning in which a vector is queried Wherein Representing the target semantic direction vector and the semantic direction vector to be avoided respectively, For the direction adjustment factor, the top K words in the vocabulary are calculated to be most similar to the query vector and are then compared to And obtaining the union set to obtain the innovation keyword set.
- 6. The method for electronic data forensics based on big data analysis according to claim 5 wherein the step S3 comprises the steps of: s31, extracting relevant evidence of the electronic data content of the current user equipment through keywords in the innovative keyword set, and acquiring a current equipment evidence data set by combining the electronic data evidence under the basic keywords; S32, combining the basic keyword electronic data evidence and the innovation keyword electronic data evidence, summarizing the data rule, and marking the suspicious period and the suspicious evidence.
- 7. The method for electronic data forensics based on big data analysis according to claim 6 wherein S31 comprises the steps of: S311, aiming at the innovation keyword set, reading the current user equipment electronic data, and capturing all data fragments, files or records containing any keyword in the innovation keyword set in the current user equipment electronic data to obtain an innovation keyword evidence set; S312, merging the evidence set captured through the basic keywords with the evidence set of the innovative keywords to form a total evidence data set.
- 8. The method for electronic data forensics based on big data analysis according to claim 7 wherein the step S32 comprises the steps of: S321, extracting a time stamp of each evidence and counting average frequency of various behaviors in different time periods according to the obtained total evidence data set, drawing a 24-hour liveness distribution map to determine a typical liveness time period of a current user based on a 3sigma criterion, identifying dense evidence occurring in a non-liveness time period as suspicious evidence, and marking the time period as suspicious time period in the total evidence data set in red.
- 9. The electronic data evidence obtaining system based on big data analysis is characterized in that the system adopts the electronic data evidence obtaining method based on big data analysis as claimed in any one of claims 1-8, and comprises a data acquisition module and basic keyword matching module, an innovation keyword generating module, a secondary evidence obtaining and suspicious marking module: the data acquisition module and the basic keyword matching module are used for counting evidence obtaining keywords of different historical cases by collecting current case information and dividing the case types so as to construct a historical word stock, obtaining basic keywords based on the current case types, reading current user equipment electronic data and obtaining evidence of relevant electronic data under the basic keywords; The innovation keyword generation module is used for screening historical cases based on user operation behaviors, case time rules and case similarity of a current case user to obtain a similar historical case set, extracting all effective evidence obtaining keywords and recent case keywords in the similar historical case set, obtaining candidate words through association rules and vector reasoning, and obtaining an innovation keyword set; the secondary evidence obtaining and suspicious marking module extracts relevant evidences of the electronic data content of the current user equipment through keywords in the innovation keyword set, obtains a current equipment evidence data set by combining the electronic data evidences under the basic keywords, and marks suspicious time periods and suspicious evidences by summarizing data rules by combining the electronic data evidences of the basic keywords and the electronic data evidences of the innovation keywords.
Description
Electronic data evidence obtaining system and method based on big data analysis Technical Field The invention relates to the technical field of electronic data evidence obtaining, in particular to an electronic data evidence obtaining system and method based on big data analysis. Background In the digital age, electronic data becomes core evidence in case detection and judicial judgment, the comprehensiveness, the accuracy and the timeliness of the electronic data evidence obtaining directly influence the case handling quality, the case types are more complex and more along with the rapid development of network technology, and novel network crimes and cross-platform crimes are endless; The existing electronic data evidence obtaining process is dependent on a fixed keyword library and a single evidence obtaining flow, deep mining and dynamic multiplexing of historical case data are lacked, and case handling staff often set evidence obtaining keywords based on experience, evidence obtaining standards for different types of cases are not uniform, so that the evidence obtaining process is insufficient in normalization and low in efficiency, meanwhile, the traditional evidence obtaining keyword library is updated and lagged, and is limited to fixed vocabularies accumulated in the historical cases, hidden and evolution keywords appearing in novel cases cannot be adapted, the key electronic data is easy to miss and evidence obtaining range is limited, various evidence obtaining ideas cannot be provided to inspire case breaking ideas of users, and the problems of low practicality and low functionality exist. Disclosure of Invention Aiming at the problems in the related art, the invention provides an electronic data evidence obtaining system and method based on big data analysis, so as to overcome the technical problems existing in the prior related art. For this purpose, the invention adopts the following specific technical scheme: An electronic data evidence obtaining method based on big data analysis, the method comprises the following steps: S1, collecting current case information, dividing case types, counting evidence obtaining keywords of cases of different histories to construct a history word stock, obtaining basic keywords based on the current case types, reading current user equipment electronic data, and obtaining evidence from related electronic data under the basic keywords; S2, screening historical similar cases and combining recent case evidence obtaining keywords based on user operation behaviors, case time rules and case similarity of the current cases and the historical cases, obtaining candidate words through association rules and vector reasoning, and obtaining an innovation keyword set; and S3, performing secondary evidence collection based on the obtained innovation keyword geometry and combining the electronic data content of the user equipment, integrating the electronic data evidence under the basic keywords and the electronic data evidence under the innovation keywords, identifying suspicious time periods and suspicious evidence in the current user evidence, and marking the suspicious time periods and suspicious evidence. As a preferred embodiment, the step S1 includes the steps of: s11, collecting basic information of a current case, including case names, involved persons and case descriptions, classifying the case into a predefined type by using a text classification model, and outputting a case type label, wherein the method specifically comprises the following steps of: Preprocessing basic information of a current case, including word segmentation processing of case description in the basic information of the current case through jieba word segmentation, filtering nonsensical words based on stop word lists, word shape normalization through a domain dictionary, and inputting preprocessed texts into a BERT model by using BERT models pre-trained on large-scale general corpus and legal corpus to obtain context-related vectors; Collecting historical case data with labels, performing neural network classifier training to obtain a classification model, wherein the historical case data comprises historical case texts, type labels and feature vectors, and inputting the current case feature vectors into the classification model to obtain current case type labels; S12, collecting different historical cases Evidence-based keyword list of (a)Calculate each wordFor specific case typesA kind of electronic device: ; Wherein, the Representing the frequency of occurrence of a word w in all case keywords of the type C, wherein N is the total number of case types, CF (w) is the number of case types containing the word w, a keyword list is built for each case type in descending order of importance scores to obtain a historical word stock, and Top-N words are extracted from the historical word stock based on the current case type label to serve as a basic keyword set; S13, according to the basic keyword set