Search

CN-122020706-A - Method and device for extracting case-related document in electronic data

CN122020706ACN 122020706 ACN122020706 ACN 122020706ACN-122020706-A

Abstract

A method for extracting a case-related document in electronic data includes the steps of carrying out content analysis and preprocessing enhancement on the document in the electronic data to obtain standardized text and structured basic information corresponding to the document, extracting basic text features and field specific features based on the obtained standardized text and structured basic information of the document to form a multi-dimensional feature set of the document, judging whether the document is the case-related document and a specific case-related type by adopting a multi-layer classification strategy for the multi-dimensional feature set of the document, and extracting key case-related information in the document by adopting a corresponding structured extraction technology aiming at the specific case-related type if the document is judged to be the case-related document. The invention comprehensively utilizes multiple technologies to realize multi-source/format data adaptation, accurate identification and classification and structured information extraction, and ensures the high efficiency, the integrity and the effectiveness of evidence collection.

Inventors

  • LUO FENG
  • CHEN JUNSHAN
  • SU ZAITIAN
  • HUANG LONG
  • LI KUN

Assignees

  • 厦门市美亚柏科信息安全研究所有限公司

Dates

Publication Date
20260512
Application Date
20260106

Claims (10)

  1. 1. The method for extracting the case-related document in the electronic data is characterized by comprising the following steps of: content analysis and preprocessing enhancement are carried out on a document in the electronic data, and standardized text and structured basic information corresponding to the document are obtained; extracting basic text features and domain specific features based on the obtained standardized text and structured basic information of the document to form a multidimensional feature set of the document; Judging whether the document is a case-related document and a specific case-related type by adopting a multi-layer classification strategy for the multi-dimensional feature set of the document; If the document is judged to be a case-related document, a corresponding structured extraction technology is adopted for specific case-related types of the document, and key case-related information in the document is extracted.
  2. 2. The method for extracting a case-related document from electronic data according to claim 1, wherein the content analysis means extracting and converting core content and associated information of the document from the electronic data by combining an analysis tool with a custom analysis logic.
  3. 3. The method for extracting a case-related document from electronic data according to claim 1, wherein the preprocessing enhancement is to perform optimization processing on the document after content analysis, and the preprocessing enhancement comprises performing automatic password cracking attempts on an encrypted document, repairing a damaged document, performing forced restoration or recognition clearing on an confused document containing hidden characters, white fonts and zero-width characters, and performing recursive decompression and content scanning on a compressed package.
  4. 4. The method for extracting a case related document in electronic data according to claim 1, wherein the standardized text is plain text data which has uniform format and can be directly used for text analysis, the plain text data comprises core characters after various documents are analyzed and converted and visible characters after preprocessing and reinforcement, and the structured basic information is non-plain text information which is extracted along with the analysis process and has a fixed format or a logic structure, and comprises a table row and column layout, paragraph division, metadata, content classification identification and a basic data structure of the document.
  5. 5. The method for extracting a case-related document in electronic data according to claim 1, wherein the basic text features comprise word frequency features based on TF-IDF, semantic vector features based on a pre-training language model and syntactic structure features, and the domain-specific features comprise fund features, speech features, report features, customs notes features and contract features.
  6. 6. The method for extracting a case-related document from electronic data according to claim 1, wherein, Based on the multidimensional feature set, adopting a BERT+ BiLSTM deep learning model to carry out preliminary classification on the document, and judging that the document is a common document or a case-related document; And carrying out weighted fusion scoring on the general attribute and the corresponding proprietary attribute by combining a plurality of special type discriminators and a differentiated domain knowledge rule designed for the characteristics of various documents to obtain the comprehensive score of each type, and determining the specific case type corresponding to the document according to the maximum value of the comprehensive score and a preset threshold value.
  7. 7. The method of claim 6, wherein the general attributes include document length, paragraph number, sentence average length, TF-IDF vector, BERT semantic vector, the special attributes correspond to core characterization parameters of fund features, speech features, report features, customs notes features and contract features, the special category discriminators are five and correspond to five types of case related types of fund running, fraud, business reports, customs notes and contracts one by one, each special category discriminator carries out weighted fusion calculation based on the special attributes and general attributes of the corresponding category and combining with knowledge rules of differentiated fields to output a comprehensive score of the corresponding category, and finally the category with the highest comprehensive score and exceeding a preset threshold is selected as the specific case related type of the document.
  8. 8. The method for extracting a case-related document from electronic data according to claim 1, wherein, for a specific case-related type, a corresponding structured extraction technique is adopted, and extracting key case-related information in the document from the multi-dimensional feature set specifically comprises the following steps: identifying a transaction key field by adopting a conditional random field sequence labeling model, and constructing a fund flow chart after monetary logic verification; A fraud operation document, namely identifying an induced language by adopting a rule and deep learning mixed method, classifying the type of the fraud operation and tracking the change rule of the fraud operation; the business report document is characterized in that a table structure is recognized through OCR, multiple report data are associated, and numerical value abnormality is detected; After the form is recognized by OCR, the key information is extracted by using a large model and is compared and verified with a customs database; Contract document, extracting content by APACHE TIKA, extracting core terms by a large model, comparing standard templates and verifying the validity of the electronic signature.
  9. 9. The method for extracting the case-related document in the electronic data according to claim 1 is characterized by further comprising result integration, wherein the method comprises the steps of carrying out association integration on the case-related documents of different data sources, providing a plurality of visual modes which at least comprise a fund flow diagram, a speech operation thermodynamic diagram and a data trend diagram to display extraction results, providing specific case analysis suggestions and investigation break openings for various case-related types based on the extracted case-related document, and recording the original position and extraction path of each case-related document to ensure verification of a evidence obtaining process.
  10. 10. An extraction device of a case document in electronic data is characterized by comprising The document preprocessing module is used for carrying out content analysis and preprocessing enhancement on a document in the electronic data to obtain a standardized text and structural basic information corresponding to the document; the feature extraction module is used for extracting basic text features and field specific features based on the standardized text and the structured basic information of the document to form a multidimensional feature set of the document; The type identification module is used for judging whether the document is a case-related document and a specific case-related type by adopting a multi-layer classification strategy for the multi-dimensional feature set of the document; And the structured extraction module is used for extracting key case-related information in the document by adopting a corresponding structured extraction technology aiming at the specific case-related type when the document is judged to be the case-related document.

Description

Method and device for extracting case-related document in electronic data Technical Field The invention relates to the field of electronic data evidence obtaining, in particular to a method and a device for extracting a case-related document in electronic data. Background The invention relates to the technical field of electronic evidence obtaining, and is suitable for judicial evidence obtaining scenes of criminal cases such as economic crimes, fraud crimes, smuggling crimes and the like. In recent years, network crimes and economic crimes are increasingly complicated, electronic evidence obtaining is increasingly important in judicial practices, and case-related documents (including funds running, fraud, business reports, customs notes, contractual agreements and the like) are key evidences for recognizing case facts. However, the current case document extraction still faces a plurality of technical bottlenecks, the traditional method generally depends on keyword search and simple rule matching, for example, the fund flow document is extracted only by identifying keywords such as 'amount', 'account', 'transfer', 'account', and the like, the similar documents are screened only by means of features such as 'first party', 'second party', 'clause', and the like, so that the accurate requirement of judicial evidence obtaining is difficult to meet. The traditional method has the obvious limitations that firstly, the expression modes of the case related documents are various, the document expressions of different cases and different main bodies are large in difference, a large number of false reports and false reports are easily generated by matching simple keywords, the accuracy is difficult to guarantee, secondly, the adaptive format is single, for the case related documents in non-text formats such as pictures, PDF scanning pieces and the like, the traditional text analysis method cannot effectively identify text information in the case related documents to the non-text formats, so that key evidence is omitted, thirdly, the documents such as fraud are not covered enough, the documents such as fraud are often screened by adopting industry black words and the like, the invisible expressions are avoided, the conventional keywords are difficult to cover the whole surface, the recognition capability of novel crime is extremely weak, fourthly, the structural suitability is poor, documents such as customs notes and contracts have specific fields and format requirements, the single rule cannot be adapted to the structural documents in different styles, and standardized structural information is difficult to extract from the documents, and the subsequent case analysis is hindered. In summary, how to break through the limitation of the traditional method, solve the problems of false alarm caused by keyword matching, difficult processing of non-text format, insufficient recognition of obscure expression and novel speech technology, poor suitability of structured documents and the like, and realize the rapid and accurate extraction and structured information output of case-related documents in massive electronic data, which is a core technical problem to be solved in the field. Disclosure of Invention The invention mainly aims to overcome the defects of low extraction efficiency, low accuracy, difficult special document processing and insufficient suitability of different types of documents of electronic data file documents in the prior art, and provides a method and a device for extracting file documents in electronic data file documents, which comprehensively apply multiple technologies to realize multi-source/format data adaptation, precise identification and classification and structured information extraction, and ensure evidence obtaining efficiency, integrity and effectiveness. The invention adopts the following technical scheme: A method for extracting a case-related document in electronic data comprises the following steps: content analysis and preprocessing enhancement are carried out on a document in the electronic data, and standardized text and structured basic information corresponding to the document are obtained; extracting basic text features and domain specific features based on the obtained standardized text and structured basic information of the document to form a multidimensional feature set of the document; Judging whether the document is a case-related document and a specific case-related type by adopting a multi-layer classification strategy for the multi-dimensional feature set of the document; If the document is judged to be a case-related document, a corresponding structured extraction technology is adopted for specific case-related types of the document, and key case-related information in the document is extracted. The content analysis refers to extracting and converting the core content and the associated information of the document in the electronic data by adopting a mode of combining an