CN-121834311-B - Simulation enhancement-based few-sample sea area illegal event feature extraction method

CN121834311BCN 121834311 BCN121834311 BCN 121834311BCN-121834311-B

Abstract

The application provides a simulation enhancement-based method for extracting characteristics of a few-sample sea area illegal event, and relates to the field of data processing. The method is applied to the situation that a text sample is scarce and multiple cases are heterogeneous in sea area illegal cases, a normalized feature system is constructed to unify business feature expression under different cases, a feature extraction frame based on text span is introduced on the basis of small-scale labeling samples, stable identification of nesting and overlapping features is achieved, a robust evaluation strategy of resource perception is combined, real performance short-board features are precisely positioned, simulation enhancement is performed in a directed mode under the condition that the feature system is unchanged, a simulation sample and an artificial sample are fused safely through a type division and boundary arbitration mechanism, model update is completed under deterministic training constraint, and therefore high-precision, stable and automatic extraction of sea area illegal event features under the condition of low resources is achieved. By implementing the technical scheme, the robustness of extracting the characteristics of the illegal events in the sea area with few samples is improved conveniently.

Inventors

YU HONGCHU
XU XIAOHAN

Assignees

武汉理工大学

Dates

Publication Date: 20260508
Application Date: 20260311

Claims (9)

1. The method for extracting the characteristics of the few-sample sea area illegal events based on simulation enhancement is characterized by comprising the following steps of: Acquiring a case text of a plurality of types of sea area illegal cases, and carrying out semantic merging and legal attribute verification on candidate features with service directivity in the case text to construct a normalized feature system; selecting a labeling sample set from each case text based on the normalization feature system, and executing structure reconstruction and unified coding processing on the labeling sample set to generate a training sample set; introducing a text span-based feature extraction framework on the basis of the training sample set, and carrying out feature extraction on the case text to obtain a feature extraction result containing a plurality of normalized features; According to the feature extraction result, describing performance stability of each normalized feature under different data division conditions by introducing a robustness assessment strategy of resource perception, and determining enhancement priority of each normalized feature by combining sample distribution information of each normalized feature in the training sample set so as to determine the feature to be enhanced according to the enhancement priority; under the condition of keeping the normalization feature system and the training sample set unchanged, constructing an evaluation data pool, and binding a sample identifier, a case identifier and a normalization feature labeling result for each sample in the evaluation data pool; Executing multi-round data partitioning processing on the evaluation data pool, generating a plurality of groups of data partitioning conditions by dimensions, executing feature extraction processing on the corresponding evaluation subsets by utilizing the feature extraction frame under each data partitioning condition, and aligning the obtained feature extraction results with the true labeling results feature by feature so as to obtain performance statistical results of all normalized features under different data partitioning conditions; Summarizing performance statistical results of each normalized feature under a plurality of groups of data dividing conditions, and describing performance stability of the normalized features on the basis of the performance statistical results; sample distribution information of each normalized feature in the training sample set is obtained, and the sample distribution information and the performance stability are subjected to joint analysis to form a resource perception description, wherein the sample distribution information comprises the number of samples and the case of cross-case coverage; Based on the resource perception description, performing enhancement priority ranking on each normalized feature, generating an enhancement priority list according to the comprehensive sequence of the performance stability and the sample distribution sparseness, and determining the normalized feature positioned in the preset column number of the enhancement priority list as the feature to be enhanced; aiming at the characteristics to be enhanced, under the condition of keeping the normalized characteristic system unchanged, generating a simulation case text by utilizing a simulation enhancement strong model, and executing self-adaptive evaluation and screening on the simulation case text based on the labeling sample set to obtain a simulation sample set; And introducing a type division and boundary arbitration mechanism between the simulation sample set and the labeling sample set to construct a combined training sample set, and updating the simulation enhancement large model through the combined training sample set under the constraint of a fixed model structure, fixed super-parameter configuration and a deterministic training strategy to output a structured sea area illegal event characteristic result.
2. The method for extracting the characteristics of the few-sample sea area illegal events based on simulation enhancement according to claim 1, wherein the method is characterized by obtaining a case text of multiple types of sea area illegal cases, and performing semantic merging and legal attribute verification on candidate characteristics with service directivity in the case text to construct a normalized characteristic system, and specifically comprises the following steps: determining a case-by-case set formed by multiple types of sea area illegal case-by-case, binding a unique case-by-case identification for each case-by-case in the case-by-case set, acquiring case texts which are consistently associated with the case-by-case identifications from a case management system, a law enforcement document system and a case-by-case entry system based on the case-by-case identifications, binding text identifications, case-by-case identifications, generation time identifications and source identifications for each case-by-case text to form a case-by-case text library; Executing character normalization processing, noise segment rejection processing and sentence segment segmentation processing on the case texts in the case text library to generate normalized case texts binding sentence segment identifications and position identifications; on the basis of the normalized case text, performing word segmentation processing, part-of-speech tagging processing and stop word filtering processing on the sentence segment sequence, and performing statistical significance analysis on the processed word elements respectively for different cases to form a preliminary screening candidate feature set; performing synonymous merging processing, entity form merging processing and cross-case co-occurrence aggregation processing on the primary screening candidate feature set to generate a candidate feature context portrait comprising a co-occurrence word set, a dependency relationship fragment set and a trigger phrase set; Under the constraint of a pre-established top-level feature set, carrying out semantic discrimination processing on each candidate feature by combining the candidate feature context image so as to determine the mapping relation between the candidate feature and the top-level feature, and forming a normalized feature system draft covering multiple cases according to the mapping relation; And executing legal attribute verification processing on the normalized features in the normalized feature system draft based on the legal element set, and correcting the mapping relation and eliminating homonymous and heteronymous conflicts and heteronymous synonymous redundancies by verifying the consistency of the normalized features, the legal elements and the applicable case so as to form a normalized feature system comprising feature identifiers, feature definitions, the applicable case set, the trigger phrase set and the negative case constraint set.
3. The method for extracting the characteristics of the few-sample sea-area illegal events based on simulation enhancement according to claim 1, wherein the method for extracting the characteristics of the few-sample sea-area illegal events based on the normalization characteristic system is characterized by selecting a labeling sample set from each case text, and performing structure reconstruction and unified coding processing on the labeling sample set to generate a training sample set, and specifically comprises the following steps: Grouping the case texts, and counting the occurrence frequency and the number of covered sentence segments of each normalized feature in the case text in each case text grouping to form a feature distribution portrait of the case text dimension; under the constraint of the feature distribution portrait, selecting a case text which simultaneously contains a plurality of normalized features and has complete semantics from each case by a group as a candidate labeling text, and keeping the quantity of the candidate labeling texts in each case by the group consistent so as to construct a cross-case by balanced labeling sample set; Performing refined labeling processing on the labeling sample set based on the normalization feature system, using a normalization feature name as a unique labeling label, performing span-level labeling on text fragments in the case text, which are semantically consistent with the normalization feature name, and binding a text identifier, a normalization feature name, a start character position and an end character position for each labeling result; Performing structure reconstruction processing on the marked samples, and uniformly reconstructing the marked results into a flattened sample structure comprising a text identifier, a text content field and a feature list field formed by a plurality of feature objects, wherein the feature objects comprise normalized feature names, initial character positions and end character positions; On the basis of the flattened sample structure, unified coding processing is carried out on the marked sample set, global unique sample identifications containing case-by-identification and original text identifications are generated for each marked sample, dictionary mapping is carried out on normalized feature names so as to ensure the coding consistency of the same normalized feature in different samples, and meanwhile, the consistency of feature span positions and text contents is checked; After unified coding and consistency verification are completed, the marked sample sets are divided in an equalizing mode according to the case marks, and a training sample set, a verification sample set and a test sample set are generated.
4. The method for extracting characteristics of a few-sample sea area illegal event based on simulation enhancement according to claim 3, wherein a characteristic extraction framework based on text span is introduced on the basis of the training sample set, and characteristic extraction is performed on the case text to obtain a characteristic extraction result comprising a plurality of normalized characteristics, and the method specifically comprises the following steps: Uniformly encoding the case text and the normalized feature names through a shared semantic encoder respectively to obtain a context representation sequence of the case text and a normalized feature semantic prototype set; enumerating candidate text spans in the context representation sequence based on a preset span length, and performing validity constraint screening on the candidate text spans to form a candidate text span set; for each candidate text span in the candidate text span set, constructing a corresponding text span semantic representation from the context representation sequence, wherein the text span semantic representation comprises span internal semantic information, span start boundary information, span end boundary information and span length information; Performing semantic matching on the text span semantic representation and the normalized feature semantic prototype set to obtain a matching result of the candidate text span on the total normalized features, and determining normalized feature names or non-feature categories corresponding to the candidate text span according to the matching result; And executing aggregation and conflict resolution processing on the candidate text spans judged to be the normalized features, generating feature extraction results containing a plurality of normalized features under the condition of keeping overlapping text spans and nested text spans, and binding text identifications, normalized feature names and text span position information consistent with a training sample set for each normalized feature in the feature extraction results.
5. The method for extracting the characteristics of the few-sample sea-area illegal events based on simulation enhancement according to claim 1, wherein the generating the simulated case text by using the simulation enhancement strong model for the characteristics to be enhanced under the condition of keeping the normalized characteristic system unchanged, and performing adaptive evaluation and screening on the simulated case text based on the labeling sample set to obtain a simulated sample set comprises the following steps: under the condition of keeping the normalized feature names, semantic definitions and applicable case-to-case ratios in the normalized feature system unchanged, constructing a simulation enhancement constraint set for each feature to be enhanced, wherein the simulation enhancement constraint set is used for limiting semantic roles, context co-occurrence relations and case-to-case ratio adaptation ranges of the feature to be enhanced in case-to-case texts; based on the labeling sample which is consistently associated with the to-be-enhanced features and the applicable cases in the labeling sample set, extracting corresponding case texts and context fragments to construct an demonstration sample set, and inputting the demonstration sample set and the simulation enhancement constraint set together as generation constraints into the simulation enhancement big model to generate candidate simulation case texts containing the to-be-enhanced features; Executing text normalization processing and feature positioning processing on the candidate simulation case text, and removing candidate simulation case text which cannot position the span of the feature text to be enhanced or does not meet semantic role constraint to obtain a first candidate simulation case text; Constructing a real distribution benchmark description based on the labeling sample set, comparing the text length characteristic, the normalized feature density characteristic and the semantic similarity characteristic of the first candidate simulation case text with the real distribution benchmark description, and performing elimination processing on the target candidate simulation case text which does not meet the real distribution constraint condition to obtain a second candidate simulation case text; And executing self-adaptive multi-index comprehensive evaluation processing on the second candidate simulation case text to determine the simulation case text meeting the requirements of semantic consistency and feature coverage, and solidifying the simulation case text together with the corresponding to-be-enhanced feature identification, the case identification and the text span information to form the simulation sample set.
6. The method for extracting characteristics of a few-sample sea-area violation event based on simulation enhancement according to claim 1, wherein a type division and boundary arbitration mechanism is introduced between the simulation sample set and the labeling sample set to construct a joint training sample set, and the simulation enhancement large model is updated through the joint training sample set under the constraints of a fixed model structure, a fixed hyper-parameter configuration and a deterministic training strategy to output characteristics of the structured sea-area violation event, specifically comprising: Dividing the normalized features into a feature set to be enhanced and a non-feature set to be enhanced, executing type divide-and-conquer processing on normalized feature labels in the simulation sample set under the condition of keeping a normalized feature system unchanged, and keeping simulation labels belonging to the feature set to be enhanced as target supervision information; Aiming at the non-to-be-enhanced feature labels removed from the simulation sample set, generating prediction labels for the corresponding simulation case text by utilizing the feature extraction frame, and taking the prediction labels with normalized features belonging to the non-to-be-enhanced feature set as candidate backfill labels; Performing boundary arbitration processing between the candidate backfill labels and the feature labels to be enhanced, and eliminating the corresponding candidate backfill labels according to the feature label priority principle to be enhanced when the candidate backfill labels are completely consistent with any feature label to be enhanced on the text span position; Performing confidence grading treatment on the candidate backfill labels reserved through boundary arbitration according to the prediction confidence, and dividing the candidate backfill labels into a confident backfill label, a fuzzy backfill label and a rejection backfill label, wherein the confident backfill label is used as effective supervision to participate in training, the rejection backfill label is removed, and the fuzzy backfill label is marked as a gradient inhibition position; combining the samples in the labeling sample set with the samples in the simulation sample set subjected to type division treatment, boundary arbitration and confidence degree classification treatment to construct a joint training sample set containing the source division mask information; Under the constraints of a fixed model structure, fixed super-parameter configuration and a deterministic training strategy, updating the simulation enhancement large model based on the combined training sample set, and outputting the characteristic result of the structured sea area illegal event by blocking gradient feedback at a fuzzy backfill labeling position, suppressing noise supervision in a mode of labeling the feature to be enhanced and ensuring normal feedback of the backfill labeling position.
7. A simulation-enhancement-based small sample sea area violation event feature extraction device, characterized in that the device is used for executing the simulation-enhancement-based small sample sea area violation event feature extraction method according to any one of claims 1 to 6, and the device comprises an acquisition module and a processing module, The acquisition module is used for acquiring the case texts of multiple types of sea area illegal cases, and carrying out semantic merging and legal attribute verification on candidate features with service directivity in the case texts so as to construct a normalized feature system; the processing module is used for selecting a labeling sample set from each case text based on the normalization feature system, and executing structure reconstruction and unified coding processing on the labeling sample set to generate a training sample set; The processing module is further used for introducing a text span-based feature extraction framework on the basis of the training sample set, and extracting features of the case text to obtain feature extraction results containing a plurality of normalized features; The processing module is further configured to characterize performance stability of each normalized feature under different data partitioning conditions by introducing a robustness evaluation strategy of resource perception according to the feature extraction result, and determine enhancement priority of each normalized feature by combining sample distribution information of each normalized feature in the training sample set, so as to determine a feature to be enhanced according to the enhancement priority; The processing module is further used for generating a simulation case text by using a simulation enhancement strong model under the condition of keeping the normalized characteristic system unchanged aiming at the characteristics to be enhanced, and executing self-adaptive evaluation and screening on the simulation case text based on the labeling sample set so as to obtain a simulation sample set; The processing module is further configured to introduce a type division and boundary arbitration mechanism between the simulation sample set and the labeling sample set, so as to construct a joint training sample set, and update the simulation enhancement large model through the joint training sample set under the constraints of a fixed model structure, a fixed hyper-parameter configuration and a deterministic training strategy, so as to output a structured sea area illegal event feature result.
8. An electronic device comprising a processor, a memory, a user interface, and a network interface, the memory for storing instructions, the user interface and the network interface each for communicating to other devices, the processor for executing the instructions stored in the memory to cause the electronic device to perform the method of any one of claims 1 to 6.
9. A non-transitory computer readable storage medium storing instructions which, when executed, perform the method of any one of claims 1 to 6.

Description

Simulation enhancement-based few-sample sea area illegal event feature extraction method Technical Field The application relates to the technical field of data processing, in particular to a simulation-enhanced-based method for extracting characteristics of a few-sample sea area illegal event. Background With the continuous improvement of the comprehensive treatment of the sea area, the sea area illegal cases show remarkable complicated characteristics in terms of quantity scale, case type and expression form. The illegal sand collection, illegal fishing, illegal pollution discharge, illegal dumping and other illegal behaviors are often interweaved in law enforcement practice, and the case description is mainly recorded in a natural language text form, so that the method has the characteristics of random expression, loose structure, non-uniform words, a large amount of hidden business semantics and the like. The unstructured case text contains key information such as ships, personnel, places, time, illegal behaviors, operation modes, case-related substances and the like, and is an important foundation for supporting case analysis, law enforcement, judgment and intelligent supervision. However, in practical engineering application, since sea-related data has objective constraints of strong sensitivity, high labeling cost, extremely limited publicable samples and the like, the number of fine labeling samples capable of being used for model training is usually in a very small scale state, the feature type distribution among different illegal cases presents obvious isomerism and long tail characteristics, and the expression modes of the same feature under different cases are obviously different, so that the feature extraction model in the general field is difficult to directly reuse. The existing text feature extraction method based on sequence labeling or fixed classification heads generally depends on large-scale labeling data and is highly sensitive to sample size and data division, meanwhile, the existing data enhancement mode depends on experience rules or static indexes, lacks objective discrimination capability on short plates with different feature real performances, is easy to introduce invalid and even harmful simulation samples, and causes negative migration on existing stable features. Therefore, when the processing method for extracting the characteristics of the illegal events of the sea area with few samples is applied to the server, the robustness of the characteristic extraction is greatly reduced. Therefore, a method for extracting the characteristics of the low-sample sea area illegal events based on simulation enhancement is urgently needed. Disclosure of Invention The application provides a simulation-enhanced method for extracting the characteristics of the few-sample sea-area illegal events, which is convenient for improving the robustness of a server to the extraction of the characteristics of the few-sample sea-area illegal events. A method for extracting characteristics of small sample sea area illegal event based on emulation enhancement includes obtaining case text of multi-class sea area illegal event, carrying out semantic merging and legal attribute verification on candidate characteristics with service directivity in case text to construct normalized characteristic system, selecting label sample set from case text based on normalized characteristic system, carrying out structure reconstruction and unified coding treatment on label sample set to generate training sample set, introducing characteristic extraction frame based on text span to carry out characteristic extraction on case text to obtain characteristic extraction result containing multiple normalized characteristics, describing performance stability of each normalized characteristic under different data division conditions by introducing robust evaluation strategy according to characteristic extraction result, combining sample distribution information of each normalized characteristic in training sample set to determine enhancement priority of each normalized characteristic to determine to-be-enhanced characteristic according to enhancement priority, carrying out structure reconstruction and unified coding treatment on label sample set to generate training sample set, introducing characteristic extraction frame based on text span, carrying out characteristic extraction on simulation sample set to obtain strong sample classification rule, carrying out strong rule evaluation on training sample set, carrying out strong rule classification, carrying out rule evaluation on training sample classification, carrying out rule classification, and carrying out rule classification on rule classification, and carrying out rule classification, and updating the simulation enhancement large model through the joint training sample set under the constraint of fixed hyper-parameter configuration and deterministic training strategy so as to