CN-122021634-A - Judicial text entity relation joint extraction method based on regional vertex labeling

CN122021634ACN 122021634 ACN122021634 ACN 122021634ACN-122021634-A

Abstract

The invention discloses a judicial text entity relation joint extraction method based on regional vertex labeling, which comprises the steps of encoding judicial text cleaning data through a pre-training language model BERT to obtain corresponding vector representation and embedded representation, forming a rectangular region in a target representation set by a head entity and a tail entity in a triplet corresponding to original judicial text data, identifying four vertexes of the rectangular region to identify the triplet, calculating probability scores of the target representation belonging to the four vertexes in the target representation set, and extracting the triplet of the original judicial text data through a loss function and characters of an allocated label. The invention converts the key information in the original judicial text data into formatted triples, accurately identifies the named entities in the judicial text, locates the core relationship between the entities, realizes the structured representation of the irregular text, and further assists the judicial staff to know the situation.

Inventors

SUN YUANYUAN
YUE YINGYING

Assignees

大连理工大学

Dates

Publication Date: 20260512
Application Date: 20260211

Claims (8)

1. A judicial text entity relation joint extraction method based on regional vertex labeling is characterized by comprising the following steps: S1, acquiring original judicial text data, and performing data cleaning on the original judicial text data to obtain judicial text cleaning data, wherein the original judicial text data comprises theft class data sets and fraud class data sets; s2, coding the judicial text cleaning data through a pre-training language model BERT to obtain corresponding vector representation; s3, embedding and determining the embedded representation of the word pairs in the judicial text based on the vectors corresponding to the characters in the vector representation and the distances corresponding to the word pairs; S4, forming a two-dimensional table by a plurality of embedded representations, performing enhancement word representation on the two-dimensional table through CNNs to obtain expansion convolution output, and splicing all expansion convolution outputs with different expansion rates to obtain multiple expansion convolution output; S5, operating the two-dimensional table and the output of the multiple expansion convolution based on an activation function to obtain a target representation set of character pairs; S6, forming a rectangular area in the target representation set by a head entity and a tail entity in the triples corresponding to the original judicial text data, and identifying the triples by identifying four vertexes of the rectangular area, wherein the four vertexes are respectively labels TL representing the upper left vertex, labels TR representing the upper right vertex, labels BL representing the lower left vertex and labels BR representing the lower right vertex, and respectively calculating probability scores of the target representation belonging to the four vertexes in the target representation set; and S7, extracting the triples of the original judicial text data through a loss function and combining the characters of the assigned labels.
2. The judicial text entity relationship joint extraction method based on regional vertex labeling according to claim 1, wherein the encoding of judicial text cleaning data by the pre-training language model BERT to obtain a corresponding vector representation is implemented by the following expression: Wherein, the A vector representation corresponding to judicial text S; is a character A corresponding vector.
3. The judicial text entity relationship joint extraction method based on regional vertex labeling according to claim 1, wherein the embedding of the word pairs in judicial text is determined by embedding vectors corresponding to characters in the vector representation and distances corresponding to the word pairs by the following expression: Wherein, the Is word pairs in judicial text S , ) Is embedded in the representation; And Is a learnable parameter; Is word pair @ , ) The corresponding distance is embedded.
4. The judicial text entity relationship joint extraction method based on regional vertex labeling according to claim 1, wherein the expanding convolution output obtained by performing enhancement word pair representation on the two-dimensional table through CNNs is realized by the following expression: Wherein, the A two-dimensional table configured for n×n embedded representations; an dilation convolution output representing a dilation rate of l; and C is the output of multiple expansion convolutions.
5. The judicial text entity relationship joint extraction method based on regional vertex labeling according to claim 1, wherein the operation of the two-dimensional table and the output of the multiple expansion convolution based on an activation function to obtain a target representation set of character pairs is implemented by the following expression: c is the output of multiple expansion convolutions obtained by splicing the expansion convolutions with different expansion rates; A two-dimensional table configured for n×n embedded representations; is an activation function; A set of target representations for a character pair.
6. The judicial text entity relationship joint extraction method based on regional vertex labeling according to claim 1, wherein probability scores of target representations belonging to the four vertices in the target representation set are realized by the following expression: Wherein, the For character pairs A probability score assigned to the tag TL; for character pairs A probability score assigned to the tag TR; for character pairs A probability score assigned to the tag BL; for character pairs A probability score assigned to the tag BR; Representing a relationship class between a head entity and a tail entity; Representing a sigmoid function for compressing the output between 0 and 1 for converting the output into probabilities in a two-class problem; 、、、 And Is of the relation category The weight matrix associated with the weight matrix is used, , , And Is of the relation category Related bias terms, both these weights and bias terms are learnable parameters; Representing a set of target representations Is represented by a target.
7. The method for extracting judicial text entity relationship association based on regional vertex labels according to claim 1, wherein extracting the triples of the original judicial text data by a loss function in combination with the characters of the assigned labels comprises: firstly extracting a plurality of first triples, then extracting a plurality of second triples, and combining the plurality of first triples and the plurality of second triples to obtain a target triplet set; the extraction process of the first triples: S71, firstly, identifying and selecting all character pairs marked as labels TL as starting points; s72, starting from the character pairs at each tag TL, searching the nearest character pair marked as the tag TR according to the directions of the tag TL, the tag TR and the tag BR; s73, starting from the character pairs at each tag TR, searching for the nearest character pair marked as a tag BR; S74, identifying three vertexes of the tag TL, the tag TR and the tag BR to determine rectangular areas of the triples, and extracting a plurality of first triples according to the rectangular areas; the extraction process of the plurality of second triples: s75, decoding from the directions of the tag BR, the tag BL and the tag TL, namely, starting from the lower right corner, searching for the character pair at the tag BLR and the character pair at the tag TLR, and extracting a plurality of second triples according to the character pair.
8. The method for extracting judicial text entity relationship association based on regional vertex labels according to claim 1, wherein extracting the triples of the original judicial text data by a loss function in combination with the characters of the assigned labels comprises: where R= { TL, TR, BL, BR }, N represents the number of entity relationships, if the character pairs In relation class Tags TL and BL are assigned next, then , ; Is an intermediate variable; For the probability score assigned to tag TL, to tag TR, to tag BL, or to tag BR.

Description

Judicial text entity relation joint extraction method based on regional vertex labeling Technical Field The invention relates to the technical field of information extraction in the field of natural language processing, in particular to a judicial text entity relationship joint extraction method based on regional vertex labeling. Background The judicial documents are the subjects of judicial institutions, litigation parties, agents thereof and the like, and in the process of processing various litigation cases and non-lawsuit pieces according to legal procedures, the special documents with legal efficacy or legal significance are produced and used together, and the documents are complex in format and various in form, and have great challenges for entity relationship joint extraction. The information extraction technology is applied to judicial texts, so that the information processing efficiency can be remarkably improved, and critical support is provided for downstream judicial tasks (such as accurate classification of crime names, automatic generation of judicial auxiliary files, systematic construction of judicial knowledge maps and the like). In order to enable judicial personnel to rapidly locate important information in a judicial document, the invention relates to the scheme that algorithm training is carried out on a data set focused on three high-incidence and high-attention crime names of theft crimes, fraud crimes and related crimes, and a vertex labeling algorithm is used for extracting triples in the judicial document. The invention provides convenience for judicial staff and accelerates the efficiency of judicial office. Because the entity relationship of the judicial field text is complex and various, and entity overlapping triples and entity pair overlapping triples often appear, the triples are difficult to extract by the existing method. The invention constructs a table for each entity relation, two entities in the triplet form a rectangular area in the table of relation, and the triplet is extracted by identifying four vertexes (namely, the upper left corner, the upper right corner, the lower left corner and the lower right corner) of the area. Meanwhile, in the method of the present invention, distance embedding is introduced to capture information about the distance between characters. Disclosure of Invention In view of the foregoing, it is necessary to provide a method, a device, a computer device and a storage medium for extracting judicial text entity relationships based on region vertex labeling. A judicial text entity relation joint extraction method based on regional vertex labeling comprises the following steps: S1, acquiring original judicial text data, and performing data cleaning on the original judicial text data to obtain judicial text cleaning data, wherein the original judicial text data comprises theft class data sets and fraud class data sets; s2, coding the judicial text cleaning data through a pre-training language model BERT to obtain corresponding vector representation; s3, embedding and determining the embedded representation of the word pairs in the judicial text based on the vectors corresponding to the characters in the vector representation and the distances corresponding to the word pairs; S4, forming a two-dimensional table by a plurality of embedded representations, performing enhancement word representation on the two-dimensional table through CNNs to obtain expansion convolution output, and splicing all expansion convolution outputs with different expansion rates to obtain multiple expansion convolution output; S5, operating the two-dimensional table and the output of the multiple expansion convolution based on an activation function to obtain a target representation set of character pairs; S6, forming a rectangular area in the target representation set by a head entity and a tail entity in the triplet corresponding to the original judicial text data, identifying the triplet by identifying four vertexes of the rectangular area, wherein the four vertexes are respectively labels TL representing the upper left vertex, labels TR representing the upper right vertex, labels BL representing the lower left vertex and labels BR representing the lower right vertex, and respectively calculating probability scores of the target representation belonging to the four vertexes in the target representation set; and S7, extracting the triples of the original judicial text data through a loss function and combining the characters of the assigned labels. In one embodiment, the encoding of the judicial text purge data by the pre-trained language model BERT to obtain the corresponding vector representation is implemented by the following expression: Wherein, the A vector representation corresponding to judicial text S; is a character A corresponding vector. In one embodiment, the embedding of word pairs in judicial text is determined based on the embedding of the vectors corres