CN-116245167-B - Text encoder training method, class search method, device and electronic equipment

CN116245167BCN 116245167 BCN116245167 BCN 116245167BCN-116245167-B

Abstract

The text encoder training method comprises the steps of obtaining a plurality of case groups, encoding case information of at least two cases by using a text encoder to obtain case characteristics of the cases, determining intra-class characteristic similarity in each case group and inter-class characteristic similarity between each case and different types of cases according to the case characteristics of the cases, determining comparison learning loss according to the intra-class characteristic similarity, the inter-class characteristic similarity and similarity information corresponding to the cases, and training the text encoder based on the comparison learning loss. According to the embodiment of the disclosure, the text encoder can be trained by introducing fine-grained legal knowledge, so that the trained text encoder outputs high-quality case characteristics, the accuracy of case-like retrieval is further improved, and meanwhile, the retrieved case-like has interpretability.

Inventors

MA YIXIAO
WU YUEYUE
LIU YIQUN
SU WEIHANG
AI QINGYAO

Assignees

清华大学

Dates

Publication Date: 20260508
Application Date: 20230328

Claims (12)

1. A method of text encoder training, comprising: Acquiring a case set, wherein the case set comprises a plurality of case groups, each case group comprises case information of at least two cases which are the same type of cases and similarity information corresponding to the at least two cases respectively, the similarity information is determined according to similarity between judging information of the cases and an unambiguous rule set related to the cases, the cases among different case groups are different types of cases, and the unambiguous rule set comprises at least one unambiguous rule; Encoding the case information of each case in the plurality of case groups by using a text encoder to obtain case characteristics of each case in the plurality of case groups; Determining intra-class feature similarity between at least two cases in each case group and inter-class feature similarity between at least two cases in each case group and different cases according to the case features of each case in the plurality of case groups, wherein the intra-class feature similarity between at least two cases in each case group comprises feature similarity of case features between each two cases in each case group; determining comparison learning loss according to the intra-class feature similarity, the inter-class feature similarity and similarity information corresponding to at least two cases in each case group of the case set, and training the text encoder based on the comparison learning loss; Wherein determining the comparison learning loss according to the intra-class feature similarity, the inter-class feature similarity, and similarity information corresponding to at least two cases in each case group of the case set, includes: according to similarity information corresponding to at least two cases in each case group, determining correlation weights between every two cases in each case group; determining a first convergence control parameter according to the correlation weight between every two cases in each case group and the intra-class feature similarity between every two cases in each case group; Determining a second convergence control parameter according to the feature similarity between at least two cases in each case group and the heterogeneous cases; And determining a contrast learning loss according to the intra-class feature similarity, the inter-class feature similarity, the first convergence control parameter and the second convergence control parameter.
2. The method of claim 1, wherein the acquiring the set of cases comprises: Acquiring judgment information of each case in a plurality of cases in a preset case library, and acquiring an unambiguous legal document set related to each case in the preset case library, wherein the judgment information comprises judgment reasons of the cases; According to the similarity between the judgment information of each case in the preset case library and each disambiguation rule in the disambiguation rule set related to each case, determining similarity information corresponding to each case in the preset case library; Determining the similar cases of each case in the preset case library according to the similarity information corresponding to each case in the preset case library; And determining at least one case set of at least one training batch according to the similar cases of each case in the preset case library and the similarity information corresponding to each case.
3. The method of claim 2, wherein the obtaining the disambiguated set of french strips involved in each case in the preset library of cases comprises: acquiring a plurality of original laws related to a plurality of cases in the preset case library, wherein each case relates to at least one original laws; Splitting each original legal strip into at least one branch clause, and extracting keywords in each branch clause to obtain at least one disambiguation legal strip under each original legal strip; And determining an unambiguous rule set related to each case in the preset case library according to at least one original rule related to each case in the preset case library and at least one unambiguous rule under each original rule.
4. The method of claim 2, wherein the preset case library includes I cases, I is a positive integer, and the determining the similar cases of each case in the preset case library according to the similarity information corresponding to each case in the preset case library includes: For the ith case in the preset case library, determining J initial similar cases related to the same original legal conditions as the ith case from the preset case library, wherein I is less than or equal to I, and J is less than or equal to I; determining the correlation weight between the ith case and each initial similar case according to the similarity information corresponding to the ith case and the similarity information corresponding to the J initial similar cases; And determining the similar cases related to the ith case from the J initial similar cases according to the correlation weights between the ith case and each initial similar case.
5. The method of claim 4, wherein the determining the correlation weight between the ith case and each initial similar case according to the similarity information corresponding to the ith case and the similarity information corresponding to the J initial similar cases comprises: according to the similarity information corresponding to the ith case, determining a first disambiguation rule with the maximum similarity of the judging information of the ith case, and according to the similarity information corresponding to the J initial similar cases in the J initial similar cases, determining a second disambiguation rule with the maximum similarity of the judging information of the J initial similar cases, wherein J is less than or equal to J; Determining a correlation weight between the ith case and the jth initial similar case according to the coincidence degree between the disambiguation rule set related to the ith case and the disambiguation rule set related to the jth initial similar case under the condition that the first disambiguation rule is the same as the second disambiguation rule, wherein the coincidence degree is positively correlated with the correlation weight, or And under the condition that the first disambiguation method is different from the second disambiguation method, determining the correlation weight between the ith case and the jth initial similar case according to the similarity score between the similarity information corresponding to the ith case and the similarity information corresponding to the jth initial similar case and the coincidence degree, wherein the similarity score is positively correlated with the correlation weight.
6. The method of claim 1, wherein the case information of each case in the set of cases is masked off a portion of the real vocabulary according to a preset masking scale, the method further comprising: Predicting the real words which are shielded in the case information of each case based on the case characteristics corresponding to each case in the case set to obtain the predicted words corresponding to each case; And determining a shielding language loss according to the predicted vocabulary corresponding to each case and the shielded real vocabulary in the case information of each case, and training the text encoder according to the shielding language loss.
7. A category retrieval method, the method comprising: acquiring case information of a target case of a case to be searched; Encoding case information of the target case by using a text encoder to obtain case characteristics corresponding to the target case, wherein the text encoder is trained by the text encoder training method according to any one of claims 1 to 6; And determining similar cases of the target case from the target case library based on the case characteristics corresponding to the target case and the case characteristics of a plurality of decide a case cases in the target case library.
8. The method of claim 7, wherein the method further comprises: and predicting an original legal and/or an unambiguous legal related to the target case according to the case characteristics corresponding to the target case.
9. A text encoder training apparatus, comprising: The system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a case set, the case set comprises a plurality of case groups, each case group comprises case information of at least two cases which are the same type of cases and similarity information corresponding to the at least two cases, the similarity information is determined according to similarity between judging information of the cases and an unambiguous rule set related to the cases, the cases among different case groups are different types of cases, and the unambiguous rule set comprises at least one unambiguous rule; The encoding module is used for encoding the case information of each case in the plurality of case groups by using a text encoder to obtain the case characteristics of each case in the plurality of case groups; The system comprises a determining module, a determining module and a processing module, wherein the determining module is used for determining the intra-class feature similarity between at least two cases in each case group and the inter-class feature similarity between at least two cases in each case group and different cases according to the case features of each case in the plurality of case groups, wherein the intra-class feature similarity between at least two cases in each case group comprises the feature similarity of case features between every two cases in each case group; The training module is used for determining comparison learning loss according to the intra-class feature similarity, the inter-class feature similarity and similarity information corresponding to at least two cases in each case group of the case set, and training the text encoder based on the comparison learning loss; Wherein determining the comparison learning loss according to the intra-class feature similarity, the inter-class feature similarity, and similarity information corresponding to at least two cases in each case group of the case set, includes: according to similarity information corresponding to at least two cases in each case group, determining correlation weights between every two cases in each case group; determining a first convergence control parameter according to the correlation weight between every two cases in each case group and the intra-class feature similarity between every two cases in each case group; Determining a second convergence control parameter according to the feature similarity between at least two cases in each case group and the heterogeneous cases; And determining a contrast learning loss according to the intra-class feature similarity, the inter-class feature similarity, the first convergence control parameter and the second convergence control parameter.
10. A category search device, comprising: the information acquisition module is used for acquiring case information of a target case of the case to be searched; the information coding module is used for coding the case information of the target case by using a text coder to obtain the case characteristics corresponding to the target case, wherein the text coder is trained by the text coder training method according to any one of claims 1 to 6; the case determining module is used for determining the similar cases of the target case from the target case library based on the case characteristics corresponding to the target case and the case characteristics of a plurality of decide a case cases in the target case library.
11. An electronic device, comprising: A processor; A memory for storing processor-executable instructions; Wherein the processor is configured to implement the method of any one of claims 1 to 8 when executing the instructions stored by the memory.
12. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 8.

Description

Text encoder training method, class search method, device and electronic equipment Technical Field The disclosure relates to the technical field of computers, and in particular relates to a text encoder training method, a class search method, a device and electronic equipment. Background The definition of the case search is that given the query case, the similar cases related to the query case are searched in the candidate case library, and the similar cases (cases) generally refer to cases with the same or similar requirement facts and case facts. For modern judicial systems, case-like search has important significance for ensuring same cases and judging and promoting judicial fairness, and the case-like can be used as a reference basis of judge cases. In recent years, because the Pre-training language model (Pre-trained language model, PLM) has better effects in natural language processing task and retrieval task, the PLM technology is introduced into the legal case type retrieval, how to improve the performance of the type retrieval task based on PLM becomes a current research hotspot, aiming at the problem, one scheme in the prior art is to propose the BERT-XS based on the BERT (a PLM), namely, a BERT model Pre-trained by using legal documents, the BERT-XS and the BERT adopt the same model structure, but only the training corpus is replaced by the legal documents, and the other scheme is to adopt Lawformer, namely, the PLM based on the legal field proposed by Longformer model, which considers the characteristics of legal long texts, thereby expanding the length limitation of text input, and simultaneously combining the global and local attention mechanisms to help the model to capture the context information of the long texts. However, in both the above two prior arts, the model is not optimized by combining the depth with legal knowledge, only the training corpus is replaced by legal text, or the surface features such as text length are optimized, and in essence, a general PLM is adopted, and the correlation between legal texts needs stronger legal knowledge, which is obviously different from the traditional text correlation, so that the PLM does not really understand the concept of the classification case from the legal level, thereby the accuracy of the PLM in case of performing classification case retrieval is lower, and the retrieved classification case also lacks good interpretability. Disclosure of Invention In view of this, the present disclosure proposes a text encoder training method, a case search method, a device and an electronic apparatus, which can introduce fine-grained legal knowledge to train a text encoder, so that the trained text encoder outputs high-quality case features, thereby effectively improving the accuracy of case search when case features output by the trained text encoder are used to search cases, and simultaneously enabling the searched similar cases to be cases with interpretability determined based on legal knowledge. According to one aspect of the disclosure, a text encoder training method is provided, which comprises the steps of obtaining a case set, wherein the case set comprises a plurality of case groups, each case group comprises case information of at least two cases which are similar cases and similarity information corresponding to each case, the similarity information is determined according to similarity between judging information of the cases and an unambiguous rule set related to the cases, the case information between different case groups is different cases, the unambiguous rule set comprises at least one unambiguous rule, the case information of each case in the plurality of case groups is encoded by using a text encoder, case characteristics of each case in the plurality of case groups are obtained, intra-class feature similarity between at least two cases in each case group is determined according to the case characteristics of each case in the plurality of case groups, the inter-class feature similarity between each case in each case group and the case in each case group is different from the similarity, the feature loss between each case in each case group and the case is learned according to the similarity, and the similarity loss is determined according to the similarity between the respective feature and the learning rule set. In one possible implementation manner, the acquiring of the case set includes acquiring judgment information of each case in a plurality of cases in a preset case library, acquiring an unambiguous rule set related to each case in the preset case library, wherein the judgment information includes judgment reasons of the cases, determining similarity information corresponding to each case in the preset case library according to similarity between the judgment information of each case in the preset case library and each unambiguous rule set related to each case, determining similar cases of each case in the preset case libra