CN-122021974-A - AI model safety alignment method, equipment and system based on vertical domain detection engine

CN122021974ACN 122021974 ACN122021974 ACN 122021974ACN-122021974-A

Abstract

The invention belongs to the technical field of artificial intelligence, and particularly relates to an AI model safety alignment method, equipment and system based on a vertical domain detection engine. Firstly, feature extraction, field classification and risk assessment are carried out on a sentence A input by a user through a vertical domain detection engine, and corresponding feature codes B and C are obtained. And then, retrieving and generating a safety prompt word segment D from the template library according to the feature codes, and splicing the safety prompt word segment D with the statement A to form a final input E, and sending the final input E into a large language model. After the model generates a reply speech segment F based on E, the system checks the reply speech segment F, namely, if the verification is passed, the F is output to a user side, and if the verification is failed, the F is used as an illegal sample to be recorded back to a vertical domain detection engine so as to perfect the risk identification capability. The invention realizes hot plug and accurate customization by decoupling safety constraint, has no fine tuning protection performance, combines weight and prompt word double-layer protection, and constructs a high-efficiency, low-loss and robust vertical field safety system.

Inventors

CHEN GONG
CHEN KAIPING
Jin Yuke
NI GAOWEI

Assignees

杭州安泉数智科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260416

Claims (10)

1. The AI model safety alignment method based on the vertical domain detection engine is characterized by comprising the following steps: s1, carrying out feature extraction, field classification and risk assessment on a sentence A input by a user through a vertical domain detection engine, and obtaining a feature code B of the field classification and a risk level C of the risk assessment; s2, retrieving and generating a safety prompt word segment D from a template library according to the feature code B and the risk level C; S3, splicing the safety prompt word segment D and the user input sentence A to form a final input E, and inputting the final input E into a large language model; s4, the large language model generates a reply language segment F based on the final input E; S5, checking the reply speech segment F, outputting the reply speech segment F to a user side if the check is passed, and recording the reply speech segment F as an illegal sample in the vertical domain detection engine if the check is failed.
2. The AI model security alignment method based on the vertical domain detection engine of claim 1, wherein in step S1, the feature extraction, domain classification and risk assessment method of the vertical domain detection engine comprises: S11, performing topic modeling on the user input sentence A, including lexical analysis, syntactic analysis and entity identification, and extracting a feature vector A1 in the user input sentence A; S12, inputting the feature vector A1 into a pre-trained multi-label classification model, and outputting one or more vertical field labels as the feature codes B; s13, evaluating and outputting the risk level C of the risk evaluation according to the safety risk content in the user input statement A.
3. The AI model security alignment method of claim 2, wherein in step S12, the feature vector A1 is input into a pre-trained multi-label classification model, and the confidence level B1 corresponding to the feature code B is output.
4. The AI model security alignment method of claim 2, wherein in step S13, the security risk content includes sensitive words and jail-breaking try sentences.
5. The AI-model security alignment method based on the vertical domain detection engine of any one of claims 1-4, wherein the step S2 includes: S21, taking the feature code B and the risk level C as a compound key, carrying out accurate matching in a safety prompt word template library, and outputting a basic template or an enhanced template or a refused template; s22, when the output in S21 is a basic template, carrying out variable filling and customized adjustment on the basic template by combining context information in a user input sentence A, and generating the safety prompt word segment D; S23, when the output in S21 is the reinforced template, carrying out variable filling and customized adjustment by combining the reinforced template with the context information in the sentence A input by the user, and adding a hard constraint instruction to generate the safety prompt word segment D; And S33, when the output in S21 is a reject template, performing variable filling and customized adjustment by combining the enhanced template with the context information in the sentence A input by the user, and adding a forced reject instruction to generate the safety prompt word segment D.
6. The AI model security alignment method based on a vertical domain detection engine of claim 1, wherein the priority of the security prompt word segment D in the final input E is the highest priority.
7. The AI model security alignment method based on a vertical domain detection engine of claim 6, wherein the security prompt word segment D is a system prompt word in the final input E, and the security prompt word segment D is located at the forefront end of the final input E.
8. An AI model security alignment system based on a vertical domain detection engine, comprising: the knowledge base management unit is used for storing and maintaining a knowledge base in the vertical field, a classification system and a prompt word template; the vertical domain detection unit is used for realizing the functions of feature extraction, domain classification and risk assessment; The prompting word generation and injection unit is used for realizing prompting word searching, customizing and injection functions; A model reasoning unit in which a large language model is deployed; and the feedback optimization unit is used for post-verifying and feeding back a verification result to the vertical domain detection unit and the prompt word generation and injection unit.
9. An electronic device comprising a processor and a memory; the processor is connected with the memory; The memory is used for storing executable program codes; The processor runs a program corresponding to executable program code stored in the memory by reading the executable program code for performing the method according to any one of claims 1-7.
10. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method according to any of claims 1-7.

Description

AI model safety alignment method, equipment and system based on vertical domain detection engine Technical Field The invention belongs to the technical field of artificial intelligence, and particularly relates to an AI model safety alignment method, equipment and system based on a vertical domain detection engine. Background Traditional LLM security alignment methods rely mainly on fine tuning of model weights (e.g., SFT, RLHF), and such static alignment methods have the following limitations: 1. alignment is costly and inflexible, requiring data re-collection, model weight re-training, long iteration cycles, and high cost whenever there is a change in regulatory or compliance requirements in the vertical domain. 2. The long tail risk is difficult to cover, the safety risk in the vertical field is often long tail and fine, and static alignment is difficult to exhaust all potential illegal scenes. 3. The "alignment tax" problem still exists in that even with Parameter Efficient Fine Tuning (PEFT) techniques, any modification of model weights can result in unpredictable performance losses to the expertise capabilities of the model. Disclosure of Invention In order to solve the defects of the prior art, the invention provides an innovative dynamic safety protection mechanism, and the safety Prompt words (promtt) are generated and injected in a targeted manner by detecting the type of the vertical field related to user input in real time so as to realize flexible, accurate and low-cost safety alignment of the vertical field with strict compliance requirements on finance, education, medical treatment and the like. The dynamic security protection mechanism is specifically an AI model security alignment method based on a vertical domain detection engine, which comprises the following steps: S1, carrying out feature extraction, field classification and risk assessment on a user input statement A through a vertical domain detection engine, and obtaining a feature code B of the field classification and a risk level C of the risk assessment S2, retrieving and generating a safety prompt word segment D from a template library according to the feature code B and the risk level C; S3, splicing the safety prompt word segment D and the user input sentence A to form a final input E, and inputting the final input E into a large language model; s4, the large language model generates a reply language segment F based on the final input E; S5, checking the reply speech segment F, outputting the reply speech segment F to a user side if the check is passed, and recording the reply speech segment F as an illegal sample in the vertical domain detection engine if the check is failed. In the above-mentioned security alignment method, the domain detection engine needs to perform feature extraction, domain classification and risk assessment based on a domain knowledge base, and the domain knowledge base needs to collect and structure in advance the rules, industry specifications, typical rule-breaking cases, etc. of the storage target vertical domain to form a domain knowledge base. Based on the domain knowledge base, a set of multi-level, fine-grained vertical domain taxonomies (e.g., finance-anti-fraud, finance-compliance marketing, education-cognitive security, education-content suitability) is established. And designing a set of safety prompt word templates for each fine-granularity vertical domain in the classification system as a template library. Optionally, in step S1, the feature extraction, domain classification and risk assessment method of the vertical domain detection engine includes: S11, performing topic modeling on the user input sentence A, including lexical analysis, syntactic analysis and entity identification, and extracting a feature vector A1 in the user input sentence A; S12, inputting the feature vector A1 into a pre-trained multi-label classification model, and outputting one or more vertical field labels as the feature codes B; s13, evaluating and outputting the risk level C of the risk evaluation according to the safety risk content in the user input statement A. Optionally, in the step S12, the feature vector A1 is input into the pretrained multi-label classification model, and the confidence coefficient B1 corresponding to the feature code B is also output. The confidence level B1 is a quantization index, and is used for measuring the grasping degree of the model on the feature code B, and filtering, sorting or weighting the feature code B is performed. Optionally, in the step S13, the security risk content includes a sensitive word and a jail-breaking attempt statement. Optionally, the step S2 includes: S21, taking the feature code B and the risk level C as a compound key, carrying out accurate matching in a safety prompt word template library, and outputting a basic template or an enhanced template or a refused template; s22, when the output in S21 is a basic template, carrying out variable filling and cus