CN-122021893-A - Security control method, apparatus, device, storage medium, and computer program product
Abstract
The application relates to the technical field of artificial intelligence, and discloses a safety control method, a device, equipment, a storage medium and a computer program product, wherein the method comprises the following steps: and responding to the input prompt, and calling a safety control module of the large language model, wherein the safety control module is pre-trained based on a pre-constructed thinking chain path, the thinking chain path is used for representing layered decision logic of a layered safety strategy, the layered safety strategy comprises a global safety strategy and a user safety strategy, the input prompt is subjected to layered safety strategy inference based on the safety control module, a response action mode of the large language model is determined according to an inference result, and an output response corresponding to the input prompt is generated through the large language model based on the response action mode, so that excessive rejection or insufficient constraint can be avoided, and the balance of safety and helpfulness can be realized.
Inventors
- SUN LIN
- SI JIANFENG
- REN HAIFENG
- ZHANG XIANGZHENG
Assignees
- 北京奇虎科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260121
Claims (10)
- 1. A safety control method, characterized in that the safety control method comprises: In response to an input prompt, invoking a security control module of a large language model, wherein the security control module is pre-trained based on a pre-constructed thought chain path for representing hierarchical decision logic of a hierarchical security policy, the hierarchical security policy comprising a global security policy and a user security policy; carrying out hierarchical security policy inference on the input prompt based on the security control module, and determining a response action mode of the large language model according to an inference result; and generating an output response corresponding to the input prompt through the large language model based on the response action mode.
- 2. The security control method of claim 1, wherein said performing hierarchical security policy inference on the input prompts based on the security control module and determining a response action pattern of the large language model based on the inference result comprises: Detecting whether the input prompt matches a global security policy risk tag based on the security control module; And if the input prompt is matched with the global security policy risk tag, taking a guide mode or a reject mode as a response action mode of the large language model.
- 3. The security control method of claim 2, wherein after detecting whether the input hint matches a global security policy risk tag based on the security control module, further comprising: If the input prompt does not match the global security policy risk tag, detecting whether a predefined user security policy exists or not based on the security control module; If a predefined user security policy exists, detecting whether the input prompt matches a user security policy risk tag based on the security control module; And if the input prompt is matched with the user security policy risk tag, taking an action mode corresponding to the user security policy risk tag as a response action mode of the large language model, wherein the action mode corresponding to the user security policy risk tag comprises an adherence mode, a guidance mode and a rejection mode.
- 4. A safety control method according to claim 3, characterized in that the safety control method further comprises: If the predefined user security policy does not exist or the input prompt does not match the user security policy risk tag, detecting whether the input prompt is safe or not based on the general capability of the security control module invoking a large language model; And if the input prompts safety, taking the adherence mode as a response action mode of the large language model.
- 5. The security control method of claim 4, wherein if there is no predefined user security policy or the input hint does not match a user security policy risk tag, after detecting whether the input hint is secure based on the universal capability of the security control module invoking a large language model, further comprising: and if the input prompt is unsafe, taking a guide mode or a reject mode as a response action mode of the large language model.
- 6. The security control method of any one of claims 1 to 5, wherein before invoking the security control module of the large language model in response to the input prompt, further comprising: acquiring an input prompt sample, and loading a preset global security policy and a randomly sampled user security policy; Performing multidirectional self-distillation on the input prompt sample based on the global security policy and the user security policy to obtain a training data set; Constructing a thinking chain path of the input prompt sample based on the training data set; and performing supervision fine adjustment on the large language model based on the thinking chain path, and constructing a safety control module of the large language model.
- 7. A safety control device, characterized in that the safety control device comprises: The model calling module is used for responding to the input prompt and calling a security control module of the large language model, wherein the security control module is pre-trained based on a pre-constructed thinking chain path, the thinking chain path is used for representing layered decision logic of a layered security policy, and the layered security policy comprises a global security policy and a user security policy; the mode determining module is used for carrying out hierarchical security policy inference on the input prompt based on the security control module and determining a response action mode of the large language model according to an inference result; and the response generation module is used for generating an output response corresponding to the input prompt through the large language model based on the response action mode.
- 8. A security control apparatus comprising a memory, a processor and a security control program stored on the memory and executable on the processor, the security control program when executed by the processor implementing the security control method according to any one of claims 1 to 6.
- 9. A storage medium having stored thereon a security control program which, when executed by a processor, implements the security control method of any one of claims 1 to 6.
- 10. A computer program product, characterized in that the computer program product comprises a safety control program which, when executed by a processor, implements the safety control method according to any one of claims 1 to 6.
Description
Security control method, apparatus, device, storage medium, and computer program product Technical Field The present application relates to the field of artificial intelligence technology, and in particular, to a security control method, apparatus, device, storage medium, and computer program product. Background Currently, with the rapid development of artificial intelligence technology, large language models (Large Language Model, LLM) play an increasingly important role. However, large language models, while improving social productivity, also expose serious security risks (e.g., inducing large language models to output offending answers by jail-breaking attacks, etc.). However, the related large language model is usually controlled by adopting a cut-off safety strategy, so that the balance between safety and helpfulness cannot be achieved, and excessive rejection or insufficient constraint is caused. Disclosure of Invention The application mainly aims to provide a safety control method, a device, equipment, a storage medium and a computer program product, and aims to solve the technical problems that the safety control mode of a related large language model usually adopts a cut-off safety strategy, the balance of safety and assistance cannot be achieved, and excessive rejection or insufficient constraint is caused. In order to achieve the above object, the present application provides a safety control method including: In response to an input prompt, invoking a security control module of a large language model, wherein the security control module is pre-trained based on a pre-constructed thought chain path for representing hierarchical decision logic of a hierarchical security policy, the hierarchical security policy comprising a global security policy and a user security policy; carrying out hierarchical security policy inference on the input prompt based on the security control module, and determining a response action mode of the large language model according to an inference result; and generating an output response corresponding to the input prompt through the large language model based on the response action mode. Optionally, the step of performing hierarchical security policy inference on the input prompt based on the security control module, and determining a response action mode of the large language model according to an inference result includes: Detecting whether the input prompt matches a global security policy risk tag based on the security control module; And if the input prompt is matched with the global security policy risk tag, taking a guide mode or a reject mode as a response action mode of the large language model. Optionally, after the security control module detects whether the input prompt matches a global security policy risk tag, the method further includes: If the input prompt does not match the global security policy risk tag, detecting whether a predefined user security policy exists or not based on the security control module; If a predefined user security policy exists, detecting whether the input prompt matches a user security policy risk tag based on the security control module; And if the input prompt is matched with the user security policy risk tag, taking an action mode corresponding to the user security policy risk tag as a response action mode of the large language model, wherein the action mode corresponding to the user security policy risk tag comprises an adherence mode, a guidance mode and a rejection mode. Optionally, the safety control method further includes: If the predefined user security policy does not exist or the input prompt does not match the user security policy risk tag, detecting whether the input prompt is safe or not based on the general capability of the security control module invoking a large language model; And if the input prompts safety, taking the adherence mode as a response action mode of the large language model. Optionally, if there is no predefined user security policy or the input prompt does not match the user security policy risk tag, after detecting whether the input prompt is secure based on the universal capability of the security control module invoking the large language model, the method further includes: and if the input prompt is unsafe, taking a guide mode or a reject mode as a response action mode of the large language model. Optionally, before the calling the security control module of the large language model in response to the input prompt, the method further comprises: acquiring an input prompt sample, and loading a preset global security policy and a randomly sampled user security policy; Performing multidirectional self-distillation on the input prompt sample based on the global security policy and the user security policy to obtain a training data set; Constructing a thinking chain path of the input prompt sample based on the training data set; and performing supervision fine adjustment on the large language model