CN-121997330-A - Large language model security test method and system based on small language cross-language attack
Abstract
The invention discloses a large language model security test method and system based on small language cross-language attack, the method comprises the steps of constructing a small language selection strategy library, designing a cross-language semantic conversion engine, constructing a multidimensional language barrier, implementing a self-adaptive language switching mechanism and establishing a cross-language security assessment system. The invention effectively bypasses the traditional single language safety detection system through the dual language barriers of the small language input and the Chinese output, the attack content is presented in a small language form, but the target model is forced to reply by Chinese, a language understanding gap is established between the input and the output, the intelligent cross-language attack is realized through the dynamic language switching strategy, and the attack method can be adjusted in real time according to the defending response of the target model. The invention obviously improves the adaptability, concealment and overall test effect of attack, and provides systematic and efficient technical means for large model security evaluation in a multi-language environment.
Inventors
- BAI YINGDONG
- XU MENG
- LI WEIZHU
- LIU WEI
- WANG ZIHAO
Assignees
- 北京灵云数科信息技术有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20251222
Claims (8)
- 1. A large language model security test method based on small language cross-language attack is characterized by comprising the following steps: s1, constructing a small language selection strategy library, and selecting at least one target small language based on a multi-dimensional evaluation system; S2, designing a cross-language semantic conversion engine, converting attack intention into target small language expression, and enhancing the capability of bypassing security detection; S3, constructing a multidimensional language barrier, and manufacturing a safety detection blind spot through double language differences of small language input and Chinese output; S4, implementing a self-adaptive language switching mechanism, and dynamically adjusting the small languages used for attack according to the defending response of the target model; S5, establishing a cross-language security assessment system, and quantitatively assessing the cross-language security protection capability of the attack effect and the target model.
- 2. The method for testing security of a large language model based on a small language cross-language attack according to claim 1, wherein said multi-dimensional evaluation system in step S1 comprises at least one of a small language usage population, coverage in large language model training data, and technical document scarcity.
- 3. The method for testing security of a large language model based on a small language cross-language attack according to claim 1, wherein the cross-language semantic conversion engine in step S2 comprises: The mixed language construction module is used for constructing a multi-language mixed input structure by combining a multi-language mixed embedding attack principle on the basis of single language attack; The dynamic position optimization module is used for determining the optimal position of the attack request in the multilingual sequence according to the multilingual characteristics; The language feature mining module is used for realizing multidimensional language feature attack by utilizing grammar, vocabulary or cultural background features of small languages; and the attention dispersing module is used for safely detecting the attention through a complex expression and structural characteristic dispersing model of the small language.
- 4. The method according to claim 1, wherein the multi-dimensional language barrier construction in step S3 includes using at least two kinds of small languages in mixture in input, constructing a deceptive context in the small language input, forcing the target model to respond with chinese output, and constructing a cross-language understanding barrier by double actions of input-output language differences and language structure complexity.
- 5. The method for testing the security of the large language model based on the cross-language attack of the small languages according to claim 1, wherein the self-adaptive language switching mechanism in the step S4 comprises the steps of analyzing the defending strength and the response mode of the target model to different small languages, evaluating the breakthrough effect and the success rate of each small language based on historical attack data, automatically selecting the optimal attack language according to the defending effect, and dynamically combining and using a plurality of small languages in a single attack.
- 6. The large language model security test method based on the small language cross-language attack according to claim 1 is characterized in that the cross-language security evaluation system in step S5 comprises the steps of testing recognition accuracy and processing capacity of a security system on small language input, evaluating semantic understanding integrity of a target model in a multi-language environment, quantifying bypassing effect of input-output language difference on security detection, and calculating breakthrough rate and success rate indexes of different small language strategies.
- 7. The method for testing security of a large language model based on a small language cross language attack according to any of claims 1 to 6, wherein the target small language includes at least one of haos, java, presbyopic, and swastii.
- 8. A large language model security test system based on a small language cross-language attack, comprising: The strategy library construction module is used for constructing a small language selection strategy library and selecting at least one target small language based on a multi-dimensional evaluation system; the semantic conversion engine module is used for designing a cross-language semantic conversion engine, converting attack intention into target small language expression and enhancing the capability of bypassing security detection; The language barrier construction module is used for constructing multidimensional language barriers and manufacturing safety detection blind spots through double language differences of small language input and Chinese output; the language switching control module is used for implementing a self-adaptive language switching mechanism and dynamically adjusting small languages used for attack according to the defense response of the target model; the security evaluation module is used for establishing a cross-language security evaluation system and quantitatively evaluating the cross-language security protection capability of the attack effect and the target model.
Description
Large language model security test method and system based on small language cross-language attack Technical Field The invention relates to the technical field of artificial intelligence safety, in particular to a large language model safety test method and system based on small language cross-language attack. Background At present, the security test of the large language model mainly depends on technologies such as content detection based on rules, manual security test, fixed template attack and the like. Based on static detection of rules, content filtering is performed through a preset sensitive vocabulary library, technical limits are fixed, contexts cannot be understood and are easy to bypass, an attack test based on an attack sample is used for generating the attack sample through gradient disturbance, technical limits are limited on the effect of a language model, manual safety test depends on expert experience and professional knowledge, implementation cost is high, test period is long, automatic test based on templates lacks adaptability and is easy to identify. Recent researches find that a deeper multi-language security vulnerability is a multi-language mixed embedding attack method, and by embedding malicious requests in a low-resource language problem, the security mechanism is bypassed by utilizing the attention blinking phenomenon, and the technology is limited in that an attack mode is relatively fixed and a dynamic optimization mechanism is lacked. In summary, the prior art has the following technical problems: 1. The prior art is mainly based on single language (Chinese) for security detection, lacks defending capability for cross-language attack, and has security blind spots in a multi-language environment of a large language model, particularly has insufficient recognition and processing capability for small language input. 2. The multi-language hybrid attack has the limitation that although the existing hybrid embedding attack method utilizes a multi-language environment, the attack mode is relatively fixed, and the method is mainly realized by embedding and spoofing prefixes at specific positions, and lacks the dynamic optimization capability based on language characteristics and model response. 3. The language characteristics are not fully utilized, namely the utilization of the low-resource language by the existing attack method is mainly concentrated on the feature of insufficient training data, and the multi-dimensional features such as grammar complexity, cultural background difference, semantic expression characteristics and the like of the language are not fully utilized. 4. The existing hybrid embedded attack method often adopts a fixed attack mode, and lacks the capability of dynamically adjusting the attack strategy according to the response of a target model and language characteristics. There is currently no effective solution to the above problems. Disclosure of Invention Aiming at the technical problems in the related art, the invention provides a large language model safety test method and system based on small language cross-language attack, which can effectively break through the limitation of the traditional single language safety detection, improve the depth and the effectiveness of the safety test and overcome the defects in the prior art by creatively utilizing small language input and Chinese output to construct double language barriers. In order to achieve the technical purpose, the technical scheme of the invention is realized as follows: a large language model safety test method based on small language cross-language attack comprises the following steps: s1, constructing a small language selection strategy library, and selecting at least one target small language based on a multi-dimensional evaluation system; S2, designing a cross-language semantic conversion engine, converting attack intention into target small language expression, and enhancing the capability of bypassing security detection; S3, constructing a multidimensional language barrier, and manufacturing a safety detection blind spot through double language differences of small language input and Chinese output; S4, implementing a self-adaptive language switching mechanism, and dynamically adjusting the small languages used for attack according to the defending response of the target model; S5, establishing a cross-language security assessment system, and quantitatively assessing the cross-language security protection capability of the attack effect and the target model. Further, the multi-dimensional evaluation system in step S1 comprises at least one of the number of people used in the small language, coverage in the training data of the large language model and scarcity of technical documents. Further, the cross-language semantic conversion engine in step S2 includes: The mixed language construction module is used for constructing a multi-language mixed input structure by combining a multi-languag