CN-122020317-A - Fine granularity large language model pedigree identification method for black box scene

CN122020317ACN 122020317 ACN122020317 ACN 122020317ACN-122020317-A

Abstract

The application relates to a black box scene-oriented fine-granularity large language model pedigree identification method, which is used for constructing a large language model black box detection framework combining knowledge in the multitasking field and a scientific representative sampling technology, covering a task query pool with multiple dimensions such as medical question-answering, mathematical reasoning and the like, and extracting a most representative detection sample from each task by utilizing a screening algorithm based on TF-IDF vectorization and KMeans clustering so as to construct a target query set. And then constructing a task-level knowledge fingerprint extraction model based on semantic topology and structural fluctuation, and calculating a relative deviation vector of a response text relative to origin semantics and stability statistics (variation coefficient) reflecting knowledge stability by executing independent generation experiments on the model for a plurality of times. On the basis, the fine-granularity spectrum evolution traceable fingerprint based on multi-level joint weighting aggregation is realized, the fingerprint can automatically weight the weakness degree of different queries in stability ranking, and weight attenuation and compensation are performed by combining task-level semantic confidence, so that the evolution similarity of a model to be tested relative to each candidate source model is accurately output.

Inventors

ZHOU XIN
FAN JUN
XU JINWEI
ZHANG HE

Assignees

南京大学

Dates

Publication Date: 20260512
Application Date: 20260318

Claims (10)

1. A black box scene-oriented fine-grained large language model pedigree identification method is characterized by comprising the following steps of: Step S1, constructing a multi-task query pool oriented to pedigree tracing of a large language model, wherein the multi-task query pool comprises a medical question-answer task, a mathematical reasoning task, a code generation task, a legal question-answer task and a safety alignment task; step S2, performing representative screening on candidate query texts of each task to obtain a target query set for black box detection; S3, inputting a target query set into the model to be tested and the candidate source model, and executing independent generation for each query for multiple times to obtain a response text set corresponding to each model; S4, respectively carrying out semantic vector coding on the query text and the response text, calculating a relative deviation vector and a stability statistic based on the response text set, and constructing task-level knowledge fingerprints of the model; s5, performing similarity comparison on task-level knowledge fingerprints of the model to be tested and each candidate source model, and performing joint weighted aggregation on different queries and different tasks to obtain the evolution similarity of the fine-granularity spectrum system of the model to be tested relative to each candidate source model; And S6, outputting the results of the pedigree family, the candidate father model, the known model identity and the like of the model to be tested according to the detection results of the pedigree similarity, the evolution similarity and the like.
2. The black box scene oriented fine-grained large language model lineage identification method according to claim 1, wherein the representative filtering in step S2 includes: 2-1, cleaning candidate query texts to remove abnormal characters, redundant blanks and invalid fragments; 2-2, grouping the topics of the candidate query text by adopting a clustering method; 2-3 selecting a representative query closest to the cluster center for each cluster, and supplementing by using heuristic scores when the representative query is insufficient, so as to improve task coverage and sampling stability.
3. The black-box scene oriented fine-grained large language model lineage identification method according to claim 2, wherein the heuristic scoring comprehensively considers text length, sentence number, lexical diversity and average word length to preferentially preserve query samples with high information content and large semantic difference.
4. The black box scene oriented fine-grained large language model lineage identification method according to claim 1, wherein in step S3, each target query performs several black box calls on each model to obtain multiple random responses under the same query, thereby capturing cognitive fluctuation characteristics of the model under the black box sampling condition.
5. The black box scene oriented fine-grained large language model lineage identification method according to claim 1, wherein the method of constructing task level knowledge fingerprint in step S4 includes: 4-1 encoding the query text into a query semantic vector; 4-2, encoding the multi-time response text corresponding to the same query into a response semantic vector set; 4-3 constructing a relative deviation vector based on the difference between the mean vector of the response semantic vector set and the query semantic vector; 4-4 calculating stability statistics based on the degree of discretization of the response semantic vector set in each semantic dimension, namely: 4-4-1 dimension statistical characteristic extraction, namely, aiming at t response semantic vector sets generated by corresponding to any query, calculating the mean value and sample standard deviation of the sets on each semantic dimension; Calculating 4-4-2 dimension variation coefficients, namely respectively calculating variation coefficients for each dimension in order to eliminate the influence of the activation intensity difference among different semantic dimensions, namely dividing the dimension standard deviation by the absolute value of the mean value, and directly taking the dimension standard deviation as the variation coefficient if the dimension mean value approaches zero so as to obtain a dimension variation coefficient vector; 4-4-3 cross-dimensional stability aggregation, namely carrying out full-dimensional arithmetic average calculation on the bisection dimensional variation coefficient vector to obtain stability statistics reflecting the query semantic consistency. 4-5, Forming the task level knowledge fingerprint by the relative deviation vectors and the stability statistics corresponding to the plurality of queries.
6. The black-box scene oriented fine-grained large language model lineage identification method according to claim 5, wherein stability statistics are obtained by responding to a ratio of standard deviation to mean absolute value of a semantic vector set in each dimension, and averaging all dimensions, thereby forming a coefficient of variation type stability index for a single query.
7. The black box scene oriented fine-grained large language model pedigree identification method according to claim 5 is characterized in that the relative deviation vector is obtained by firstly calculating an arithmetic average value of all vectors in a response semantic vector set to obtain a center vector, and then subtracting the center vector from a query semantic vector to obtain an original deviation vector. The principle is that random semantic noise in the generation process of the model is eliminated by utilizing the center vector so as to extract the deterministic offset of the model to the cognition of specific knowledge points, and meanwhile, the direction information of the original deviation vector can be reserved, and the topological displacement characteristics of the model to be tested relative to the reference semantic space can be completely represented.
8. The black box scene-oriented fine-grained large language model pedigree identification method according to claim 1, wherein when different queries are combined and weighted in step S5, query weights are calculated according to stability sequencing positions of a to-be-tested model and a candidate source model on corresponding queries, and then cosine similarities among relative deviation vectors are utilized to obtain query-level similarities; the stability ranking deviates from the middle position, the distinguishing capability of the corresponding query is stronger or weaker, and the stability ranking is converted into the query weight through a preset function so as to highlight the response mode with high distinguishing capability.
9. The black box scene-oriented fine-grained large language model pedigree identification method is characterized in that when different tasks are combined and weighted and aggregated in the step S5, task confidence coefficients are built based on distribution entropy of each query weight, the more concentrated the weight distribution is, the higher the task confidence coefficients are, and then the task similarity, the task confidence coefficients and the task weights are combined and aggregated to obtain final pedigree similarity.
10. The black-box scene oriented fine-grained large language model lineage identification method according to claim 1, further comprising a buffer multiplexing step: storing the historical black box response result and the knowledge fingerprint result in a lasting manner; When the knowledge fingerprint extraction rule is adjusted but the original response result is kept valid, the task-level knowledge fingerprint is reconstructed based on the cached response result, and the model to be tested is not required to be called again, so that the parameter tuning and iteration verification efficiency is improved.

Description

Fine granularity large language model pedigree identification method for black box scene Technical Field The application relates to a fine-grained model pedigree evolution tracing method based on a black box scene, and belongs to the technical field of artificial intelligent model safety in software engineering. Background With the rapid development of the generated artificial intelligence technology, the large language model and the derivative model thereof are widely applied to numerous scenes such as intelligent question and answer, code generation, content creation, medical consultation, legal assistance and industry automation. Unlike traditional software systems, the generated artificial intelligence model usually takes a large-scale pre-training model as a base, and continuously evolves in the following modes of instruction fine adjustment, field fine adjustment, efficient parameter tuning, reinforcement learning alignment, distillation compression, quantitative deployment, model merging, adapter injection and the like, so that a complex model family structure and lineage inheritance relationship are formed. The evolution mode improves the model capacity and the deployment efficiency, and simultaneously brings a series of new problems of model identification, source verification, intellectual property protection, supply chain security audit, model asset management and the like. However, large scale application of generative artificial intelligence models also faces significant challenges. First, the training, fine tuning and redistribution processes of the models are highly complex, and a large number of derivative models with similar functions and different sources can be developed around the same basic model by different subjects, so that the real inheritance relationship between the models is difficult to directly judge through external names, interface descriptions or functional performances. Secondly, more and more models are provided in the form of closed source services or standardized application program interfaces, and auditors often cannot access model parameters, training logs, middle layer characterization or probability output, and can observe model behaviors only through limited times of input and output interaction. Again, the problems of model theft, unauthorized distillation, unauthorized secondary fine tuning, renaming release, etc. are increasing, so that the high-confidence audit requirement is difficult to meet only by relying on manual experience or a simple content comparison source identification mode. In order to cope with the challenges, around the problems of identity recognition and source verification of the generated artificial intelligent model, related researches gradually form a model traceability technology system. According to the accessibility degree of auditors to the internal information of the target model, the existing model tracing technology can be systematically divided into three types of white box fingerprint tracing, gray box fingerprint tracing and black box fingerprint tracing. White-box fingerprinting usually assumes that auditors have full access to model parameters or intermediate characterizations, extracting stable lineage features by directly analyzing weight distribution, feature representation, gradient information, neuronal sensitivity, or activation patterns. The method can accurately observe the internal structure of the model, has strong discrimination on parameter disturbance, pruning, rearrangement and partial post-processing operation, but is applicable to the premise that the target model has to have high transparency, so that the method is difficult to be directly used for a commercial closed source model and an online interface model. The ash box fingerprint tracing adopts an active detection paradigm, a certain degree of control or customization needs to be carried out on a target model in the fingerprint construction stage, and identity verification can be completed through black box inquiry in the actual use stage. The method generally realizes the rapid identification of a specific target model by constructing a highly specific query response mode, a trigger prompt or a statistical signature, but the generation of the early fingerprint still depends on the authority of a developer or the control right of the model. Compared with the prior art, the black box fingerprint tracing method only depends on the interaction of the standardized input and output interface and the target model, extracts the identity characteristics by analyzing the output behaviors of the model, has the strongest universality and deployment flexibility, and is more in line with the actual environment of the current model service and interface distribution. Although black box tracing has more advantages in application prospect, the existing research still faces multiple difficulties. The existing black box method can be generally summarized into three routes, n