CN-122027849-A - Video evidence obtaining method based on Transformer and quantum characteristics

CN122027849ACN 122027849 ACN122027849 ACN 122027849ACN-122027849-A

Abstract

The invention provides a video evidence obtaining method based on a transducer and quantum characteristics, which belongs to the technical field of multimedia content security and computer vision intersection, and comprises the following steps of S1, multi-mode original data acquisition; S2, preprocessing multi-mode data, S3, preliminary detection of single-mode authenticity, S4, quantum-optimized multi-mode feature fusion, and S5, comprehensive judgment of multi-mode authenticity. Through multi-mode cooperation and quantum technology innovation, the problems of insufficient robustness, inaccurate positioning and the like of the traditional evidence obtaining technology are effectively solved, and an efficient, accurate and landable technical scheme is provided for verifying the authenticity of video content.

Inventors

Shan Wuyang
SU HANG
SI JINGWEI
GUO JUNJIE
XIE YIJIE
WANG YAN
LU XIAOYU
YU XIAOQIN
CAO JIARUI
YANG YE

Assignees

成都理工大学

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (10)

1. The video evidence obtaining method based on the Transformer and the quantum characteristics is characterized by comprising the following steps of: s1, acquiring multi-mode original data, namely acquiring visual mode data, audio mode data, text mode data and metadata of a video to form a multi-mode original data set; s2, preprocessing the multi-mode data, respectively denoising, unifying formats and aligning time axes for the multi-mode original data, and outputting a preprocessed multi-mode data set; S3, primarily detecting the single-mode authenticity, respectively counting the number of abnormal frames of the video mode, counting the total duration of abnormal fragments of the audio mode, counting the time length of mismatching of semantics and the number of inconsistent characters of the watermark of the text mode and counting the number of abnormal core fields of the metadata based on a multi-mode data set, simultaneously calculating the average abnormal confidence coefficient of each mode, and outputting 4 single-mode abnormal reports; s4, quantum-optimized multi-mode feature fusion, which comprises the following steps: s4.1, extracting the abnormal proportion and the average abnormal confidence coefficient of each mode from the single-mode abnormal report to form a standardized single-mode characteristic sub-vector; S4.2, optimizing the standardized single-mode feature sub-vector by adopting a quantum surface fitting algorithm, constructing a quantum state by mapping the feature dimension and the number of quantum bits, optimizing the feature fitting error based on the Hamiltonian amount, and outputting the quantum optimized single-mode feature sub-vector and the quantum fitting confidence; S4.3, combining the average abnormal confidence coefficient of each mode of the S3 with the quantum fitting confidence coefficient, and calculating a double confidence coefficient fusion weight; S4.4, carrying out element-level weighting on the quantum optimization single-mode feature sub-vector according to the double confidence fusion weight, and then splicing to output a multi-mode fusion feature vector; s5, comprehensively judging the authenticity of the multiple modes, inputting the multi-mode fusion feature vector into a pre-trained transducer model, outputting a preliminary judging result of video authenticity/tampering, a main tampering type and overall judging confidence, simultaneously combining quantum optimization single-mode feature sub-vector verification, refining a tampering time range and a space region, and outputting a comprehensive judging report.
2. The Transformer and quantum feature based video forensics method of claim 1 further comprising: And S6, performing closed loop verification and parameter optimization, constructing a verification data set containing real labels, calculating the judgment accuracy, recall rate and F1 value based on the comprehensive judgment report, simultaneously evaluating the average fitting error and the quantum confidence mean value of the S4.2 quantum surface fitting algorithm, and adjusting the pretreatment parameters of the S2, the detection threshold of the S3 and the quantum surface fitting algorithm parameters of the S4.2 according to the evaluation result to form a evidence obtaining flow closed loop.
3. The method for video evidence collection based on a transducer and quantum features according to claim 1, wherein the specific processing procedure of the quantum surface fitting algorithm in S4.2 comprises: S4.2.1, quantum state mapping, namely determining the number of quantum bits according to the dimension of a standardized single-mode characteristic sub-vector, mapping each characteristic component into a single-quantum bit state with a ground state probability amplitude, and combining the single-quantum bit state into a multiple-quantum bit state through tensor product; s4.2.2, constructing Hamiltonian volume, namely setting the initial value of the weight coefficient of each quantum bit as a corresponding characteristic component by taking characteristic fitting errors as targets, and constructing the Hamiltonian volume by combining Pauli-Z operators; S4.2.3, model training, namely iteratively optimizing the weight coefficient of the Hamiltonian amount by adopting a gradient descent method, taking the sum of fitting errors of all training samples as a training ending condition, and storing the trained weight coefficient; S4.2.4 feature optimization and confidence calculation, namely correcting the standardized single-mode feature sub-vector based on the trained weight coefficient to obtain a quantum optimization feature sub-vector, and calculating quantum fitting confidence through the L2 norm relative error of the original feature and the optimized feature.
4. The method for video evidence collection based on Transformer and quantum features according to claim 1, wherein the calculating process of the double confidence fusion weight in S4.3 comprises: the F1 value comprehensively judged under different coefficient combinations is tested through a verification data set, and a weight coefficient alpha of the single-mode average abnormal confidence coefficient and a weight coefficient beta of the quantum fitting confidence coefficient are determined; for each mode, calculating the product of alpha and the average abnormal confidence coefficient of the mode and the product of beta and the quantum fitting confidence coefficient of the mode, and summing to obtain the comprehensive confidence coefficient of the mode; Dividing the comprehensive confidence coefficient of each mode by the sum of the comprehensive confidence coefficients of all modes to obtain a normalized double-confidence-coefficient fusion weight, wherein the sum of the fusion weights of all modes is 1.
5. The method for video forensics based on the transducer and the quantum features according to claim 1, wherein the processing of the transducer model in S5 comprises: The dimension of the model input layer is set as the dimension of the S4.4 multi-mode fusion feature vector, and the dimension of the hidden layer is determined through cross verification; the model training data comprises a public video tampering data set and a quantum optimization feature sample, wherein the quantum optimization feature sample consists of S4.4 multi-mode fusion feature vectors of each video and corresponding real labels; The model output layer comprises two kinds of branches and a plurality of kinds of branches, wherein the two kinds of branches output real probability and tamper probability, the plurality of kinds of branches output the probabilities of visual tamper, audio tamper, text tamper, metadata tamper and multi-mode tamper, and the highest probability is taken as a main tamper type.
6. The method for video forensics based on Transformer and quantum features according to claim 5, wherein the specific processing of the tampered region positioning in S5 comprises: Converting the number of the visual anomaly frame in the S3 into a time stamp through the video frame rate, and unifying the time stamp of the visual anomaly frame, the time stamp of the audio anomaly fragment and the time stamp of the text anomaly subtitle into a video total time axis; Combining S4.2 quantum to optimize the abnormal duty ratio in the single-mode characteristic sub-vector, verifying and combining abnormal time stamps of all modes, and determining tamper start time and tamper end time; and according to the main tampering type, extracting abnormal region information of a corresponding mode from the S3 single-mode abnormal report, and marking a high-density tampering region by combining the reliability of the S4.2 quantum optimization characteristics to form a refined tampering region description.
7. The method for video forensics based on transfomer and quantum features according to claim 2, wherein the adjusting of the parameters of the S4.2 quantum surface fitting algorithm in S6 includes: If the average fitting error is higher than a preset threshold, reducing the learning rate of a gradient descent method or enlarging the scale of a training data set, wherein new samples of the training data set need to contain standardized single-mode feature sub-vectors with different tampering types and corresponding ideal feature sub-vectors; If the quantum confidence mean value is lower than a preset threshold value, improving the convergence threshold value precision of Hamiltonian quantity training or adjusting the quantum state mapping rule, and increasing the corresponding number of the quantum bit number and the characteristic dimension.
8. The method for video forensics based on Transformer and quantum features according to claim 1, wherein the specific process of multi-modal time axis alignment in S2 comprises: The visual mode frame alignment comprises the steps of extracting SIFT feature points of each frame, matching with feature points of a previous frame to calculate offset, correcting the current frame position, and ensuring that the alignment error between frames is less than or equal to 1 pixel; the audio mode time stamp verification comprises the steps of calculating total time length of all audio frames after framing, comparing the total time length with the total time length of the audio recorded in S1, judging that alignment is qualified when the error is less than or equal to 10ms, and otherwise, readjusting frame length and frame shift parameters; and matching a text mode time axis, namely comparing the starting/ending time stamp of the caption with the total time axis of the video, ensuring that the caption time range is within the video duration, and marking the exceeding part as an abnormal caption.
9. The method for video forensics based on Transformer and quantum features according to claim 1 wherein the dynamic adjustment of the single-mode detection threshold in S3 comprises: the visual mode inter-frame consistency detection threshold value is adjusted according to the video scene type, the inter-frame gray level difference average value threshold value of the static scene is set to be 5, and the inter-frame gray level difference average value threshold value of the dynamic scene is set to be 15; The voice print consistency detection threshold of the audio mode is that a voice print similarity threshold is set as a similarity minimum value of 95% normal samples based on voice print similarity distribution of a training data set; And (3) a text mode semantic matching detection threshold value, namely setting a similarity value corresponding to the semantic matching error rate less than or equal to 5% as a semantic similarity threshold value through verification data set test.
10. The method for video forensics based on Transformer and quantum features according to claim 2 wherein the evaluation index of closed loop verification in S6 comprises: Determining performance index, accuracy = (number of true positive samples+number of true negative samples)/total number of samples, recall = number of true positive samples/(number of true positive samples+number of false negative samples), f1 value = 2×accuracy×recall/(accuracy+recall); The quantum optimization index is average fitting error = sum of relative errors of original features and optimized features of all samples/number of samples, and quantum confidence average = sum of quantum fitting confidence of all samples/number of samples; When the F1 value in the performance index is more than or equal to a preset threshold value and the average fitting error in the quantum optimization index is less than or equal to the preset threshold value, judging that the current parameter configuration is qualified; And S6, the closed loop optimization finishing condition is that in the continuous 3-round verification, the judging performance index and the quantum optimization index both meet the qualification standard, the fluctuation amplitude of two adjacent rounds of indexes is less than or equal to 5%, and if the finishing condition is not met, the parameter adjustment and verification process of S6 is repeatedly executed until the finishing condition is met or the adjustment times reach the preset upper limit, and an optimal parameter configuration report is output at the moment.

Description

Video evidence obtaining method based on Transformer and quantum characteristics Technical Field The invention relates to the technical field of multimedia content security and computer vision intersection, in particular to a video evidence obtaining method based on a Transformer and quantum characteristics. Background With the popularity of video editing tools and the rapid development of AI-generated content (AIGC) technology, video tampering means (such as visual frame splicing/replacement, audio synthesis/clipping, text subtitle tampering, metadata falsification, etc.) are increasingly hidden, and conventional video authenticity evidence obtaining technologies face significant limitations: The prior art is independent of single mode (such as visual edge detection or audio voiceprint comparison) for judging, and is easy to avoid by tamper means in a targeted manner, for example, visual tamper can cover up splicing marks through smooth edges, and audio tamper can simulate the characteristics of an original speaker by adjusting voiceprint parameters, so that misjudgment or missed judgment is easy to occur in single mode detection. The multi-mode fusion lacks depth optimization, wherein part of multi-mode schemes only carry out simple weighted fusion on detection results of all modes, noise in single-mode characteristics (such as abnormal frame misjudgment caused by visual occasional salt and pepper noise and energy detection deviation caused by interference of audio environmental noise) is not effectively corrected, fusion weight only depends on single-mode detection confidence, and accuracy of the characteristics is ignored, so that the fusion results are easily misled by low-quality characteristics. The method lacks high-efficiency feature optimization means, the traditional feature processing technology (such as Gaussian filtering and wavelet denoising) can only eliminate surface noise, systematic deviation (such as feature deviation generated by voice-to-text errors in text semantic matching) caused by detection logic limitation in single-mode features cannot be corrected, meanwhile, fitting advantages of quantum calculation on high-dimensional features are not introduced, and the bottleneck of the traditional algorithm in feature optimization efficiency and robustness is difficult to break through. The method has no closed-loop optimization mechanism, the existing scheme is mostly a unidirectional flow of acquisition-detection-judgment, a parameter feedback mechanism based on a verification result is not established, and detection threshold values, fusion weights or characteristic optimization parameters cannot be dynamically adjusted according to evidence obtaining requirements of different scenes (such as static/dynamic videos and different shooting devices), so that the technical adaptability and the continuous performance improvement capability are insufficient. The technical pain point makes it difficult to consider judgment accuracy, positioning accuracy and technical universality when facing complex tampering scenes in the traditional scheme, and cannot meet increasingly strict video content authenticity evidence obtaining requirements. Disclosure of Invention The invention provides a video evidence obtaining method based on a Transformer and quantum characteristics, which effectively solves the problems of insufficient robustness, inaccurate positioning and the like of the traditional evidence obtaining technology through multi-mode cooperation and quantum technology innovation, and provides a high-efficiency, accurate and landable technical scheme for verifying the authenticity of video content. In order to achieve the above purpose, the invention adopts the following technical scheme: a video evidence obtaining method based on a transducer and quantum features comprises the steps of S1, collecting multi-mode original data, obtaining visual mode data, audio mode data, text mode data and metadata of a video to form a multi-mode original data set, S2, preprocessing the multi-mode original data, respectively denoising, unifying formats and aligning time axes, outputting the preprocessed multi-mode data set, S3, preliminarily detecting the authenticity of the single mode, respectively carrying out statistics on the number of abnormal frames of the visual mode, the total duration of abnormal segments of the audio mode, the text statistics text semantic mismatch duration and watermark mismatch character number, counting the number of abnormal core fields of the metadata, calculating average abnormal confidence coefficient of each mode, and outputting 4 single-mode abnormal reports, S4, respectively, extracting the abnormal occupation ratio and the average abnormal confidence coefficient of each mode from the single-mode abnormal reports to form a standardized single-mode feature vector, S4.2, respectively carrying out optimization on the number of abnormal frames of the single-mode statistics based