CN-121561680-B - Multi-mode data time synchronization and annotation and edge reasoning method and system

CN121561680BCN 121561680 BCN121561680 BCN 121561680BCN-121561680-B

Abstract

The invention provides a method and a system for multi-mode data time synchronization, labeling and edge reasoning, and relates to the technical field of cognitive assessment. The multi-mode data time synchronization and annotation and edge reasoning method comprises the steps of collecting multi-mode data, aligning the time of the multi-mode data by adopting a method of global logic clock, network time protocol and sliding window drift correction, identifying 3 dimensions of task stages, behavior events and emotion states, outputting 3 types of annotation labels, deploying an edge reasoning model, automatically triggering cloud rechecking when the confidence level is lower than a threshold value, generating a unique time stamp and task identification for each collected frame, and establishing a unified log database by the cloud. According to the invention, a time sequence correction algorithm is combined with an AI model annotation logic depth, so that multi-mode data millisecond time alignment and intelligent annotation are realized, and the technical problems of asynchronous data, high delay and large manual annotation amount in a cognitive assessment and rehabilitation training scene are solved.

Inventors

XIA MINGYUE
TANG WEI
YANG CHUAN
LI WENBO

Assignees

华院计算技术(上海)股份有限公司

Dates

Publication Date: 20260508
Application Date: 20260121

Claims (14)

1. The multi-mode data time synchronization and annotation and edge reasoning method is characterized by comprising the following steps: s1, multi-mode data acquisition is carried out in the cognitive evaluation process, wherein data collected by the same sensor in a hardware sampling period is set as an acquisition frame, and the data modes comprise an audio stream, a video stream, a touch and handwriting track, an action/skeleton track and a task event stream; S2, time synchronization and drift correction, namely synchronizing a hardware sampling period based on a main control MCU reference sampling signal, adopting a global logic clock, a network time protocol and a sliding window drift correction method, aligning the time of multi-mode data, and carrying out drift correction when the error is greater than 2 ms; The sliding window drift correction method comprises dynamically updating compensation parameters based on the offset of past N frames recorded by a sliding window; S3, AI intelligent annotation, namely uniformly encoding multi-mode data into feature vectors, identifying 3 dimensions of a task stage, a behavior event and an emotion state, outputting 3 types of annotation labels, smoothing annotation results and detecting conflicts, and screening annotation conflicts or abnormal annotation conditions; S4, edge reasoning and cloud edge cooperation are carried out, namely an AI model is deployed in a lightweight mode at the equipment end, immediate judgment and output of effective labels are achieved, and cloud rechecking is automatically triggered when the confidence level is lower than a threshold value; And S5, generating a unique time stamp and a task identifier for each acquisition frame, establishing a unified log database by the cloud, and recording acquisition, reasoning, storage and AI annotation information of the multi-mode data so as to realize the whole flow tracing, playback and quality assessment of the data.
2. The method according to claim 1, wherein the data parameters of step S1 include: The video sampling frame rate is more than or equal to 25fps, and the resolution is more than or equal to 1080p; The audio sampling rate is more than or equal to 16kHz/24bit, and the signal to noise ratio is more than or equal to 60dB; the touch sampling frequency is more than or equal to 60Hz, and the track precision is less than or equal to 2mm; The motion detection module uses a high-precision bone recognition model, and the inter-frame drift is less than or equal to 5px.
3. The method according to claim 1, wherein the specific method of step S2 comprises: correcting the local time: ; Wherein: t global is unified time, T local is local time obtained by a sensor, K (T) is a dynamic compensation parameter calculated according to sliding window drift, and delta NTP is network time synchronization deviation; the calculation method of the dynamic compensation parameter comprises the following steps: ; Wherein E avg is the average drift error per unit time in the sliding window, K (t+1) is the corrected dynamic compensation parameter, K (t) is the current dynamic compensation parameter, and alpha is the dynamic learning rate.
4. The method of claim 3, wherein the calculation method of E avg comprises a reduced method and a robust method: the simplification method comprises the following steps: ; Wherein, the For the current alignment residual error to be present, The local time increment of the sampling point corresponding to the ith residual error sample in the sliding window from the anchor point is set, wherein the anchor point is drift correction reference moment formed based on time setting event; robust method, calculating linear regression slope: ; Wherein x is the set of x i , E is the residual error Is a set of (3).
5. The method of claim 3, wherein an automatic drift retraining is triggered to perform drift correction when the synchronization error E >2ms, wherein, T sensor is the sensor time.
6. The method according to claim 1, wherein the specific step of step S3 comprises: S31, multi-modal feature Fusion, namely using a transducer/Time-series Fusion vector coding model to encode the visual key frame feature, the voice frequency spectrum feature, the motion track, the eye movement sequence feature and the touch track feature in a cross-modal attention and Time position, and outputting a feature vector with fixed dimension; S32, annotating task output, namely, identifying through a classifier, and outputting 3 classes of annotation labels, namely, task stage labels comprising preparation, execution, completion and hesitation, behavior event labels comprising errors, repetition, slowness and incompleteness, emotion state labels comprising tension, pleasure, concentration and confusion, wherein the annotation labels and the original data are bound in a JSON structure; s33, confidence level smoothing and conflict detection, namely voting smoothing is carried out by using a time window delta T, annotation results are smoothed, and the jitter label is filtered: ; where confidence' is the smoothed confidence, confidence i is the class confidence, and N is the class number.
7. The method according to claim 6, wherein the step S33 includes: merging the same event according to the annotation ID, calculating to obtain the fragment and smooth confidence of each type of event based on a sliding window stepped every 50ms, and then carrying out conflict detection, wherein the specific method comprises the following steps: and when the similar labels of different source modes collide within 100ms, fusing or reserving a plurality of parties, outputting a collision mark and reducing the weight for the long-term collision situation.
8. The method according to claim 1, wherein the specific step of step S4 comprises: S41, deploying a lightweight edge model, namely compressing an original AI model by 70% by adopting an 8-bit quantization and distillation technology, and deploying the compressed AI model to NPU, DSP, CPU equipment ends to perform voice keyword recognition, facial emotion recognition and behavior detection; S42, edge decision reasoning, namely outputting a valid label when the confidence coefficient of the equipment end recognition result is not less than 0.85, otherwise, marking the data segment as to-be-checked by the cloud; S43, cloud edge cooperation, namely uploading the device side reasoning result and the time stamp to the cloud end, and checking consistency by the master model.
9. The method according to claim 8, wherein in the step S43, the master model verifies consistency by a consistency index C: ; The Score edge is the confidence coefficient of the edge node after calibration of the same event in the alignment time window, and the Score cloud is the confidence coefficient of the cloud host model after calibration of the same event in the same alignment time window; If C <0.6, the edge model is automatically marked as a bias state.
10. The method of claim 1, wherein the step S5 comprises the steps of backtracking and auditing are based on a unified log database of a cloud, uploading abstracts in real time by an edge end, establishing a time axis index after cloud aggregation, realizing multi-mode searching, aligning, replaying and auditing, and carrying out local caching in an offline scene, and carrying out post-networking retransmission and merging with a cloud log.
11. The method of claim 1, wherein constructing the data structure comprises: In step S1, meta information is added for each acquisition frame: {timestamp_local, sensorID, modalityType, frameIndex, deviceID}; wherein, timestamp_local is a local timestamp, sensorID is a sensor identifier, modalityType modality type, frameIndex is a frame number, deviceID is a device ID; in step S3, the data of the output annotation tag is: {timestamp_global, eventType, eventConfidence, modalitySource, sensorID, annotationID}; Wherein, timestamp_global is a uniform timestamp, eventType is an event type, eventConfidence is an identification confidence, modalitySource is a source modality, sensorID is a sensor identifier, annotationID is an annotation ID; In step S5, the data recorded per frame data is: {taskID, sessionToken, timestamp_global, sensorID, deviceID, syncError, modelVersion, edgeDecisionFlag, edgeTag, confidence}; Wherein taskID is task ID, sessionToken is session identification, timestamp_global is uniform timestamp, sensorID is sensor identification, deviceID is device ID, syncError is time error, modelVersion is model version, edgeDecisionFlag is edge decision marker, edgeTag is decision path tag, confidence is decision confidence.
12. A multi-modal data time synchronization and annotation and edge reasoning system, parts of the system cooperating to implement the steps of the method of any one of claims 1-11, the system comprising: The multi-mode acquisition end specifically comprises a voice acquisition module, an eye movement acquisition module, a video acquisition module and an action/touch acquisition module; the equipment end/edge computing node specifically comprises a time synchronization and drift correction controller, a multi-mode buffering and alignment module, an AI intelligent annotation module and a log and data packaging module; The cloud center system specifically comprises a cloud master model reasoning and rechecking module, a time reference synchronization module, a log and tracing database and a data management and reporting system; the cloud master model reasoning and rechecking module adopts high-precision AI model analysis, the result is used for updating task identification and tracing information generated by the log and data packaging module, and the time reference synchronization module adopts an NTP/PTP global clock to perform time synchronization correction on the time synchronization and drift correction controller.
13. A computer device comprising a memory, a processor and a computer program stored on the memory, the processor executing the computer program to implement the steps of the method of any one of claims 1-11.
14. A computer program product, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1-11.

Description

Multi-mode data time synchronization and annotation and edge reasoning method and system Technical Field The invention relates to the technical field of cognitive assessment, in particular to a multi-mode data time synchronization and annotation and edge reasoning method and system. Background In the cognitive evaluation and rehabilitation training process, a subject usually needs to execute multiple interactive tasks such as language answers, hand actions, drawing tasks, touch operation, facial expression reactions and the like, and a system needs to synchronously acquire multiple source modes such as voice, video, action tracks, touch data and the like and finish time alignment within a millisecond range so as to ensure accurate recognition of behavior events. In the traditional scene, the multi-mode data often depend on manual annotation or off-line analysis after the event, the time offset is serious, and a unified time sequence reference is lacking, so that the event positioning is inaccurate, the annotation efficiency is low, and the real-time feedback is difficult. This results in a series of problems: (1) And the time is asynchronous, namely the sampling frequency and the time reference of each sensor are inconsistent, so that the acquired data have deflection above millisecond level, and the alignment of task events and the extraction precision of AI features are affected. (2) The manual annotation efficiency is low, the multi-mode data volume is large, the task stage, the emotion state or the wrong behavior are marked manually frame by frame, the cost is high, and the subjectivity is high. (3) The cloud reasoning delay is high, and the existing system generally needs to upload the original data to the cloud for unified analysis, so that feedback delay is caused, and the system is not suitable for real-time training and real-time evaluation. (4) The unified time sequence audit and traceability mechanism is lacking, namely, in a multi-device cooperation acquisition scene, the time sequence accuracy and the source integrity of the data cannot be tracked rapidly. Therefore, a unified system architecture capable of implementing multi-mode data time synchronization, intelligent annotation, edge reasoning and tracing is urgently needed to improve the real-time performance and reliability of data processing. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a multi-mode data time synchronization and annotation and edge reasoning method and system, which are suitable for various intelligent medical and behavior evaluation scenes with strong multi-mode sensing requirements and high sensitivity to time precision and behavior event positioning, and comprise application fields of cognitive screening, psychological evaluation, nerve function monitoring, rehabilitation training evaluation, intelligent accompanying equipment and the like. The invention realizes the whole framework of 'acquisition, synchronization, identification, annotation and edge instant decision' through hardware triggering synchronization, logic clock fusion, AI automatic annotation and edge reasoning mechanism, and is suitable for intelligent terminals of hospitals, community digital health sites, endowment assessment equipment, mobile rehabilitation equipment, household intelligent accompanying systems and the like. The system can ensure that the multi-mode data has a highly consistent time reference in the acquisition stage and has a traceable, low-delay and expandable structure, so that the system becomes an important basic capability of intelligent medical equipment. In a first aspect, the present invention provides a method for time synchronization, labeling and edge reasoning of multi-modal data, including the steps of: s1, multi-mode data acquisition is carried out in the cognitive evaluation process, wherein data collected by the same sensor in a hardware sampling period is set as an acquisition frame, and the data modes comprise an audio stream, a video stream, a touch and handwriting track, an action/skeleton track and a task event stream; S2, time synchronization and drift correction, namely synchronizing a hardware sampling period based on a main control MCU reference sampling signal, aligning the time of multi-mode data by adopting a method of global logic clock, network time protocol and sliding window drift correction, and controlling the error within 2 ms; S3, AI intelligent annotation, namely uniformly encoding multi-mode data into feature vectors, identifying 3 dimensions of a task stage, a behavior event and an emotion state, outputting 3 types of annotation labels, smoothing annotation results and detecting conflicts, and screening annotation conflicts or abnormal annotation conditions; s4, edge reasoning and cloud edge cooperation, namely, an AI model is deployed in a lightweight mode at the equipment end, so that instant judgment and label generation are realized, and cloud rechecki