CN-122024329-A - False falling filtering method and system based on multi-mode space-time feature fusion
Abstract
The embodiment of the application provides a false falling filtering method and a system based on multi-mode space-time feature fusion, wherein the method comprises the steps of obtaining continuous-time video stream data containing a target person; the method comprises the steps of carrying out skeleton key point recognition and gesture index calculation on video stream data through a predetermined gesture key point recognition model, determining a primary falling suspected event, obtaining multi-mode data in a preset time period based on the time corresponding to the primary falling suspected event, carrying out multi-mode analysis and multi-mode fusion based on the multi-mode data, determining a secondary falling suspected event, carrying out static gesture judgment and sensitive voice segment recognition on the video stream data based on the secondary falling suspected event, determining a final judgment result, and triggering an alarm mechanism under the condition that the final judgment result represents that a target person falls truly. By means of the scheme, accuracy of the falling identification result can be improved.
Inventors
- LING XINGYU
- HE HUA
Assignees
- 西安电子科技大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260410
Claims (10)
- 1. A false fall filtering method based on multi-modal spatiotemporal feature fusion, the method comprising: acquiring video stream data containing continuous time of a target person; Bone key point recognition and gesture index calculation are carried out on the video stream data through a predetermined gesture key point recognition model, and a primary falling suspected event is determined; Acquiring multi-mode data in a preset time period based on the time corresponding to the primary fall suspected event, and carrying out multi-mode analysis and multi-mode fusion based on the multi-mode data to determine a secondary fall suspected event; Based on the secondary fall suspected event, carrying out static attitude judgment and sensitive voice segment recognition on the video stream data, and determining a final judgment result; And triggering an alarm mechanism under the condition that the final judging result represents that the target person falls truly.
- 2. The method of claim 1, wherein the bone keypoint identification and pose index calculation of the video stream data by a predetermined pose keypoint identification model, determining a primary fall suspected event, comprises: performing skeleton key point recognition on the target person in the video stream data through the gesture key point recognition model to obtain human skeleton key points and first coordinate information corresponding to the human skeleton key points; Calculating a posture index based on first coordinate information corresponding to the human skeleton key points, and determining a trunk main shaft inclination angle and a head height variation; and determining the primary falling suspected event under the condition that the trunk main shaft inclination angle is larger than a preset inclination angle threshold value and the head height variation is larger than a preset height threshold value.
- 3. The method according to claim 2, wherein calculating the posture index based on the first coordinate information corresponding to the key points of the human skeleton, and determining the trunk spindle inclination angle and the head height variation comprises: determining target coordinate information corresponding to two groups of target skeleton key points related to the trunk gesture based on first coordinate information corresponding to the human skeleton key points, wherein the two groups of target skeleton key points are respectively an upper body skeleton key point and a lower body skeleton key point; Calculating the inclination angle of the trunk main shaft based on the target coordinate information; Selecting a longitudinal coordinate value comprising head or neck key points based on first coordinate information corresponding to the human skeleton key points, and determining a key point longitudinal coordinate sequence; and calculating the head height variation based on the longitudinal coordinate sequence of the key points.
- 4. The method of claim 1, wherein the multimodal data includes an initial sequence of human skeletal keypoints, an initial sequence of triaxial accelerations, and an initial sequence of environmental audio; The multi-modal analysis and multi-modal fusion are performed based on the multi-modal data, and the determination of the secondary fall suspected event comprises the following steps: Performing time alignment processing based on the initial human skeleton key point sequence, the initial triaxial acceleration sequence and the initial environment audio sequence to obtain a human skeleton key point sequence, a triaxial acceleration sequence and an environment audio sequence; respectively carrying out feature analysis and multi-mode fusion on the human skeleton key point sequence, the triaxial acceleration sequence and the environment audio sequence to determine fusion confidence; the attention weight for adjusting the dynamic threshold value is acquired, dynamic threshold value adjustment is carried out based on the attention weight, and a corrected fusion judgment threshold value of the area where the target person is determined; And determining the secondary fall suspected event under the condition that the fusion confidence is larger than the corrected fusion judgment threshold value.
- 5. The method of claim 4, wherein the performing feature analysis and multi-modal fusion on the human skeletal keypoint sequence, the triaxial acceleration sequence, and the environmental audio sequence, respectively, to determine a fusion confidence level comprises: Identifying a neural network model through a predetermined falling behavior, and calculating the similarity between the human skeleton key point sequence and a preset real falling mode sequence to obtain falling confidence; Extracting features of the triaxial acceleration sequence to obtain energy distribution and frequency features of vibration signals, and calculating vibration confidence coefficient based on the energy distribution and the frequency features; Extracting features of the environmental audio sequence to obtain transient impact features, and calculating the confidence coefficient of the acoustic mode based on the transient impact features; And performing multi-mode fusion based on the falling confidence coefficient, the vibration confidence coefficient and the acoustic mode confidence coefficient, and determining the fusion confidence coefficient.
- 6. The method of claim 1, wherein the performing static gesture determination and sensitive speech segment recognition on the video stream data based on the secondary fall suspected event, determining a final determination result, comprises: Based on the secondary fall suspected event, bone key points are identified to the video stream data through the gesture key point identification model, and final human bone key points and second coordinate information corresponding to the final human bone key points are determined, wherein the final human bone key points comprise head key points, shoulder key points and hip key points; calculating displacement variances corresponding to the head key points, the shoulder key points and the hip key points respectively based on the final human skeleton key points and second coordinate information corresponding to the final human skeleton key points; Determining a displacement variance mean value based on the displacement variances corresponding to the head key points, the shoulder key points and the hip key points respectively, and comparing the displacement variance mean value with a preset static threshold value to determine a static posture judging result; Extracting audio characteristics and identifying sensitive voice segments of the video stream data, and determining a sensitive voice segment identification judgment result; And determining the final judging result based on the static gesture judging result and the sensitive voice segment recognition judging result.
- 7. The method of claim 6, wherein said performing audio feature extraction and sensitive speech segment recognition on said video stream data to determine a sensitive speech segment recognition decision result comprises: Extracting audio characteristics of the video stream data to obtain an audio characteristic sequence; Respectively extracting features of the audio feature sequence through a predetermined semantic sensitive branch, a predetermined non-semantic abnormal branch and a predetermined silence detection branch to obtain a first feature map, a second feature map and a third feature map which are respectively corresponding to the semantic sensitive branch, the non-semantic abnormal branch and the silence detection branch; performing feature fusion on the first feature map, the second feature map and the third feature map to obtain fusion features; based on the fusion characteristics, sensitive voice segment recognition is carried out to obtain a three-dimensional vector, wherein the three-dimensional vector comprises confidence degrees corresponding to semantic sensitivity, nonsense abnormality and silence detection; and comparing the confidence degrees corresponding to the semantic sensitivity, the non-semantic anomaly and the silence detection respectively with preset confidence degree thresholds corresponding to the semantic sensitivity, the non-semantic anomaly and the silence detection respectively to determine the recognition judging result of the sensitive voice segment.
- 8. A false fall filtering system based on multi-mode space-time feature fusion is characterized by comprising an acquisition module, a determination module and an alarm module, wherein, The acquisition module is used for acquiring continuous-time video stream data containing the target person; The determining module is used for carrying out skeleton key point recognition and gesture index calculation on the video stream data through a predetermined gesture key point recognition model to determine a primary falling suspected event, acquiring multi-mode data of a preset time period based on the time corresponding to the primary falling suspected event, carrying out multi-mode analysis and multi-mode fusion on the multi-mode data to determine a secondary falling suspected event, carrying out static gesture judgment and sensitive voice segment recognition on the video stream data based on the secondary falling suspected event, and determining a final judging result; the alarm module is used for triggering an alarm mechanism under the condition that the final judging result represents that the target person falls down truly.
- 9. A false fall filtering device based on multi-mode space-time feature fusion is characterized by comprising a processor and a memory, wherein, The memory is used for storing a computer program; the processor being adapted to call and run the computer program from the memory to perform the method of any one of claims 1 to 7.
- 10. A computer readable storage medium storing executable instructions for causing a processor to perform the method of any one of claims 1 to 7.
Description
False falling filtering method and system based on multi-mode space-time feature fusion Technical Field The application relates to the technical field of fall identification, in particular to a false fall filtering method and system based on multi-mode space-time feature fusion. Background With the continuous deepening of the aging degree of the population, the risk of falling of the old in daily life is obviously increased, and the falling is often accompanied by serious consequences such as fracture, brain injury and the like, so that the fall is one of the important reasons for causing disability and even death of the old. How to timely and accurately identify the falling event of the old in daily home or accompanying scenes becomes a key problem to be solved in the fields of intelligent accompanying and health monitoring. Existing fall detection techniques rely mainly on single vision analysis, wearable sensors or simple threshold rule judgment. The method based on the wearable equipment is subject to illumination change, shielding and complex background influence, has limited distinguishing capability on daily actions such as bending, sitting and the like, depends on active wearing of the old, has low use compliance, and is difficult to cope with highly diversified human body behavior patterns in a real living environment based on a single sensing signal or a scheme of fixed rules. In addition, the old people fall often happens in the scene that activities such as kitchen, bathroom are frequent and the behavior change is violent, and there is apparent difference in the normal gesture variation range of human body under different scenes, if the unified decision threshold value is adopted, the mistake report or the omission is very easy to produce. Meanwhile, after falling, how to distinguish the situation of 'true falling' from the situation of 'short falling but self-rising' also puts higher demands on the reliability of the system. Disclosure of Invention The embodiment of the application expects to provide a false falling filtering method and a system based on multi-mode space-time feature fusion, which can improve the accuracy of falling identification results. The technical scheme of the application is realized as follows: in a first aspect, an embodiment of the present application provides a method for filtering false falls based on multi-modal spatio-temporal feature fusion, the method comprising: acquiring video stream data containing continuous time of a target person; Bone key point recognition and gesture index calculation are carried out on the video stream data through a predetermined gesture key point recognition model, and a primary falling suspected event is determined; Acquiring multi-mode data in a preset time period based on the time corresponding to the primary fall suspected event, and carrying out multi-mode analysis and multi-mode fusion based on the multi-mode data to determine a secondary fall suspected event; Based on the secondary fall suspected event, carrying out static attitude judgment and sensitive voice segment recognition on the video stream data, and determining a final judgment result; And triggering an alarm mechanism under the condition that the final judging result represents that the target person falls truly. In the above scheme, the step of performing skeleton key point recognition and gesture index calculation on the video stream data through a predetermined gesture key point recognition model to determine a primary fall suspected event includes: performing skeleton key point recognition on the target person in the video stream data through the gesture key point recognition model to obtain human skeleton key points and first coordinate information corresponding to the human skeleton key points; Calculating a posture index based on first coordinate information corresponding to the human skeleton key points, and determining a trunk main shaft inclination angle and a head height variation; and determining the primary falling suspected event under the condition that the trunk main shaft inclination angle is larger than a preset inclination angle threshold value and the head height variation is larger than a preset height threshold value. In the above scheme, the calculating the posture index based on the first coordinate information corresponding to the key points of the human skeleton, and determining the trunk spindle inclination angle and the head height variation includes: determining target coordinate information corresponding to two groups of target skeleton key points related to the trunk gesture based on first coordinate information corresponding to the human skeleton key points, wherein the two groups of target skeleton key points are respectively an upper body skeleton key point and a lower body skeleton key point; Calculating the inclination angle of the trunk main shaft based on the target coordinate information; Selecting a longitudinal coordinate valu