CN-115438725-B - State detection method, device, equipment and storage medium
Abstract
The application discloses a state detection method, a state detection device and a storage medium, wherein the state detection method comprises the steps of acquiring video and audio of a target object; and obtaining a state detection result about the target object at least based on the first facial feature, the first voice feature and the semantic integrity feature, wherein the state detection result is used for determining whether the target object has a preset state or not. By the mode, the accuracy of state detection of the target object can be improved.
Inventors
- YANG PENG
- KONG CHANGQING
- WAN GENSHUN
- PAN JIA
- LIU CONG
- HU GUOPING
- LIU QINGFENG
Assignees
- 科大讯飞股份有限公司
- 科大讯飞股份有限公司
Dates
- Publication Date
- 20260421
- Application Date
- 20220823
- Priority Date
- 20220823
Claims (9)
- 1. A method of state detection, the method comprising: acquiring video and audio related to a target object, wherein the video comprises a multi-frame facial image of the target object, and the audio comprises a multi-frame voice frame of the target object; extracting at least a first facial feature of the target object from the video, extracting a first voice feature of the target object from the audio, and obtaining a semantic integrity feature of text information corresponding to the audio based on the audio, wherein the first facial feature is obtained by using second facial features respectively extracted from at least one frame of facial images, the first voice feature is obtained by using second voice features extracted from at least one frame of voice frames, the semantic integrity feature is obtained by processing the text information, the text information is obtained by performing voice recognition on voice fusion features of the audio, the voice fusion features are obtained by fusing second voice features corresponding to each voice frame, and the semantic integrity feature is used for quantifying the integrity of semantic expression of the text information and is used for representing the semantic expression state of the target object; and obtaining a state detection result about the target object based on at least the first facial feature, the first voice feature and the semantic integrity feature, wherein the state detection result is used for determining whether the target object has a preset state caused by nervous system degeneration.
- 2. The method of claim 1, wherein extracting second facial features in at least one frame of the facial image, respectively, comprises: Extracting a plurality of face key points from the face image for each frame of the face image; The method comprises the steps of respectively obtaining a spatial relation value between a first line segment and each second line segment, wherein the first line segment is composed of connecting lines of at least two face key points, and each second line segment is composed of connecting lines of the face key points corresponding to one end of the first line segment and each face key point; and sequencing the spatial relation values corresponding to the face images according to a preset sequence to obtain second face features in the face images.
- 3. The method of claim 1, wherein the obtaining a status detection result for the target object based at least on the first facial feature, the first speech feature, and the semantic integrity feature comprises: Fusing the first facial features and the first voice features to obtain first fused features; fusing the first fusion feature and the semantic integrity feature to obtain a second fusion feature; And carrying out state detection on the second fusion characteristic to obtain a state detection result about the target object.
- 4. The method of claim 3, wherein the first facial feature comprises a second facial feature of the facial image of each frame, the first speech feature comprises a second speech feature of the speech frame of each frame, and the fusing the first facial feature and the first speech feature to obtain a first fused feature comprises: fusing each second facial feature to obtain a facial fusion feature, and fusing each second voice feature to obtain a voice fusion feature, wherein the facial fusion feature is the first facial feature, and the voice fusion feature is the first voice feature; fusing the facial fusion feature and the voice fusion feature to obtain a third fusion feature; processing the third fusion feature by using a processing model to obtain the first fusion feature; And/or, the performing state detection on the second fusion feature to obtain a state detection result about the target object, including: And processing the second fusion characteristic by using the classification model to obtain a state detection result about the target object.
- 5. The method of claim 1, wherein there are multiple sets of video and audio related to a target object, and wherein the state detection result includes a probability that the target object exists in the preset state; After obtaining the state detection results corresponding to each group of the video and the audio, the method further comprises: and obtaining a final state detection result about the target object based on the probabilities in the state detection results corresponding to the videos and the audios.
- 6. The method of claim 1, wherein the first speech feature is obtained using a speech recognition kit and the first facial feature is obtained using a feature extraction tool in an open source library of face recognition.
- 7. A condition detection apparatus, the apparatus comprising: the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring video and audio of a target object, wherein the video comprises a multi-frame facial image of the target object, and the audio comprises a multi-frame voice frame of the target object; The device comprises a feature extraction module, a voice fusion feature, a semantic integrity feature and a voice fusion feature, wherein the feature extraction module is used for extracting at least a first facial feature of the target object from the video, extracting a first voice feature of the target object from the audio, and obtaining a semantic integrity feature of text information corresponding to the audio based on the audio, wherein the first facial feature is obtained by utilizing second facial features respectively extracted from at least one frame of facial images, the first voice feature is obtained by utilizing second voice features extracted from at least one frame of voice frames, the semantic integrity feature is obtained by processing the text information, the text information is obtained by carrying out voice recognition on voice fusion features of the audio, the voice fusion feature is obtained by fusing second voice features corresponding to each voice frame, and the semantic integrity feature is used for quantifying the semantic integrity of the text information and is used for representing the semantic expression state of the target object; the state detection module is used for obtaining a state detection result about a target object based on at least the first facial feature, the first voice feature and the semantic integrity feature, and the state detection result is used for determining whether a preset state caused by nervous system degeneration exists in the target object.
- 8. An electronic device comprising a memory and a processor coupled to each other, The memory stores program instructions; The processor is configured to execute program instructions stored in the memory to implement the method of any one of claims 1-6.
- 9. A computer readable storage medium, characterized in that the computer readable storage medium is for storing program instructions, the program instructions being executable to implement the method of any one of claims 1-6.
Description
State detection method, device, equipment and storage medium Technical Field The present application relates to the field of intelligent detection technologies, and in particular, to a state detection method, apparatus, device, and storage medium. Background In daily life, a target object (for example, a person) always shows various states, and the situation of the target object will be described generally by using the states, but for various reasons, the target object has a certain state, but is not self-known, for example, the state corresponds to the state shown by a parkinson patient, and many target objects already show the state corresponding to the parkinson patient, but the state of the target object is unknown or uncertain due to insufficient knowledge or state expression degree, so that the timing of making effective countermeasures is delayed. Therefore, how to obtain the corresponding state detection result by detecting the state is significant. Disclosure of Invention The application mainly solves the technical problem of providing a state detection method, a state detection device, state detection equipment and a storage medium, and can improve the accuracy of state detection of a target object. In order to solve the technical problems, the application provides a state detection method, which comprises the steps of obtaining video and audio of a target object, extracting at least first facial features of the target object from the video, extracting first voice features of the target object from the audio, obtaining semantic integrity features of text information corresponding to the audio based on the audio, and obtaining a state detection result of the target object based on at least the first facial features, the first voice features and the semantic integrity features, wherein the state detection result is used for determining whether the target object has a preset state. The method comprises the steps of extracting first facial features of a target object from the video, wherein the first facial features comprise at least one frame of facial image, respectively extracting second facial features to obtain first facial features, and/or extracting voice features and semantic integrity features of the target object from the audio, wherein the audio comprises multiple frames of voice frames of the target object, the method comprises the steps of extracting second voice features of at least one frame of voice frames to obtain first voice features, and/or extracting second voice features corresponding to each voice frame, fusing each second voice feature to obtain voice fusion features of the audio, carrying out voice recognition on the voice fusion features to obtain text information corresponding to the audio, and processing the text information to obtain the semantic integrity features. The method comprises the steps of respectively extracting second facial features in at least one frame of facial image, respectively extracting a plurality of facial key points from the facial image for each frame of facial image, respectively obtaining spatial relation values between a first line segment and each second line segment, wherein the first line segment consists of connecting lines of at least two facial key points, each second line segment consists of connecting lines of the facial key points corresponding to one end of the first line segment and each facial key point, and sequencing the spatial relation values corresponding to the facial image according to a preset sequence to obtain the second facial features in the facial image. The method comprises the steps of obtaining a state detection result related to a target object based on at least a first facial feature, a first voice feature and a semantic integrity feature, obtaining a first fusion feature by fusing the first facial feature and the first voice feature, obtaining a second fusion feature by fusing the first fusion feature and the semantic integrity feature, and obtaining the state detection result related to the target object by carrying out state detection on the second fusion feature. The method comprises the steps of obtaining a multi-frame facial image of a target object by video, wherein a first facial feature comprises a second facial feature of each frame of facial image, audio comprises a multi-frame voice frame of the target object, the first voice feature comprises a second voice feature of each frame of voice frame, fusing the first facial feature and the first voice feature to obtain a first fused feature, wherein the second facial feature is fused to obtain a face fused feature, and the second voice feature is fused to obtain a voice fused feature, the face fused feature is the first facial feature, the voice fused feature is the first voice feature, the face fused feature and the voice fused feature are fused to obtain a third fused feature, the third fused feature is processed by a processing model to obtain a first fuse