US-12620261-B2 - System and method for reading and analysing behaviour including verbal, body language and facial expressions in order to determine a person's congruence

US12620261B2US 12620261 B2US12620261 B2US 12620261B2US-12620261-B2

Abstract

According to the invention, is provided a data processing system for determining congruence or incongruence between the body language and the Speech of a person, comprising a self-learning machine, such as a neutralneural network, arranged for receiving as input a dataset including: approved data of a collection of analysed Speeches of persons, said approved data comprising for each analysed Speech: * a set of video sequences, comprising audio sequences and visual sequences, each audio sequence corresponding to one visual sequence, and * an approved congruence indicator for each of said video sequence—said self-learning machine being trained so that the data processing system is able to deliver as output a congruence indicator.

Inventors

Caroline MATTEUCCI
Joanna BESSERT-NETTELBECK

Assignees

PLENIUM AG

Dates

Publication Date: 20260505
Application Date: 20191220
Priority Date: 20181220

Claims (20)

1 . Method for providing indicators of congruence or incongruence between the body language and the Speech of a person comprising the following steps: a) providing a video recording device adapted to record images of a subject including face and at least some parts of the body, b) recording a video of the Speech of that person with said video recording device, said video being divided into n video sequences comprising n sequences of images and n corresponding audio sequences, c) for each sequence of images, detecting at least one Visual cue Vc and attributing at least one rating among positive Vc+, neutral Vc0 or negative Vc− for each visual cue Vc, d) for each audio sequence, detecting at least one Audio cue Ac and attributing at least one rating among positive Act, neutral Ac0 or negative Ac− for each Audio cue Ac, e) for each video sequence, comparing the rating of said Audio cue Ac with the rating of said Visual cue Vc, and giving a congruence indicator which is a positive congruence indicator if both ratings are either positive (Vc+ and Ac+) or negative (Vc+ and Ac+), a negative congruence indicator if one of the ratings is positive and the other one is negative (Vc+ and Ac−, or Vc− and Ac+), and a neutral congruence indicator if one of the ratings is neutral (Vc0 or Ac0).
2 . Method according to claim 1 , wherein said Visual cue Vc is one of the following: all facial expressions or body language cues, including a visual sign of discomfort, a visual sign of comfort or a visual pacificator sign.
3 . Method according to f claim 1 , wherein said Audio cue Ac is one of the following: for the voice: rhythm, speed, high volume, low volume, pitch, high ton, low ton; for the negative or positive emotional voice; verbal style: linguistics, inquiry, word, count, change of verbal style and a positive or negative sentiment expressed in the audio sequence, an audio sign of discomfort, an audio sign of comfort and an audio pacificator sign.
4 . Method according to claim 1 , further comprising a reference table with the rating correspondence(s) of the Visual cue Vc and of the Audio cue Ac.
5 . Method according to claim 1 , wherein it further comprises before step b), a preliminary step b0) for baseline establishment during the following sub-steps are implemented: i) a reference film is shown to said person, said reference film comprising m reference film sequences, at least some of the reference film sequences being emotionally charged; ii) during the showing of the film, a recording of a reference video of the person is done; iii) dividing the reference video into m reference video sequences, each reference video sequence corresponds to a reference film sequence of said film; iv) for each reference video sequence, detecting at least one Visual cue Vc of a micro expression that is memorised in a baseline table of said person.
6 . A method according to claim 1 , wherein the Speech of the person takes place in front of another person considered as an interviewer, so that the Speech forms an interview between said person or interviewee and an interviewer, wherein the method further comprising the following steps: f) providing a second video recording device adapted to record images of said interviewer including face and at least some parts of the body, g) recording also a video of the Speech of that interviewer with said second video recording device, said video being divided into n video sequences comprising n sequences of images and n corresponding audio sequences, h) detecting at least one Visual cue Vc of the interviewer for each sequence of images and detecting at least one Audio cue Ac of the interviewer for each audio sequence, i) for each video sequence, analysing the rating of the Audio cue Ac and of the Visual cue Vc of the person forming the interviewee with respect to the Visual cue Vc and Audio cue Ac of the interviewer, whereby establishing a positive or negative influence indicator, whereby the influence indicator is positive when there is a detected influence of the Visual cue Vc and Audio cue Ac of the interviewer on the rating of the Audio cue Ac and of the Visual cue Vc of the person forming the interviewee, and where the influence indicator is negative when there is no detected influence of the Visual cue Vc and Audio cue Ac of the interviewer on the rating of the Audio cue Ac and of the Visual cue Vc of the person forming the interviewee.
7 . A method according to claim 1 , wherein said detected influence indicator is used to provide to the interviewer a series of formulations of hypotheses in the form of affirmations and/or questions.
8 . A method according to claim 1 , wherein the Speech of the person takes place in front of another person considered as an interviewer, so that the Speech forms an interview between said person or interviewee and an interviewer, wherein the method further comprising the following steps: f) providing a second video recording device adapted to record images of said interviewer including face and at least some parts of the body, g) recording also a video of the Speech of that interviewer with said second video recording device, said video being divided into n video sequences comprising n sequences of images and n corresponding audio sequences, h) detecting at least one Visual cue Vc of the interviewer for each sequence of images and detecting at least one Audio cue Ac of the interviewer for each audio sequence, i) for each video sequence, analysing the rating of the Audio cue Ac and of the Visual cue Vc of the person forming the interviewer with respect to the Visual cue Vc and Audio cue Ac of the interviewee, whereby establishing a positive or negative influence indicator, whereby the influence indicator is positive when there is a detected influence of the Visual cue Vc and Audio cue Ac of the interviewee on the rating of the Audio cue Ac and of the Visual cue Vc of the person forming the interviewer, and where the influence indicator is negative when there is no detected influence of the Visual cue Vc and Audio cue Ac of the interviewee interviewer on the rating of the Audio cue Ac and of the Visual cue Vc of the person forming the interviewer.
9 . System for providing indicators of congruence or incongruence between the body language and a person's Speech, comprising: a self-learning machine programmed to receive as input, on the one hand, several sets of audio sequences of a person's Speech, wherein each audio sequence corresponds to one Audio cue Ac and, on the other hand, a set of sequences of images of said person during said Speech, wherein said images comprising face and at least some parts of the body and wherein each sequence of images corresponds to one Visual cue Vc, the said self-learning machine having been trained so that said system is able to deliver as output, after analysing a video sequence comprising one sequence of images and one corresponding audio sequence, with both at least one identified Visual cue Vc based on said sequence of images and at least one identified Audio cue Ac based on said audio sequence, which forms a pair or a group of identified cues (Vc+Ac) and points to a congruence or incongruence, wherein for each Audio cue Ac and each Visual cue Vc, attributing at least one rating among positive rating, neutral rating, or negative rating.
10 . System according to claim 9 , wherein said Visual cue Vc is either a facial expression or a body language cue.
11 . System according to claim 10 , wherein said system further comprises a Visual cue detector able to analyse said video sequences and to provide one or several corresponding identified Visual cues Vc.
12 . System according to claim 10 , wherein said system further comprises an Audio cue detector able to analyse said audio sequences and to provide one or several corresponding identified Audio cues Ac.
13 . System according to claim 10 , wherein said self-learning machine comprises a clustering or multi-output artificial neuronal network.
14 . System according to claim 10 , wherein said self-learning machine comprises an artificial neuronal network with a multiplicity of layers.
15 . System according to claim 10 , wherein said self-learning machine is a deep learning machine.
16 . System according to claim 10 , wherein said self-learning machine will, with enough data, infer the best and most accurate cues that determine the congruence and incongruence between the Audio cues (Ac), Visual cues (Vc) and the cues themselves.
17 . A system according to claim 9 , wherein said Audio cue Ac comprises at least one of the following: voice, negative emotional voice, positive emotional voice and verbal style.
18 . System according to claim 9 , wherein said self-learning machine further receives as input a reference table with the rating correspondence of each of the Visual cues Vc and of each of the Audio cues Ac, and wherein based on said identified Visual cue Vc and on said identified Audio cue Ac of the analysed video sequence and based on said reference table, said system is further able to deliver as output both at least one Visual cue Vc rating and at least one Audio cue Ac rating, which forms a pair or a group of cue ratings.
19 . System according to claim 18 , wherein said system is further able through said pair or said group of cue ratings corresponding to the analysed video sequence, to deliver as output, an indicator of congruence or incongruence of the analysed video sequence.
20 . System according to claim 19 , wherein said indicator of congruence or of incongruence is a positive congruence indicator for the analysed video sequence when the Visual cue Vc rating and the Audio cue Ac rating are the same, and negative congruence indicator of the analysed video sequence when the Visual cue Vc rating and the Audio cue Ac rating are different and one of the rating is positive and the other one is negative, or the cue in itself displays a sign of incongruence or congruence.

Description

RELATED APPLICATIONS The present application is a national phase of PCT/IB2019/061184, filed Dec. 20, 2019, which claims the benefit of Swiss Patent Application No. CH 01571/18, filed Dec. 20, 2018. The entire disclosures of which are hereby incorporated by reference. FIELD OF THE INVENTION The present invention concerns a method and a system for providing indicators of congruence or incongruence between the body language (including all facial expressions) and the Speech of a person. This method and system are useful to provide, by reading and analysing the body language (including all facial expressions) and the features of the Speech, which could be done totally or partially automatically, the congruence of a person's behaviour or Speech in relation to the situation (comfort, part of the 6C of the congruence method: calibration, comfort, context, change, combination, consciousness). A lot of situations exist where there is a need for establishing the congruence of a Speech of a person. Such a tool would notably be useful and applicable to both the business and legal worlds. For instance, in human resources management (recruitment, conflict management, talent integration, communications, etc.), for all insurances purposes (medical consultant, insurance fraud, etc.), for social services (coaches, psychologists, psychiatrists, telemedicine, etc.) for all justice and/or police departments (police investigation, judge, lawyers, etc.) for security services (migrations, customs, airport, security agent, etc.) and all calls (interviews/conference calls/business calls/telemedicine, etc. supported by a camera). The fields of use are therefore defined as follows, in a non-limitative way: human resources management (recruitment, conflict management, talent integration, communications, etc.), for all insurances purposes (medical consultant, insurance fraud, etc.), for social services (coaches, psychologists, psychiatrists, telemedicine, etc.) for all justice and/or police departments (police investigation, judge, lawyers, etc.) for security services (migrations, customs, airport, security agent, etc.) and all calls (interviews/conference calls/business calls/telemedicine, etc. supported by a camera). Such an analysis is part of the personality profiling field in psychology, notably known as an investigative tool used by law enforcement agencies to identify likely suspects. This method and system is taking into account all the bodily and verbal cues necessary for reading and analyse behaviour, making it possible to establish the congruence or incongruence of an individual, namely his/her consistency or non-coherence as well as his behavioural profile. DESCRIPTION OF RELATED ART There exist numerous prior art references presenting systems and methods for detecting in a speech of a subject the truth or the deceit. For instance, in US20080260212A1 images of the subject's face are recorded, a mathematical model of a face defined by a set of facial feature locations and textures and a mathematical model of facial behaviours that correlate to truth or deceit are used. The facial feature locations are compared to the image to provide a set of matched facial feature locations and the mathematical model of facial behaviours are compared to the matched facial feature locations in order to provide a deceit indication as a function of the comparison. Also CN104537361 relates to a lie-detection method based on a video. This lie detection method includes the steps of detecting visual behaviour characteristics of a detected object according to video images, detecting physiological parameter characteristics of the detected object according to the video images, and obtaining lying probability data by combining the visual behaviour characteristics with the physiological parameter characteristics. WO2008063527 relates to procedures to allow an indication of truth or lie to be deduced, notably (a) monitoring the activation of a plurality of regions of a subject's brain while the subject answers questions and (b) measuring one or more physiological parameters while the subject answers questions, and combining the results of (a) and (b) to form a composite evaluation indicative of the truth or lie in the subject's response. US2016354024 concerns detection of deception and prediction interviewer accuracy. Physiological information of the interviewer during the interview is recorded by at least a first sensor, including a time series of physiological data. By processing the recorded physiological information, the interview assessment calculated by a computer indicates at least one of whether a statement made by the interviewee is likely to be deceitful and whether the interviewer is likely to be accurate in estimating truthfulness of the interviewee. WO2008063155 relates to deception detection via functional near-infrared spectroscopy. More precisely, Functional near-infrared (fNIR) neuroimaging is used to detect deception. Oxygenation levels