CN-121998600-A - Interview evaluation method and interview evaluation system based on multi-mode information

CN121998600ACN 121998600 ACN121998600 ACN 121998600ACN-121998600-A

Abstract

The invention discloses an interview evaluation method and system based on multi-mode information, and relates to the technical field of artificial intelligence, wherein the method comprises the following steps of starting an interview client, synchronously collecting video streams and audio streams of candidates by the interview client, and generating corresponding interactive text streams in real time; the interactive text stream, the video stream and the audio stream corresponding to the same time stamp form multi-mode interview data, and the interactive text stream, the video stream and the audio stream in the multi-mode interview data are locally cached and preprocessed to obtain multi-mode time sequence data. According to the invention, through the arrangement of the self-adaptive buffer zone and the priority, the contradiction between delay and packet loss can be intelligently balanced while the alignment and the integrity of multi-mode time sequence data are ensured, high-quality and high-reliability input can be provided for a deep neural network model, and finally, accurate and stable remote automatic interview evaluation is realized.

Inventors

Xiong Quanlang
YANG QIONG
HUANG WUCHENG
LIU JINGJING

Assignees

武汉梦软科技有限公司

Dates

Publication Date: 20260508
Application Date: 20251217

Claims (8)

1. The interview evaluation method based on the multi-mode information is characterized by comprising the following steps of: Starting an interview client, synchronously collecting video streams and audio streams of a candidate, and generating corresponding interactive text streams in real time, wherein the interactive text streams, the video streams and the audio streams corresponding to the same time stamp form multi-mode interview data; performing local caching and preprocessing on the interactive text stream, the video stream and the audio stream in the multi-mode interview data to obtain multi-mode time sequence data; Transmitting the multi-mode time sequence data to a server, aligning and synchronizing the received multi-mode time sequence data by the server, and compensating and correcting the multi-mode time sequence data; inputting the compensated and corrected multi-mode time sequence data into a trained deep neural network model to obtain a comprehensive feature vector of a candidate; And generating a multi-dimensional evaluation report containing the evaluation index, the key multi-mode time sequence data fragment and the interpretability analysis.
2. The method for interview evaluation based on multi-modal information according to claim 1, wherein the step of starting the interview client, the interview client synchronously collects video streams and audio streams of the candidate and generates corresponding interactive text streams in real time includes the steps of; the interview client device synchronously collects video streams and audio streams of candidates, and performs real-time voice recognition on the audio streams by utilizing the audio streams integrated in the interview client device to generate interactive text streams corresponding to the audio streams, wherein when collection starts, starting time stamps based on the same clock source are distributed for the video streams, the audio streams and the interactive text streams.
3. The method for interview evaluation based on multi-modal information according to claim 1, wherein the locally caching and preprocessing the interactive text stream, the video stream and the audio stream in the multi-modal interview data to obtain multi-modal time sequence data comprises: dynamically monitoring the current network state, and adjusting the coding resolution, the frame rate and the coding code rate of the audio stream according to the network state; performing face detection and tracking on the video stream, and extracting a face region image sequence and head posture parameters as primary visual features; the method comprises the steps of calculating time sequence statistic values of sound volume, speech speed and fundamental frequency on an audio stream to serve as primary audio features, and directly taking an interactive text stream as primary text features, wherein the primary visual features, the primary audio features and the primary text features are time sequence sequences, and the primary visual features, the primary audio features and the primary text features form multi-mode time sequence data.
4. The method for interview evaluation based on multi-modal information according to claim 1, wherein the transmitting the multi-modal time series data to the server side, the server side aligning and synchronizing the received multi-modal time series data and compensating and correcting the multi-modal time series data, includes: Carrying out data transmission on the multi-mode time sequence data; Setting an adaptive buffer area at a server end, and aligning data of different modes to the same time axis according to the time stamp; and repairing the audio stream by adopting a forward error correction and packet loss concealment algorithm, and repairing the video stream by adopting a time domain error concealment algorithm.
5. The method for interview evaluation based on multimodal information of claim 4, wherein the establishing of the adaptive buffer comprises: Continuously monitoring real-time network performance indexes in the whole interview process, wherein the real-time network performance indexes comprise instantaneous one-way delay, instantaneous jitter and instantaneous packet loss rate; marking the transmitted multi-mode time sequence data at the client, and distributing a content priority label for each multi-mode time sequence data, wherein the content priority label comprises a high priority, a standard priority and a low priority; According to the real-time network performance index and the priority distribution of the current multi-mode time sequence data, dynamically adjusting the depth of the self-adaptive buffer area, wherein the dynamic adjustment comprises a first adjustment rule, a second adjustment rule and a third adjustment rule; recording the alignment success rate and the newly added delay after each time of self-adaptive buffer area depth adjustment; And taking the alignment success rate and the newly added processing delay as feedback signals, inputting the feedback signals into a control algorithm, and dynamically adjusting the proportionality coefficient and the threshold value related in the first adjustment rule and the second adjustment rule on line to realize the self-adaptive optimization based on the current network environment and the data flow characteristics.
6. The method for interview evaluation based on multi-modal information according to claim 1, wherein the inputting the compensated and corrected multi-modal time series data into the trained deep neural network model to obtain the comprehensive feature vector of the candidate comprises: Collecting and preprocessing historical multi-modal interview data and corresponding historical comprehensive feature vectors, and training a deep neural network model by utilizing the preprocessed historical multi-modal interview data and the corresponding historical comprehensive feature vectors to obtain a trained deep neural network model; And inputting the compensated and corrected multi-mode time sequence data into a trained deep neural network model, and outputting the comprehensive feature vector of the candidate by the trained deep neural network model.
7. The method of claim 1, wherein the step of inputting the integrated feature vector into the capability model of the target post to obtain the predictive value of the multiple evaluation index, and generating a multi-dimensional evaluation report including the evaluation index, the key multi-modal time series data segment and the interpretability analysis comprises the steps of: inputting the comprehensive feature vector into a configurable capacity model corresponding to the target post, and calculating predicted values of a plurality of evaluation indexes through a mapping relation defined by the capacity model; Meanwhile, positioning key multi-mode time sequence data segments influencing scoring based on parameters of a deep neural network internal mechanism and a capability model for generating comprehensive feature vectors; And carrying out association analysis on the predicted values of the multiple evaluation indexes and the key multi-mode time sequence data fragments to generate a multi-dimensional evaluation report.
8. An interview evaluation system based on multi-modal information for implementing an interview evaluation method based on multi-modal information as set forth in any one of claims 1-7, comprising: The system comprises an acquisition module, a interview client, a display module and a display module, wherein the acquisition module is used for starting the interview client, synchronously acquiring video streams and audio streams of a candidate, and generating corresponding interactive text streams in real time, wherein the interactive text streams, the video streams and the audio streams corresponding to the same time stamp form multi-mode interview data; The preprocessing module is used for carrying out local caching and preprocessing on the interactive text stream, the video stream and the audio stream in the multi-mode interview data to obtain multi-mode time sequence data; The transmission module is used for transmitting the multi-mode time sequence data to the server, and the server aligns and synchronizes the received multi-mode time sequence data and compensates and corrects the multi-mode time sequence data; The model processing module is used for inputting the compensated and corrected multi-mode time sequence data into a trained deep neural network model to obtain the comprehensive feature vector of the candidate; The evaluation module is used for inputting the comprehensive feature vector into the capability model of the target post to obtain the predicted values of a plurality of evaluation indexes, and generating a multi-dimensional evaluation report containing the evaluation indexes, the key multi-mode time sequence data fragments and the interpretability analysis.

Description

Interview evaluation method and interview evaluation system based on multi-mode information Technical Field The invention relates to the technical field of artificial intelligence, in particular to an interview evaluation method and system based on multi-mode information. Background With the development of artificial intelligence technology, an automatic interview evaluation system based on multi-modal information (such as video, audio and text) has become an important tool for talent screening and evaluation, and the system builds an algorithm model to predict post competence by analyzing multi-dimensional data such as facial expression, voice intonation, language content and the like of a candidate in the interview process, so as to aim at improving recruitment efficiency; However, when facing the remote network interview scene, a core technical bottleneck which exists for a long time and severely restricts the evaluation efficiency is faced, wherein the instability of network transmission, especially the problems of network delay, jitter and packet loss, cause serious time sequence dislocation and quality damage when the multi-mode data stream is transmitted to a server side; Therefore, it is desirable to provide an interview evaluation method and system based on multi-modal information to solve the above-mentioned problems. Disclosure of Invention The invention aims to provide an interview evaluation method and system based on multi-mode information, which are used for solving the problem of the deficiency in the background technology. In order to achieve the above purpose, the invention provides a multi-mode information-based interview evaluation method, which comprises the following steps: Starting an interview client, synchronously collecting video streams and audio streams of a candidate, and generating corresponding interactive text streams in real time, wherein the interactive text streams, the video streams and the audio streams corresponding to the same time stamp form multi-mode interview data; performing local caching and preprocessing on the interactive text stream, the video stream and the audio stream in the multi-mode interview data to obtain multi-mode time sequence data; Transmitting the multi-mode time sequence data to a server, aligning and synchronizing the received multi-mode time sequence data by the server, and compensating and correcting the multi-mode time sequence data; inputting the compensated and corrected multi-mode time sequence data into a trained deep neural network model to obtain a comprehensive feature vector of a candidate; And generating a multi-dimensional evaluation report containing the evaluation index, the key multi-mode time sequence data fragment and the interpretability analysis. In a preferred embodiment, the interview client is started, and the interview client synchronously collects video streams and audio streams of the candidate and generates corresponding interactive text streams in real time, wherein the interactive text streams comprise; the interview client device synchronously collects video streams and audio streams of candidates, and performs real-time voice recognition on the audio streams by utilizing the audio streams integrated in the interview client device to generate interactive text streams corresponding to the audio streams, wherein when collection starts, starting time stamps based on the same clock source are distributed for the video streams, the audio streams and the interactive text streams. In a preferred embodiment, the locally caching and preprocessing the interactive text stream, the video stream and the audio stream in the multi-mode interview data to obtain multi-mode time sequence data includes: dynamically monitoring the current network state, and adjusting the coding resolution, the frame rate and the coding code rate of the audio stream according to the network state; performing face detection and tracking on the video stream, and extracting a face region image sequence and head posture parameters as primary visual features; the method comprises the steps of calculating time sequence statistic values of sound volume, speech speed and fundamental frequency on an audio stream to serve as primary audio features, and directly taking an interactive text stream as primary text features, wherein the primary visual features, the primary audio features and the primary text features are time sequence sequences, and the primary visual features, the primary audio features and the primary text features form multi-mode time sequence data. In a preferred embodiment, the transmitting the multi-mode time series data to the server side, the server side aligns and synchronizes the received multi-mode time series data, and compensates and corrects the multi-mode time series data, includes: Carrying out data transmission on the multi-mode time sequence data; Setting an adaptive buffer area at a server end, and aligning data of different modes t