CN-121482755-B - Method and system for realizing man-machine collaborative scene guidance based on multi-source vision

CN121482755BCN 121482755 BCN121482755 BCN 121482755BCN-121482755-B

Abstract

The application discloses a method and a system for realizing man-machine collaborative scene guidance based on multi-source vision, which are characterized in that a scene representation is formed by firstly extracting passable areas and barrier distribution through time alignment and coordinate unification of forward-looking, round-looking and depth data, left/right/left three discrete guidance is overlapped on a forward-looking picture for manual selection only when a turnout is detected, and a selected result is output to an external system for prompting or assisting decision-making by using guidance parameters, so that one-to-one correspondence of guidance information and a real scene is realized, non-key moment interference is reduced, selection space is compressed, cognitive load is reduced, selection speed and safety are improved, and meanwhile, decoupling is performed by parameterization output so as to facilitate secondary safety check and cross-system adaptation.

Inventors

WANG FEI
ZHANG SIHAO

Assignees

弈芯科技(杭州)有限公司

Dates

Publication Date: 20260508
Application Date: 20260106

Claims (12)

1. A method for implementing human-machine collaborative scene guidance based on multi-source vision, comprising: acquiring a forward-looking video stream, an around-looking video stream and depth data, and performing time alignment on each path of data to form a synchronous data packet; Geometric calibration and coordinate unification are carried out based on the synchronous data packet, and a passable area, barrier distribution and dynamic risk are extracted from the synchronous data packet to form scene representation; detecting a turnout event based on the formed scene representation, and determining that the turnout event occurs when at least two mutually separated feasible channels exist in the scene and the feasible channels simultaneously meet the following conditions, wherein the conditions met comprise: The geometric conditions are met, namely the width of the feasible channels is larger than a preset width threshold value, and the included angle between the channels is larger than a preset included angle threshold value; on the premise of meeting geometric conditions, the dynamic occupation probability corresponding to each feasible channel is smaller than a preset probability threshold value, or the minimum collision time is larger than a preset time threshold value; when the feasible channel meets the geometric conditions but does not meet the dynamic risk constraint conditions, marking the corresponding position as a bifurcation limited by the dynamic risk, and only prompting the low risk direction or outputting risk prompting information; only when a turnout event is detected, three guide elements are overlapped in the forward-looking video, and the three guide elements respectively correspond to the left direction, the right direction and the left direction; calculating a safety coefficient for each candidate local path and marking the grade; Based on an intention perception type man-machine collaborative decision mechanism, the guidance selection of an external system is combined with the intention of an operator based on the AI candidate paths to carry out joint decision, and the method comprises the steps of superposing indication information of each candidate local path in a display interface corresponding to a forward-looking video, wherein the indication information comprises visual identifications used for representing the path direction and safety coefficients thereof; and receiving the selection of one of three guiding elements by an operator, and outputting the candidate local path corresponding to the selected guiding element to an external system in the form of guiding parameters, wherein the guiding parameters comprise a direction, a path segment identifier, an expected speed range and an effective time limit, and the guiding parameters are used for prompting or assisting the external system to carry out secondary safety check before execution.
2. The method of claim 1, further comprising removing noise from the synchronous data packet by using an improved bilateral filtering algorithm while preserving edge details, and removing outliers by using a statistical filtering and radius filtering combination algorithm for depth data outliers.
3. The method of claim 1, further comprising extracting dynamic risk, comprising: Acquiring dynamic obstacle targets which can move, cross or approach in a scene in a short time scale through multi-mode sensing fusion and dynamic obstacle prediction; And calculating the correlation between the future track of the obstacle target and the candidate passing path in the local path candidates, and generating the level of dynamic risk of the direction corresponding to the candidate passing path based on the correlation.
4. A method according to claim 3, wherein the acquiring the obstacle target comprises: A multi-mode fusion network based on a Transformer is adopted, the features of the forward-looking video stream, the around-looking video stream and the depth number are input into the network, cross-source feature interaction is realized through a cross-attention mechanism, and a unified fusion feature map is output; carrying out dynamic obstacle detection and tracking on the fusion feature map to obtain the obstacle target; said calculating the future trajectory of the obstacle target comprises: And predicting the future movement trend of the obstacle target by adopting a time sequence prediction model to obtain the future track of the obstacle target.
5. The method of any of claims 1-4, further comprising returning to continued execution of the step of superimposing three guide elements in the forward video when the switch event occurs upon reaching a next switch.
6. The method of any of claims 1-4, wherein the generating of their corresponding candidate local paths for each guiding element in accordance with the passable area and obstacle distribution comprises: Modeling the passable region and executing global path search; And carrying out local optimization on a path segment with curvature exceeding a preset safety threshold in the global path, maximizing the path smoothness to be an optimization target, and obtaining three optimized candidate local paths with different directions by taking a minimum turning radius which does not enter a risk area and meets a chassis as a constraint condition.
7. A computer-readable storage medium storing computer-executable instructions for performing the method of achieving human-machine collaborative scene guidance based on multi-source vision of any one of claims 1-6.
8. A system for realizing man-machine cooperative scene guidance based on multi-source vision is characterized by comprising a data acquisition module, a geometric calibration and coordinate unification module, a scene characterization construction module, a branch event detection module, a guidance superposition module, a man-machine interaction and selection module and a guidance parameter output module, The data acquisition module is used for acquiring the forward-looking video stream, the around-looking video stream and the depth data and aligning each path of data to the same reference time stamp to form a synchronous data packet; the geometric calibration and coordinate unification module is used for carrying out geometric calibration and coordinate unification based on the synchronous data packet; the scene representation construction module is used for extracting passable areas, barrier distribution and dynamic risks from the data with unified coordinates to form scene representation; The system comprises a branch event detection module, a dynamic risk judgment module and a dynamic risk judgment module, wherein the branch event detection module is used for carrying out branch event detection based on formed scene characterization, and determining that a branch event occurs when at least two mutually separated feasible channels exist in a scene and simultaneously meet the following conditions, wherein the met conditions comprise that the width of the feasible channels is larger than a preset width threshold value and the included angle between the channels is larger than a preset included angle threshold value, the feasible channels meet the safety clearance requirement, the dynamic occupation probability corresponding to the feasible channels is smaller than the preset probability threshold value or the minimum collision time is larger than the preset time threshold value on the premise of meeting the geometric conditions, and the corresponding positions are marked as branches limited by the dynamic risk when the feasible channels meet the geometric conditions but do not meet the dynamic risk constraint conditions and are used for prompting only the low risk direction or outputting risk prompting information; The guiding superposition module is used for superposing three guiding elements in the forward-looking video only when a turnout event is detected, wherein the three guiding elements respectively correspond to the left direction, the right direction and the right direction, and generate corresponding candidate local paths for the guiding elements according to the passable area and the obstacle distribution; The man-machine interaction and selection module is used for calculating the safety coefficient for each candidate local path and marking the grade; based on an intention perception type man-machine collaborative decision mechanism, the guidance selection of an external system is combined with the intention of an operator based on an AI candidate path to carry out joint decision, and the method comprises the steps of superposing indication information of each candidate local path in a display interface corresponding to a forward-looking video, wherein the indication information comprises visual identifications used for representing the path direction and safety coefficients thereof, adjusting the visual weight of the indication information corresponding to each candidate local path according to the current intention probability of the operator so as to reflect the safety coefficients through colors, outputting path parameters and obstacle prediction tracks corresponding to each candidate local path to assist the operator to judge, presenting the forward-looking video superposed with three guide elements at a display end, and receiving the selection of one guide element by the operator; The system comprises a guiding parameter output module, a guiding parameter generation module and a guiding parameter generation module, wherein the guiding parameter output module is used for outputting corresponding candidate local paths to an external system in the form of guiding parameters according to the selection of guiding elements, and the guiding parameters comprise directions, path segment identifiers, expected speed ranges and effective time limits and are used for prompting or assisting the external system to carry out secondary safety check before execution.
9. The system of claim 8, wherein the data acquisition module is further configured to remove noise from the synchronous data packet by using an improved bilateral filtering algorithm while preserving edge details, and to reject outliers by using a combination of statistical filtering and radius filtering algorithm for depth data outliers.
10. The system of claim 8, wherein the generating of the corresponding candidate local paths for the guiding elements according to the passable area and the obstacle distribution in the guiding superposition module comprises modeling the passable area, executing global path searching, locally optimizing path segments with curvatures exceeding a preset safety threshold in the global path, maximizing the path smoothness as an optimization target, and obtaining three optimized candidate local paths with different directions by taking a non-entering risk area and meeting a minimum turning radius of a chassis as constraint conditions.
11. The system of claim 8, wherein the data acquisition module comprises a forward-looking camera, a look-around camera, and a depth/distance sensor.
12. The system of claim 8, wherein the scene representation construction module is further configured to extract dynamic risk, the scene representation being a ternary scene representation.

Description

Method and system for realizing man-machine collaborative scene guidance based on multi-source vision Technical Field The application relates to the technical field of computer vision and man-machine interaction, in particular to a method and a system for realizing man-machine collaborative scene guidance based on multi-source vision. Background With the development of tele-operation and semi-automation systems, forward-looking/around-looking cameras and depth sensors are widely used to acquire field environments, and to superimpose prompts on the display to help users understand feasible areas and potential obstacles, with the intention of "people around" approach. However, the related technical schemes generally have the following problems that candidate prompt triggering conditions are not clear and are easy to interfere or overload information at non-critical time, candidate sets are unstable or too many, cognitive burden is increased, decision speed is influenced, interaction time sequence and rollback mechanism are lost, so that misoperation or suspension state is caused, quantitative safety indexes and visualization are not bound, risks and costs of different choices are difficult to visually compare, and end-to-end time delay and bandwidth fluctuation lack of self-adaptive rendering, so that instantaneity and stability are influenced. Therefore, a scene guidance technology is needed to improve the understandability, interaction efficiency and robustness in complex environments. Disclosure of Invention The application provides a method and a system for realizing man-machine collaborative scene guidance based on multi-source vision, which can improve the understandability, interaction efficiency and robustness in a complex environment. The embodiment of the application provides a method for realizing man-machine collaborative scene guidance based on multi-source vision, which comprises the following steps: acquiring a forward-looking video stream, an around-looking video stream and depth data, and performing time alignment on each path of data to form a synchronous data packet; Geometric calibration and coordinate unification are carried out based on the synchronous data packet, and passable areas and barrier distribution are extracted from the synchronous data packet to form scene representation; Detecting a turnout event based on the formed scene representation, and when the turnout event occurs, superposing three guide elements in the forward-looking video, wherein the three guide elements respectively correspond to the left direction, the right direction and the right direction; and outputting the guiding parameters of the corresponding candidate local paths to an external system according to the selection of the guiding elements. In an exemplary embodiment, the method further comprises the steps of removing noise from the synchronous data packet by adopting an improved bilateral filtering algorithm and simultaneously retaining edge details, and eliminating abnormal points by adopting a statistical filtering and radius filtering combination algorithm aiming at abnormal values of depth data. In one illustrative example, the method further comprises extracting dynamic risk, comprising: Acquiring dynamic obstacle targets which can move, cross or approach in a scene in a short time scale through multi-mode sensing fusion and dynamic obstacle prediction; And calculating the correlation between the future track of the obstacle target and the candidate passing path in the local path candidates, and generating the level of dynamic risk of the direction corresponding to the candidate passing path based on the correlation. In an illustrative example, the acquiring the obstacle target includes: A multi-mode fusion network based on a Transformer is adopted, the features of the forward-looking video stream, the around-looking video stream and the depth number are input into the network, cross-source feature interaction is realized through a cross-attention mechanism, and a unified fusion feature map is output; carrying out dynamic obstacle detection and tracking on the fusion feature map to obtain the obstacle target; said calculating the future trajectory of the obstacle target comprises: And predicting the future movement trend of the obstacle target by adopting a time sequence prediction model to obtain the future track of the obstacle target. In one illustrative example, further comprising introducing an intent-aware human-machine collaborative decision-making mechanism to cause guided selection of the external system to make a joint decision based on the AI candidate path in combination with the operator intent. In an exemplary embodiment, the detecting a branch event based on the formed scene representation includes: and when at least two feasible channels which are separated from each other and meet the preset judging rule exist in the scene, and each feasible channel meets the safety clearance requirem