CN-115169951-B - Multi-feature fusion automatic driving course reinforcement learning training method

CN115169951BCN 115169951 BCN115169951 BCN 115169951BCN-115169951-B

Abstract

The invention discloses an automatic driving course reinforcement learning training method based on multi-element feature fusion, which comprises the steps of carrying out global path planning according to training tasks, acquiring vehicle surrounding environment information by utilizing vehicle-mounted sensors to obtain traffic state information vectors, acquiring local occupation grid patterns of the current position of a vehicle based on a Flunar coordinate system, designing action space and rewarding functions of behavior decision reinforcement learning, constructing a deep reinforcement learning backbone network model, grading according to the difficulty of subtasks in the automatic driving training process, inputting the extracted multi-element features into the constructed deep reinforcement learning backbone network model, outputting to obtain discrete optimal driving behaviors, and generating the current optimal track of an unmanned vehicle in real time through a multi-resolution track planning algorithm based on sampling according to the optimal driving behaviors. According to the invention, the data driving mode enables the automatic driving automobile to spontaneously learn the optimal driving strategy, so that the navigation task in the urban structured road environment is completed.

Inventors

LIANG HUAWEI
LI ZHIYUAN
ZHANG SONG
WANG JIAN
ZHOU PENGFEI

Assignees

中国科学院合肥物质科学研究院

Dates

Publication Date: 20260508
Application Date: 20220726

Claims (1)

1. The automatic driving course reinforcement learning training method based on multi-element feature fusion is characterized by comprising the following specific steps of: Step S1, global path planning is carried out according to a training task, vehicle-mounted sensors are used for collecting surrounding environment information of a vehicle to obtain traffic state information vectors, and a local occupation grid diagram of the current position of the vehicle is obtained based on a Flunar coordinate system, and the method specifically comprises the following steps: setting a road scene and a training task, and generating a lane-level global path by using an Astar-based global planning algorithm as a training global reference path; Acquiring multi-sensor data acquired by vehicle-mounted equipment, and performing data processing on the multi-sensor data to obtain a traffic state information vector, wherein the traffic state information vector consists of a vehicle state vector and a traffic environment element vector; The vehicle state vector is represented as [ v, gamma ], wherein v is a normalized speed, and gamma is the collision risk in the current lane; the calculation formula of the normalized velocity v is that v=v ego /v max ; v ego is the speed of the own vehicle, v max is the maximum allowable speed of the current lane or the preset maximum speed; the safe distance d s between the own vehicle and the front vehicle is obtained, and the calculation formula is as follows: ; Wherein v 1 is the speed of the front vehicle, a ego and a 1 are the deceleration of the own vehicle and the front vehicle respectively, and t represents the reaction time of the own vehicle; The calculation formula of the collision risk gamma in the current lane is gamma=d/d s γ max ; Wherein d is the actual distance between the vehicle and the front vehicle, and gamma max is the maximum ratio of the actual distance to the safety distance; The traffic environment element vector may be represented as [ e l ,e c ,e r ,l l ,l c ,l r ,r,g,b,δ 1 ,δ 2 ]; Wherein, [ e l ,e c ,e r ] respectively represents whether a left lane, a middle lane and a right lane exist, [ l l ,l c ,l r ] respectively represents whether lane changing is allowed on the left lane, the middle lane and the right lane under traffic regulations, [ r, g and b ] represents the state of a traffic light, [ delta 1 ,δ 2 ] respectively represents a normalized value of a distance to a next road junction and a normalized value of a distance to a global path end point; constructing a local occupation grid diagram based on a Fluna coordinate system, and regarding any position in the local occupation grid diagram as a linear combination of a transverse vector and a longitudinal vector; S2, designing action space and rewarding function of action decision reinforcement learning, and building a deep reinforcement learning backbone network model, wherein the specific steps include: Designing an action space, namely designing actions for completing all target tasks, and simultaneously compressing dimensions of a solving space to obtain five-dimensional driving behavior actions A= { left lane change, right lane change, acceleration, deceleration and lane keeping }; Designing a reward function, wherein the reward function mainly comprises main line sparse rewards and track planning feedback dense rewards; The main line sparse report R 1 (s) is a report obtained in a special state of the unmanned vehicle, and mainly includes a round-trip termination state report R 1 (s) and a navigation task progress report R 2 (s), and the calculation formulas are respectively as follows: Wherein k 1 is a super parameter, and v is the current speed of the unmanned vehicle; The track planning feedback dense return performs joint optimization on track planning and behavior decision, and an expected value of a track planning result cost in a behavior decision period is used as an input of C-R mapping, wherein a calculation formula is as follows: R 2 =-k 2 (C-b); Wherein N is the number of track planning times which are completed in the current behavior decision period, c n is the cost of the nth track planning result in the current behavior decision period, and k 2 and b are positive super parameters; building a deep reinforcement learning backbone network model based on a Soft Actor-Critic algorithm, wherein the deep reinforcement learning backbone network model comprises an Actor network and four Critic networks; Step S3, grading according to the difficulty of subtasks in the automatic driving training process, inputting the extracted multi-element features into a constructed deep reinforcement learning backbone network model, and outputting discrete optimal driving behaviors, wherein the specific steps comprise: Classifying the number of subtasks with different difficulties based on a difficulty classification principle of course tasks, wherein the subtasks comprise global route length, route curve, traffic light status, other traffic participating vehicles and interaction with dynamic vehicles; in the course reinforcement learning training process, judging whether course scheduling is performed or not according to the completion condition of the course tasks of the last 20 rounds; If the success rate of the course task reaches more than 90%, and the normalized root mean square NRMSD is smaller than the threshold value 0.1, completing the current course learning task and entering the training learning of the next course task; The normalized root mean square NRMSD has the following calculation formula: ; Where r' represents the return desire, r i represents the return for the ith round, and N is the total number of rounds; Step S4, generating a current optimal track of the unmanned vehicle in real time through a multi-resolution track planning algorithm based on sampling according to the obtained optimal driving behavior, wherein the method specifically comprises the following steps of: selecting a proper pre-aiming point in a global reference path according to the current speed of the vehicle, and sampling the pre-aiming point to two sides along the transverse direction of the global reference path to generate a series of terminal states; Generating a curve cluster with continuous curvature from the current position of the vehicle to the sampled terminal state position by using a third-order Bezier curve as a candidate track cluster of the current position of the unmanned vehicle, wherein the third-order Bezier curve is defined as follows: Wherein H i is a control point of a third-order bezier curve, B i is a coefficient corresponding to the control point, i=1, 2,3, and the range of the value of the parameter t is (0, 1); b i (t) has the formula: the calculation formula of the control point H i is as follows: Wherein S x and S y are the lateral offset and the longitudinal offset of the terminal state relative to the vehicle, ω is the relative angle between the terminal state direction and the X-axis, and l is the distance between the control points H 0 and H 1 , and is also the distance between H 2 and H 3 ; according to the obtained track cluster, a multi-attribute evaluation function is established to solve the track with the minimum cost as the current optimal track, and the calculation formula of the optimal track is as follows: ; Wherein N represents the number of tracks in the track set, C j L 、C j S ,C j D and C j K represent track length, security assessment, lateral offset and curvature smoothing cost, and λ L 、λ S 、λ D and λ K are weights of the jth candidate track on track length, security assessment, lateral offset and curvature smoothing cost indexes, respectively; The calculation formula of each evaluation index is as follows: Wherein L j is the length of the jth candidate track, L max 、L min represents the maximum length and the minimum length of the candidate tracks in the track set, D max 、d min represents the maximum value and the minimum value of the distance between the track and the obstacle in the track set, D j is the distance between the jth candidate track and the obstacle, D i j is the lateral offset between the ith track point in the jth candidate track and the global reference path, m is the number of track points in the jth candidate track, D j max is the maximum lateral offset between the jth candidate track and the global reference path, K i j is the curvature of the ith track point in the jth candidate track, and K j max is the maximum curvature in the jth candidate track.

Description

Multi-feature fusion automatic driving course reinforcement learning training method Technical Field The invention relates to the technical field of automatic driving, in particular to an automatic driving course reinforcement learning training method with multi-element feature fusion. Background The unmanned vehicle has important functions in the aspects of guaranteeing traffic safety, reducing transportation cost, improving vehicle efficiency, reducing air pollution and the like, and has wide application scenes. The behavior decision is used as a brain of the unmanned vehicle, the environment sensing system is supported, the planning control is started downwards, and the method plays a key role in improving the autonomy, the intelligence and the safety of the unmanned vehicle. Therefore, research on behavior decision methods of unmanned vehicles has important significance for promoting development of unmanned vehicles. The behavior decision layer is a strategy making layer of the driving behavior of the unmanned vehicle, and the behavior decision layer is mainly used for outputting intelligent, reasonable and legal advanced driving behaviors by using an intelligent decision algorithm according to the local environment information expressed by the perception module as input, including the vehicle self information, the local environment traffic state and the like. The prior art has the defects that the current more conventional method is a rule-based behavior decision method, and a behavior decision rule knowledge base is established based on traffic laws and regulations and priori driving knowledge of experienced drivers according to the analysis of the environmental states of different scenes and different road conditions. Although the decision method based on rules has clear structure and easy realization, the decision method as a reaction type has the problems of poor flexibility, poor expandability, poor maintainability, suboptimal solution, low intelligent degree and the like. On the one hand, in order to cover as many scenarios as possible, the finite state machine model requires an unlimited number of rules. However, expanding such decision diagrams is labor intensive engineering. Therefore, the rule-based behavior decision model needs to take a lot of time and effort to convert the human prior experience into rules to cope with traffic driving scenes as complex as possible. On the other hand, as rule bases expand, various rule couplings are complex, making maintenance and replacement of rules more difficult. Disclosure of Invention The invention aims to overcome the defects in the prior art, and adopts an automatic driving course reinforcement learning training method with multi-element feature fusion to solve the problems in the prior art. A multi-element feature fusion automatic driving course reinforcement learning training method comprises the following specific steps: step S1, global path planning is carried out according to a training task, vehicle-mounted sensors are used for collecting surrounding environment information of a vehicle to obtain traffic state information vectors, and a local occupation grid diagram of the current position of the vehicle is obtained based on a Flunar coordinate system; s2, designing action space and rewarding functions of action decision reinforcement learning, and constructing a deep reinforcement learning backbone network model; Step S3, grading according to the difficulty of subtasks in the automatic driving training process, inputting the extracted multi-element features into a constructed deep reinforcement learning backbone network model, and outputting discrete optimal driving behaviors; and S4, generating a current optimal track of the unmanned vehicle in real time through a multi-resolution track planning algorithm based on sampling according to the obtained optimal driving behavior. As a further technical scheme of the invention, the specific steps in the step S1 comprise: setting a road scene and a training task, and generating a lane-level global path by using an Astar-based global planning algorithm as a training global reference path; Acquiring multi-sensor data acquired by vehicle-mounted equipment, and performing data processing on the multi-sensor data to obtain a traffic state information vector, wherein the traffic state information vector consists of a vehicle state vector and a traffic environment element vector; The vehicle state vector is represented as [ v, gamma ], wherein v is a normalized speed, and gamma is the collision risk in the current lane; the calculation formula of the normalized velocity v is that v=v ego/vmax; v ego is the speed of the own vehicle, v max is the maximum allowable speed of the current lane or the preset maximum speed; the safe distance d s between the own vehicle and the front vehicle is obtained, and the calculation formula is as follows: Wherein v 1 is the speed of the front vehicle, a ego and a 1 are the