CN-122024208-A - Traffic signal joint perception method and device, electronic equipment and storage medium

CN122024208ACN 122024208 ACN122024208 ACN 122024208ACN-122024208-A

Abstract

The invention provides a traffic signal joint perception method, a device, electronic equipment and a storage medium, and relates to the technical field of computer vision, wherein the method comprises the steps of inputting a traffic signal image sequence to be detected into a joint perception model to obtain detection results of traffic signal lamps and traffic signs; the model is obtained through training based on a sample traffic signal image sequence, detection results of a sample traffic signal lamp and a traffic sign, preset semantic association information and preset state transfer prior information and is used for carrying out joint perception based on enhancement feature vectors of the traffic signal lamp and the traffic sign and the checked traffic signal lamp state, wherein the enhancement feature vectors of the traffic signal lamp and the traffic sign are obtained through determining based on the traffic signal image sequence to be detected and the preset semantic association information, and the checked traffic signal lamp state is obtained through carrying out time sequence consistency check based on the preset state transfer prior information. The invention can improve the accuracy and reliability of the traffic signal detection result.

Inventors

WANG FENGYAN
QIU ZENGYU
Wen Ziteng
HU JINSHUI
GUO TAO

Assignees

科大讯飞股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (11)

1. A method for joint perception of traffic signals, comprising: Acquiring a traffic signal image sequence to be detected; inputting the traffic signal image sequence to be detected into a joint perception model to obtain a traffic signal lamp detection result and a traffic sign detection result which are output by the joint perception model; the joint perception model is trained based on a sample traffic signal image sequence, a sample traffic signal lamp detection result and a sample traffic sign detection result corresponding to the sample traffic signal image sequence, preset semantic association information and preset state transition priori information; The joint perception model is used for joint perception based on the traffic signal lamp enhancement feature vector, the traffic sign enhancement feature vector and the checked traffic signal lamp state; the traffic signal lamp enhancement feature vector and the traffic sign enhancement feature vector are determined based on the to-be-detected traffic signal image sequence and the preset semantic association information, and the checked traffic signal lamp state is obtained by performing time sequence consistency check based on the preset state transfer priori information.
2. The traffic signal joint perception method according to claim 1, wherein the inputting the sequence of traffic signal images to be detected into a joint perception model to obtain the traffic signal detection result and the traffic sign detection result output by the joint perception model comprises: extracting a multi-scale feature map corresponding to each frame image in the traffic signal image sequence to be detected through a feature extraction layer of the joint perception model; processing the multi-scale feature map through a feature map processing layer of the joint perception model to obtain a traffic signal lamp feature vector and a traffic sign feature vector; Carrying out semantic interaction enhancement processing on the traffic signal lamp feature vector and the traffic sign feature vector based on the preset semantic association information through a semantic interaction layer of the joint perception model to obtain a traffic signal lamp enhancement feature vector and a traffic sign enhancement feature vector; Performing time sequence consistency check on the predicted state of the traffic signal lamp based on the preset state transfer prior information through a time sequence check layer of the joint perception model to obtain the checked state of the traffic signal lamp; And determining the traffic signal lamp detection result based on the traffic signal lamp enhancement feature vector and the checked traffic signal lamp state through the traffic signal lamp detection head of the combined perception model, and determining the traffic sign detection result based on the traffic sign enhancement feature vector through the traffic sign detection head of the combined perception model.
3. The traffic signal joint perception method according to claim 2, wherein the processing the multi-scale feature map by the feature map processing layer of the joint perception model to obtain a traffic signal feature vector and a traffic sign feature vector comprises: The method comprises the steps of carrying out candidate frame extraction processing on a multi-scale feature map through a candidate frame generation network shared in a feature map processing layer of the joint perception model to obtain a traffic signal lamp candidate frame set and a traffic sign candidate frame set, wherein the traffic signal lamp candidate frame set and the traffic sign candidate frame set are respectively generated based on corresponding anchor frame parameters; And respectively extracting traffic signal lamp feature vectors corresponding to each candidate frame in the traffic signal lamp candidate frame set and traffic sign feature vectors corresponding to each candidate frame in the traffic sign candidate frame set from the multi-scale feature map through the region-of-interest alignment operation.
4. The traffic signal joint perception method according to claim 3, wherein the semantic interaction processing is performed on the traffic signal lamp feature vector and the traffic sign feature vector based on the preset semantic association information by the semantic interaction layer of the joint perception model to obtain a traffic signal lamp enhancement feature vector and a traffic sign enhancement feature vector, respectively, and the method comprises: Acquiring a semantic association matrix constructed based on the preset semantic association information through a semantic interaction layer of the joint perception model, wherein the semantic association matrix is used for representing association strength between each traffic signal lamp state and each traffic sign category; Calculating a first target semantic attention weight based on the category distribution probability of the traffic signal lamp candidate frame set and the semantic association matrix, and calculating a second target semantic attention weight based on the category distribution probability of the traffic signal lamp candidate frame set and the semantic association matrix; based on the first target semantic attention weight, carrying out weighted summation on the traffic signal lamp feature vectors subjected to network mapping processing to obtain first semantic context vectors, and fusing the first semantic context vectors with the traffic sign feature vectors to obtain the traffic sign enhancement feature vectors; and carrying out weighted summation on the traffic sign feature vector subjected to network mapping processing based on the second target semantic attention weight to obtain a second semantic context vector, and fusing the second semantic context vector with the traffic signal lamp feature vector to obtain the traffic signal lamp enhancement feature vector.
5. The traffic signal joint perception method according to claim 2, wherein the performing, by the timing verification layer of the joint perception model, timing consistency verification on the predicted state of the traffic signal based on the preset state transition priori information to obtain the verified traffic signal state includes: Acquiring a state transition probability matrix constructed based on the preset state transition prior information through a time sequence check layer of the joint perception model, wherein the state transition probability matrix is used for representing the probabilities of state transitions of different traffic lights; Acquiring a historical state memory queue for the same traffic signal example, wherein the historical state memory queue stores a historical state detection result of the same traffic signal example in a historical traffic signal image frame; Calculating a time sequence support score corresponding to the traffic signal candidate state based on the state transition probability matrix and the historical state memory queue, and acquiring an original prediction state corresponding to the current traffic signal image frame and an original state prediction distribution thereof; carrying out weighted fusion on the original state prediction distribution and the time sequence support score to obtain smooth state confidence distribution; And determining the checked traffic signal lamp state based on the smooth state confidence distribution.
6. The traffic signal joint perception method according to any one of claims 2 to 5, wherein before the processing the multi-scale feature map by the feature map processing layer of the joint perception model to obtain a traffic signal feature vector and a traffic sign feature vector, the method further comprises: Based on the feature alignment layer of the joint perception model, performing time sequence alignment on the target deep feature map in the multi-scale feature map to obtain an aligned target deep feature map; fusing the aligned target deep feature map and a shallow feature map to obtain an updated multi-scale feature map, wherein the shallow feature map is a feature map except the target deep feature map in the multi-scale feature map; The processing of the multi-scale feature map through the feature map processing layer of the joint perception model to obtain a traffic signal lamp feature vector and a traffic sign feature vector comprises the following steps: And processing the updated multi-scale feature map through a feature map processing layer of the joint perception model to obtain the traffic signal lamp feature vector and the traffic sign feature vector.
7. The traffic signal joint perception method according to claim 6, wherein the determining the traffic signal detection result by the traffic signal detection head of the joint perception model based on the traffic signal enhancement feature vector and the checked traffic signal state, and the determining the traffic signal detection result by the traffic signal detection head of the joint perception model based on the traffic signal enhancement feature vector, further comprises, before: Performing global pooling on the aligned target deep feature images through a weight adjustment layer of the joint perception model to obtain scene-level semantic feature vectors; Calculating a first gating weight corresponding to a traffic signal lamp detection task and a second gating weight corresponding to the traffic sign detection task based on the scene-level semantic feature vector; Based on the first gating weight, carrying out weighted modulation on the traffic signal lamp enhancement feature vector to obtain a modulated traffic signal lamp enhancement feature vector; based on the second gating weight, carrying out weighted modulation on the traffic sign enhancement feature vector to obtain a modulated traffic sign enhancement feature vector; The determining, by the traffic signal light detection head of the joint perception model, the traffic signal light detection result based on the traffic signal light enhancement feature vector and the verified traffic signal light state, and determining, by the traffic sign detection head of the joint perception model, the traffic sign detection result based on the traffic sign enhancement feature vector, includes: And determining a traffic signal lamp detection result based on the modulated traffic signal lamp enhancement feature vector and the checked traffic signal lamp state through the traffic signal lamp detection head of the joint perception model, and determining the traffic sign detection result based on the modulated traffic sign enhancement feature vector through the traffic sign detection head of the joint perception model.
8. The traffic signal joint perception method according to any one of claims 1 to 5, wherein the joint perception model is trained with a preset loss function, the preset loss function being determined based on traffic signal detection loss, traffic sign detection loss, semantic consistency loss, and timing smoothing loss; The semantic consistency loss is determined based on the preset semantic association information, sample traffic signal lamp state prediction confidence and sample traffic sign category prediction confidence; And the time sequence smoothing loss is obtained by determining based on the preset state transition prior information and sample state prediction distribution corresponding to adjacent image frames in the sample traffic signal image sequence.
9. A traffic signal joint perception device, comprising: The image sequence acquisition module is used for acquiring a traffic signal image sequence to be detected; The detection result acquisition module is used for inputting the to-be-detected traffic signal image sequence into the joint perception model to obtain a traffic signal lamp detection result and a traffic sign detection result which are output by the joint perception model; the joint perception model is trained based on a sample traffic signal image sequence, a sample traffic signal lamp detection result and a sample traffic sign detection result corresponding to the sample traffic signal image sequence, preset semantic association information and preset state transition priori information; The joint perception model is used for joint perception based on the traffic signal lamp enhancement feature vector, the traffic sign enhancement feature vector and the checked traffic signal lamp state; the traffic signal lamp enhancement feature vector and the traffic sign enhancement feature vector are determined based on the to-be-detected traffic signal image sequence and the preset semantic association information, and the checked traffic signal lamp state is obtained by performing time sequence consistency check based on the preset state transfer priori information.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements the traffic signal joint awareness method of any of claims 1-8 when executing the computer program.
11. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the traffic signal joint awareness method of any of claims 1 to 8.

Description

Traffic signal joint perception method and device, electronic equipment and storage medium Technical Field The present invention relates to the field of computer vision, and in particular, to a traffic signal joint sensing method, apparatus, electronic device, and storage medium. Background In an urban road automatic driving scene, a traffic signal lamp (also called a traffic light) and a traffic sign are core basis for a vehicle to carry out compliance traffic decision, so that accurate and stable traffic signal lamp and traffic sign detection is an important task in the automatic driving scene, and key information is provided for subsequent safe compliance driving. In particular, traffic lights provide dynamic real-time traffic instructions, such as red/yellow/green status switching, while traffic signs provide static, long-term valid, regular constraints, such as "park yield," "inhibit left turn," "limit 60," etc. It is understood from driving semantics that the two are not isolated, but have natural strong coupling and mutual interpretation relations in terms of semantic logic, spatial layout and time sequence behavior. For example, the meaning of "red light" is usually used to define a complete parking instruction together with the "stop line" and "stop giving way" marks, the effective traffic direction of "green light" is specifically restricted by the "straight-going/left-turning arrow" or lane guiding mark, and the early warning meaning of "yellow light" can be mutually enhanced with the marks of "notice pedestrians", "speed-down giving way", etc. However, existing autopilot sensing systems generally handle traffic light identification and traffic sign detection as two independent tasks, typically employing parallel or serial sensing architectures. The method of 'divide and conquer' has the remarkable limitation that firstly, the method ignores rich context association between the two, so that the perception uncertainty of any task can be amplified independently under challenging scenes such as complex illumination, partial shielding, bad weather or long target distance, and the system lacks the capability of cross-validation and auxiliary reasoning by utilizing reliable clues provided by another task, and the overall perception robustness is insufficient. Secondly, when the independent perception results are output to a downstream planning control module, decision delay or conflict can be caused by time asynchronism, confidence coefficient conflict or semantic ambiguity (such as green light is on and a pass-forbidden mark is detected at the same time), so that unnecessary safety risks and performance bottlenecks are introduced to an automatic driving system. In addition, two independent models also bring about problems of computing resource redundancy and system complexity improvement. Therefore, with the continuous improvement of the requirements of the high-order autopilot on the reliability, the real-time performance and the decision-making friendliness of the perception system, a new perception framework capable of explicitly modeling and utilizing the multi-level and multi-dimensional context dependency relationship between the traffic signal lamp and the traffic sign is needed in the industry so as to improve the performance and the safety level of the whole autopilot system. Disclosure of Invention The invention provides a traffic signal joint perception method, a device, electronic equipment and a storage medium, which are used for fusing semantic interaction characteristics and time sequence verification state depth into a single model to carry out joint perception decision, so as to provide more accurate, reliable and semantically unified traffic signal lamp and traffic sign detection results for a downstream planning and control module, and improve the performance and safety level of the whole automatic driving system. The invention provides a traffic signal joint perception method, which comprises the following steps: Acquiring a traffic signal image sequence to be detected; inputting the traffic signal image sequence to be detected into a joint perception model to obtain a traffic signal lamp detection result and a traffic sign detection result which are output by the joint perception model; the joint perception model is trained based on a sample traffic signal image sequence, a sample traffic signal lamp detection result and a sample traffic sign detection result corresponding to the sample traffic signal image sequence, preset semantic association information and preset state transition priori information; The joint perception model is used for joint perception based on the traffic signal lamp enhancement feature vector, the traffic sign enhancement feature vector and the checked traffic signal lamp state; the traffic signal lamp enhancement feature vector and the traffic sign enhancement feature vector are determined based on the to-be-detected traffic signal image sequence and the