CN-121999071-A - Real-time scene graph generation method, device and medium based on relation drive positioning

CN121999071ACN 121999071 ACN121999071 ACN 121999071ACN-121999071-A

Abstract

The method, the device and the medium for generating the real-time scene graph based on the relation driving positioning are characterized in that a relation priority decoding mechanism is introduced, a fixed number of relation queries are used for directly positioning an interaction region, subject and object characteristics are generated based on light-weight linear mapping and gating interaction, so that the entity pairing computation complexity of the traditional O (N2) is reduced to O (N), the model operand and the redundant structure are obviously reduced, uniform relation region supervision is adopted, predicate classification and relation region regression are bound, the problem that semantic prediction is inconsistent with spatial positioning is effectively solved, intensive supervision is provided by combining auxiliary detection branches to enhance characteristic stability and training convergence, and the decoding structure is highly light and supports dynamic adjustment depth in an inference stage, so that the model can reach real-time or near real-time reasoning speed on common general single-card GPU (TeslaV 100) equipment. The invention realizes higher relation positioning precision, lower calculation cost, faster reasoning speed and stronger training stability.

Inventors

CHEN MEIWEN
WANG ZHIYU
CHENG FAN
MA TIANYI
FANG XIN

Assignees

安徽大学

Dates

Publication Date: 20260508
Application Date: 20260116

Claims (9)

1. A real-time scene graph generating method based on relation driving positioning is characterized in that the following steps are executed by computer equipment, S1, processing an input image by utilizing a feature coding module consisting of a CNN backbone network and a transducer encoder, extracting a global multi-scale feature map, and generating a coding sequence with context information ; S2, N learnable relation query vectors and coding sequences Input relation decoder for capturing the space and semantic information of interaction region by self-attention and cross-attention mechanism and outputting feature vector representing potential interaction relation ; S3, analyzing predicate and entity characteristics in parallel, wherein the method specifically comprises the following steps: predicate prediction by using relation prediction head to characteristic vector Performing predicate classification and relational frame regression; Entity refinement, namely utilizing a linear projection and gating interaction unit to conduct self-feature vector through an entity refinement module Directly derive subject vectors And object vector And further predicting the class and bounding box of the host and object, avoiding expensive cross-attention calculations; S4, based on the result of relation prediction and entity refinement, each query output is converted into a complete triplet, namely a subject-predicate-object, through a triplet joint construction module, and a final scene graph is generated through confidence screening and post-processing.
2. The method for generating a real-time scene graph based on relational drive positioning as recited in claim 1 wherein S1 comprises, The system firstly receives an original input image, and converts the original input image into a high-dimensional semantic representation through a feature coding module; Processing logic, namely firstly, extracting a multi-scale feature map from an image through a CNN backbone network Global context modeling is then performed by a transducer encoder; The transducer encoder in the feature encoding module adopts a mixed encoding structure with a multi-scale feature aggregation function, and the mixed encoding structure is used for carrying out decoupling interaction and cross-scale fusion on feature graphs with different dimensions output by the CNN backbone network so as to generate the encoding feature sequence 。
3. The method for generating a real-time scene graph based on relational drive positioning as recited in claim 2 wherein S2 comprises, Based on the coding features, the system locates the interaction region in the image using a relationship priority mechanism instead of the traditional entity pairing logic; processing logic a relational decoder receives N learnable relational query vectors ; Implementation details the decoder consists of L layer converters layers, outputting the relational features through self-attention and cross-attention mechanisms ; Each layer contains self-attention, cross-attention and feed forward networks, supports running at smaller L values in the inference phase, speed/accuracy trade-offs by executing only the top k layer decoder; Dynamic adjustment of the characteristics the module supports dynamic setting of the decoding layer number L in the reasoning phase.
4. The method for generating a real-time scene graph based on relational drive positioning as recited in claim 3, wherein S3 comprises, System slave relationship feature Starting, analyzing predicate information in parallel and deducing corresponding subject and object characteristics; Predicate parsing RelationHead Classifying and regressing; Predicate classification, relation characteristics Predicate (predicate) classification Outputting |Cp| predicte class scores using the linear layer+softmax/logits; during training, a VarifocalLoss (VFL) loss function is used for considering class unbalance and boundary frame confidence; Regression of relational frame to regression of relational frame parameters Normalized to [0,1] using the MLP output quaternary regression parameters (x_c, y_c, w, h); loss of l1+ GIoU; Entity refinement (EntityRefinementModule) avoids expensive extra cross-attention by top-down mapping; projection mapping using linear projection layers And Matrix generation of initial feature vectors Dimension d' =128, where ; Gating interactions (Interaction) by re-mapping feature vectors And Information exchange of the main guest features is realized through Concat, layerNorm +ReLU and Gate units; entity prediction by separate entity heads to predict categories of subject/object respectively And regression frame 。
5. The method for generating a real-time scene graph based on relational drive positioning as recited in claim 4, wherein S3 further comprises introducing auxiliary detection branches into the system during training for enhancing stability and positioning accuracy of the features, comprising, Processing logic to map the relational features to detected features Wherein Wherein the feature dimension =128, And performs the DETR-style object detection task.
6. The method for generating a real-time scene graph based on relational drive positioning as recited in claim 5, wherein the processing step of the triplet joint construction module of S4 comprises, Triple assembly, namely integrating the output into complete triple < s, p, o > by a triple joint construction module; Post-processing, performing optional confidence threshold screening, non-maximum suppression (NMS) and ranking, and outputting a final scene graph.
7. The method for generating a real-time scene graph based on relational drive positioning as recited in claim 6, wherein S4 further comprises training a matching and loss strategy as follows, Hungary matching, namely, the matching target carries out one-time Hungary matching on N predicted triples and M groups-truth triples, and the matching cost is high The classification and regression cost of each component of the triplet are integrated: Wherein the method comprises the steps of In the case of cross entropy or VFL, ; The three-tuple loss calculation, namely after the two-way matching is completed, the system calculates the loss of the three-tuple level aiming at the prediction pair of successful matching and takes the loss as a direct target of model optimization; Processing logic, wherein the loss function aims at simultaneously monitoring semantic accuracy and spatial positioning consistency of each component in the triplet; The calculation formula is as follows: wherein the indication function For ensuring that the bounding box regression loss only works for non-background positive samples that match to the real target; Total loss calculation, namely, a final training target consists of triplet loss and auxiliary task loss together so as to further enhance the stability of the characteristics; The calculation formula is as follows: Wherein the method comprises the steps of The target detection loss generated for the auxiliary detection branch is used for enhancing the entity-level supervision signal; Is an adjustable balance coefficient.
8. A computer readable storage medium storing a computer program, which when executed by a processor causes the processor to perform the steps of the method according to any one of claims 1 to 7.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the computer program, when executed by the processor, causes the processor to perform the steps of the method according to any of claims 1 to 7.

Description

Real-time scene graph generation method, device and medium based on relation drive positioning Technical Field The invention relates to the technical field of scene graph generation, in particular to a real-time scene graph generation method, device and storage medium based on relation driving positioning. Background Scene graph generation (SceneGraphGeneration, SGG) is a type of visual understanding technique that automatically extracts object entities and their semantic relationships from images. The SGG can provide high-level semantic information for tasks such as visual questions and answers, man-machine interaction, image retrieval, automatic driving, robot decision making and the like by constructing a graph structure consisting of 'subject-predicate-object' triplets. In recent years, with the development of a deep learning detection framework, a scene graph generation technology forms two main flow routes, namely a two-stage relationship reasoning framework and a single-stage end-to-end framework. Both types of techniques rely on deep learning models to run on hardware platforms such as GPUs. Two-stage scene graph generation technique The two-stage model is typically based on a "target detector+relationship classifier" architecture. Typical systems include FasterR-CNN, neural Motifs, etc., and their software systems are generally composed of the following functional modules: 1. Target detection module All objects in the image are detected by a region suggestion network (RegionProposalNetwork, RPN), and object categories and bounding boxes are output. 2. Feature extraction and ROI pooling module Each detected region is mapped to a high-dimensional visual feature space for use in subsequent relationship reasoning. 3. Entity pairing module The N target entities are combined in pairs of O (N2) and a possible subject-object candidate pair is constructed. 4. Relationship classification module The predicted predicate category is embedded with the entity pair characteristics, the context information, and the semantics. Although the two-stage method has higher recognition accuracy, the following defects exist: the calculation complexity is high, namely the combination of the entity pairs leads to the complexity of O (N2) level, and the reasoning speed is limited; the structural redundancy is serious, namely, the target detection and the relation reasoning are independently executed in two stages, and the characteristic is repeatedly calculated; the error accumulation, namely the target detection error is further transmitted to the relation prediction; Real-time processing is difficult to realize, the general reasoning speed is 10-15FPS, and real-time scenes such as automatic driving and the like cannot be met. (II) single-stage end-to-end scene graph generation technology With the advent of end-to-end detection frameworks (e.g., DETR), part of the research expressed SGG as a collective prediction problem, outputting complete triples simultaneously through a transducer decoderSubject-predicate-object. These methods avoid explicit entity pairing, but are often difficult to implement in real-time reasoning (below 30 FPS) due to the large number of queries, high-dimensional features, and deep-attention networks required. Typical schemes such as RelTR, SGTR, etc., the system of which is generally made up of the following software modules: 1. Image coding module Global multi-scale features are extracted using CNN or backbone transformers. 2. Multi-headed transducer decoder A large number of query tokens (typically 100-300) are input and the categories and positions of subjects, predicates, objects are output in parallel. 3. Triplet matching and optimizing module And carrying out one-to-one allocation on the predictions and the labels through Hungary matching (HungarianMatching), and then adopting joint loss training. Although the single-stage method avoids entity pairing and has more uniform structure, the following disadvantages still exist: The number of queries is large, the structure is huge, and a large query token is needed to cover enough triplet combinations, so that the calculation amount of a decoder is large; Redundant triplet structure-triple prediction (subject/predicate/object) reuse similar features, resulting in wasted resources. The real-time performance is poor, and most models cannot achieve real-time (more than 30 FPS) reasoning under the environment of common general GPU equipment; the relation positioning is inaccurate, and the existing method lacks unified relation area supervision, so that the positioning of a host and a guest is inconsistent with predicate prediction. Disclosure of Invention According to the real-time scene graph generation method, device and storage medium based on the relation driving positioning, the quick positioning and the high-efficiency host-client analysis of the interaction area are realized through the decoding mechanism with the relation priority, and the reason