US-12626374-B1 - Method for tracking multiple objects by using multiple cameras based on temporal-spatial graph neural network utilizing geometric features for each of the multiple objects and object tracking device using the same

US12626374B1US 12626374 B1US12626374 B1US 12626374B1US-12626374-B1

Abstract

There is provided a method for tracking multiple objects by using multiple cameras based on a temporal-spatial GNN utilizing geometric features for each of the multiple objects, including steps of: (a) acquiring, by an object tracking device, each of a (1_t_1)-th object descriptor to an (n_t_j)-th object descriptor; (b) calculating similarities of each of object descriptor pairs, and inputting a (1_t_1)-th node to an (n_t_j)-th node and spatial edges into the GNN, to thereby instruct the GNN to merge nodes corresponding to a same object, and thus generate a t-th spatial graph; and (c) generating correspondence information, and inputting the (t_1)-th node to the (t_k)-th node, the ((t−1)_1)-th node to the ((t−1)_p)-th node, and temporal edges into the GNN, to thereby instruct the GNN to merge nodes corresponding to the same object, and thus generate a t-th temporal-spatial graph, and finally, track the multiple objects within the specific space.

Inventors

Jungkwon Lee
Eun Soo Im

Assignees

SUPERB AI CO., LTD.

Dates

Publication Date: 20260512
Application Date: 20251113
Priority Date: 20250924

Claims (20)

1 . A method for tracking multiple objects by using multiple cameras based on a temporal-spatial GNN utilizing geometric features for each of the multiple objects, comprising steps of: (a) acquiring, by an object tracking device, each of a (1_t_1)-th object descriptor to a (1_t_i)-th object descriptor corresponding to each of a (1_t_1)-th object to a (1_t_i)-th object on a (1_t)-th frame through each of an (n_t_1)-th object descriptor to an (n_t_j)-th object descriptor corresponding to each of an (n_t_1)-th object to an (n_t_j)-th object on an (n_t)-th frame, wherein each of the (1_t)-th frame to the (n_t)-th frame is captured at a present time t from each of a first camera to an n-th camera installed in a specific space, wherein the n is 2 or more, and the first camera to the n-th camera have different viewing frustums, wherein if an object exists on the (1_t)-th frame, the i is 1 or more, and each of the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor includes respective appearance feature information and respective geometric feature information for each of the (1_t_1)-th object to the (1_t_i)-th object, and if the object does not exist on the (1_t)-th frame, the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor do not exist, wherein if an object exists on the (n_t)-th frame, the j is 1 or more, and each of the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor includes respective appearance feature information and respective geometric feature information for each of the (n_t_1)-th object to the (n_t_j)-th object, and if the object does not exist on the (n_t)-th frame, the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor do not exist; (b) calculating, by the object tracking device, similarities of each of object descriptor pairs by referring to the (1_t_1)-th object descriptor to the (n_t_j)-th object descriptor, wherein each of the object descriptor pairs is each of pairs having one object descriptor among the (1_t_1)-th object descriptor to the (n_t_j)-th object descriptor and another object descriptor among the (1_t_1)-th object descriptor to the (n_t_j)-th object descriptor, and inputting, by the object tracking device, a (1_t_1)-th node corresponding to the (1_t_1)-th object descriptor to an (n_t_j)-th node corresponding to the (n_t_j)-th object descriptor and each of spatial edges corresponding to each of the similarities into the GNN (Graph Neural Network), to thereby instruct the GNN to (i) merge nodes corresponding to a same object among the (1_t_1)-th node to the (n_t_j)-th node by referring to the (1_t_1)-th node to the (n_t_j)-th node and the spatial edges, and thus (ii) generate a t-th spatial graph including a (t_1)-th node to a (t_k)-th node, wherein if the (1_t_1)-th node to the (n_t_j)-th node exist, the k is 1 or more, and if the (1_t_1)-th node to the (n_t_j)-th node do not exist, the (t_1)-th node to the (t_k)-th node do not exist; and (c) generating, by the object tracking device, each piece of correspondence information between each of the (t_1)-th node to the (t_k)-th node of the t-th spatial graph and each of a ((t−1)_1)-th node to a ((t−1)_p)-th node of a (t−1)-th spatial graph, wherein the (t−1)-th spatial graph is generated at a previous time (t−1), and wherein if an object detected within the specific space in the previous time (t−1) exists, the p is 1 or more, otherwise, the ((t−1)_1)-th node to the ((t−1)_p)-th node do not exist, and inputting, by the object tracking device, the (t_1)-th node to the (t_k)-th node, the ((t−1)_1)-th node to the ((t−1)_p)-th node, and temporal edges corresponding to each piece of the correspondence information into the GNN, to thereby instruct the GNN to (i) merge nodes corresponding to the same object between the (t_1)-th node to the (t_k)-th node and the ((t−1)_1)-th node to the ((t−1)_p)-th node by referring to the (t_1)-th node to the (t_k)-th node, the ((t−1)_1)-th node to the ((t−1)_p)-th node and the temporal edges, and thus generate a t-th temporal-spatial graph, and finally, (ii) track the multiple objects within the specific space.
2 . The method of claim 1 , wherein, at the step of (a), a specific object descriptor, which is one of the (1_t_1)-th object descriptor to the (n_t_j)-th object descriptor, includes specific appearance feature information and specific geometric feature information, wherein the specific appearance feature information has embedded features which are acquired by embedding at least one of (i) a cropped image or its feature information, wherein the cropped image is obtained by cropping a specific object area corresponding to a specific object from a specific frame, wherein the specific frame is captured by a specific camera corresponding to the specific object descriptor, wherein the specific camera is one of the first camera to the n-th camera, (ii) clothing color information of the specific object, (iii) hairstyle information of the specific object, (iv) hair color information of the specific object, (v) skin color information of the specific object, and (vi) facial feature information of the specific object, wherein the specific geometric feature information has at least one of (i) BEV (Bird's Eye View) coordinate information obtained by projecting a center coordinate of a bounding box corresponding to the specific object onto a ground plane based on camera parameters of the specific camera, (ii) object displacement information, which is a displacement of the specific object estimated by referring to the BEV coordinate, (iii) 3D epipolar similarity information for the specific object calculated based on the specific camera and at least one other camera, wherein the at least one other camera has another viewing frustum that at least partially overlaps with a specific viewing frustum of the specific camera, (iv) object key point information according to a pose of the specific object, and (v) body shape information of the specific object.
3 . The method of claim 2 , wherein the specific appearance feature information further includes reliability information indicating a reliability of the embedded features due to an occlusion of the specific object between the specific frame and other frames.
4 . The method of claim 1 , wherein, at the step of (c), in generating specific correspondence information between a (t_specific)-th node and a ((t−1)_specific)-th node, wherein the (t_specific)-th node is one of the (t_1)-th node to the (t_k)-th node and the ((t−1)_specific)-th node is one of the ((t−1)_1)-th node to ((t−1)_p)-th node, the object tracking device generates the specific correspondence information such that the specific correspondence information includes at least one of (i) time interval information between the present time t and the previous time (t−1), (ii) movement distance information between the (t_specific)-th node and the ((t−1)_specific)-th node within the specific space, and (iii) velocity difference information between the (t_specific)-th node and the ((t−1)_specific)-th node.
5 . The method of claim 1 , wherein, at the step of (a), the object tracking device receives each of the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor through the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor from each of a first edge device corresponding to the first camera to an n-th edge device corresponding to the n-th camera, wherein, at the step of (b), the object tracking device synchronizes, to the present time t, the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor through the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor.
6 . The method of claim 5 , wherein, at the step of (a), a specific edge device, which is one of the first edge device to the n-th edge device, performs an object detection on a specific frame, which is captured by a specific camera corresponding to the specific edge device, to thereby detect specific objects on the specific frame, wherein the specific camera is one of the first camera to the n-th camera, generates each piece of specific appearance feature information of each of the specific objects by referring to each of detection results of each of the specific objects, generates each piece of specific geometric feature information for each of the specific objects based on at least the specific camera, and generates each of specific object descriptors for each of the specific objects by referring to each piece of the specific appearance feature information and each piece of the specific geometric feature information.
7 . The method of claim 1 , wherein, at the step of (a), the object tracking device receives each of the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor through the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor from each of a first edge device to an n′-th edge device, wherein the first edge device to the n′-th edge device correspond to first grouped cameras to n′-th grouped cameras, wherein the first grouped cameras to the n′-th grouped cameras are acquired by clustering the first camera to the n-th camera into one or more groups, wherein, at the step of (b), the object tracking device synchronizes, to the present time t, the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor through the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor.
8 . A method for tracking multiple objects by using multiple cameras based on a temporal-spatial GNN utilizing geometric features for each of the multiple objects, comprising steps of: (a) acquiring, by an object tracking device, each of a (1_t_1)-th object descriptor to a (1_t_i)-th object descriptor corresponding to each of a (1_t_1)-th object to a (1_t_i)-th object on a (1_t)-th frame through each of an (n_t_1)-th object descriptor to an (n_t_j)-th object descriptor corresponding to each of an (n_t_1)-th object to an (n_t_j)-th object on an (n_t)-th frame, wherein each of the (1_t)-th frame to the (n_t)-th frame is captured at a present time t from each of a first camera to an n-th camera installed in a specific space, wherein the n is 2 or more, and the first camera to the n-th camera have different viewing frustums, wherein if an object exists on the (1_t)-th frame, the i is 1 or more, and each of the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor includes respective appearance feature information and respective geometric feature information for each of the (1_t_1)-th object to the (1_t_i)-th object, and if the object does not exist on the (1_t)-th frame, the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor do not exist, wherein if an object exists on the (n_t)-th frame, the j is 1 or more, and each of the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor includes respective appearance feature information and respective geometric feature information for each of the (n_t_1)-th object to the (n_t_j)-th object, and if the object does not exist on the (n_t)-th frame, the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor do not exist; and (b) generating, by the object tracking device, similarities of each of spatial object descriptor pairs by referring to the (1_t_1)-th object descriptor to the (n_t_j)-th object descriptor, and each piece of correspondence information between each of the (1_t_1)-th object descriptor to the (n_t_j)-th object descriptor and each of a ((t−1)_1)-th node to a ((t−1)_p)-th node, wherein each of the ((t−1)_1)-th node to the ((t−1)_p)-th node is a node of a (t−1)-th temporal-spatial graph generated at a previous time (t−1), wherein each of the spatial object descriptor pairs is each of pairs having one object descriptor among the (1_t_1)-th object descriptor to the (n_t_j)-th object descriptor and another object descriptor among the (1_t_1)-th object descriptor to the (n_t_j)-th object descriptor, and wherein if an object detected within the specific space in the previous times exists, the p is 1 or more, otherwise, the ((t−1)_1)-th node to the ((t−1)_p)-th node do not exist, and inputting, by the object tracking device, a (1_t_1)-th node corresponding to the (1_t_1)-th object descriptor to an (n_t_j)-th node corresponding to the (n_t_j)-th object descriptor, the ((t−1)_1)-th node to the ((t−1)_p)-th node, spatial edges corresponding to the similarities, and temporal edges corresponding to the correspondence information into the GNN (Graph Neural Network), to thereby instruct the GNN to (i) perform one of sub-processes of: (i−1) merging nodes corresponding to a same object among the (1_t_1)-th node to the (n_t_j)-th node by referring to the spatial edges, and thus generating a (t_1)-th node to a (t_k)-th node, merging nodes corresponding to the same object among the (t_1)-th node to the (t_k)-th node and the ((t−1)_1)-th node to the ((t−1)_p)-th node by referring to the temporal edges, and thus generating a t-th temporal-spatial graph; and (i−2) merging nodes corresponding to the same object among the (1_t_1)-th node to the (n_t_j)-th node and the ((t−1)_1)-th node to the ((t−1)_p)-th node by referring to the temporal edges, identifying merged nodes and unmerged nodes among the (1_t_1)-th node to the (n_t_j)-th node, merging nodes corresponding to the same object among the merged nodes and the unmerged nodes, and thus generating the t-th temporal-spatial graph; and finally, (ii) track the multiple objects within the specific space, wherein if the (1_t_1)-th node to the (n_t_j)-th node exist, the k is 1 or more, and if the (1_t_1)-th node to the (n_t_j)-th node do not exist, the (t_1)-th node to the (t_k)-th node do not exist.
9 . The method of claim 8 , wherein, at the step of (a), the object tracking device receives each of the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor through the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor from each of a first edge device corresponding to the first camera to an n-th edge device corresponding to the n-th camera, wherein, at the step of (b), the object tracking device synchronizes, to the present time t, the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor through the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor.
10 . The method of claim 8 , wherein, at the step of (a), the object tracking device receives each of the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor through the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor from each of a first edge device to an n′-th edge device, wherein the first edge device to the n′-th edge device correspond to first grouped cameras to n′-th grouped cameras, wherein the first grouped cameras to the n′-th grouped cameras are acquired by clustering the first camera to the n-th camera into one or more groups, wherein, at the step of (b), the object tracking device synchronizes, to the present time t, the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor through the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor.
11 . An object tracking device for tracking multiple objects by using multiple cameras based on a temporal-spatial GNN utilizing geometric features for each of the multiple objects, comprising: at least one memory which saves instructions for tracking the multiple objects by using the multiple cameras based on the temporal-spatial GNN utilizing the geometric features for each of the multiple objects; and at least one processor configured to perform an operation for tracking the multiple objects by using the multiple cameras based on the temporal-spatial GNN utilizing the geometric features for each of the multiple objects according to the instructions stored in the memory to perform processes of: (I) acquiring each of a (1_t_1)-th object descriptor to a (1_t_i)-th object descriptor corresponding to each of a (1_t_1)-th object to a (1_t_i)-th object on a (1_t)-th frame through each of an (n_t_1)-th object descriptor to an (n_t_j)-th object descriptor corresponding to each of an (n_t_1)-th object to an (n_t_j)-th object on an (n_t)-th frame, wherein each of the (1_t)-th frame to the (n_t)-th frame is captured at a present time t from each of a first camera to an n-th camera installed in a specific space, wherein the n is 2 or more, and the first camera to the n-th camera have different viewing frustums, wherein if an object exists on the (1_t)-th frame, the i is 1 or more, and each of the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor includes respective appearance feature information and respective geometric feature information for each of the (1_t_1)-th object to the (1_t_i)-th object, and if the object does not exist on the (1_t)-th frame, the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor do not exist, wherein if an object exists on the (n_t)-th frame, the j is 1 or more, and each of the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor includes respective appearance feature information and respective geometric feature information for each of the (n_t_1)-th object to the (n_t_j)-th object, and if the object does not exist on the (n_t)-th frame, the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor do not exist; (II) calculating similarities of each of object descriptor pairs by referring to the (1_t_1)-th object descriptor to the (n_t_j)-th object descriptor, wherein each of the object descriptor pairs is each of pairs having one object descriptor among the (1_t_1)-th object descriptor to the (n_t_j)-th object descriptor and another object descriptor among the (1_t_1)-th object descriptor to the (n_t_j)-th object descriptor, and inputting a (1_t_1)-th node corresponding to the (1_t_1)-th object descriptor to an (n_t_j)-th node corresponding to the (n_t_j)-th object descriptor and each of spatial edges corresponding to each of the similarities into the GNN (Graph Neural Network), to thereby instruct the GNN to (i) merge nodes corresponding to a same object among the (1_t_1)-th node to the (n_t_j)-th node by referring to the (1_t_1)-th node to the (n_t_j)-th node and the spatial edges, and thus (ii) generate a t-th spatial graph including a (t_1)-th node to a (t_k)-th node, wherein if the (1_t_1)-th node to the (n_t_j)-th node exist, the k is 1 or more, and if the (1_t_1)-th node to the (n_t_j)-th node do not exist, the (t_1)-th node to the (t_k)-th node do not exist; and (III) generating each piece of correspondence information between each of the (t_1)-th node to the (t_k)-th node of the t-th spatial graph and each of a ((t−1)_1)-th node to a ((t−1)_p)-th node of a (t−1)-th spatial graph, wherein the (t−1)-th spatial graph is generated at a previous time (t−1), and wherein if an object detected within the specific space in the previous time (t−1) exists, the p is 1 or more, otherwise, the ((t−1)_1)-th node to the ((t−1)_p)-th node do not exist, and inputting the (t_1)-th node to the (t_k)-th node, the ((t−1)_1)-th node to the ((t−1)_p)-th node, and temporal edges corresponding to each piece of the correspondence information into the GNN, to thereby instruct the GNN to (i) merge nodes corresponding to the same object between the (t_1)-th node to the (t_k)-th node and the ((t−1)_1)-th node to the ((t−1)_p)-th node by referring to the (t_1)-th node to the (t_k)-th node, the ((t−1)_1)-th node to the ((t−1)_p)-th node and the temporal edges, and thus generate a t-th temporal-spatial graph, and finally, (ii) track the multiple objects within the specific space.
12 . The object tracking device of claim 11 , wherein a specific object descriptor, which is one of the (1_t_1)-th object descriptor to the (n_t_j)-th object descriptor, includes specific appearance feature information and specific geometric feature information, wherein the specific appearance feature information has embedded features which are acquired by embedding at least one of (i) a cropped image or its feature information, wherein the cropped image is obtained by cropping a specific object area corresponding to a specific object from a specific frame, wherein the specific frame is captured by a specific camera corresponding to the specific object descriptor, wherein the specific camera is one of the first camera to the n-th camera, (ii) clothing color information of the specific object, (iii) hairstyle information of the specific object, (iv) hair color information of the specific object, (v) skin color information of the specific object, and (vi) facial feature information of the specific object, wherein the specific geometric feature information has at least one of (i) BEV (Bird's Eye View) coordinate information obtained by projecting a center coordinate of a bounding box corresponding to the specific object onto a ground plane based on camera parameters of the specific camera, (ii) object displacement information, which is a displacement of the specific object estimated by referring to the BEV coordinate, (iii) 3D epipolar similarity information for the specific object calculated based on the specific camera and at least one other camera, wherein the at least one other camera has another viewing frustum that at least partially overlaps with a specific viewing frustum of the specific camera, (iv) object key point information according to a pose of the specific object, and (v) body shape information of the specific object.
13 . The object tracking device of claim 12 , wherein the specific appearance feature information further includes reliability information indicating a reliability of the embedded features due to an occlusion of the specific object between the specific frame and other frames.
14 . The object tracking device of claim 11 , wherein, at the process of (III), in generating specific correspondence information between a (t_specific)-th node and a ((t−1)_specific)-th node, wherein the (t_specific)-th node is one of the (t_1)-th node to the (t_k)-th node and the ((t−1)_specific)-th node is one of the ((t−1)_1)-th node to ((t−1)_p)-th node, the processor generates the specific correspondence information such that the specific correspondence information includes at least one of (i) time interval information between the present time t and the previous time (t−1), (ii) movement distance information between the (t_specific)-th node and the ((t−1)_specific)-th node within the specific space, and (iii) velocity difference information between the (t_specific)-th node and the ((t−1)_specific)-th node.
15 . The object tracking device of claim 11 , wherein, at the process of (I), the processor receives each of the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor through the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor from each of a first edge device corresponding to the first camera to an n-th edge device corresponding to the n-th camera, wherein, at the process of (II), the processor synchronizes, to the present time t, the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor through the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor.
16 . The object tracking device of claim 15 , wherein, at the process of (I), a specific edge device, which is one of the first edge device to the n-th edge device, performs an object detection on a specific frame, which is captured by a specific camera corresponding to the specific edge device, to thereby detect specific objects on the specific frame, wherein the specific camera is one of the first camera to the n-th camera, generates each piece of specific appearance feature information of each of the specific objects by referring to each of detection results of each of the specific objects, generates each piece of specific geometric feature information for each of the specific objects based on at least the specific camera, and generates each of specific object descriptors for each of the specific objects by referring to each piece of the specific appearance feature information and each piece of the specific geometric feature information.
17 . The object tracking device of claim 11 , wherein, at the process of (I), the processor receives each of the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor through the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor from each of a first edge device to an n′-th edge device, wherein the first edge device to the n′-th edge device correspond to first grouped cameras to n′-th grouped cameras, wherein the first grouped cameras to the n′-th grouped cameras are acquired by clustering the first camera to the n-th camera into one or more groups, wherein, at the process of (II), the processor synchronizes, to the present time t, the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor through the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor.
18 . An object tracking device for tracking multiple objects by using multiple cameras based on a temporal-spatial GNN utilizing geometric features for each of the multiple objects, comprising: at least one memory which saves instructions for tracking the multiple objects by using the multiple cameras based on the temporal-spatial GNN utilizing the geometric features for each of the multiple objects; and at least one processor configured to perform an operation for tracking the multiple objects by using the multiple cameras based on the temporal-spatial GNN utilizing the geometric features for each of the multiple objects according to the instructions stored in the memory to perform processes of: (I) acquiring each of a (1_t_1)-th object descriptor to a (1_t_i)-th object descriptor corresponding to each of a (1_t_1)-th object to a (1_t_i)-th object on a (1_t)-th frame through each of an (n_t_1)-th object descriptor to an (n_t_j)-th object descriptor corresponding to each of an (n_t_1)-th object to an (n_t_j)-th object on an (n_t)-th frame, wherein each of the (1_t)-th frame to the (n_t)-th frame is captured at a present time t from each of a first camera to an n-th camera installed in a specific space, wherein the n is 2 or more, and the first camera to the n-th camera have different viewing frustums, wherein if an object exists on the (1_t)-th frame, the i is 1 or more, and each of the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor includes respective appearance feature information and respective geometric feature information for each of the (1_t_1)-th object to the (1_t_i)-th object, and if the object does not exist on the (1_t)-th frame, the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor do not exist, wherein if an object exists on the (n_t)-th frame, the j is 1 or more, and each of the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor includes respective appearance feature information and respective geometric feature information for each of the (n_t_1)-th object to the (n_t_j)-th object, and if the object does not exist on the (n_t)-th frame, the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor do not exist; and (II) generating similarities of each of spatial object descriptor pairs by referring to the (1_t_1)-th object descriptor to the (n_t_j)-th object descriptor, and each piece of correspondence information between each of the (1_t_1)-th object descriptor to the (n_t_j)-th object descriptor and each of a ((t−1)_1)-th node to a ((t−1)_p)-th node, wherein each of the ((t−1)_1)-th node to the ((t−1)_p)-th node is a node of a (t−1)-th temporal-spatial graph generated at a previous time (t−1), wherein each of the spatial object descriptor pairs is each of pairs having one object descriptor among the (1_t_1)-th object descriptor to the (n_t_j)-th object descriptor and another object descriptor among the (1_t_1)-th object descriptor to the (n_t_j)-th object descriptor, and wherein if an object detected within the specific space in the previous times exists, the p is 1 or more, otherwise, the ((t−1)_1)-th node to the ((t−1)_p)-th node do not exist, and inputting a (1_t_1)-th node corresponding to the (1_t_1)-th object descriptor to an (n_t_j)-th node corresponding to the (n_t_j)-th object descriptor, the ((t−1)_1)-th node to the ((t−1)_p)-th node, spatial edges corresponding to the similarities, and temporal edges corresponding to the correspondence information into the GNN (Graph Neural Network), to thereby instruct the GNN to (i) perform one of sub-processes of: (i−1) merging nodes corresponding to a same object among the (1_t_1)-th node to the (n_t_j)-th node by referring to the spatial edges, and thus generating a (t_1)-th node to a (t_k)-th node, merging nodes corresponding to the same object among the (t_1)-th node to the (t_k)-th node and the ((t−1)_1)-th node to the ((t−1)_p)-th node by referring to the temporal edges, and thus generating a t-th temporal-spatial graph; and (i−2) merging nodes corresponding to the same object among the (1_t_1)-th node to the (n_t_j)-th node and the ((t−1)_1)-th node to the ((t−1)_p)-th node by referring to the temporal edges, identifying merged nodes and unmerged nodes among the (1_t_1)-th node to the (n_t_j)-th node, merging nodes corresponding to the same object among the merged nodes and the unmerged nodes, and thus generating the t-th temporal-spatial graph; and finally, (ii) track the multiple objects within the specific space, wherein if the (1_t_1)-th node to the (n_t_j)-th node exist, the k is 1 or more, and if the (1_t_1)-th node to the (n_t_j)-th node do not exist, the (t_1)-th node to the (t_k)-th node do not exist.
19 . The object tracking device of claim 18 , wherein, at the process of (I), the processor receives each of the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor through the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor from each of a first edge device corresponding to the first camera to an n-th edge device corresponding to the n-th camera, wherein, at the process of (II), the processor synchronizes, to the present time t, the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor through the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor.
20 . The object tracking device of claim 18 , wherein, at the process of (I), the processor receives each of the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor through the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor from each of a first edge device to an n′-th edge device, wherein the first edge device to the n′-th edge device correspond to first grouped cameras to n′-th grouped cameras, wherein the first grouped cameras to the n′-th grouped cameras are acquired by clustering the first camera to the n-th camera into one or more groups, wherein, at the process of (II), the processor synchronizes, to the present time t, the (1_t_1)-th object descriptor to the (1_t_i)-th object descriptor through the (n_t_1)-th object descriptor to the (n_t_j)-th object descriptor.

Description

CROSS REFERENCE TO RELATED APPLICATION This application claims the benefit of priority to Korean Patent Application No. 10-2025-0137897, filed on Sep. 24, 2025, the entire contents of which being incorporated herein by reference. FIELD OF THE DISCLOSURE The present disclosure relates to a Multi-Target Multi-Camera (MTMC) tracking, and more specifically, a method for tracking multiple objects by using multiple cameras based on a temporal-spatial GNN utilizing geometric features for each of the multiple objects and an object tracking device using the same. BACKGROUND OF THE DISCLOSURE A security system utilizing cameras, such as CCTV, is commonly used to ensure a security of a specific space. These cameras are installed in large buildings such as hypermarkets, department stores, research centers, and public institutions, or in smaller buildings such as homes, daycare centers, convenience stores, and banks, as well as in public spaces such as parks and roads. Images captured by the cameras are used to conduct real-time surveillance or analyze information about the corresponding space. In order to ensure the security of the specific space through the images captured by the cameras, multiple objects such as people and cars must be tracked, and MOT (Multiple Object Tracking) technology has been used to detect each of locations of each of the multiple objects within a single camera or a single sequence for each of frames and consistently link each of unique IDs of each of the multiple objects between the frames. Furthermore, there is an MTMC (Multi-Target Multi-Camera) tracking method, which extends the MOT with multiple cameras and continuously synchronizes object IDs among the multiple cameras, wherein each of the multiple cameras has different locations and different viewing frustums. This type of object tracking re-identifies a same object observed from different viewpoints or different cameras, and thus, the tracking is conducted by following a trajectory of the object. Also, a filter-based technique and an AI-based technique are primarily used to re-identify the object, i.e., to associate data. The filter-based technique estimates a state of the object by repeating an observation and a prediction, and perform a data association while correcting the state of the object to maintain a continuity between the frames by using a dynamic model, such as a Kalman filter, an extended Kalman filter, and an unscented Kalman filter, or perform the data association while correcting the state of the object to maintain the continuity between the frames by using a probabilistic hypothesis on a distribution of a measurement noise, such as the JPDA (Joint Probability Data Association) and MHT (Multiple Hypothesis Tracking). In addition, the AI-based technique directly performs or assists the data association by allowing a network to train an object feature embedding and a spatiotemporal relationship. In particular, a tracking-by-attention method, which utilizes a transformer architecture and an attention mechanism, treats each piece of appearance/location information of each of the objects as tokens and automatically determines a matching of the object IDs by training inter-frame interactions. However, since many elements of a conventional MTMC/MOT-based tracking model that make up an algorithm, such as a feature extraction, an association rule, and a distance threshold, are manually designed, it is difficult to apply the tracking model optimized for a specific capturing environment to another capturing environment after it has been optimized for the specific capturing environment, such as a camera placement, lighting, and a crowd density. In addition, the tracking model designed based on rules takes a lot of time for a parameter adjustment, and it is difficult to sufficiently reflect complex interactions such as an object occlusion, an object reappearance, and an object movement between asynchronous cameras. Accordingly, there are limitations in a performance of the tracking model and its applicability to real-world situations. Furthermore, as the performance of the tracking model depends on training data, recently emerging AI-based tracking model suffers from a significant performance degradation when applied to commercial applications due to insufficient training data. In particular, the MTMC presents numerous variables, such as differences in field of view between the cameras, time deviations, various movement paths and reappearances. This makes it difficult for the tracking models trained in a single domain to accurately re-identify the objects when expanded to different spaces, seasons or event conditions. SUMMARY OF THE DISCLOSURE It is an object of the present disclosure to solve all the aforementioned problems. It is another object of the present disclosure to enable an accurate object tracking regardless of a capturing environment of a camera. It is still another object of the present disclosure to enable a more accur