US-12620191-B1 - Method and system for semantic perception editing of objects in three-dimensional scene

US12620191B1US 12620191 B1US12620191 B1US 12620191B1US-12620191-B1

Abstract

A method for semantic perception editing of objects in a three-dimensional scene includes steps as follows: input point cloud data of a three-dimensional scene is analyzed by using a three-dimensional instance segmentation network, and a current position and a bounding box of a target object are determined based on an editing instruction. A set of semantic relationship rules between the multiple object instances from a historical editing case database is learned by using a Bayesian dynamic updating mechanism, and dynamic semantic constraints are dynamically generated. Editing operations are converted into a standardized mathematical representation by using a unified parametric framework, and the verified editing operations are optimized based on a comprehensive loss minimizing function to obtain optimized editing parameters. The three-dimensional scene is transformed based on the optimized editing parameters to obtain the edited three-dimensional scene.

Inventors

Fengkai LUAN
Hu ZHANG
Jiaxing YANG
Zhengqi LIANG
Siliang Sun
Yuyang XIA

Assignees

WUHAN UNIVERSITY OF TECHNOLOGY
Wuhan Qikai Information Technology Development Co., LTD

Dates

Publication Date: 20260505
Application Date: 20251011
Priority Date: 20250827

Claims (20)

1 . A method for semantic perception editing of objects in a three-dimensional scene, the method comprising the following steps: S 1 , analyzing, using a three-dimensional instance segmentation network, input point cloud data of the three-dimensional scene to obtain multiple object instances and semantic labels of the multiple object instances, a set of supporting surfaces of the multiple object instances, a spatial relationship graph among the multiple object instances, and scene environmental features, and determining a current position and a bounding box of a target object based on an editing instruction, comprising: S 11 , decomposing, using the three-dimensional instance segmentation network, the input point cloud data of the three-dimensional scene into the multiple object instances, wherein each of the multiple object instances comprises a point set and a corresponding one of the semantic labels; S 12 , extracting global spatial structural information of the three-dimensional scene, wherein the global spatial structural information comprises the set of supporting surfaces being capable of bearing the multiple object instances, free space regions, and the spatial relationship graph among the multiple object instances; S 13 , obtaining the scene environmental features through analyzing the input point cloud data of the three-dimensional scene, wherein the scene environmental features comprise supporting surface load-bearing information, object material properties, and spatial distribution information; and S 14 , receiving the editing instruction and determining the current position, and the bounding box of the target object in the three-dimensional scene based on the editing instruction; S 2 , learning, using a Bayesian dynamic updating mechanism, a set of semantic relationship rules between the multiple object instances from a historical editing case database, and dynamically generating dynamic semantic constraints based on the set of semantic relationship rules and the scene environmental features; S 3 , converting, using a unified parametric framework, editing operations into a standardized mathematical representation, calculating a comprehensive semantic constraint score based on the dynamic semantic constraints to verify rationality of the editing operations, thereby to determine verified editing operations, and optimizing, based on a comprehensive loss minimizing function, the verified editing operations to obtain optimized editing parameters; and S 4 , transforming, using a unified editing operation execution engine, the three-dimensional scene based on the optimized editing parameters, and updating the spatial relationship graph and the set of supporting surfaces to obtain the edited three-dimensional scene.
2 . The method as claimed in claim 1 , wherein the step S 2 comprises: S 21 , establishing a database A={(s i , α i )|s i ∈S} associating the semantic labels of the multiple object instances with semantic attribute vectors of the multiple object instances, where s i represents a corresponding semantic label of an i-th object instance of the multiple object instances, S represents a set of the semantic labels, and α i represents a semantic attribute vector of the i-th object instance of the multiple object instances, which includes a weight, a volume, a material hardness, a surface roughness, and a thermal conductivity; S 22 , defining the set of semantic relationship rules R semantic ={r ij r ij : s i ×s j →[0,1],s i , s j ∈S} among the multiple object instances, where r ij represents a compatibility score function between the semantic label s i and a semantic label s j of a j-th object instance of the multiple object instances, and r ij represents a semantic relationship rule between the i-th object instance and the j-th object instance; S 23 , treating the semantic relationship rule r ij as a posterior probability following a Beta distribution Beta(α ij , β ij ), where α ij represents an accumulated parameter of successful evidence corresponding to the semantic relationship rule r ij , and β ij represents an accumulated parameter of failed evidence corresponding to the semantic relationship rule r ij ; S 24 , updating posterior parameters based on a feedback loss from the historical editing case database = { ( E ( h ) , Q total ( h ) ) ❘ h = 1 , 2 , … , NH } , where E (h) represents an h-th historical editing operation, Q total ( h ) represents a quality score corresponding the h-th historical editing operation, and NH represents a total number of historical cases in the historical editing case database; and S 25 , dynamically generating, based on the learned set of semantic relationship rules and the scene environmental features, wherein the dynamic semantic constraints comprise a support constraint, a stability constraint, and a compatibility constraint.
3 . The method as claimed in claim 2 , wherein in the step S 24 , formulas for updating the posterior parameters are expressed as follows: α ij ( t + 1 ) = α ij ( t ) + ( 1 - L feedback ( h ) ) · κ ⁢ β ij ( t + 1 ) = β ij ( t ) + L feedback ( h ) · κ ⁢ r ij ( t + 1 ) = α ij ( t + 1 ) α ij ( t + 1 ) + β ij ( t + 1 ) where L feedback ( h ) = 1 - Q total ( h ) represents the feedback loss, t represents a current time step, t+1 represents an updated time step, and k represents an updated intensity factor, configured to control feedback sensitivity; wherein the method further comprises: after updating the posterior parameters α ij ( t + 1 ) and β ij ( t + 1 ) , taking a posterior mean r ij ( t + 1 ) = α ij ( t + 1 ) / ( α ij ( t + 1 ) + β ij ( t + 1 ) ) as a new rule value.
4 . The method as claimed in claim 2 , wherein formulas for calculating the comprehensive semantic constraint score are expressed as follows: S semantic = γ 1 · C support + γ 2 · C stability + γ 3 · C compatibilty ⁢ C support ⁢ [ is_support ⁢ _surface ⁢ ( f dest ) ⋀ w target ≤ w max ( f dest ) ] ⁢ C stability = [ contact_area ⁢ ( O target , f dest ) ≥ A min ] ⁢ C compatibility = [ r target , env ≥ ρ threshold ] where S semantic represents the comprehensive semantic constraint score; C support represents the support constraint; C stability represents the stability constraint; C compatibility represents the compatibility constraint; γ 1 , γ 2 , and γ 3 represent weight coefficients for C support , C stability , and C compatibility , respectively, and satisfy γ 1 +γ 2 +γ 3 =1; [·] represents an indicator function; f dest represents a target support surface selected from the set of support surfaces; is_support_surface(·) represents a support surface determination function; w target represents a weight of the target object; w max (f dest ) represents a maximum load-bearing capacity of the target support surface f dest ; ∧ represents a logical operator; contact_area(·,·) represents a contact area calculation function; O target represents the target object; A min represents a minimum contact area required to ensure stability; r target,env represents a compatibility score between the target object and environment calculated based on the set of semantic relationship rules R semantic ; and ρ threshold represents a compatibility determination threshold.
5 . The method as claimed in claim 2 , wherein in the step S 3 , formulas of the comprehensive loss function are expressed as follows: L total = ω 1 · L orientation + ω 2 · L semantic + ω 3 · L interference ⁢ L orientation =  a target ⊙ ( θ pose - θ ideal )  2 2 ⁢ L semantic =  E ⁡ ( G spatial ) - E ⁡ ( G spatial )  F ⁢ L interference = ∑ v ∈ V voxel exp ⁡ ( - r v / τ ) · [ occ ⁡ ( v ) > occ thresh ] where L total represents the comprehensive loss function; L orientation represents an orientation deviation loss; L semantic represents a graph embedding matching loss; L interference represents a voxel semantic interference penalty; ω 1 , ω 2 , and ω 3 represent weight coefficients for L orientation , L semantic , and L interference , respectively, and satisfy ω 1 +ω 2 +ω 3 =1; α target represents a semantic attribute vector of the target object; ⊙ represents element-wise multiplication; θ pose represents a current orientation; θ ideal represents an ideal orientation;  ·  2 2 represents a squared L2 norm; E(·) represents an embedding matrix generated by a graph neural network; G spatial represents a current spatial relationship graph; Ĝ spatial represents a predicted post-editing spatial relationship graph; ∥·∥ F represents a Frobenius norm; V voxel represents a voxel grid partition of the three-dimensional scene; r v represents an average compatibility score between relevant objects within a voxel v; τ represents a temperature parameter; occ(v) represents an occupancy rate of the voxel v; OCC thresh represents an occupancy threshold; and [·] represents an indicator function.
6 . The method as claimed in claim 1 , wherein the editing operations comprise a deletion operation, a generation operation, a movement operation, and a swap operation, the deletion operation is used to define a removal area and an area of influence, the generation operation is used to determine a generation location and object parameters, the movement operation comprises a transformation matrix, and the swap operation is used to define a pair of swap transformations for two objects of the multiple object instances.
7 . The method as claimed in claim 1 , further comprising: scene reconstructing and visual consistency maintaining, comprising: defining void regions and repair masks, wherein the void regions are areas left vacant after object removal, and the repair masks are used to identify points to be repaired; performing, using a multi-modal generation network, in situ repair on the edited three-dimensional scene, filling the void regions left after the object removal with surrounding environmental context information, and outputting a repaired point cloud and texture; and performing new location adaptation processing on the repaired point cloud and texture to generate interaction effects between the multiple object instances and the environment, comprising: calculating a set of contact points, generating shadow projection effects, simulating contact deformation, and adjusting material reflection, in order to perform the visual consistency maintaining.
8 . The method as claimed in claim 7 , wherein the scene reconstructing comprises: in-situ repairing and new locations adapting; filling the void regions left after object removal using contextual information from surrounding environment and defining the void regions and the repair masks by employing a reconstruction algorithm; and generating interaction effects between the multiple object instances and the environment, and calculating a set of contact points between the multiple object instances and the environment, generating contact shadows and minor environmental indentations, and simulating realistic physical interactions through local geometric adjustments.
9 . The method as claimed in claim 8 , wherein the reconstruction algorithm is expressed as follows: P repaired , C repaired = G multi ( P , C origin , M repair , s guide , L light ) where P represents a point cloud of the three-dimensional scene, P repaired represents a repaired point cloud of the three-dimensional scene after the object removal and subsequent hole filling, C repaired represents a repaired color data associated with the a repaired point cloud P repaired , G multi represents a multi-modal generative network, C origin represents an original texture color data, s guide represents a semantic guiding label, and L light represent lighting parameters, M repair represents the repair masks.
10 . The method as claimed in claim 1 , further comprising: quality assessing and semantic constraint rule updating, comprising: calculating, using a multi-dimensional quality assessment system, a comprehensive quality score of geometric consistency, physical plausibility, and semantic consistency; adding a current editing case and the comprehensive quality score to the historical editing case database; automatically updating, based on the comprehensive quality score as a feedback loss, the set of semantic relationship rules through a Bayesian update formula to achieve feedback learning and the semantic constraint rule updating.
11 . The method as claimed in claim 1 , wherein the step S 3 further comprises: S 31 , receiving the editing instruction obtained from step S 1 , establishing the unified parametric framework for the editing operations, and converting the editing operations into the standardized mathematical representations; S 32 , calculating the comprehensive semantic constraint score based on the dynamic semantic constraints obtained from step S 2 to verify the rationality of the editing operations; S 33 , constructing the comprehensive loss minimizing function for the verified editing operations; and S 34 , minimizing, using a gradient descent algorithm, the comprehensive loss minimizing function to thereby obtain the optimized editing parameter.
12 . The method as claimed in claim 11 , wherein the step S 31 further comprises: for deletion operation, defining a removal region R remove and an area of influence R affect , where r affect represents a radius of influence; for generation operation, determining a generation location p gen and object parameters φ gen =(s type , d size , θ pose ), where s type represents a type of a generated object, d size represents a size parameter of the generated object, and θ pose represents a pose parameter of the generated object; for movement operation, wherein core parameters of the movement operations comprise a transformation matrix T move =[R|t], where a rotation matrix R and a translation vector t are transformation parameters; and for swap operation, defining a pair of swap transformations for two objects.
13 . The method as claimed in claim 11 , wherein the step S 34 further comprises: calculating a gradient of the comprehensive loss minimizing function L total with respect to the editing parameters via back propagation: ∇ θ L total = ω 1 ⁢ ∇ θ L orientation + ω 2 ⁢ ∇ θ L semantic + ω 3 ⁢ ∇ θ L interference updating parameters: Θ (k+1) =Θ (k) −η∇ Θ L total , where Θ represents the set of operation parameters, ∇ Θ represents a gradient operation with respect to the set of operation parameters Θ, L total represents the comprehensive loss function; L orientation represents an orientation deviation loss, calculating the deviation of the object's orientation relative to the ideal orientation; L semantic represents a graph embedding matching loss, quantifying the expected deviation of semantic relationships before and after editing; L interference represents a voxel semantic interference penalty, assessing the degree of spatial interference; ω 1 , ω 2 , and ω 3 represent weight coefficients for L orientation , L semantic , and L interference , respectively, and satisfy ω 1 +ω 2 +ω 3 =1; η represents a learning rate and k represents an iteration count; and k+1 represents a next iteration count; in response to a convergence condition  L total ( k + 1 ) - L total ( k )  < ϵ being met, ϵ>0 representing the convergence threshold, or a maximum number of iterations being reached, outputting the optimized editing parameters Θ*.
14 . The method as claimed in claim 1 , wherein the step S 4 comprises: establishing the unified editing operation execution engine to perform transformations on the three-dimensional scene based on the optimized editing parameters, comprising: conducting a target object extraction phase to separate a target object point cloud O target from the three-dimensional scene, and recording contextual information of surrounding environment C context ={p i ∈P|∥p i −centroid(O target )|≤r context } of the three-dimensional scene, where r context represents a context radius parameter, O target represents target object identifier, P represents the point cloud data of the three-dimensional scene, and p i represents an i-th target object point cloud; executing transformation operations corresponding to the editing operations; synchronously updating supporting surface information F′={f′k|k=1,2, . . . , K′ f }, where K′f represents a total number of supporting surfaces after the transformation operations, f′k represents a k-th supporting surface in the edited scene; k represents the index of the supporting surface, configured to identify different supporting surfaces; re-detecting horizontal supporting surfaces in the edited three-dimensional scene and updating a load-bearing capacity and available area information for each horizontal supporting surface; and defining an edited complete scene as P edited =P new ∪{O′ target }∪{environment point cloud}, where P new represents a new point cloud data after the editing operation, O′ target represents the transformed point cloud of the target object, and environment point cloud represents a background environmental point cloud in the scene excluding the target object, including static environmental elements such as walls and floors.
15 . The method as claimed in claim 14 , wherein the executing transformation operations corresponding to the editing operations comprises: for generation operation, creating a new object O new =G shape (φ gen , p gen ) via a shape generation network G shape and executing P new =P∪O new , where φ gen represents the objective paramters, p gen represents the target position coordinates for object generation, G shape represents a shape generation network, configured to generate a corresponding three-dimensional geometric shape based on semantic type and parameters, P new represents a new point-cloud data after the editing operation, O new represents the point-cloud data of the newly created object, P represents an original 3-D scene point-cloud dataset before the editing operation; for movement operation, applying a calculated transformation matrix to the target object O′ target , thereby obtaining a point cloud of a moved object; and for swap operation, simultaneously applying two transformation matrices of two objects O′ 1 and O′ 2 .
16 . A system for the semantic perception editing of objects in a three-dimensional scene configured to perform the method as claimed in claim 1 , comprising: a three-dimensional scene analyzing module, configured to analyze, using the three-dimensional instance segmentation network, the input point cloud data of the three-dimensional scene to recognize the multiple object instances and the semantic labels of the multiple object instances in the three-dimensional scene, extract the global spatial structural information of the three-dimensional scene and the scene environmental features, and determine the current position of the target object in the three-dimensional scene; a semantic relationship learning module, configured to learn, using the Bayesian dynamic updating mechanism, the set of semantic relationship rules between the multiple object instances, establish an object semantic attribute database, and dynamically generate the dynamic semantic constraints based on the set of semantic relationship rules and the scene environmental features; and an editing operation execution module, configured to receive the editing instruction and perform parametric processing, verify the rationality of the editing operations based on the dynamic semantic constraints, optimize the verified editing operations through the comprehensive loss minimizing function, and execute the corresponding editing operations to generate the edited three-dimensional scene; wherein each of the three-dimensional scene analyzing module, the semantic relationship learning module, and the editing operation execution module is embodied by at least one processor and at least one memory coupled to the at least one processor, and the at least one memory is stored with computer programs executable by the at least one processor.
17 . The system as claimed in claim 16 , further comprising: a dynamic constraint verification module, configured to generate the dynamic semantic constraints and calculate the comprehensive semantic constraint score; a loss function optimization module, configured to construct the comprehensive loss minimizing function, wherein the comprehensive loss minimizing function comprises orientation deviation, semantic matching, and voxel interference; and a scene updating module, configured to reconstruct the set of supporting surfaces and update the spatial relationship graph; wherein each of the dynamic constraint verification module, the loss function optimization module, and the scene updating module is embodied by the at least one processor and the at least one memory coupled to the at least one processor, and the at least one memory is stored with the computer programs executable by the at least one processor.
18 . The system as claimed in claim 16 , wherein the editing operation execution module comprises a scene reconstructor and a quality assessment component, the scene reconstructor is configured to repair void regions and generate physical interaction effects through a multi-modal generative network, thereby ensuring the visual consistency of the editing results, and the quality assessment component is configured to evaluate a quality of the edited three-dimensional scene from geometric, physical, and semantic and convey to a feedback information to the semantic relationship learning module.
19 . A non-transitory computer-readable storage medium stored with a computer program, wherein the computer program is configured, when being executed by a processor, to implement the method for semantic perception editing of the objects in the three-dimensional scene as claimed in claim 1 .
20 . A method for semantic perception editing of objects in a three-dimensional scene, comprising the following steps: S 1 , analyzing, using a three-dimensional instance segmentation network, input point cloud data of the three-dimensional scene to obtain multiple object instances and semantic labels of the multiple object instances, a set of supporting surfaces of the multiple object instances, a spatial relationship graph among the multiple object instances, and scene environmental features, and determining a current position and a bounding box of a target object based on an editing instruction; S 2 , learning, using a Bayesian dynamic updating mechanism, a set of semantic relationship rules between the multiple object instances from a historical editing case database, and dynamically generating dynamic semantic constraints based on the set of semantic relationship rules and the scene environmental features, comprising: S 21 , establishing a database A={(s i , α i )|s i ∈S} associating the semantic labels of the multiple object instances with semantic attribute vectors of the multiple object instances, where s i represents a corresponding semantic label of an i-th object instance of the multiple object instances, S represents a set of the semantic labels, and α i represents a semantic attribute vector of the i-th object instance of the multiple object instances, which includes a weight, a volume, a material hardness, a surface roughness, and a thermal conductivity; S 22 , defining the set of semantic relationship rules R semantic ={r ij |r ij : s i ×s j −→[0,1],s i , s j ∈S} among the multiple object instances, where r ij represents a compatibility score function between the semantic label s i and a semantic label s j of a j-th object instance of the multiple object instances, and r ij represents a semantic relationship rule between the i-th object instance and the j-th object instance; S 23 , treating the semantic relationship rule r ij as a posterior probability following a Beta distribution Beta(α ij , β ij ), where α ij represents an accumulated parameter of successful evidence corresponding to the semantic relationship rule r ij , and β ij represents an accumulated parameter of failed evidence corresponding to the semantic relationship rule r ij ; S 24 , updating posterior parameters based on a feedback loss from the historical editing case database = { ( E ( h ) , Q total ( h ) ) ❘ h = 1 , 2 , … , NH } , where E (h) represents an h-th historical editing operation, Q total ( h ) represents a quality score corresponding the h-th historical editing operation, and NH represents a total number of historical cases in the historical editing case database; and S 25 , dynamically generating, based on the learned set of semantic relationship rules and the scene environmental features, wherein the dynamic semantic constraints comprise a support constraint, a stability constraint, and a compatibility constraint; S 3 , converting, using a unified parametric framework, editing operations into a standardized mathematical representation, calculating a comprehensive semantic constraint score based on the dynamic semantic constraints to verify rationality of the editing operations, thereby to determine verified editing operations, and optimizing, based on a comprehensive loss minimizing function, the verified editing operations to obtain optimized editing parameters; and S 4 , transforming, using a unified editing operation execution engine, the three-dimensional scene based on the optimized editing parameters, and updating the spatial relationship graph and the set of supporting surfaces to obtain the edited three-dimensional scene.

Description

TECHNICAL FIELD The disclosure relates to the field of three-dimensional editing technologies, and particularly to a method and system for semantic perception editing of objects in a three-dimensional scene. BACKGROUND With the rapid development of embodied intelligence and robotics, scene perception and understanding capabilities of robots in complex three-dimensional environments have become a key technological bottleneck. To train and test the scene perception algorithms of robots, a large number of diverse three-dimensional scene datasets are required. However, traditional data collection methods are costly and have limited scene diversity. The three-dimensional scene editing technologies provide an important means to expand the training datasets for the embodied intelligence. However, the existing three-dimensional scene editing is mainly based on geometric transformations and lack understanding and constraints of semantic relationships between objects. When performing editing operations such as object movement, the existing three-dimensional scene editing often does not consider physical plausibility and semantic consistency, which can easily lead to unreasonable phenomena such as floating objects, interpenetration, or semantic mismatches. These problems seriously affect the quality of the generated datasets and the training effectiveness of robot scene perception algorithms. The Chinese patent with the publication No. CN117237575A discloses a method for indoor scene generation, a control device, and a readable storage medium. The patent achieves automated generation of virtual indoor scene data through a technical solution that involves constructing a static three-dimensional scene based on a virtual scene, generating a task trajectory according to task type of the robot, performing collision checks to obtain check results, and ultimately generating an indoor scene. This method employs a hierarchical construction, which includes the random placement of environmental layout assets, large virtual scene assets, small virtual scene assets, and decorative virtual scene assets. It ensures the validity of the scene by performing collision checks based on a task trajectory of the robot and generates various data formats such as red green and blue (RGB) images, depth maps, and semantic information through a physics simulation engine. However, the patent mainly focuses on the automated generation of scenes and lacks the ability to intelligently edit existing scenes. It cannot perform semantic perception-based object operations and scene modifications on existing three-dimensional scenes according to user instructions. SUMMARY In view of the above, the disclosure provides a method and system for semantic perception editing of objects in a three-dimensional scene. By establishing a semantic relationship learning mechanism and a multi-dimensional constraint verification system, the disclosure achieves intelligent editing operations on objects in three-dimensional scenes, ensuring that the editing results meet the requirements of physical plausibility and semantic consistency. This provides high-quality and diverse training datasets for embodied intelligence and robot scene perception algorithms, enhancing ability of the robots to understand and interact with complex three-dimensional environments. The technical solutions of the disclosure are as follows. On the one hand, the disclosure provides a method for semantic perception editing of objects in a three-dimensional scene, which includes steps S1-S4. In the step S1, input point cloud data of a three-dimensional scene is analyzed by using a three-dimensional instance segmentation network to obtain multiple object instances and semantic labels of the multiple object instances, a set of supporting surfaces of the multiple object instances, a spatial relationship graph among the multiple object instances, and scene environmental features, and a current position and a bounding box of a target object are determined based on an editing instruction. In the step S2, a set of semantic relationship rules between the multiple object instances from a historical editing case database is learned by using a Bayesian dynamic updating mechanism, and dynamic semantic constraints are dynamically generated based on the set of semantic relationship rules and the scene environmental features. In the step S3, editing operations are converted into a standardized mathematical representation by using a unified parametric framework, a comprehensive semantic constraint score is calculated based on the dynamic semantic constraints to verify rationality of the editing operations, thereby to determine verified editing operations, and the verified editing operations are optimized based on a comprehensive loss minimizing function to obtain optimized editing parameters. In the step S4, the three-dimensional scene is transformed based on the optimized editing parameters by using a unified editing