CN-121980241-A - Method and system for automatically generating training data set
Abstract
The invention provides a method and a system for automatically generating a training data set, which are applied to the technical field of AI artificial intelligence and comprise the steps of obtaining building specifications under different scenes and marking to obtain first class data, obtaining CAD images under different scenes and marking to obtain second class data, carrying out feature extraction on the first class data by adopting a building data processing model, carrying out feature extraction on the second class data by adopting a building image processing model to obtain first-level features, second-level features and third-level features, constructing a core component library and a secondary component library based on the features of each level, constructing combination rules of different scenes based on the features of each level, the core component library and the secondary component library, presetting a generating scene, loading corresponding combination rules, randomly generating a CAD model and CAD original data according to the combination rules, synchronously generating three-dimensional tag data corresponding to all components, and repeating the steps until the CAD model data set reaches the requirements. The method can quickly construct massive and diverse training data.
Inventors
- YANG XUN
- LIU BIN
Assignees
- 重庆特斯联启智科技有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260121
Claims (10)
- 1. A method of automatically generating a training data set, comprising the steps of: Acquiring building specifications under different scenes and marking to obtain first class data, acquiring CAD graphs under different scenes and marking to obtain second class data, wherein the marked contents comprise scenes and scene grades; performing feature extraction on the first class data by adopting a constructed building data processing model, and performing feature extraction on the second class data by adopting a constructed building graph processing model to obtain first-level features, second-level features and third-level features, wherein the latter-level features are added with refined scene features on the basis of the former-level features; constructing a core component library and a secondary component library based on each level of characteristics, wherein the priority of the core component library is higher than that of the secondary component library, and all components comprise component types, CAD (computer aided design) original data and three-dimensional tag data; Constructing combination rules of different scenes based on each level of characteristics, a core component library and a secondary component library, wherein the combination rules comprise space rules, configuration rules and conflict rules; presetting a generating scene, loading a combination rule corresponding to the generating scene, randomly generating a CAD model and CAD original data based on the generating scene according to the combination rule, synchronously generating three-dimensional label data corresponding to all components, and repeating the steps until the number of samples of the CAD model data set reaches the requirement.
- 2. The method of automatically generating a training data set of claim 1, wherein the scene levels comprise three levels of a first general level, a second scene level, and a third professional level, wherein, The first general level includes building general design specifications for each industry; the second scene level comprises building general design specifications under different scenes; the third scene level includes a building specific design specification for a specific scene.
- 3. The method of automatically generating a training data set of claim 1, wherein the building data processing model comprises a modified BERT-base model comprising, in order, a first input layer, an embedding layer, a first coding layer, a task head layer, and a first output layer, wherein, The first input layer is used for inputting data of one type and converting corresponding labels into labels to be embedded and transmitted to the embedded layer, the embedded layer is used for conducting text semantic coding on the data transmitted by the input layer to obtain semantic vectors, the first coding layer comprises a Transformer encoder and a hierarchical attention mechanism and is used for conducting feature extraction and weight enhancement on the semantic vectors to obtain semantic features fusing all scene weights, the task head layer is used for conducting structural extraction and classification on the semantic features to obtain corresponding features of different feature indexes, the first output layer outputs the data transmitted by the task head layer to obtain primary features, secondary features and tertiary features, The feature indicators include spatial combination logic, component type, build quantity, area, and position constraints.
- 4. The method of claim 1, wherein the building map processing model comprises, in order, an input layer, an encoding layer, a modeling layer, a scene adaptation layer, and an output layer, wherein, The second input layer is used for inputting second-class data and carrying out vectorization processing, meanwhile, corresponding labels are converted into labels to be embedded and transmitted to the second coding layer, the second coding layer comprises a multi-layer perceptron and is used for extracting geometric attributes of the data transmitted by the input layer to obtain geometric feature vectors, the modeling layer comprises a GNN model and is used for fusing the geometric feature vectors and topological relations to obtain topological feature vectors, the scene adaptation layer comprises three parallel full-connection branches and is used for carrying out scene weight reinforcement on the topological feature vectors based on scene grade labels to output scene feature vectors adapting to different scene grades, and the second output layer outputs the data transmitted by the scene adaptation layer to obtain primary features, secondary features and tertiary features.
- 5. The method of claim 1, wherein the core component library corresponds to component attributes of a primary feature and the secondary component library corresponds to component attributes of a secondary feature and a tertiary feature; Each element in the core component library and the secondary component library is composed of CAD original data, wherein the CAD original data is basic data which is obtained based on standard CAD graph construction and is used for generating a minimum data unit of a CAD model; The types of the CAD original data comprise line types supported by CAD and corresponding line data.
- 6. The method for automatically generating a training data set according to claim 1, wherein a combination rule of different scenes is constructed based on each level of characteristics, a core component library and a secondary component library, the combination rule including a spatial rule, a configuration rule and a conflict rule, comprising the steps of: manually constructing a core constraint, and converting the core constraint, each level of characteristics, a core component library and a secondary component library into rule parameters; Based on the core constraint, matching corresponding components and rules for different scenes, and generating rule sets under different scenes; Performing conflict detection on the rule set, correcting by taking core constraint as the highest priority to obtain combination rules of different scenes without conflict, The space rule comprises the combination logic of components in a core component library and a secondary component library; the configuration rule comprises the number, the area and the position constraint of various components; the conflict rules include priorities between components.
- 7. A method of automatically generating a training data set as claimed in claim 1, wherein randomly generating a CAD model based on the generated scene and in accordance with the combining rules comprises: Under the condition that the generating scene is met, randomly selecting components in a core component library according to the combination rule, merging and splicing the components through random edges to generate a core space, and selecting components in a secondary component library according to the configuration rule to be configured in the core space to obtain an initial space; and detecting the initial space based on the conflict rule and adjusting according to the priority to obtain the CAD model.
- 8. A method of automatically generating a training data set in accordance with claim 5, wherein the generation of CAD raw data and three-dimensional label data comprises the steps of: And recording and generating line types and line data of each space and offset coordinates, scaling and rotation angles of relative center points of the corresponding spaces to obtain CAD original data and three-dimensional label data.
- 9. A system for automatically generating a training data set for implementing a method of automatically generating a training data set according to any of claims 1-8, the system comprising: The data acquisition module is used for acquiring building specifications under different scenes and marking to obtain first class data, acquiring CAD graphs under different scenes and marking to obtain second class data, wherein the marked contents comprise scenes and scene grades; the feature extraction module is connected with the data acquisition module and is used for carrying out feature extraction on the first class of data by adopting a constructed building data processing model, and carrying out feature extraction on the second class of data by adopting a constructed building graph processing model to obtain first-level features, second-level features and third-level features; The component construction module is connected with the feature extraction module and used for constructing a core component library and a secondary component library based on each level of features, the priority of the core component library is higher than that of the secondary component library, and all components comprise component types, CAD (computer-aided design) original data and three-dimensional tag data; The rule generation module is connected with the component construction module and is used for constructing combination rules of different scenes based on each level of characteristics, the core component library and the secondary component library, and the combination rules comprise space rules, configuration rules and conflict rules; the CAD generation module is linked with the rule generation module and is used for presetting a generation scene, loading a combination rule corresponding to the generation scene, randomly generating a CAD model and CAD original data according to the combination rule based on the generation scene, synchronously generating three-dimensional label data corresponding to all components, and repeating the steps until the number of samples of the CAD model data set reaches the requirement.
- 10. An electronic device comprising a memory, a processor, and a computer program, wherein the computer program is stored in the memory and configured to be executed by the processor to implement a method of automatically generating a training data set according to any of claims 1 to 8.
Description
Method and system for automatically generating training data set Technical Field The disclosure relates to the technical field of AI artificial intelligence, and in particular relates to a method and a system for automatically generating a training data set. Background At present, various industries are developing towards the targets of automation and intellectualization, and in the digital twin field, a three-dimensional twin model is often required to be built according to the actual situation 1:1 of building floor layout for three-dimensional visualization and accurate management, such as house renting and selling scenes and property management scenes in buildings and parks. In order to realize the automation from indoor layout to three-dimensional twin, the current most hot large model technology is most effective, CAD drawing identification is automatically completed by constructing a CAD identification large model, and three-dimensional twin models of buildings, floors, rooms and the like are constructed. However, the serious data problem is faced in the process of training CAD to identify a large model, the current training data is generally derived from CAD drawings of actual buildings, the serious data volume is insufficient, the data diversity is small, and the data problem that the three-dimensional twin model cannot be corresponding exists, the manual construction of the data by a worker faces huge personnel cost and time cost, and the trained CAD is poor in large model identification capability and low in generalization. In order to solve the problems of small quantity of training data and low generalization of the large model for CAD identification, a method for automatically generating a large model training data set for floor level CAD identification is provided, and massive and diversified training data can be quickly constructed by the method. Accordingly, there is an urgent need to develop a method and system for automatically generating a training data set to solve the above problems. Disclosure of Invention The present disclosure provides a method and a system for automatically generating a training data set, which solve the above problems, and solve the problems of small amount of training data and low generalization of a large model identified by CAD in the prior art. According to a first aspect of the present disclosure, a method of automatically generating a training data set is provided. Acquiring building specifications under different scenes and marking to obtain first class data, acquiring CAD graphs under different scenes and marking to obtain second class data, wherein the marked contents comprise scenes and scene grades; performing feature extraction on the first class data by adopting a constructed building data processing model, and performing feature extraction on the second class data by adopting a constructed building graph processing model to obtain first-level features, second-level features and third-level features, wherein the latter-level features are added with refined scene features on the basis of the former-level features; constructing a core component library and a secondary component library based on each level of characteristics, wherein the priority of the core component library is higher than that of the secondary component library, and all components comprise component types, CAD (computer aided design) original data and three-dimensional tag data; Constructing combination rules of different scenes based on each level of characteristics, a core component library and a secondary component library, wherein the combination rules comprise space rules, configuration rules and conflict rules; presetting a generating scene, loading a combination rule corresponding to the generating scene, randomly generating a CAD model and CAD original data based on the generating scene according to the combination rule, synchronously generating three-dimensional label data corresponding to all components, and repeating the steps until the number of samples of the CAD model data set reaches the requirement. Further, the scene level comprises three levels of a first general level, a second scene level and a third professional level, wherein, The first general level includes building general design specifications for each industry; the second scene level comprises building general design specifications under different scenes; the third scene level includes a building specific design specification for a specific scene. Further, the building data processing model comprises an improved BERT-base model which sequentially comprises a first input layer, an embedded layer, a first coding layer, a task head layer and a first output layer, wherein, The first input layer is used for inputting data of one type and converting corresponding labels into labels to be embedded and transmitted to the embedded layer, the embedded layer is used for conducting text semantic coding on the data transmi