CN-121981030-A - Bidirectional data synthesis method and system based on register transmission level code
Abstract
The application discloses a bidirectional data synthesis method based on register transmission level codes, which comprises the steps of collecting a register transmission level RTL code dataset in the real world, analyzing signals in each RTL data, decomposing each verification specification in the RTL data into a plurality of natural language properties, inputting each natural language property and the corresponding RTL code into a large language model to generate an assertion SVA corresponding to the natural language property, strengthening semantic consistency between the natural language property and the SVA by using bidirectional screening, executing two translations from the SVA to the natural language to the SVA, verifying logical equivalence of the two translations SVA based on formal equivalence check aiming at the translated bidirectional language, and filtering out unequal data to generate a natural language and SVA pair with the corresponding RTL code. The application realizes obtaining a large number of SVA data sets which are high-quality and accord with real world distribution, and improves the nature language property in the data sets and the semantic consistency between SVAs.
Inventors
- HU XING
- WU YUTONG
- JIN PENGWEI
- HUANG DI
- ZHAO YONGWEI
- ZHANG RUI
- ZHANG XISHAN
- Du Zidong
- GUO QI
Assignees
- 中国科学院计算技术研究所
Dates
- Publication Date
- 20260505
- Application Date
- 20260203
Claims (10)
- 1. A method for bi-directional data synthesis based on register transfer level codes, the method comprising: collecting a register transmission level RTL code dataset in the real world, analyzing signals in each RTL data, decomposing each verification specification in the RTL data into a plurality of natural language properties, inputting each natural language property and a corresponding RTL code into a large language model, and generating an assertion SVA corresponding to the natural language property; And a step of bi-directionally screening data, which is to use bi-directional screening to strengthen the nature of natural language and semantic consistency between SVAs, execute two translations from SVAs to natural language to SVAs, verify the logical equivalence of the two translations of SVAs based on formal equivalence check aiming at the bi-directional language after translation and filter out unequal data to generate natural language and SVA pairs with corresponding RTL codes.
- 2. The method of bi-directional data synthesis based on register transfer level codes as recited in claim 1, further comprising: A data enhancement step, namely enhancing the data quality of the natural language and SVA pair by using a data enhancement technology; and a step of supervising and fine-tuning, namely fine-tuning the large language model by using the obtained training data of the natural language and SVA.
- 3. A bi-directional data synthesis method based on register transfer level code according to claim 1 or 2, wherein said predicate synthesis step comprises: SVA synthesis is carried out by adopting an RTL code open source data set, wherein the data set comprises an RTL code and a verification specification generated by a large model; Analyzing signals in each RTL code, screening out a plurality of examples containing clock and reset signals, decomposing each verification specification into a plurality of independent natural language properties by using a large language model, and generating a plurality of natural language properties; And inputting each synthesized natural language property and the corresponding RTL code into the large language model again, generating an assertion SVA corresponding to the natural language property, and primarily screening verifiable assertion SVA data.
- 4. The bi-directional data synthesis method based on register transfer level codes according to claim 1 or 2, wherein said bi-directional screening data step comprises: Translation of an asserted SVA to natural language and then to an asserted SVA, translating the asserted SVA from a large language model to natural language properties, and then re-translating back to another asserted SVA, wherein in the translation of the SVA to natural language, a few examples are used to guide the model to generate high-level natural language properties instead of directly describing signal relationships; based on the data screening of formalized equivalence check, formalized verifying the logical equivalence of two SVAs and filtering out unequal data to obtain a plurality of natural language and SVA pairs with corresponding RTL codes.
- 5. The method of bi-directional data synthesis based on register transfer level codes as recited in claim 2, wherein said data enhancement step comprises: Analyzing errors which cannot be identified by the bidirectional translation by adopting a large language model review mode combined with expert priori, and selecting natural language and SVA pairs which do not contain the errors by using a large language model; performing difficulty filtering by using a general large language model, generating a plurality of SVAs for each natural language property, removing samples of all SVAs equivalent to answers in a data set, and filtering out simple data; Adding a long reasoning track to the data set by using the large language model with the reasoning enhancement, and re-inputting each natural language property in the data set into the large language model to obtain a reasoning track and a final SVA answer.
- 6. The method of bi-directional data synthesis based on register transfer level code of claim 2, wherein said step of supervisory fine tuning comprises: Training data of a plurality of natural language and SVA pairs are obtained in the data synthesis process to finely tune a general large language model, RTL codes and natural language properties are spliced to be used as the input of the model, the format of the large language model is followed, an inference track is placed between special marks, and an SVA answer is placed after the special marks.
- 7. A bidirectional data synthesizing system based on a register transfer level code, employing the bidirectional data synthesizing method based on a register transfer level code as recited in any one of claims 1 to 6, said system comprising: the assertion synthesis module is used for collecting a register transmission level RTL code dataset in the real world, analyzing signals in each RTL data, decomposing each verification specification in the RTL data into a plurality of natural language properties, inputting each natural language property and a corresponding RTL code into a large language model, and generating assertion SVA corresponding to the natural language property; And the bidirectional screening data module is used for strengthening the nature of natural language and semantic consistency between SVAs by bidirectional screening, executing twice translation from SVA to natural language to SVA, verifying the logical equivalence of twice translated SVA based on formal equivalence check aiming at the translated bidirectional language, filtering out unequal data, and generating natural language and SVA pairs with corresponding RTL codes.
- 8. The bi-directional data synthesis system based on register transfer level code of claim 7, wherein said system further comprises: The data enhancement module is used for enhancing the data quality of the natural language and SVA pair by using a data enhancement technology; And the supervision fine tuning module is used for fine tuning the large language model by obtaining training data of the natural language and SVA.
- 9. A computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of the method for bi-directional data synthesis based on register transfer level code as claimed in any one of claims 1-6.
- 10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of the method of bi-directional data synthesis based on register transfer level code as claimed in any one of claims 1 to 6.
Description
Bidirectional data synthesis method and system based on register transmission level code Technical Field The invention relates to the field of large language models and hardware verification, in particular to a bidirectional data synthesis technical scheme based on register transmission level RTL codes. Background Formal verification based on assertions plays a key role in digital hardware design flow. Based on natural language specifications and RTL (REGISTER TRANSFER LEVEL ) code, a verification engineer needs to write a logic assertion, systemVerilog Assertions (SVA, systemVerilog assertion), and with the aid of a simulation or formal verification tool (e.g., jasperGold) ensure that the RTL code meets the constraints corresponding to these SVAs. Since the writing of high quality SVAs requires a great deal of manual work and expertise, the automatic generation of SVAs has become an important research direction. Traditional rule-based methods (e.g., HARM, ARTmine) mine SVAs from simulated waveforms, but rely on existing "correct RTL" for difficult application in real industrial verification scenarios. Much recent research has used Large Language Model (LLM) analysis specifications and RTL to express validation properties in natural language and translate them into corresponding SVAs. The prior art relies heavily on a generic large language model (e.g., deepSeek-V3.1) for natural language to SVA translation, which has the following problems: because RTL and hardware-form verification code in the published data are rare, the general large language model often lacks the background knowledge required for hardware verification and therefore performs poorly on the specialized task of SVA generation. In addition, advanced large language models (such as GPT-5 or DeepSeek-R1) either require external interfaces to be invoked, fail to meet the privacy requirements of hardware companies, or require large amounts of computing resources for deployment, which is cost prohibitive. Therefore, it is important to develop a large language model that can both accurately generate SVAs and afford deployment costs. Therefore, it is needed to study a method for synthesizing bidirectional data based on RTL to train a large language model dedicated to hardware assertion generation, which can solve the problems that the general large language model lacks hardware verification knowledge and is difficult to deploy. Disclosure of Invention In order to solve the problems that a general large language model lacks hardware verification knowledge and is difficult to deploy in the prior art, the application provides a bidirectional data synthesis method based on RTL (real time kinematic) so as to train the large language model special for hardware assertion generation. In a first aspect, an embodiment of the present application provides a bidirectional data synthesis method based on a register transmission level code, where the method includes: Collecting a register transmission level RTL code dataset in the real world, analyzing signals in each RTL data, decomposing each verification specification in the RTL data into a plurality of natural language properties, inputting each natural language property and a corresponding RTL code into a large language model, and generating an assertion SVA corresponding to the natural language property; And a step of bi-directionally screening data, which is to use bi-directional screening to strengthen the nature of natural language and semantic consistency between SVAs, execute two translations from SVAs to natural language to SVAs, verify the logical equivalence of the two translations of SVAs based on formal equivalence check aiming at the bi-directional language after translation and filter out unequal data to generate natural language and SVA pairs with corresponding RTL codes. In an embodiment of the present invention, the method for synthesizing bidirectional data based on a register transmission level code is characterized in that the method further includes: a data enhancement step, namely enhancing the data quality of the natural language and SVA pair by using a data enhancement technology; And a step of supervision fine tuning, namely fine tuning the training data into a large language model by using the obtained natural language and SVA. In an embodiment of the present invention, the above-mentioned assertion synthesis step includes: SVA synthesis is carried out by adopting an RTL code open source data set, wherein the data set comprises an RTL code and a verification specification generated by a large model; Analyzing signals in each RTL code, screening out a plurality of examples containing clock and reset signals, decomposing each verification specification into a plurality of independent natural language properties by using a large language model, and generating a plurality of natural language properties; and inputting each synthesized natural language property and the corresponding RTL cod