CN-121980575-A - Automatic vulnerability and code warehouse mapping frame based on software entity, vulnerability code snapshot acquisition method, system and storage medium
Abstract
The invention relates to an automatic vulnerability and code warehouse mapping framework based on a software entity, a vulnerability code snapshot acquisition method, a system and a storage medium, wherein the method comprises the following steps of S101, collecting vulnerability related information, and carrying out link filtering and verification to obtain an initial CVE-code warehouse mapping set; step S102, grouping CVEs based on the software entity associated with the CVEs to obtain a CVE group corresponding to the software entity, constructing a mapping relation between the software entity and a code warehouse, and applying the mapping relation to other CVEs in the group, step S103, locating to an accurate code state containing a vulnerability, step S104, integrating the result of step S103, outputting final structured data, and step S105, updating and verifying a mapping knowledge base of the software entity and the code warehouse based on continuous input of a new CVE record, so as to realize continuous evolution of the system. The invention has the beneficial effect of breaking through various bottlenecks faced by the prior art.
Inventors
- GU ZHAOQUAN
- ZHANG CHENHUI
- WANG HAIYAN
- ZHU JUNYI
Assignees
- 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院)
Dates
- Publication Date
- 20260505
- Application Date
- 20260409
Claims (9)
- 1. An automated vulnerability and code warehouse mapping framework and a vulnerability code snapshot acquisition method based on software entities are characterized by comprising the following steps: step S101, multi-source data fusion and direct mapping are carried out, vulnerability related information is collected from a plurality of data sources, and link filtering and verification are carried out on the collected data so as to obtain an initial CVE-to-code warehouse mapping set; Step S102, grouping the CVEs based on the software entities related to the CVEs to obtain CVE groups corresponding to the same software entities, constructing mapping relations between the software entities and a code warehouse by using the CVEs with built mapping in the groups, and applying the mapping relations to other CVEs in the groups; Step 103, accurate version positioning, namely associating the affected version information disclosed in the CVE record with a specific release version in a code warehouse to position to a specific version code snapshot containing the vulnerability; Step S104, generating a loophole mapping relation, integrating the result of the step S103, and outputting final structured data, wherein the final structured data comprises a mapping relation table from CVE to a warehouse and a URL set from CVE to a snapshot of a specific version code; Step 105, iteration of the knowledge base and evolution of the system, updating and verifying the mapping knowledge base of the software entity and the code warehouse based on the continuous input of the new CVE record so as to support retrospective enhancement of the mapping of the new CVE and the history mapping and realize continuous evolution of the system.
- 2. The automated vulnerability-to-code repository mapping framework and vulnerability code snapshot acquisition method of claim 1, comprising, in step S101: Step S101-1, multi-source data acquisition, CVE records, reference links and affected version information are acquired from a plurality of data sources, wherein the data sources comprise a general vulnerability disclosure database, a Github security announcement, an open source vulnerability database OSV, a vulnerability exploitation database and a Github warehouse; Step S101-2, filtering a non-code warehouse link by adopting a keyword mechanism, verifying the availability of the filtered link, and processing redirection to ensure the durability of the link; and step S101-3, actively discovering, namely initiating targeted Github search by using a manufacturer-product combination derived from CPE for CVE without explicit link mapping after filtering, and compensating for the deficiency of the explicit link mapping.
- 3. The automated vulnerability-to-code repository mapping framework and vulnerability code snapshot acquisition method of claim 1, further comprising, in step S102: step S102-1, grouping entities, normalizing tuples from CPE data of CVEs, and grouping CVEs of the same entity into a group; Step S102-2, traversing each entity group generated in step S102-1, checking whether each CVE item in the group has established a mapping relation with the Github warehouse in step S101, if mapping exists, executing step S102-3, if all CVEs in the group do not exist a mapping relation, skipping the entity group, and then continuing to process the next group until all entity groups are traversed; step S102-3, reasoning in the group, and if any CVE in the group has established a warehouse link through a verification rule, enabling all other CVEs in the group to automatically inherit the mapping relation based on the rule that vulnerabilities of the same software entity are concentrated in a main code warehouse.
- 4. The automated vulnerability-to-code repository mapping framework and vulnerability code snapshot acquisition method of claim 1, further comprising, in step S103: Step S103-1, version collection, namely automatically acquiring all history labels and branches of a target code warehouse through a Github API, and constructing a complete warehouse version list as a matching target library; Step S103-2, a double-layer matching algorithm, wherein the first layer is an exact match, namely, the affected version number explicitly pointed in the CVE record is directly matched with the version identifier of the warehouse, if the exact match is not successful, the second layer multi-mode fault-tolerant match is automatically triggered, the system automatically generates a common named variant mode according to the initial version number and searches in a version library again according to the common named variant mode, and for the successfully matched warehouse version, the system automatically generates and outputs a snapshot link of the corresponding directly downloaded vulnerability-containing version provided by the Github official.
- 5. An automated vulnerability and code warehouse mapping framework and vulnerability code snapshot acquisition system based on software entities, comprising: The multi-source data fusion and direct mapping unit is used for collecting vulnerability related information from a plurality of data sources and carrying out link filtering and verification on the collected data so as to obtain an initial CVE-to-code warehouse mapping set; The software entity grouping and heuristic mapping unit is used for grouping the CVEs based on the software entities associated with the CVEs to obtain CVE groups corresponding to the same software entity, constructing mapping relations between the software entities and the code warehouse by using the CVEs with established mapping in the groups, and applying the mapping relations to other CVEs in the groups; The accurate version positioning unit is used for associating the affected version information disclosed in the CVE record with a specific release version in the code warehouse so as to position a specific version code snapshot containing the loopholes; The vulnerability mapping relation generating unit is used for integrating the result of the accurate version positioning unit and outputting final structured data, wherein the final structured data comprises a mapping relation table from CVE to a warehouse and a URL set from CVE to a specific version code snapshot; And the knowledge base iteration and system evolution unit updates and verifies the mapping knowledge base of the software entity and the code warehouse based on the continuous input of the new CVE record so as to support the retrospective enhancement of the mapping of the new CVE and the historical mapping and realize the continuous evolution of the system.
- 6. The automated vulnerability and code warehouse mapping framework and vulnerability code snapshot capture system of claim 5, wherein in the multi-source data fusion and direct mapping unit, comprising: The multi-source data acquisition module is used for acquiring CVE records, reference links and affected version information from a plurality of data sources, wherein the data sources comprise a general vulnerability disclosure database, a Github security announcement, an open source vulnerability database OSV, an vulnerability exploitation database and a Github warehouse; the link filtering and verifying module is used for filtering the non-code warehouse link by adopting a keyword mechanism, verifying the availability of the filtered link, processing redirection and ensuring the link durability; And the active discovery module initiates targeted Github search by using CPE-derived vendor-product combination for CVE without explicit link mapping after filtering, and compensates the defect of the explicit link mapping.
- 7. The automated vulnerability and code warehouse mapping framework and vulnerability code snapshot capture system of claim 5, wherein the software entity grouping and heuristic mapping unit comprises: the software entity grouping module is used for normalizing tuples from CPE data of the CVE and grouping the CVE of the same entity into a group; The intra-group mapping judging module is responsible for traversing each entity group generated in the software entity grouping module, checking whether each CVE entry in the group has established a mapping relation with a Github warehouse in the multi-source data fusion and direct mapping unit, calling an intra-group reasoning module if mapping exists, skipping the entity group if all CVEs in the group do not have the mapping relation, and continuing to judge the next group until all entity groups are traversed; and the in-group reasoning module is used for enabling all other CVEs in the group to automatically inherit the mapping relation based on the rule that the vulnerabilities of the same software entity are concentrated in the main code warehouse if any CVE in the group has established a warehouse link through the verification rule.
- 8. The automated vulnerability and code warehouse mapping framework and vulnerability code snapshot acquisition system of claim 5, wherein the precise version locating unit further comprises: The system comprises a version collection module, a target code warehouse and a target code warehouse, wherein all history labels and branches of the target code warehouse are automatically acquired through a Github API, and a complete warehouse version list is constructed to serve as a matching target library; The dual-layer matching algorithm module comprises a first layer, a second layer and a system, wherein the first layer is accurate matching, namely, the affected version number explicitly pointed in the CVE record is directly matched with a version identifier of a warehouse, if the accurate matching is not successful, the second layer multi-mode fault-tolerant matching is automatically triggered, the system automatically generates a common named variant mode according to the initial version number and searches in a version library again according to the common named variant mode, and for a successfully matched warehouse version, the system automatically generates and outputs a snapshot link of a corresponding directly downloaded vulnerability-containing version provided by a Github official.
- 9. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program configured to implement the steps of the method according to any one of claims 1-4 when called by a processor.
Description
Automatic vulnerability and code warehouse mapping frame based on software entity, vulnerability code snapshot acquisition method, system and storage medium Technical Field The invention relates to the technical field of computer software processing, in particular to an automatic vulnerability and code warehouse mapping frame based on a software entity, a vulnerability code snapshot acquisition method, a system and a storage medium. Background In the prior art, the mapping of CVEs to code repositories mainly follows the "patch driven" paradigm, with the core idea being to handle individual CVE records in isolation through explicit reference links (such as the GitHub commit hash) in the vulnerability database. Representative works include: CVEfixes As an early representative, the method simply extracts the direct Github repository links from the NVD's "references" field and builds a CVE to code repository mapping based on these pre-existing links. The coverage is low because it relies entirely on limited explicit links in the NVD. MoreFixes extend on the CVEfixes basis, attempt to promote coverage by integrating multisource data (e.g., adding a Github Security bulletin GHSA as a supplemental source) and modeling Github searches based on vendor-product names. This approach represents a current more advanced tool, but coverage is still limited and its nature is still an aggregation of explicit links. Common to these approaches is the severe reliance on existing direct reference links in authoritative vulnerability databases (e.g., NVD, GHSA), with the core limitation that the passive waiting database provides ready link information, rather than actively building a mapping relationship. The prior art has the following key problems and defects, and severely restricts the scale and efficiency of vulnerability analysis: 1. coverage is very low and there is a ceiling effect that the paradigm relying on explicit links has touched the ceiling because most CVE records lack direct reference links. The root cause is that the existing method passively depends on the existing information in the database and cannot process CVE records with sparse information. 2. The orphan processing mode results in poor scalability in that existing tools process each CVE atomically independently, without exploiting the relevance of CVEs within the same software entity (e.g., vendor-product pair). For example, if a piece of software (e.g., apache Tomcat) has a plurality of CVEs mapped successfully, but the new CVEs cannot be mapped due to the shorthand of description or lack of direct links, the prior art cannot make use of the known mapping within the group for reasoning. This results in an inefficient mapping process. 3. The mapping granularity is insufficient, a code snapshot layer is lacked, and downstream tasks (such as vulnerability reproduction and patch verification) need accurate version snapshot. However, existing datasets (e.g., CVEfixes) rarely provide a mapping of CVEs to specific versions (e.g., git labels), resulting in researchers having to manually locate versions, which is time consuming and error prone, limiting the feasibility of large-scale security studies. Disclosure of Invention The invention provides an automatic loophole and code warehouse mapping frame based on a software entity, a loophole code snapshot acquisition method, a system and a storage medium, and aims to solve the defect problem in the prior art. The invention provides an automatic loophole and code warehouse mapping frame based on a software entity and a loophole code snapshot acquisition method, which comprises the following steps: step S101, multi-source data fusion and direct mapping are carried out, vulnerability related information is collected from a plurality of data sources, and link filtering and verification are carried out on the collected data so as to obtain an initial CVE-to-code warehouse mapping set; Step S102, grouping the CVEs based on the software entities related to the CVEs to obtain CVE groups corresponding to the same software entities, constructing mapping relations between the software entities and a code warehouse by using the CVEs with built mapping in the groups, and applying the mapping relations to other CVEs in the groups; Step 103, accurate version positioning, namely associating the affected version information disclosed in the CVE record with a specific release version in a code warehouse to position to a specific version code snapshot containing the vulnerability; Step S104, generating a loophole mapping relation, integrating the result of the step S103, and outputting final structured data, wherein the final structured data comprises a mapping relation table from CVE to a warehouse and a URL set from CVE to a snapshot of a specific version code; Step 105, iteration of the knowledge base and evolution of the system, updating and verifying the mapping knowledge base of the software entity and the code warehouse based