CN-122020256-A - Enterprise industry classification method and system based on associated enterprise information and BERT model
Abstract
The divisional application discloses an enterprise industry classification method and system based on associated enterprise information and BERT models, and relates to the technical field of computers. The method comprises the steps of obtaining enterprise information of enterprises to be classified, extracting enterprise keywords of the enterprises to be classified according to the enterprise information, determining enterprise keyword sets of the enterprises to be classified based on the enterprise keywords, calculating similarity between the enterprise keyword sets and preset industry keyword sets through a pre-trained BERT model, selecting industry categories corresponding to the industry keyword sets with similarity larger than a similarity threshold as industries to which alternatives belong, determining associated enterprises of the enterprises to be classified if the industries to which the alternatives belong are not unique, processing the similarity between the associated enterprise information of the associated enterprises and the industries to which the alternatives belong according to preset comprehensive industry scoring calculation rules, calculating to obtain comprehensive industry scores of the industries to which the alternatives belong, and selecting the industry with the highest score as the industry to which the alternatives belong, so that the accuracy of enterprise industry classification is improved.
Inventors
- TANG JINGYI
- ZUO XIAOLEI
Assignees
- 企知道科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20230714
Claims (10)
- 1. An enterprise industry classification method based on associated enterprise information and BERT models, the method comprising the steps of: acquiring enterprise information of enterprises to be classified; Extracting enterprise keywords of the enterprise to be classified according to the enterprise information, and determining an enterprise keyword set of the enterprise to be classified based on the enterprise keywords; Encoding the enterprise keyword set and the industry keyword set through a pre-trained BERT model to respectively obtain an enterprise word set vector of the enterprise keyword set and an industry word set vector of the industry keyword set, and then calculating cosine similarity between the enterprise word set vector and the industry word set vector; Selecting an industry category corresponding to the industry keyword set with the similarity larger than a similarity threshold as an alternative belonging industry of the enterprise to be classified; If the industry to which the alternative of the enterprise to be classified belongs is not unique, determining associated enterprises of the enterprise to be classified, and acquiring associated enterprise information of each associated enterprise, wherein the associated enterprise information comprises associated enterprise associated relation information and associated enterprise industry classification information; Processing the similarity between the associated enterprise information and the industries to which the alternatives belong according to a preset comprehensive industry score calculation rule, and calculating to obtain the comprehensive industry score of the industries to which the alternatives belong; and selecting the alternative belonging industry with the highest comprehensive industry score as the belonging industry of the enterprise to be classified.
- 2. The enterprise industry classification method based on the associated enterprise information and the BERT model according to claim 1, wherein the processing the similarity between the associated enterprise information and each industry to which the alternative belongs according to a preset comprehensive industry score calculation rule, and calculating to obtain a comprehensive industry score of each industry to which the alternative belongs specifically comprises: for an industry to which the alternative belongs, taking the similarity between the industry to which the alternative belongs and the enterprise to be classified as a first industry score; determining the same industry associated enterprises with the same industry category as the industry to which the alternative belongs from the associated enterprises of the enterprises to be classified; calculating a second industry score according to the associated enterprise information of the same industry associated enterprise; And carrying out weighted calculation on the first industry score and the second industry score to finish calculation of the comprehensive industry score of the industry to which the alternative belongs.
- 3. The business industry classification method based on the associated business information and the BERT model according to claim 1, further comprising an industry keyword set creation method before obtaining the business information of the business to be classified, wherein the industry keyword set creation method specifically comprises: Acquiring classification files of national economy industry; Creating a plurality of industry keyword sets corresponding to industry categories specified by national economic industry classification files; Acquiring industry keywords of industry categories corresponding to the industry keyword sets; And storing the industry keywords of each industry category into the corresponding industry keyword sets respectively, and completing the creation of each industry keyword set.
- 4. The business industry classification method based on the associated business information and the BERT model of claim 3, wherein in obtaining industry keywords of industry categories corresponding to each of the industry keyword sets, specifically comprising: For one industry keyword set, acquiring a first industry keyword according to industry notes of an industry category corresponding to the industry keyword set in the national economy industry classification file; acquiring enterprise information of a peer enterprise in a preset enterprise database, wherein the industries of the peer enterprise are the same as the industry category corresponding to the industry keyword set; and acquiring a second industry keyword according to the enterprise information of the same-industry enterprise.
- 5. The business industry classification method based on the associated business information and the BERT model of claim 2, wherein in said calculating a second industry score from said associated business information of said same business associated business, specifically comprising: Classifying each associated enterprise based on the associated enterprise association relationship information of each associated enterprise, and determining the same industry associated enterprise corresponding to the industries to which the multiple alternatives of the enterprise to be classified belong; and calculating a second industry score of the industry to which the alternative belongs according to the associated enterprise information of the same industry associated enterprise of the industry to which the alternative belongs.
- 6. The business industry classification method based on the associated business information and the BERT model of claim 5, wherein calculating the second industry score of the industry to which the alternative belongs according to the associated business information of the same industry associated business of the industry to which the alternative belongs specifically comprises: Calculating the association degree scores of the same industry association enterprises and the enterprises to be classified; and calculating a second industry score according to the association degree scores of all the same industry association enterprises of the industries to which the alternative belongs.
- 7. The business industry classification method based on the associated business information and BERT model of claim 6, wherein in said calculating the association degree score of the same business associated business and the business to be classified, specifically comprising: and determining the association degree of the cooperation years, the cooperation transaction amounts and the cooperation project quantity indexes between the same industry associated enterprises and the enterprises to be classified according to a preset cooperation relation degree evaluation standard so as to obtain the association degree score.
- 8. An enterprise industry classification system based on an associated enterprise information and BERT model, the system comprising: The enterprise information acquisition module is used for acquiring enterprise information of enterprises to be classified; the enterprise keyword extraction module is used for extracting enterprise keywords of the enterprise to be classified according to the enterprise information and determining an enterprise keyword set of the enterprise to be classified based on the enterprise keywords; the similarity calculation module is used for encoding the enterprise keyword set and the industry keyword set through the pre-trained BERT model to respectively obtain enterprise word set vectors of the enterprise keyword set and industry word set vectors of the industry keyword set, and then calculating cosine similarity between the enterprise word set vectors and the industry word set vectors; The alternative belonging industry determining module is used for selecting the industry category corresponding to the industry keyword set with the similarity larger than the similarity threshold as the alternative belonging industry of the enterprise to be classified; The associated enterprise determining module is used for determining associated enterprises of the enterprises to be classified and acquiring associated enterprise information of each associated enterprise if the industries to which the alternatives of the enterprises to be classified belong are not unique, wherein the associated enterprise information comprises associated enterprise association relationship information and associated enterprise industry classification information; The comprehensive score calculation module is used for processing the similarity between the associated enterprise information and the industries to which the alternatives belong according to a preset comprehensive industry score calculation rule, and calculating to obtain the comprehensive industry score of the industries to which the alternatives belong; and the industry determining module is used for selecting the alternative belonging industry with the highest comprehensive industry score as the belonging industry of the enterprise to be classified.
- 9. An electronic device comprising a processor (401), a memory (405), a user interface (403) and a network interface (404), the memory (405) being configured to store instructions, the user interface (403) and the network interface (404) being configured to communicate with other devices, the processor (401) being configured to execute the instructions stored in the memory (405) to cause the electronic device (400) to perform the method according to any of claims 1-7.
- 10. A computer-readable storage medium, wherein the computer-readable storage medium stores instructions that, when executed, performs the method steps of any of claims 1-7.
Description
Enterprise industry classification method and system based on associated enterprise information and BERT model The application relates to a classified application of patent application with the application number 202310869145.4, the application date 2023, 7, 14 and the application name of 'an enterprise industry classification method, system, equipment and medium based on big data'. Technical Field The application relates to the technical field of computers, in particular to an enterprise industry classification method and system based on associated enterprise information and BERT models. Background The national economy industry classification refers to a standardized method for classifying each industry according to different characteristics of production and operation activities of the industry, and at present, national statistics bureau issued "national economy industry classification" is the most commonly used industry classification standard. For each enterprise, the industry label of the enterprise is a very important field, and the main business of the enterprise can be well reflected through the industry label of the enterprise. Thus, in an enterprise database, industry classification of an enterprise is required to determine the enterprise's industry label. The existing enterprise industry classification method generally classifies the enterprise industry based on a single index or a few indexes, and is easily limited by the information of the enterprise, so that the enterprise industry classification is inaccurate. Disclosure of Invention In order to improve the accuracy of enterprise industry classification, the application provides an enterprise industry classification method and system based on associated enterprise information and BERT models. In a first aspect, the present application provides a business industry classification method based on big data, the method comprising the steps of: acquiring enterprise information of enterprises to be classified; Extracting enterprise keywords of the enterprise to be classified according to the enterprise information, and determining an enterprise keyword set of the enterprise to be classified based on the enterprise keywords; respectively calculating the similarity between the enterprise keyword sets and preset industry keyword sets through a preset similarity calculation model; And selecting the industry category corresponding to the industry keyword set with the similarity larger than a similarity threshold as the industry to which the enterprise to be classified belongs. By adopting the technical scheme, the enterprise keyword sets of the enterprises to be classified are determined based on the enterprise information of the enterprises to be classified, the industry keyword sets of the industry categories are determined based on the industry keywords of the industry categories, and the alternative industries of the enterprises to be classified are determined by calculating the similarity between the enterprise keyword sets and the industry keyword sets. When the enterprise keyword set and the industry keyword set are determined, multidimensional enterprise information and industry data are considered, so that the enterprises to be classified and the classes of the industries are described more accurately, and the accuracy of enterprise industry classification is improved. Optionally, after selecting the industry category corresponding to the industry keyword set with the similarity greater than the similarity threshold as the industry to which the enterprise to be classified belongs, the method further includes: judging whether the industry to which the alternative of the enterprise to be classified belongs is unique; if not, determining the associated enterprises of the enterprises to be classified, and acquiring associated enterprise information of each associated enterprise, wherein the associated enterprise information comprises associated enterprise associated relation information and associated enterprise industry classification information; And determining the industries of the enterprises to be classified in a plurality of industries of the alternative according to the associated enterprise information. By adopting the technical scheme, when the enterprises to be classified are classified in the industries of the enterprises to be classified, the industries to which the enterprises to be classified belong are fuzzy possibly due to the attribute of the enterprises to be classified, and at the moment, the situation that the enterprises to be classified reorganize a plurality of alternative industries to which the enterprises to be classified belong can possibly occur, and the similarity of the industries to which the alternatives belong is larger than a similarity threshold value. At the moment, the enterprises to be classified are further classified based on the associated enterprise information of the associated enterprises of the enterpris