CN-121998781-A - Tax risk identification method and system based on multi-source data comparison

CN121998781ACN 121998781 ACN121998781 ACN 121998781ACN-121998781-A

Abstract

The invention discloses a tax risk identification method and system based on multi-source data comparison, wherein the method comprises the steps of collecting multi-source data of a target subject and an associated subject from a tax, invoice and business credit system, constructing a dataset, extracting time sequence data of a preset key risk index in a preset observation period, calculating fluctuation rate and deviation degree, generating dynamic risk feature vectors representing risk conditions of each subject, constructing a risk conduction network based on association relations between the target subject and the associated subject, calculating time sequence correlation of the dynamic risk feature vectors respectively corresponding to the target subject node and the associated subject node in a specific time window, judging that risk synergy exists when the correlation coefficient exceeds a preset threshold, marking the associated subjects corresponding to a plurality of associated subject nodes as high risk clusters, and evaluating direct tax risk. The tax risk dynamic accurate identification and quantitative prevention and control are realized, and the accuracy and the efficiency are improved.

Inventors

HUANG YANWEI
ZHOU WEILI
LIU AIHONG
WANG QIAOLIN
YU JIANMEI
CHEN SIYU
LIU CHENG

Assignees

华润江中药业股份有限公司

Dates

Publication Date: 20260508
Application Date: 20251225

Claims (10)

1. The tax risk identification method based on multi-source data comparison is characterized by comprising the following steps: Collecting multi-source data of a target subject and an associated subject from tax, invoice and business credit system according to a preset period, and correspondingly constructing a data set; Extracting time sequence data of preset key risk indexes in a preset observation period according to the data set, calculating the fluctuation rate and the deviation degree based on the time sequence data of each index, and generating a dynamic risk feature vector representing the risk condition of each main body; constructing a risk conduction network based on the association relation between the target subject and the association subjects, wherein network nodes represent the target subject and all the association subjects, and directed edges represent risk conduction paths; In the risk conduction network, calculating time sequence correlation of dynamic risk feature vectors respectively corresponding to a target subject node and an associated subject node in a specific time window; When the correlation coefficient exceeds a preset threshold, judging that risk coordination exists and marking the association subject corresponding to a plurality of association subject nodes as a high risk cluster; And evaluating direct and associated tax risks formed by the association subject to the target subject in the high risk cluster, and outputting quantified risk levels and coping strategy data.
2. The tax risk identification method based on multi-source data comparison according to claim 1, wherein the step of constructing a risk conduction network based on an association relationship between the target subject and the association subject comprises: Acquiring association relation data between the target main body and each association main body, wherein the association relation data comprises a share right investment proportion, a annual transaction amount occupation ratio and the number of enterprises with overlapping strands; Mapping the target subject and each association subject into network nodes respectively, and establishing directed edges between any network nodes based on the directionality of the association relationship to generate a preliminary network topology structure; and calculating the association strength value of each directed edge by adopting a weighted fusion algorithm according to the association relation data, and assigning the association strength value as the weight of the corresponding directed edge to complete the construction of the risk conduction network.
3. The tax risk identification method based on multi-source data comparison according to claim 2, wherein the step of calculating the association strength value of each directed edge by using a weighted fusion algorithm according to the association relation data, and assigning the association strength value as the weight of the corresponding directed edge to complete the construction of a risk conduction network comprises the following steps: Respectively carrying out normalization processing on the share right investment proportion, the annual transaction amount occupation ratio and the number of the overlapping enterprises in the association relation data to obtain standardized values corresponding to all indexes; Configuring preset weight coefficients for each class of association relation indexes, carrying out linear weighted fusion on standardized numerical values of multiple classes of association relation indexes corresponding to each directed edge and the respective preset weight coefficients, and calculating to obtain association strength values of the directed edges; And mapping the association strength value into a visual mark corresponding to the directed edge, wherein the association strength value is positively correlated with the width and the color depth of the edge, and generating a risk conduction network comprising node attributes and weighted edge attributes.
4. The tax risk identification method based on multi-source data comparison according to claim 3, wherein the step of configuring a preset weight coefficient for each kind of association relation index, performing linear weighted fusion on the standardized numerical value of the multi-kind association relation index corresponding to each directed edge and the respective preset weight coefficient, and calculating to obtain the association strength value of the directed edge comprises the following steps: Respectively calculating information entropy of the stock right investment proportion, the annual transaction amount occupation ratio and the number of the stock-holding overlapped enterprises, and determining the variation degree of each association relation index according to the information entropy so as to dynamically configure preset weight coefficients of each association relation index; And multiplying and summing the standardized numerical values corresponding to the share investment proportion, the annual transaction amount occupation ratio and the number of the share-holding overlapped enterprises with the preset weight coefficient respectively to obtain the correlation strength value of the directed edge, and carrying out normalization processing.
5. The tax risk identification method based on multi-source data comparison according to any one of claims 1 to 4, wherein the step of calculating the time sequence correlation of the dynamic risk feature vectors corresponding to the target subject node and the associated subject node respectively in the risk conduction network includes: In the risk conduction network, calculating pearson correlation coefficients of time sequences of dynamic risk feature vectors respectively corresponding to a target subject node and an associated subject node, and taking the pearson correlation coefficients as first correlation indexes; and calculating a dynamic time warping distance between the time sequences, and converting the dynamic time warping distance into a similarity score serving as a second correlation index.
6. The tax risk identification method based on multi-source data alignment according to claim 5, wherein the step of calculating pearson correlation coefficients of time series of dynamic risk feature vectors respectively corresponding to the target subject node and the associated subject node as the first correlation index comprises: Carrying out mean value centering treatment on time sequences of the dynamic risk feature vectors respectively corresponding to the target subject node and the associated subject node; And calculating covariance and standard deviation based on the centralized time sequence, and obtaining the pearson correlation coefficient according to the ratio of the covariance to the standard deviation.
7. The method for identifying tax risk based on multi-source data alignment according to claim 6, wherein the step of calculating a dynamic time warping distance between the time series and converting it into a similarity score as the second correlation index comprises: constructing an Euclidean distance matrix between the target subject node and the time sequence of the associated subject node, and searching an optimal regular path through a dynamic programming algorithm; and calculating the accumulated distance from the first point to the second point on the optimal regular path, and mapping the accumulated distance to a similarity score between 0 and 1.
8. A tax risk identification system based on multi-source data alignment, applied to the method of any one of claims 1-7, the system comprising: The data acquisition module is used for acquiring multi-source data of the target main body and the related main body from the tax, invoice and business credit system according to a preset period, and correspondingly constructing a data set; the vector generation module is used for extracting time sequence data of preset key risk indexes in a preset observation period according to the data set, calculating fluctuation rate and deviation degree based on the time sequence data of each index, and generating dynamic risk feature vectors representing risk conditions of each main body; The network construction module is used for constructing a risk conduction network based on the association relation between the target subject and the association subjects, wherein the network nodes represent the target subject and all the association subjects, and the directed edges represent risk conduction paths; The time sequence calculating module is used for calculating time sequence correlation of dynamic risk feature vectors corresponding to the target main body node and the associated main body node respectively in the risk conduction network in a specific time window; the cluster marking module is used for judging that risk collaboration exists and marking the associated main bodies corresponding to the plurality of associated main body nodes as high-risk clusters when the correlation coefficient exceeds a preset threshold value; and the risk evaluation module is used for evaluating the direct and associated tax risk formed by the association subject to the target subject in the high risk cluster and outputting quantized risk level and coping strategy data.
9. A readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method of any of claims 1-7.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1-7 when the program is executed by the processor.

Description

Tax risk identification method and system based on multi-source data comparison Technical Field The invention relates to the technical field of tax risk identification, in particular to a tax risk identification method and system based on multi-source data comparison. Background The pharmaceutical industry is used as a strategic industry for the relationship of national and civil life, and tax risk management is extremely complex and special due to business modes, supervision policies and supply chain structures specific to the industry. The drug enterprises generally have the research and development characteristics of high investment, long period and high risk, are widely applicable to policies such as research and development expense addition deduction, tax preference of high-technology enterprises, and the like, and are complex in tax treatment. Meanwhile, a multi-link circulation system from bulk drugs, preparations to commercial distribution and strict supervision policies such as two-ticket system, centralized taking purchase and the like jointly form a unique tax risk environment of a drug enterprise, and the tax risk environment is associated with CRO, CMO, CSO and complex networks formed among distributors, so that the concealment and linkage of risk conduction are further improved. Most of the existing tax risk identification methods are universal frameworks, and unique challenges of pharmaceutical enterprises are difficult to effectively cope with. The methods generally perform static ratio analysis or threshold comparison based on financial and tax report data at a single time point, and rely on common indexes such as burden of taxation rates, invoice matching degrees and the like. The analysis mode of 'one-cut' can not dynamically capture the risk evolution of the whole life cycle of the medicine enterprise, is difficult to design effective risk monitoring indexes aiming at policy impact (such as huge price and income structure changes caused by taking purchase), and lacks a special algorithm model for potential cooperative abnormal behaviors in a complex association transaction network of the medicine enterprise, so that the applicability and the accuracy of the medicine enterprise are seriously insufficient. The key bottleneck in the prior art is the lack of dynamic topology awareness capability for complex supply chain networks of pharmaceutical enterprises. The existing analysis method cannot construct a network model reflecting the real-time association relationship between the medicine enterprise and CRO, CMO, CSO and between the medicine enterprise and the distributor, so that the system cannot describe the conduction path of risks among multiple nodes, and key risk hubs in the network are difficult to locate. The lack of the capability enables risk identification to stay on a 'punctiform' analysis level, and the chain reaction caused by the risks of the related parties cannot be early-warned, so that the main technical obstacle of accurate prevention and control of the tax risks of the current medicine enterprise is formed. Disclosure of Invention Aiming at the defects of the prior art, the invention aims to provide a tax risk identification method and system based on multi-source data comparison, which aim to solve the problems described in the prior art. The first aspect of the invention provides a tax risk identification method based on multi-source data comparison, which comprises the following steps: Collecting multi-source data of a target subject and an associated subject from tax, invoice and business credit system according to a preset period, and correspondingly constructing a data set; Extracting time sequence data of preset key risk indexes in a preset observation period according to the data set, calculating the fluctuation rate and the deviation degree based on the time sequence data of each index, and generating a dynamic risk feature vector representing the risk condition of each main body; constructing a risk conduction network based on the association relation between the target subject and the association subjects, wherein network nodes represent the target subject and all the association subjects, and directed edges represent risk conduction paths; In the risk conduction network, calculating time sequence correlation of dynamic risk feature vectors respectively corresponding to a target subject node and an associated subject node in a specific time window; When the correlation coefficient exceeds a preset threshold, judging that risk coordination exists and marking the association subject corresponding to a plurality of association subject nodes as a high risk cluster; And evaluating direct and associated tax risks formed by the association subject to the target subject in the high risk cluster, and outputting quantified risk levels and coping strategy data. According to an aspect of the above technical solution, the step of constructing a risk conduction network ba