CN-122022568-A - Tobacco leaf group formula similarity evaluation method, system and storage medium

CN122022568ACN 122022568 ACN122022568 ACN 122022568ACN-122022568-A

Abstract

The invention relates to the field of cigarette product design, in particular to a method, a system and a storage medium for evaluating the similarity of a tobacco leaf group formula based on a self-encoder, which comprise the steps of obtaining formula sample data and preprocessing to obtain a training data set and a test data set; the method comprises the steps of constructing a self-encoder model, training, acquiring potential space representation of a training data set and potential space representation of a test data set based on the trained self-encoder model, acquiring potential space attention weighted average distance according to the potential space representation of the training data set and the potential space representation of the test data set, calculating reconstruction errors according to the potential space representation of the test data set, and acquiring comprehensive similarity according to the potential space attention weighted average distance and the reconstruction errors. According to the invention, the self-encoder is utilized to realize nonlinear dimension reduction, and the accuracy and the robustness of the processing capacity and similarity evaluation of high-dimensional data are improved by combining the potential space adjacent distance and the reconstruction error.

Inventors

ZHENG CHENGXU
TIAN YUNONG
ZHANG LILI
PENG YUHAN
JIANG JIALEI
BI YIMING
SONG WEI
ZHAO ZHENJIE
HE WENMIAO
LI YONGSHENG
HAO XIANWEI
Liao fu
ZHAO LIFEI
LI QINGXIANG
LI SHITOU

Assignees

浙江中烟工业有限责任公司

Dates

Publication Date: 20260512
Application Date: 20260120

Claims (10)

1. A method for evaluating similarity of a tobacco leaf group formulation, the method comprising: acquiring recipe sample data and preprocessing the recipe sample data to acquire a training data set and a test data set; Constructing a self-encoder model and training; acquiring a potential spatial representation of the training dataset and a potential spatial representation of the test dataset based on the trained self-encoder model; Acquiring a potential spatial attention weighted average distance according to the potential spatial representation of the training data set and the potential spatial representation of the test data set; calculating a reconstruction error from the potential spatial representation of the test dataset; and acquiring the comprehensive similarity according to the potential space attention weighted average distance and the reconstruction error.
2. The method of evaluating according to claim 1, wherein constructing and training a self-encoder model comprises: constructing a self-encoder model with a learning index weight and setting a dimension reduction target dimension; weighting each dimension index of the training data set and the test data set by adopting an index weighting mechanism; and constructing a joint loss function to train the self-encoder model based on the weighted training data set and the test data set.
3. The method of claim 2, wherein weighting the metrics of the training data set and the test data set using a metric weighting mechanism comprises: Calculating the weight coefficient corresponding to each index according to the formula (1) and the formula (2), ,(1) ,(2) Wherein, the Is the first The first training sample The weight coefficient of the individual index is calculated, Is the first The feature vectors of the individual training samples are, In order to be able to train the parameter matrix, Is the first The first test sample The weight coefficient of the individual index is calculated, Is the first Feature vectors of the test samples.
4. The method of evaluation of claim 2, wherein constructing a joint loss function based on the weighted training data set and the test data set to train the self-encoder model comprises: constructing a joint loss function according to the formulas (3) to (6), ,(3) ,(4) (5) (6) Wherein, the In order to combine the loss function(s), In order to reconstruct the error loss function, For the similarity constraint loss function, In order to be non-similar to the separation constraint loss, Is that Is used for the weight coefficient of the (c), Is that Is used for the weight coefficient of the (c), Is the first The feature vectors of the individual training samples are, Is a self-encoder pair Is used to reconstruct the result of the (c), In the case of a batch size of the product, Is the first A potential representation of the individual training samples is provided, Is the first A potential representation of the individual training samples is provided, For a set of pairs of similar samples, As a set of non-similar pairs of samples, Is a preset minimum space interval of the latent space.
5. The method of claim 1, wherein obtaining a potential spatial attention weighted average distance from the potential spatial representation of the training dataset and the potential spatial representation of the test dataset comprises: Computing a potential spatial representation of each test dataset sample And potential spatial representation of each training dataset sample According to the preset nearest neighbor sample number, the minimum distance is reserved A plurality of points; Sample level attention weights are calculated according to equation (7), ,(7) Wherein, the Is the first The test sample is allocated to the first The normalized weights of the nearest neighbors of a single pair, Is that And the first The distance between the nearest neighbor training samples, Representation pair The similarity scores of the nearest neighbors are summed, Is the nearest neighbor number; The weighted average distance is calculated according to equation (8), ,(8) Wherein, the Is that And (3) with Weighted average distance of nearest neighbor samples.
6. The method of claim 5, wherein the distance between the potential spatial representation of each training data set sample and the potential spatial representation of each test data set sample is calculated, and the minimum distance is maintained based on a predetermined number of nearest neighbor samples The points include: The distance is calculated according to equation (9), ,(9) Wherein, the In order for the euclidean distance to be the same, For training data set sample The values in the dimensions of the individual features, Sample of test dataset The values in the dimensions of the individual features, Is the first Variance of the individual feature dimensions.
7. The method of evaluating according to claim 1, wherein calculating a reconstruction error from the potential spatial representation of the test dataset comprises: The reconstruction error is calculated according to equation (10), ,(10) Wherein, the In order to reconstruct the errors of the image, In order to test the original features of the data set, Is a potential spatial representation of the test dataset.
8. The method of claim 1, wherein obtaining a composite similarity from the potential spatial attention weighted average distance and reconstruction error comprises: calculating the mutual distance of each sample of the training data set on the potential space output by the encoder, and acquiring a distance maximum value; calculating the reconstruction error of each sample of the training data set, and obtaining the maximum value of the reconstruction error; the similarity of the formulas is calculated according to formula (11), ,(11) Wherein, the To evaluate the similarity of the formulation to be evaluated to the historical formulation, The average distance is weighted for the potential spatial attention, For training the maximum value of the mutual distance of all samples of the data set over the potential space of the encoder output, In order to reconstruct the errors of the image, For maximum reconstruction errors in the training data set samples, As the distance error weight coefficient, To reconstruct the error weight coefficients.
9. A tobacco leaf group formula similarity evaluation system, characterized in that the system comprises a processor configured to perform the method of any one of claims 1 to 8.
10. A computer readable storage medium having instructions stored thereon which, when executed by a processor, implement the method of any of claims 1 to 8.

Description

Tobacco leaf group formula similarity evaluation method, system and storage medium Technical Field The invention relates to the field of cigarette product design, in particular to a tobacco leaf group formula similarity evaluation method, system and storage medium based on a self-encoder. Background In the process of designing and maintaining cigarette products, the similarity characterization technology among different leaf group formulas plays a vital role in multiple links such as formula design, raw material substitution, product screening and the like. The similarity evaluation of the tobacco leaf group formulas mainly measures the similarity of each to-be-measured leaf group formula and the historical standard leaf group formula set. Traditional evaluation methods of similarity of tobacco leaf group formulas mainly depend on manual experience, such as evaluation through sensory evaluation, or evaluation according to some simple statistical indexes, such as Euclidean distance, correlation coefficient or weighted difference of the contents of all components. However, the existing methods have a plurality of limitations that the differences and the correlations among different cigarette formulas are difficult to comprehensively reflect only depending on sensory evaluation or a few key chemical indexes, and the similarity evaluation is compared on one side. Secondly, when modeling is performed by using more chemical indexes, the linear dimension reduction methods such as Principal Component Analysis (PCA) and Factor Analysis (FA) which are commonly used at present are limited in processing nonlinear feature mapping, and cannot effectively extract a low-dimension embedding space reflecting essential features of a formula. Potential structural similarity between different cigarette formulas cannot be accurately quantified, so that deviation exists between a similarity evaluation result and actual objective expression. In addition, when the distribution of the historical formula data set serving as an evaluation reference is irregular, the existing measurement method of the similarity of the formula of the tobacco leaf group based on the distance or the correlation coefficient is difficult to accurately represent the similarity of the overall distribution of the sample to be evaluated and the historical formula, and is greatly influenced by different distributions and outliers of the historical formula set. Therefore, a method for comprehensively and accurately representing the similarity between leaf group formulas is needed to solve the problems that in the prior art, the evaluation index is on one side, the linear dimension reduction method is insufficient in nonlinear characteristic processing capability, and the similarity measurement is easy to be interfered when data distribution is irregular. Disclosure of Invention The embodiment of the invention aims to provide a tobacco leaf group formula similarity evaluation method, a system and a storage medium based on a self-encoder, which are used for solving the problems that in the prior art, evaluation dimension one-sided and nonlinear characteristic extraction capability is insufficient, and similarity measurement results are easy to be interfered when data distribution is irregular. In order to achieve the above object, an embodiment of the present invention provides a method for evaluating similarity of a tobacco leaf group formulation, including: acquiring recipe sample data and preprocessing the recipe sample data to acquire a training data set and a test data set; Constructing a self-encoder model and training; acquiring a potential spatial representation of the training dataset and a potential spatial representation of the test dataset based on the trained self-encoder model; Acquiring a potential spatial attention weighted average distance according to the potential spatial representation of the training data set and the potential spatial representation of the test data set; calculating a reconstruction error from the potential spatial representation of the test dataset; and acquiring the comprehensive similarity according to the potential space attention weighted average distance and the reconstruction error. Optionally, constructing the self-encoder model and training comprises: constructing a self-encoder model with a learning index weight and setting a dimension reduction target dimension; weighting each dimension index of the training data set and the test data set by adopting an index weighting mechanism; and constructing a joint loss function to train the self-encoder model based on the weighted training data set and the test data set. Optionally, weighting the metrics of the training data set and the test data set using a metric weighting mechanism includes: Calculating the weight coefficient corresponding to each index according to the formula (1) and the formula (2), ,(1) ,(2) Wherein, the Is the firstThe first training sampleThe weight