CN-121455956-B - PeakFit data automatic processing method and system
Abstract
The invention provides a PeakFit data automatic processing method and system, which relate to the technical field of data processing, wherein the method comprises the steps of collecting a result file in a PeakFit output catalog through regular matching; the method comprises the steps of checking the format and the integrity of a result file, extracting key parameters in the checked result file to generate a parameter association table of the key parameters and metadata, cleaning data of the key parameters in the parameter association table in a statistical characteristic and dynamic threshold mode, carrying out format standardization on the cleaned key parameters, carrying out accuracy verification on the key data subjected to data cleaning and format standardization in a cross comparison verification mode, and sorting the key parameters subjected to accuracy verification into an Excel worksheet according to a preset format.
Inventors
- LI YANXIA
- Zhong Richen
- LING YIFAN
Assignees
- 北京科技大学
Dates
- Publication Date
- 20260505
- Application Date
- 20251128
Claims (8)
- 1. The PeakFit data automatic processing method is characterized by comprising the following steps of: S1, collecting result files in a PeakFit output catalog through regular matching; S2, checking the format and the integrity of the result file; s3, extracting key parameters in the checked result file, and generating a parameter association table of the key parameters and metadata; S4, carrying out data cleaning on the key parameters in the parameter association table in a statistical characteristic and dynamic threshold mode, and carrying out format standardization on the key parameters after data cleaning; S5, performing accuracy verification on key parameters after data cleaning and format standardization treatment in a cross comparison verification mode; s6, sorting the key parameters subjected to accuracy verification into an Excel worksheet according to a preset format; the verification specifically comprises structural entropy verification for format verification and semantic topology verification for integrity verification; Wherein, the S2 specifically includes: S201, summarizing the head N rows, the middle N rows and the tail N rows of each result file in a segmented sampling mode, and counting the separator distribution in all the summarized rows; S202, calculating the structural entropy value of each result file based on the separator distribution summary result; s203, querying a structural entropy mean value and a structural entropy standard deviation of a historical effective result file, and calculating a structural entropy dynamic threshold value based on the structural entropy mean value and the structural entropy standard deviation; S204, judging whether the structural entropy value of each result file is smaller than the structural entropy dynamic threshold value, if so, marking the result file as a result file to be checked, entering S205, otherwise, marking the result file as structural damage, and placing the result file in an isolation area to wait for manual processing; S205, pre-constructing a standardized keyword vector library in the PeakFit field; s206, embedding a Word2Vec Word to be trained, and vectorizing the column names in the result file to be checked to obtain column name vectors; s207, calculating a cosine similarity matrix between the column name vector and the keyword vectors in the standardized keyword vector library; S208, judging whether the difference value between the highest similarity and the second highest similarity corresponding to each column name vector is smaller than a preset difference value based on the cosine similarity matrix, if so, placing the column name vector into the isolation area to wait for manual processing, otherwise, judging that the matching is correct, mapping the column name vector to the standardized keyword vector library, and finishing the format and the integrity check.
- 2. The PeakFit data automation processing method according to claim 1, wherein S1 specifically includes: S101, establishing a named regular expression suitable for the result file; S102, using the named regular expressions to perform regular matching on each result file in the PeakFit output catalogue one by one; s103, judging whether the regular matching of each result file is successful, if so, extracting metadata of the result file, otherwise, adding the result file into an isolation area; And S104, storing the metadata into an SQLite temporary database to form a target mapping table for representing the mapping relation between the result file and the metadata, wherein the metadata comprises a file path, a file name, file creation time, experiment date and batch ID.
- 3. The PeakFit data automatic processing method of claim 1, wherein the key parameters comprise a peak area, a peak center value and a peak height; the step S3 specifically comprises the following steps: s301, positioning key parameter columns in a result file with verification completed according to the mapping result of the column name vector; S302, using a pandas library of Python to read the numerical data of the key parameter column, and storing the numerical data in a DATAFRAME data structure; and S303, associating the numerical data of the DATAFRAME data structure with a target mapping table by taking a sample ID as a key, and generating a parameter association table of the key parameters and the metadata.
- 4. The PeakFit data automation processing method according to claim 1, wherein S4 specifically includes: s401, calculating the mean value, standard deviation, first quartile and third quartile of each parameter column corresponding to the key parameters in the parameter association table; S402, determining the distribution type of each parameter array by using a shape-Wilk test, wherein the distribution type comprises normal distribution and non-normal distribution; S403, performing data cleaning operation on the parameter columns which are normally distributed by using a three-sigma principle, and performing data cleaning operation on the parameter columns which are not normally distributed by using a box line diagram rule; And S404, carrying out standardization processing on each cleaned parameter sequence, wherein the standardization processing comprises the steps of converting parameter values in parameter sequences from different sources into unified standard units and uniformly converting all parameter values into floating point number types.
- 5. The PeakFit data automation processing method according to claim 1, wherein S5 specifically includes: S501, judging whether key parameters after data cleaning and format standardization processing are repeated sample data according to the sample ID of the PeakFit result file, if so, entering S502, otherwise, entering S503; S502, calculating a relative error between a pair of repeated sample data, judging whether the relative error is smaller than a preset relative error, if so, judging that the pair of repeated sample data is effective data, otherwise, marking the pair of repeated sample data as data to be checked, and returning to S4 for cleaning again; s503, calculating a predicted value of non-repeated sample data by adopting a pre-trained association model, determining an absolute error between the predicted value and an actual value, judging whether the absolute error is smaller than a preset absolute error, if so, judging that the non-repeated sample data is effective, otherwise, marking the non-repeated sample data as the data to be checked, and returning to S4 for cleaning again.
- 6. The PeakFit data automation processing method according to claim 1, wherein the step S6 specifically includes: s601, setting an output template of the Excel worksheet, wherein the output template comprises naming rules and the sequence of columns of the Excel worksheet; S602, automatically filling the key parameters subjected to accuracy verification into the Excel worksheet according to the output template; And S603, generating an Excel result file based on the filled Excel worksheet, and storing the Excel result file in a specified directory.
- 7. The PeakFit data automation processing method as set forth in claim 6, further including, after S6: S7, calculating a hash value of the Excel result file, and verifying the integrity of the file when the Excel result file is called or transmitted based on the hash value.
- 8. A PeakFit data automation processing system, comprising: A processor; A memory having stored thereon computer readable instructions which, when executed by the processor, implement the PeakFit data automation processing method according to any one of claims 1 to 7.
Description
PeakFit data automatic processing method and system Technical Field The invention relates to the technical field of data processing, in particular to a PeakFit data automatic processing method and system. Background With the continuous improvement of the requirements on data processing efficiency and precision in scientific research and engineering application, the peak fitting technology is widely applied in the fields of spectrum analysis, chromatographic analysis, thermal analysis, biological medicine analysis and the like. The peak fitting is not only helpful for researchers to quickly identify characteristic peaks in data, but also provides support for subsequent structural analysis, chemical component analysis and pharmacodynamics research through parameter extraction. Among the numerous peak fitting software, peakFit is widely used in its powerful functions, good compatibility and wide applicability. Currently, peakFit software provides the basic functions of data processing and result output. After the peak fitting is completed, the user can save a single result file or transfer the fitting parameters to Excel or other data processing software in a copying manner. For small amounts of data, this approach can meet the basic requirements. However, the prior art mainly relies on manual copying and pasting or a mode of independently storing files, has low operation efficiency and is easy to make mistakes, and particularly when a large amount of data is processed in batches, the problems of omission and format confusion are easy to occur, meanwhile, the support of automation and standardization is lacking, the accuracy and consistency of the data are difficult to ensure, and the data processing efficiency is seriously affected. Disclosure of Invention The invention provides a PeakFit data automatic processing method and a PeakFit data automatic processing system, which aim to solve the technical problems that the prior art mainly relies on a mode of manually copying, pasting or independently storing files, has low operation efficiency and is easy to make mistakes, and particularly when a large amount of data is processed in batches, not only consumes a large amount of time and energy, but also is easy to miss and confuse formats, and meanwhile, the technical problems that the accuracy and consistency of the data are difficult to ensure and the data processing efficiency is seriously influenced due to lack of automatic and standardized support are solved. The technical scheme provided by the embodiment of the invention is as follows: first aspect: the embodiment of the invention provides a PeakFit data automatic processing method, which comprises the following steps: S1, collecting result files in a PeakFit output catalog through regular matching; S2, checking the format and the integrity of the result file; s3, extracting key parameters in the checked result file, and generating a parameter association table of the key parameters and metadata; S4, carrying out data cleaning on the key parameters in the parameter association table in a statistical characteristic and dynamic threshold mode, and carrying out format standardization on the key parameters after data cleaning; S5, performing accuracy verification on key parameters after data cleaning and format standardization treatment in a cross comparison verification mode; And S6, sorting the key parameters subjected to accuracy verification into an Excel worksheet according to a preset format. Second aspect: the embodiment of the invention provides a PeakFit data automatic processing system, which comprises: A processor; and a memory having stored thereon computer readable instructions which, when executed by the processor, implement the PeakFit data automation processing method according to the first aspect. Third aspect: An embodiment of the present invention provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the PeakFit data automation processing method according to the first aspect. The technical scheme provided by the embodiment of the invention has the beneficial effects that at least: According to the invention, the PeakFit output result file is automatically collected and checked, the key parameters are extracted, the parameter association table is generated, and then the processing result is directly imported into Excel according to the preset format, so that the problems of low efficiency and error-prone caused by manual copying and pasting or file storage one by one are effectively avoided. Meanwhile, the data cleaning and format standardization processing are combined, and the consistency and accuracy of key parameters are guaranteed through a cross verification mechanism, so that automation and standardization support is provided while efficient batch data processing is realized, and the reliability and overall efficiency of data processing are remarkably improved.