CN-114765059-B - Global progenitor source estimation method and system based on principal component analysis
Abstract
The invention provides a global ancestral source estimation method and a global ancestral source estimation system based on principal component analysis, wherein the method comprises the steps of projecting principal components of a offspring sample to be detected and principal components of an ancestral sample into the same two-dimensional plane based on a principal component analysis method according to mononucleotide mutation data, and carrying out regional division on the projected two-dimensional plane to obtain a plurality of corresponding group regions; fitting the population area of each ancestor sample to obtain an ancestor population fitting area, and obtaining the area density of each ancestor population fitting area, wherein each ancestor population fitting area extends outwards according to the same proportion, when any ancestor population fitting area contains a offspring sample to be tested, the outwards extending is stopped, and the ancestor population proportion of the offspring sample to be tested currently contained is obtained according to the area after extending and the area density. According to the invention, through the main component analysis result, the corresponding Zu Xianzhan ratio in the offspring individual or population is directly calculated, so that the accuracy is higher.
Inventors
- ZHAO YIQIANG
- LIU QI
- DAI YUNPING
Assignees
- 中国农业大学
Dates
- Publication Date
- 20260508
- Application Date
- 20210113
Claims (8)
- 1. A global progenitor estimation method based on principal component analysis, comprising: Based on a principal component analysis method, projecting principal components of a offspring sample to be detected and principal components of an ancestor sample into the same two-dimensional plane according to mononucleotide mutation data, and carrying out region division on the projected two-dimensional plane to obtain a plurality of corresponding group regions; carrying out ellipse fitting on the population area of each ancestor sample by a least square method to obtain ancestor population fitting areas, and obtaining the area density of each ancestor population fitting area; Extending each ancestor group fitting area outwards according to the same proportion, stopping extending outwards when any ancestor group fitting area contains a to-be-measured offspring sample, and obtaining the ancestor group proportion of the currently contained to-be-measured offspring sample according to the area after extending and the area density, wherein the method comprises the following steps: According to the area Sl of the area after the current extension process and the area density N/S of the ancestor group fitting area, the current extension multiple ni of the ancestor group fitting area is obtained through an extension multiple formula, wherein the extension multiple formula is as follows: ; Wherein N represents the number of sample points contained in the ancestor population fitted region before non-extension, and S represents the area of the ancestor population fitted region before non-extension; According to the expansion times and an ancestor population proportion formula, calculating to obtain an ancestor population proportion p of the offspring sample to be measured, wherein the ancestor population proportion formula is as follows: ; where k represents k offspring samples to be tested.
- 2. The principal component analysis-based global progenitor estimation method according to claim 1, wherein the principal component analysis-based method projects principal components of a offspring sample to be measured and principal components of an ancestor sample into the same two-dimensional plane according to single nucleotide mutation data, and performs region division on the projected two-dimensional plane to obtain a plurality of corresponding population regions, comprising: Obtaining a first feature vector according to the single nucleotide mutation data of the offspring sample to be detected, and obtaining a second feature vector according to the single nucleotide mutation data of the ancestral sample; according to the first characteristic vector and the second characteristic vector, the offspring sample to be measured and the ancestor sample are projected to the same two-dimensional plane; And marking each sample on the projected two-dimensional plane, and dividing the population area according to the marking result.
- 3. The global progenitor estimation method based on principal component analysis according to claim 1, wherein the performing elliptic fitting on the population region of each ancestor sample by the least square method to obtain an ancestor population fit region, and obtaining the region density of each ancestor population fit region, comprises: fitting a population area of each ancestor sample into an elliptical area by a least square elliptical fitting method, and extending the elliptical area to the outside until all ancestor samples in the current population area are covered, so as to obtain an ancestor population fitting area; Obtaining the number of ancestor sample points in each ancestor population fitting area, and calculating the elliptical area of the ancestor population fitting area according to an elliptical area formula; and obtaining the regional density of each ancestor population fitting region according to the number of the ancestor sample points and the elliptic area.
- 4. The method for estimating global progenitor based on principal component analysis according to claim 1, wherein the step of extending each of the ancestor population fit areas outward in the same proportion, and stopping the outward extension when any one of the ancestor population fit areas contains a offspring sample to be measured, comprises: And (3) extending each ancestor group fitting area outwards according to the same proportion by an asymptotic method, keeping the eccentricity of each ancestor group fitting area unchanged until the extended ancestor group fitting area contains a to-be-tested offspring sample, stopping the current extension process, and calculating to obtain the area of the extended ancestor group fitting area.
- 5. The method according to claim 1, wherein after said extending each ancestor population fitting area outward in the same ratio, stopping the outward extension when any ancestor population fitting area contains a offspring sample to be tested, and obtaining the ancestor population ratio of the offspring sample to be tested currently contained according to the extended area and the area density, the method further comprises: Step S1, after the ancestor population proportion of the offspring sample to be measured contained in the previous extension process is obtained, continuing to extend each ancestor population fitting area outwards according to the same proportion; Step S2, stopping outward extension when any ancestor group fitting area contains a offspring sample to be tested, obtaining the current extension multiple of the ancestor group fitting area according to the area of the area after the current extension process and the area density of the ancestor group fitting area, and obtaining the ancestor group proportion of the offspring sample to be tested, which is contained in the ancestor group fitting area during the current extension process, according to the extension multiple; And step S3, repeating the steps S1 to S2 until the ancestor population proportion of all the offspring samples to be tested is obtained.
- 6. A global progenitor estimation system based on principal component analysis, characterized in that the system is adapted to implement the global progenitor estimation method based on principal component analysis according to any of claims 1 to 5.
- 7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the principal component analysis based global progenitor estimation method according to any of claims 1 to 5 when the computer program is executed.
- 8. A non-transitory computer readable storage medium, having stored thereon a computer program, which when executed by a processor, implements the steps of the principal component analysis based global progenitor estimation method according to any of claims 1 to 5.
Description
Global progenitor source estimation method and system based on principal component analysis Technical Field The invention relates to the technical field of biological information, in particular to a global progenitor source estimation method and system based on principal component analysis. Background The animal variety tracing can effectively predict the amplitude and direction of the variation of the hybrid offspring character, and the global ancestral source estimation is helpful for detecting factors with great influence on the variety in the animal evolution and breeding process, and has great significance for hybridization breeding. By estimating the global ancestor of the animal species, the ancestor duty ratio of the species is calculated, the species can be traced more accurately, and the evolutionary relationship between different species of the same animal species and the evolutionary history of the corresponding animal can be understood, thereby providing conditions for exploring the origin of the corresponding animal species. However, the efficiency of existing global progenitor estimation is still relatively low and the accuracy is still to be further improved. Therefore, there is a need for a global progenitor estimation method and system based on principal component analysis to solve the above problems. Disclosure of Invention Aiming at the problems existing in the prior art, the invention provides a global progenitor source estimation method and a global progenitor source estimation system based on principal component analysis. The invention provides a global progenitor source estimation method based on principal component analysis, which comprises the following steps: Based on a principal component analysis method, projecting principal components of a offspring sample to be detected and principal components of an ancestor sample into the same two-dimensional plane according to mononucleotide mutation data, and carrying out region division on the projected two-dimensional plane to obtain a plurality of corresponding group regions; Fitting the population area of each ancestor sample to obtain an ancestor population fitting area, and obtaining the area density of each ancestor population fitting area; And (3) extending each ancestor group fitting area outwards according to the same proportion, stopping extending outwards when any ancestor group fitting area contains the offspring sample to be tested, and obtaining the ancestor group proportion of the offspring sample to be tested which is currently contained according to the area after extending and the area density. According to the method for estimating global ancestral sources based on principal component analysis, which is provided by the invention, the principal components of the offspring sample to be detected and the principal components of the ancestral sample to be detected are projected into the same two-dimensional plane according to single nucleotide mutation data, and the projected two-dimensional plane is subjected to regional division to obtain a plurality of corresponding group regions, and the method comprises the following steps: Obtaining a first feature vector according to the single nucleotide mutation data of the offspring sample to be detected, and obtaining a second feature vector according to the single nucleotide mutation data of the ancestral sample; according to the first characteristic vector and the second characteristic vector, the offspring sample to be measured and the ancestor sample are projected to the same two-dimensional plane; And marking each sample on the projected two-dimensional plane, and dividing the population area according to the marking result. According to the global progenitor estimation method based on principal component analysis, provided by the invention, the population area of each ancestor sample is fitted to obtain an ancestor population fitting area, and the area density of each ancestor population fitting area is obtained, and the method comprises the following steps: fitting a population area of each ancestor sample into an elliptical area by a least square elliptical fitting method, and extending the elliptical area to the outside until all ancestor samples in the current population area are covered, so as to obtain an ancestor population fitting area; Obtaining the number of ancestor sample points in each ancestor population fitting area, and calculating the elliptical area of the ancestor population fitting area according to an elliptical area formula; and obtaining the regional density of each ancestor population fitting region according to the number of the ancestor sample points and the elliptic area. According to the global progenitor source estimation method based on principal component analysis, each ancestor group fitting area is extended outwards according to the same proportion, and when any ancestor group fitting area contains a offspring sample to be tested, the outwar