Principal components analysis

allel.stats.decomposition.pca(gn, n_components=10, copy=True, scaler='patterson', ploidy=2)[source]

Perform principal components analysis of genotype data, via singular value decomposition.

Parameters:

gn : array_like, float, shape (n_variants, n_samples)

Genotypes at biallelic variants, coded as the number of alternate alleles per call (i.e., 0 = hom ref, 1 = het, 2 = hom alt).

n_components : int, optional

Number of components to keep.

copy : bool, optional

If False, data passed to fit are overwritten.

scaler : {‘patterson’, ‘standard’, None}

Scaling method; ‘patterson’ applies the method of Patterson et al 2006; ‘standard’ scales to unit variance; None centers the data only.

ploidy : int, optional

Sample ploidy, only relevant if ‘patterson’ scaler is used.

Returns:

coords : ndarray, float, shape (n_samples, n_components)

Transformed coordinates for the samples.

model : GenotypePCA

Model instance containing the variance ratio explained and the stored components (a.k.a., loadings). Can be used to project further data into the same principal components space via the transform() method.

Notes

Genotype data should be filtered prior to using this function to remove variants in linkage disequilibrium.

allel.stats.decomposition.randomized_pca(gn, n_components=10, copy=True, iterated_power=3, random_state=None, scaler='patterson', ploidy=2)[source]

Perform principal components analysis of genotype data, via an approximate truncated singular value decomposition using randomization to speed up the computation.

Parameters:

gn : array_like, float, shape (n_variants, n_samples)

Genotypes at biallelic variants, coded as the number of alternate alleles per call (i.e., 0 = hom ref, 1 = het, 2 = hom alt).

n_components : int, optional

Number of components to keep.

copy : bool, optional

If False, data passed to fit are overwritten.

iterated_power : int, optional

Number of iterations for the power method.

random_state : int or RandomState instance or None (default)

Pseudo Random Number generator seed control. If None, use the numpy.random singleton.

scaler : {‘patterson’, ‘standard’, None}

Scaling method; ‘patterson’ applies the method of Patterson et al 2006; ‘standard’ scales to unit variance; None centers the data only.

ploidy : int, optional

Sample ploidy, only relevant if ‘patterson’ scaler is used.

Returns:

coords : ndarray, float, shape (n_samples, n_components)

Transformed coordinates for the samples.

model : GenotypeRandomizedPCA

Model instance containing the variance ratio explained and the stored components (a.k.a., loadings). Can be used to project further data into the same principal components space via the transform() method.

Notes

Genotype data should be filtered prior to using this function to remove variants in linkage disequilibrium.

Based on the sklearn.decomposition.RandomizedPCA implementation.