Pairwise distance and ordination

allel.stats.distance.pairwise_distance(x, metric)[source]

Compute pairwise distance between individuals (e.g., samples or haplotypes).

Parameters:

x : array_like, shape (n, m, ...)

Array of m observations (e.g., samples or haplotypes) in a space with n dimensions (e.g., variants). Note that the order of the first two dimensions is swapped compared to what is expected by scipy.spatial.distance.pdist.

metric : string or function

Distance metric. See documentation for the function scipy.spatial.distance.pdist() for a list of built-in distance metrics.

Returns:

dist : ndarray, shape (m * (m - 1) / 2,)

Distance matrix in condensed form.

Notes

If x is a bcolz carray, a chunk-wise implementation will be used to avoid loading the entire input array into memory. This means that a distance matrix will be calculated for each chunk in the input array, and the results will be summed to produce the final output. For some distance metrics this will return a different result from the standard implementation, although the relative distances may be equivalent.

Examples

>>> import allel
>>> g = allel.model.GenotypeArray([[[0, 0], [0, 1], [1, 1]],
...                                [[0, 1], [1, 1], [1, 2]],
...                                [[0, 2], [2, 2], [-1, -1]]])
>>> d = allel.stats.pairwise_distance(g.to_n_alt(), metric='cityblock')
>>> d
array([ 3.,  4.,  3.])
>>> import scipy.spatial
>>> scipy.spatial.distance.squareform(d)
array([[ 0.,  3.,  4.],
       [ 3.,  0.,  3.],
       [ 4.,  3.,  0.]])
allel.stats.distance.pairwise_dxy(pos, gac, start=None, stop=None, is_accessible=None)[source]

Convenience function to calculate a pairwise distance matrix using nucleotide divergence (a.k.a. Dxy) as the distance metric.

Parameters:

pos : array_like, int, shape (n_variants,)

Variant positions.

gac : array_like, int, shape (n_variants, n_samples, n_alleles)

Per-genotype allele counts.

start : int, optional

Start position of region to use.

stop : int, optional

Stop position of region to use.

is_accessible : array_like, bool, shape (len(contig),), optional

Boolean array indicating accessibility status for all positions in the chromosome/contig.

Returns:

dist : ndarray

Distance matrix in condensed form.

See also

allel.model.GenotypeArray.to_allele_counts

allel.stats.distance.pcoa(dist)[source]

Perform principal coordinate analysis of a distance matrix, a.k.a. classical multi-dimensional scaling.

Parameters:

dist : array_like

Distance matrix in condensed form.

Returns:

coords : ndarray, shape (n_samples, n_dimensions)

Transformed coordinates for the samples.

explained_ratio : ndarray, shape (n_dimensions)

Variance explained by each dimension.

allel.stats.distance.condensed_coords(i, j, n)[source]

Transform square distance matrix coordinates to the corresponding index into a condensed, 1D form of the matrix.

Parameters:

i : int

Row index.

j : int

Column index.

n : int

Size of the square matrix (length of first or second dimension).

Returns:

ix : int

allel.stats.distance.condensed_coords_within(pop, n)[source]

Return indices into a condensed distance matrix for all pairwise comparisons within the given population.

Parameters:

pop : array_like, int

Indices of samples or haplotypes within the population.

n : int

Size of the square matrix (length of first or second dimension).

Returns:

indices : ndarray, int

allel.stats.distance.condensed_coords_between(pop1, pop2, n)[source]

Return indices into a condensed distance matrix for all pairwise comparisons between two populations.

Parameters:

pop1 : array_like, int

Indices of samples or haplotypes within the first population.

pop2 : array_like, int

Indices of samples or haplotypes within the second population.

n : int

Size of the square matrix (length of first or second dimension).

Returns:

indices : ndarray, int