Pairwise distance

allel.stats.distance.pairwise_distance(x, metric)[source]

Compute pairwise distance between individuals (e.g., samples or haplotypes).

Parameters:

x : array_like, shape (n, m, ...)

Array of m observations (e.g., samples or haplotypes) in a space with n dimensions (e.g., variants). Note that the order of the first two dimensions is swapped compared to what is expected by scipy.spatial.distance.pdist.

metric : string or function

Distance metric. See documentation for the function scipy.spatial.distance.pdist() for a list of built-in distance metrics.

Returns:

dist : ndarray, shape (n_individuals * (n_individuals - 1) / 2,)

Distance matrix in condensed form.

Notes

If x is a bcolz carray, a chunk-wise implementation will be used to avoid loading the entire input array into memory. This means that a distance matrix will be calculated for each chunk in the input array, and the results will be summed to produce the final output. For some distance metrics this will return a different result from the standard implementation, although the relative distances may be equivalent.

Examples

>>> import allel
>>> g = allel.model.GenotypeArray([[[0, 0], [0, 1], [1, 1]],
...                                [[0, 1], [1, 1], [1, 2]],
...                                [[0, 2], [2, 2], [-1, -1]]])
>>> d = allel.stats.pairwise_distance(g.to_n_alt(), metric='cityblock')
>>> d
array([ 3.,  4.,  3.])
>>> import scipy.spatial
>>> scipy.spatial.distance.squareform(d)
array([[ 0.,  3.,  4.],
       [ 3.,  0.,  3.],
       [ 4.,  3.,  0.]])
allel.stats.distance.pairwise_dxy(pos, gac, start=None, stop=None, is_accessible=None)[source]

Convenience function to calculate a pairwise distance matrix using nucleotide divergence (a.k.a. Dxy) as the distance metric.

Parameters:

pos : array_like, int, shape (n_variants,)

Variant positions.

gac : array_like, int, shape (n_variants, n_samples, n_alleles)

Per-genotype allele counts.

start : int, optional

Start position of region to use.

stop : int, optional

Stop position of region to use.

is_accessible : array_like, bool, shape (len(contig),), optional

Boolean array indicating accessibility status for all positions in the chromosome/contig.

Returns:

dist : ndarray

Distance matrix in condensed form.