Statistics¶
This module provides statistical functions for use with variant call data.
Diversity & divergence¶
- allel.stats.mean_pairwise_diversity(ac, fill=nan)[source]¶
Calculate for each variant the mean number of pairwise differences between haplotypes within a single population.
Parameters: ac : array_like, int, shape (n_variants, n_alleles)
Allele counts array.
fill : float
Use this value where there are no pairs to compare (e.g., all allele calls are missing).
Returns: mpd : ndarray, float, shape (n_variants,)
See also
Notes
The values returned by this function can be summed over a genome region and divided by the number of accessible bases to estimate nucleotide diversity, a.k.a. pi.
Examples
>>> import allel >>> h = allel.model.HaplotypeArray([[0, 0, 0, 0], ... [0, 0, 0, 1], ... [0, 0, 1, 1], ... [0, 1, 1, 1], ... [1, 1, 1, 1], ... [0, 0, 1, 2], ... [0, 1, 1, 2], ... [0, 1, -1, -1]]) >>> ac = h.count_alleles() >>> allel.stats.mean_pairwise_diversity(ac) array([ 0. , 0.5 , 0.66666667, 0.5 , 0. , 0.83333333, 0.83333333, 1. ])
- allel.stats.sequence_diversity(pos, ac, start=None, stop=None, is_accessible=None)[source]¶
Calculate nucleotide diversity within a given region.
Parameters: pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
ac : array_like, int, shape (n_variants, n_alleles)
Allele counts array.
start : int, optional
The position at which to start (1-based).
stop : int, optional
The position at which to stop (1-based).
is_accessible : array_like, bool, shape (len(contig),), optional
Boolean array indicating accessibility status for all positions in the chromosome/contig.
Returns: pi : ndarray, float, shape (n_windows,)
Nucleotide diversity.
Examples
>>> import allel >>> g = allel.model.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 0], [0, 1]], ... [[0, 0], [1, 1]], ... [[0, 1], [1, 1]], ... [[1, 1], [1, 1]], ... [[0, 0], [1, 2]], ... [[0, 1], [1, 2]], ... [[0, 1], [-1, -1]], ... [[-1, -1], [-1, -1]]]) >>> ac = g.count_alleles() >>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27] >>> pi = allel.stats.sequence_diversity(pos, ac, start=1, stop=31) >>> pi 0.13978494623655915
- allel.stats.windowed_diversity(pos, ac, size, start=None, stop=None, step=None, windows=None, is_accessible=None, fill=nan)[source]¶
Calculate nucleotide diversity in windows over a single chromosome/contig.
Parameters: pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
ac : array_like, int, shape (n_variants, n_alleles)
Allele counts array.
size : int
The window size (number of bases).
start : int, optional
The position at which to start (1-based).
stop : int, optional
The position at which to stop (1-based).
step : int, optional
The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows.
windows : array_like, int, shape (n_windows, 2), optional
Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters.
is_accessible : array_like, bool, shape (len(contig),), optional
Boolean array indicating accessibility status for all positions in the chromosome/contig.
fill : object, optional
The value to use where a window is completely inaccessible.
Returns: pi : ndarray, float, shape (n_windows,)
Nucleotide diversity in each window.
windows : ndarray, int, shape (n_windows, 2)
The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates.
n_bases : ndarray, int, shape (n_windows,)
Number of (accessible) bases in each window.
counts : ndarray, int, shape (n_windows,)
Number of variants in each window.
Examples
>>> import allel >>> g = allel.model.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 0], [0, 1]], ... [[0, 0], [1, 1]], ... [[0, 1], [1, 1]], ... [[1, 1], [1, 1]], ... [[0, 0], [1, 2]], ... [[0, 1], [1, 2]], ... [[0, 1], [-1, -1]], ... [[-1, -1], [-1, -1]]]) >>> ac = g.count_alleles() >>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27] >>> pi, windows, n_bases, counts = allel.stats.windowed_diversity( ... pos, ac, size=10, start=1, stop=31 ... ) >>> pi array([ 0.11666667, 0.21666667, 0.09090909]) >>> windows array([[ 1, 10], [11, 20], [21, 31]]) >>> n_bases array([10, 10, 11]) >>> counts array([3, 4, 2])
- allel.stats.mean_pairwise_divergence(ac1, ac2, fill=nan)[source]¶
Calculate for each variant the mean number of pairwise differences between haplotypes from two different populations.
Parameters: ac1 : array_like, int, shape (n_variants, n_alleles)
Allele counts array from the first population.
ac2 : array_like, int, shape (n_variants, n_alleles)
Allele counts array from the second population.
fill : float
Use this value where there are no pairs to compare (e.g., all allele calls are missing).
Returns: mpd : ndarray, float, shape (n_variants,)
See also
Notes
The values returned by this function can be summed over a genome region and divided by the number of accessible bases to estimate nucleotide divergence between two populations, a.k.a. Dxy.
Examples
>>> import allel >>> h = allel.model.HaplotypeArray([[0, 0, 0, 0], ... [0, 0, 0, 1], ... [0, 0, 1, 1], ... [0, 1, 1, 1], ... [1, 1, 1, 1], ... [0, 0, 1, 2], ... [0, 1, 1, 2], ... [0, 1, -1, -1]]) >>> ac1 = h.take([0, 1], axis=1).count_alleles() >>> ac2 = h.take([2, 3], axis=1).count_alleles() >>> allel.stats.mean_pairwise_divergence(ac1, ac2) array([ 0. , 0.5 , 1. , 0.5 , 0. , 1. , 0.75, nan])
- allel.stats.sequence_divergence(pos, ac1, ac2, start=None, stop=None, is_accessible=None)[source]¶
Calculate nucleotide divergence between two populations within a given region.
Parameters: pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
ac1 : array_like, int, shape (n_variants, n_alleles)
Allele counts array for the first population.
ac2 : array_like, int, shape (n_variants, n_alleles)
Allele counts array for the second population.
start : int, optional
The position at which to start (1-based).
stop : int, optional
The position at which to stop (1-based).
is_accessible : array_like, bool, shape (len(contig),), optional
Boolean array indicating accessibility status for all positions in the chromosome/contig.
Returns: Dxy : ndarray, float, shape (n_windows,)
Nucleotide divergence.
Examples
Simplest case, two haplotypes in each population:
>>> import allel >>> h = allel.model.HaplotypeArray([[0, 0, 0, 0], ... [0, 0, 0, 1], ... [0, 0, 1, 1], ... [0, 1, 1, 1], ... [1, 1, 1, 1], ... [0, 0, 1, 2], ... [0, 1, 1, 2], ... [0, 1, -1, -1], ... [-1, -1, -1, -1]]) >>> h1 = h.subset(haplotypes=[0, 1]) >>> h2 = h.subset(haplotypes=[2, 3]) >>> ac1 = h1.count_alleles() >>> ac2 = h2.count_alleles() >>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27] >>> dxy = sequence_divergence(pos, ac1, ac2, start=1, stop=31) >>> dxy 0.12096774193548387
- allel.stats.windowed_divergence(pos, ac1, ac2, size, start=None, stop=None, step=None, is_accessible=None, fill=nan)[source]¶
Calculate nucleotide divergence between two populations in windows over a single chromosome/contig.
Parameters: pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
ac1 : array_like, int, shape (n_variants, n_alleles)
Allele counts array for the first population.
ac2 : array_like, int, shape (n_variants, n_alleles)
Allele counts array for the second population.
size : int
The window size (number of bases).
start : int, optional
The position at which to start (1-based).
stop : int, optional
The position at which to stop (1-based).
step : int, optional
The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows.
windows : array_like, int, shape (n_windows, 2), optional
Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters.
is_accessible : array_like, bool, shape (len(contig),), optional
Boolean array indicating accessibility status for all positions in the chromosome/contig.
fill : object, optional
The value to use where a window is completely inaccessible.
Returns: Dxy : ndarray, float, shape (n_windows,)
Nucleotide divergence in each window.
windows : ndarray, int, shape (n_windows, 2)
The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates.
n_bases : ndarray, int, shape (n_windows,)
Number of (accessible) bases in each window.
counts : ndarray, int, shape (n_windows,)
Number of variants in each window.
Examples
Simplest case, two haplotypes in each population:
>>> import allel >>> h = allel.model.HaplotypeArray([[0, 0, 0, 0], ... [0, 0, 0, 1], ... [0, 0, 1, 1], ... [0, 1, 1, 1], ... [1, 1, 1, 1], ... [0, 0, 1, 2], ... [0, 1, 1, 2], ... [0, 1, -1, -1], ... [-1, -1, -1, -1]]) >>> h1 = h.subset(haplotypes=[0, 1]) >>> h2 = h.subset(haplotypes=[2, 3]) >>> ac1 = h1.count_alleles() >>> ac2 = h2.count_alleles() >>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27] >>> dxy, windows, n_bases, counts = windowed_divergence( ... pos, ac1, ac2, size=10, start=1, stop=31 ... ) >>> dxy array([ 0.15 , 0.225, 0. ]) >>> windows array([[ 1, 10], [11, 20], [21, 31]]) >>> n_bases array([10, 10, 11]) >>> counts array([3, 4, 2])
Pairwise distance¶
- allel.stats.pairwise_distance(x, metric)[source]¶
Compute pairwise distance between individuals (e.g., samples or haplotypes).
Parameters: x : array_like, shape (n, m, ...)
Array of m observations (e.g., samples or haplotypes) in a space with n dimensions (e.g., variants). Note that the order of the first two dimensions is swapped compared to what is expected by scipy.spatial.distance.pdist.
metric : string or function
Distance metric. See documentation for the function scipy.spatial.distance.pdist() for a list of built-in distance metrics.
Returns: dist : ndarray, shape (n_individuals * (n_individuals - 1) / 2,)
Distance matrix in condensed form.
See also
Notes
If x is a bcolz carray, a chunk-wise implementation will be used to avoid loading the entire input array into memory. This means that a distance matrix will be calculated for each chunk in the input array, and the results will be summed to produce the final output. For some distance metrics this will return a different result from the standard implementation, although the relative distances may be equivalent.
Examples
>>> import allel >>> g = allel.model.GenotypeArray([[[0, 0], [0, 1], [1, 1]], ... [[0, 1], [1, 1], [1, 2]], ... [[0, 2], [2, 2], [-1, -1]]]) >>> d = allel.stats.pairwise_distance(g.to_n_alt(), metric='cityblock') >>> d array([ 3., 4., 3.]) >>> import scipy.spatial >>> scipy.spatial.distance.squareform(d) array([[ 0., 3., 4.], [ 3., 0., 3.], [ 4., 3., 0.]])
Hardy-Weinberg equilibrium¶
- allel.stats.heterozygosity_observed(g, fill=nan)[source]¶
Calculate the rate of observed heterozygosity for each variant.
Parameters: g : array_like, int, shape (n_variants, n_samples, ploidy)
Genotype array.
fill : float, optional
Use this value for variants where all calls are missing.
Returns: ho : ndarray, float, shape (n_variants,)
Observed heterozygosity
Examples
>>> import allel >>> g = allel.model.GenotypeArray([[[0, 0], [0, 0], [0, 0]], ... [[0, 0], [0, 1], [1, 1]], ... [[0, 0], [1, 1], [2, 2]], ... [[1, 1], [1, 2], [-1, -1]]]) >>> allel.stats.heterozygosity_observed(g) array([ 0. , 0.33333333, 0. , 0.5 ])
- allel.stats.heterozygosity_expected(af, ploidy, fill=nan)[source]¶
Calculate the expected rate of heterozygosity for each variant under Hardy-Weinberg equilibrium.
Parameters: af : array_like, float, shape (n_variants, n_alleles)
Allele frequencies array.
fill : float, optional
Use this value for variants where allele frequencies do not sum to 1.
Returns: he : ndarray, float, shape (n_variants,)
Expected heterozygosity
Examples
>>> import allel >>> g = allel.model.GenotypeArray([[[0, 0], [0, 0], [0, 0]], ... [[0, 0], [0, 1], [1, 1]], ... [[0, 0], [1, 1], [2, 2]], ... [[1, 1], [1, 2], [-1, -1]]]) >>> af = g.count_alleles().to_frequencies() >>> allel.stats.heterozygosity_expected(af, ploidy=2) array([ 0. , 0.5 , 0.66666667, 0.375 ])
- allel.stats.inbreeding_coefficient(g, fill=nan)[source]¶
Calculate the inbreeding coefficient for each variant.
Parameters: g : array_like, int, shape (n_variants, n_samples, ploidy)
Genotype array.
fill : float, optional
Use this value for variants where the expected heterozygosity is zero.
Returns: f : ndarray, float, shape (n_variants,)
Inbreeding coefficient.
Notes
The inbreeding coefficient is calculated as 1 - (Ho/He) where Ho is the observed heterozygosity and He is the expected heterozygosity.
Examples
>>> import allel >>> g = allel.model.GenotypeArray([[[0, 0], [0, 0], [0, 0]], ... [[0, 0], [0, 1], [1, 1]], ... [[0, 0], [1, 1], [2, 2]], ... [[1, 1], [1, 2], [-1, -1]]]) >>> allel.stats.inbreeding_coefficient(g) array([ nan, 0.33333333, 1. , -0.33333333])
Window utilities¶
- allel.stats.moving_statistic(values, statistic, size, start=0, stop=None, step=None)[source]¶
Calculate a statistic in a moving window over values.
Parameters: values : array_like
The data to summarise.
statistic : function
The statistic to compute within each window.
size : int
The window size (number of values).
start : int, optional
The index at which to start.
stop : int, optional
The index at which to stop.
step : int, optional
The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows.
Returns: out : ndarray, shape (n_windows,)
Examples
>>> import allel >>> values = [2, 5, 8, 16] >>> allel.stats.moving_statistic(values, np.sum, size=2) array([ 7, 24]) >>> allel.stats.moving_statistic(values, np.sum, size=2, step=1) array([ 7, 13, 24])
- allel.stats.windowed_count(pos, size=None, start=None, stop=None, step=None, windows=None)[source]¶
Count the number of items in windows over a single chromosome/contig.
Parameters: pos : array_like, int, shape (n_items,)
The item positions in ascending order, using 1-based coordinates..
size : int
The window size (number of bases).
start : int, optional
The position at which to start (1-based).
stop : int, optional
The position at which to stop (1-based).
step : int, optional
The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows.
windows : array_like, int, shape (n_windows, 2), optional
Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters.
Returns: counts : ndarray, int, shape (n_windows,)
The number of items in each window.
windows : ndarray, int, shape (n_windows, 2)
The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates.
Notes
The window stop positions are included within a window.
The final window will be truncated to the specified stop position, and so may be smaller than the other windows.
Examples
Non-overlapping windows:
>>> import allel >>> pos = [1, 7, 12, 15, 28] >>> counts, windows = allel.stats.windowed_count(pos, size=10) >>> counts array([2, 2, 1]) >>> windows array([[ 1, 10], [11, 20], [21, 28]])
Half-overlapping windows:
>>> counts, windows = allel.stats.windowed_count(pos, size=10, step=5) >>> counts array([2, 3, 2, 0, 1]) >>> windows array([[ 1, 10], [ 6, 15], [11, 20], [16, 25], [21, 28]])
- allel.stats.windowed_statistic(pos, values, statistic, size, start=None, stop=None, step=None, windows=None, fill=nan)[source]¶
Calculate a statistic from items in windows over a single chromosome/contig.
Parameters: pos : array_like, int, shape (n_items,)
The item positions in ascending order, using 1-based coordinates..
values : array_like, int, shape (n_items,)
The values to summarise.
statistic : function
The statistic to compute.
size : int
The window size (number of bases).
start : int, optional
The position at which to start (1-based).
stop : int, optional
The position at which to stop (1-based).
step : int, optional
The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows.
windows : array_like, int, shape (n_windows, 2), optional
Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters.
fill : object, optional
The value to use where a window is empty, i.e., contains no items.
Returns: out : ndarray, shape (n_windows,)
The value of the statistic for each window.
windows : ndarray, int, shape (n_windows, 2)
The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates.
counts : ndarray, int, shape (n_windows,)
The number of items in each window.
Notes
The window stop positions are included within a window.
The final window will be truncated to the specified stop position, and so may be smaller than the other windows.
Examples
Count non-zero (i.e., True) items in non-overlapping windows:
>>> import allel >>> pos = [1, 7, 12, 15, 28] >>> values = [True, False, True, False, False] >>> nnz, windows, counts = allel.stats.windowed_statistic( ... pos, values, statistic=np.count_nonzero, size=10 ... ) >>> nnz array([1, 1, 0]) >>> windows array([[ 1, 10], [11, 20], [21, 28]]) >>> counts array([2, 2, 1])
Compute a sum over items in half-overlapping windows:
>>> values = [3, 4, 2, 6, 9] >>> x, windows, counts = allel.stats.windowed_statistic( ... pos, values, statistic=np.sum, size=10, step=5, fill=0 ... ) >>> x array([ 7, 12, 8, 0, 9]) >>> windows array([[ 1, 10], [ 6, 15], [11, 20], [16, 25], [21, 28]]) >>> counts array([2, 3, 2, 0, 1])
- allel.stats.per_base(x, windows, is_accessible=None, fill=nan)[source]¶
Calculate the per-base value of a windowed statistic.
Parameters: x : array_like, shape (n_windows,)
The statistic to average per-base.
windows : array_like, int, shape (n_windows, 2)
The windows used, as an array of (window_start, window_stop) positions using 1-based coordinates.
is_accessible : array_like, bool, shape (len(contig),), optional
Boolean array indicating accessibility status for all positions in the chromosome/contig.
fill : object, optional
Use this value where there are no accessible bases in a window.
Returns: y : ndarray, float, shape (n_windows,)
The input array divided by the number of (accessible) bases in each window.
n_bases : ndarray, int, shape (n_windows,)
The number of (accessible) bases in each window