Diversity & divergence¶
Nucleotide diversity & divergence¶
- allel.stats.diversity.mean_pairwise_difference(ac, an=None, fill=nan)[source]¶
Calculate for each variant the mean number of pairwise differences between chromosomes sampled from within a single population.
Parameters: ac : array_like, int, shape (n_variants, n_alleles)
Allele counts array.
an : array_like, int, shape (n_variants,), optional
Allele numbers. If not provided, will be calculated from ac.
fill : float
Use this value where there are no pairs to compare (e.g., all allele calls are missing).
Returns: mpd : ndarray, float, shape (n_variants,)
See also
Notes
The values returned by this function can be summed over a genome region and divided by the number of accessible bases to estimate nucleotide diversity, a.k.a. pi.
Examples
>>> import allel >>> h = allel.model.HaplotypeArray([[0, 0, 0, 0], ... [0, 0, 0, 1], ... [0, 0, 1, 1], ... [0, 1, 1, 1], ... [1, 1, 1, 1], ... [0, 0, 1, 2], ... [0, 1, 1, 2], ... [0, 1, -1, -1]]) >>> ac = h.count_alleles() >>> allel.stats.mean_pairwise_difference(ac) array([ 0. , 0.5 , 0.66666667, 0.5 , 0. , 0.83333333, 0.83333333, 1. ])
- allel.stats.diversity.mean_pairwise_difference_between(ac1, ac2, an1=None, an2=None, fill=nan)[source]¶
Calculate for each variant the mean number of pairwise differences between chromosomes sampled from two different populations.
Parameters: ac1 : array_like, int, shape (n_variants, n_alleles)
Allele counts array from the first population.
ac2 : array_like, int, shape (n_variants, n_alleles)
Allele counts array from the second population.
an1 : array_like, int, shape (n_variants,), optional
Allele numbers for the first population. If not provided, will be calculated from ac1.
an2 : array_like, int, shape (n_variants,), optional
Allele numbers for the second population. If not provided, will be calculated from ac2.
fill : float
Use this value where there are no pairs to compare (e.g., all allele calls are missing).
Returns: mpd : ndarray, float, shape (n_variants,)
See also
Notes
The values returned by this function can be summed over a genome region and divided by the number of accessible bases to estimate nucleotide divergence between two populations, a.k.a. Dxy.
Examples
>>> import allel >>> h = allel.model.HaplotypeArray([[0, 0, 0, 0], ... [0, 0, 0, 1], ... [0, 0, 1, 1], ... [0, 1, 1, 1], ... [1, 1, 1, 1], ... [0, 0, 1, 2], ... [0, 1, 1, 2], ... [0, 1, -1, -1]]) >>> ac1 = h.count_alleles(subpop=[0, 1]) >>> ac2 = h.count_alleles(subpop=[2, 3]) >>> allel.stats.mean_pairwise_difference_between(ac1, ac2) array([ 0. , 0.5 , 1. , 0.5 , 0. , 1. , 0.75, nan])
- allel.stats.diversity.sequence_diversity(pos, ac, start=None, stop=None, is_accessible=None)[source]¶
Estimate nucleotide diversity within a given region.
Parameters: pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
ac : array_like, int, shape (n_variants, n_alleles)
Allele counts array.
start : int, optional
The position at which to start (1-based).
stop : int, optional
The position at which to stop (1-based).
is_accessible : array_like, bool, shape (len(contig),), optional
Boolean array indicating accessibility status for all positions in the chromosome/contig.
Returns: pi : ndarray, float, shape (n_windows,)
Nucleotide diversity.
Examples
>>> import allel >>> g = allel.model.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 0], [0, 1]], ... [[0, 0], [1, 1]], ... [[0, 1], [1, 1]], ... [[1, 1], [1, 1]], ... [[0, 0], [1, 2]], ... [[0, 1], [1, 2]], ... [[0, 1], [-1, -1]], ... [[-1, -1], [-1, -1]]]) >>> ac = g.count_alleles() >>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27] >>> pi = allel.stats.sequence_diversity(pos, ac, start=1, stop=31) >>> pi 0.13978494623655915
- allel.stats.diversity.sequence_divergence(pos, ac1, ac2, an1=None, an2=None, start=None, stop=None, is_accessible=None)[source]¶
Estimate nucleotide divergence between two populations within a given region.
Parameters: pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
ac1 : array_like, int, shape (n_variants, n_alleles)
Allele counts array for the first population.
ac2 : array_like, int, shape (n_variants, n_alleles)
Allele counts array for the second population.
start : int, optional
The position at which to start (1-based).
stop : int, optional
The position at which to stop (1-based).
is_accessible : array_like, bool, shape (len(contig),), optional
Boolean array indicating accessibility status for all positions in the chromosome/contig.
Returns: Dxy : ndarray, float, shape (n_windows,)
Nucleotide divergence.
Examples
Simplest case, two haplotypes in each population:
>>> import allel >>> h = allel.model.HaplotypeArray([[0, 0, 0, 0], ... [0, 0, 0, 1], ... [0, 0, 1, 1], ... [0, 1, 1, 1], ... [1, 1, 1, 1], ... [0, 0, 1, 2], ... [0, 1, 1, 2], ... [0, 1, -1, -1], ... [-1, -1, -1, -1]]) >>> ac1 = h.count_alleles(subpop=[0, 1]) >>> ac2 = h.count_alleles(subpop=[2, 3]) >>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27] >>> dxy = sequence_divergence(pos, ac1, ac2, start=1, stop=31) >>> dxy 0.12096774193548387
- allel.stats.diversity.windowed_diversity(pos, ac, size, start=None, stop=None, step=None, windows=None, is_accessible=None, fill=nan)[source]¶
Estimate nucleotide diversity in windows over a single chromosome/contig.
Parameters: pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
ac : array_like, int, shape (n_variants, n_alleles)
Allele counts array.
size : int
The window size (number of bases).
start : int, optional
The position at which to start (1-based).
stop : int, optional
The position at which to stop (1-based).
step : int, optional
The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows.
windows : array_like, int, shape (n_windows, 2), optional
Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters.
is_accessible : array_like, bool, shape (len(contig),), optional
Boolean array indicating accessibility status for all positions in the chromosome/contig.
fill : object, optional
The value to use where a window is completely inaccessible.
Returns: pi : ndarray, float, shape (n_windows,)
Nucleotide diversity in each window.
windows : ndarray, int, shape (n_windows, 2)
The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates.
n_bases : ndarray, int, shape (n_windows,)
Number of (accessible) bases in each window.
counts : ndarray, int, shape (n_windows,)
Number of variants in each window.
Examples
>>> import allel >>> g = allel.model.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 0], [0, 1]], ... [[0, 0], [1, 1]], ... [[0, 1], [1, 1]], ... [[1, 1], [1, 1]], ... [[0, 0], [1, 2]], ... [[0, 1], [1, 2]], ... [[0, 1], [-1, -1]], ... [[-1, -1], [-1, -1]]]) >>> ac = g.count_alleles() >>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27] >>> pi, windows, n_bases, counts = allel.stats.windowed_diversity( ... pos, ac, size=10, start=1, stop=31 ... ) >>> pi array([ 0.11666667, 0.21666667, 0.09090909]) >>> windows array([[ 1, 10], [11, 20], [21, 31]]) >>> n_bases array([10, 10, 11]) >>> counts array([3, 4, 2])
- allel.stats.diversity.windowed_divergence(pos, ac1, ac2, size, start=None, stop=None, step=None, windows=None, is_accessible=None, fill=nan)[source]¶
Estimate nucleotide divergence between two populations in windows over a single chromosome/contig.
Parameters: pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
ac1 : array_like, int, shape (n_variants, n_alleles)
Allele counts array for the first population.
ac2 : array_like, int, shape (n_variants, n_alleles)
Allele counts array for the second population.
size : int
The window size (number of bases).
start : int, optional
The position at which to start (1-based).
stop : int, optional
The position at which to stop (1-based).
step : int, optional
The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows.
windows : array_like, int, shape (n_windows, 2), optional
Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters.
is_accessible : array_like, bool, shape (len(contig),), optional
Boolean array indicating accessibility status for all positions in the chromosome/contig.
fill : object, optional
The value to use where a window is completely inaccessible.
Returns: Dxy : ndarray, float, shape (n_windows,)
Nucleotide divergence in each window.
windows : ndarray, int, shape (n_windows, 2)
The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates.
n_bases : ndarray, int, shape (n_windows,)
Number of (accessible) bases in each window.
counts : ndarray, int, shape (n_windows,)
Number of variants in each window.
Examples
Simplest case, two haplotypes in each population:
>>> import allel >>> h = allel.model.HaplotypeArray([[0, 0, 0, 0], ... [0, 0, 0, 1], ... [0, 0, 1, 1], ... [0, 1, 1, 1], ... [1, 1, 1, 1], ... [0, 0, 1, 2], ... [0, 1, 1, 2], ... [0, 1, -1, -1], ... [-1, -1, -1, -1]]) >>> ac1 = h.count_alleles(subpop=[0, 1]) >>> ac2 = h.count_alleles(subpop=[2, 3]) >>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27] >>> dxy, windows, n_bases, counts = windowed_divergence( ... pos, ac1, ac2, size=10, start=1, stop=31 ... ) >>> dxy array([ 0.15 , 0.225, 0. ]) >>> windows array([[ 1, 10], [11, 20], [21, 31]]) >>> n_bases array([10, 10, 11]) >>> counts array([3, 4, 2])
F-statistics¶
- allel.stats.diversity.weir_cockerham_fst(g, subpops, max_allele=None)[source]¶
Compute the variance components from the analyses of variance of allele frequencies according to Weir and Cockerham (1984).
Parameters: g : array_like, int, shape (n_variants, n_samples, ploidy)
Genotype array.
subpops : sequence of sequences of ints
Sample indices for each subpopulation.
max_allele : int, optional
The highest allele index to consider.
Returns: a : ndarray, float, shape (n_variants, n_alleles)
Component of variance between populations.
b : ndarray, float, shape (n_variants, n_alleles)
Component of variance between individuals within populations.
c : ndarray, float, shape (n_variants, n_alleles)
Component of variance between gametes within individuals.
Examples
Calculate variance components from some genotype data:
>>> import allel >>> g = [[[0, 0], [0, 0], [1, 1], [1, 1]], ... [[0, 1], [0, 1], [0, 1], [0, 1]], ... [[0, 0], [0, 0], [0, 0], [0, 0]], ... [[0, 1], [1, 2], [1, 1], [2, 2]], ... [[0, 0], [1, 1], [0, 1], [-1, -1]]] >>> subpops = [[0, 1], [2, 3]] >>> a, b, c = allel.stats.weir_cockerham_fst(g, subpops) >>> a array([[ 0.5 , 0.5 , 0. ], [ 0. , 0. , 0. ], [ 0. , 0. , 0. ], [ 0. , -0.125, -0.125], [-0.375, -0.375, 0. ]]) >>> b array([[ 0. , 0. , 0. ], [-0.25 , -0.25 , 0. ], [ 0. , 0. , 0. ], [ 0. , 0.125 , 0.25 ], [ 0.41666667, 0.41666667, 0. ]]) >>> c array([[ 0. , 0. , 0. ], [ 0.5 , 0.5 , 0. ], [ 0. , 0. , 0. ], [ 0.125 , 0.25 , 0.125 ], [ 0.16666667, 0.16666667, 0. ]])
Estimate the parameter theta (a.k.a., Fst) for each variant and each allele individually:
>>> fst = a / (a + b + c) >>> fst array([[ 1. , 1. , nan], [ 0. , 0. , nan], [ nan, nan, nan], [ 0. , -0.5, -0.5], [-1.8, -1.8, nan]])
Estimate Fst for each variant individually (averaging over alleles):
>>> fst = (np.sum(a, axis=1) / ... (np.sum(a, axis=1) + np.sum(b, axis=1) + np.sum(c, axis=1))) >>> fst array([ 1. , 0. , nan, -0.4, -1.8])
Estimate Fst averaging over all variants and alleles:
>>> fst = np.sum(a) / (np.sum(a) + np.sum(b) + np.sum(c)) >>> fst -4.3680905886891398e-17
Note that estimated Fst values may be negative.
- allel.stats.diversity.hudson_fst(ac1, ac2, fill=nan)[source]¶
Calculate the numerator and denominator for Fst estimation using the method of Hudson (1992) elaborated by Bhatia et al. (2013).
Parameters: ac1 : array_like, int, shape (n_variants, n_alleles)
Allele counts array from the first population.
ac2 : array_like, int, shape (n_variants, n_alleles)
Allele counts array from the second population.
fill : float
Use this value where there are no pairs to compare (e.g., all allele calls are missing).
Returns: num : ndarray, float, shape (n_variants,)
Heterozygosity between the two populations minus average of heterozygosity within each population.
den : ndarray, float, shape (n_variants,)
Heterozygosity between the two populations.
Examples
Calculate numerator and denominator for Fst estimation:
>>> import allel >>> g = allel.model.GenotypeArray([[[0, 0], [0, 0], [1, 1], [1, 1]], ... [[0, 1], [0, 1], [0, 1], [0, 1]], ... [[0, 0], [0, 0], [0, 0], [0, 0]], ... [[0, 1], [1, 2], [1, 1], [2, 2]], ... [[0, 0], [1, 1], [0, 1], [-1, -1]]]) >>> subpops = [[0, 1], [2, 3]] >>> ac1 = g.count_alleles(subpop=subpops[0]) >>> ac2 = g.count_alleles(subpop=subpops[1]) >>> num, den = allel.stats.hudson_fst(ac1, ac2) >>> num array([ 1. , -0.16666667, 0. , -0.125 , -0.33333333]) >>> den array([ 1. , 0.5 , 0. , 0.625, 0.5 ])
Estimate Fst for each variant individually:
>>> fst = num / den >>> fst array([ 1. , -0.33333333, nan, -0.2 , -0.66666667])
Estimate Fst averaging over variants:
>>> fst = np.sum(num) / np.sum(den) >>> fst 0.1428571428571429
- allel.stats.diversity.windowed_weir_cockerham_fst(pos, g, subpops, size, start=None, stop=None, step=None, windows=None, fill=nan, max_allele=None)[source]¶
Estimate average Fst in windows over a single chromosome/contig, following the method of Weir and Cockerham (1984).
Parameters: pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
g : array_like, int, shape (n_variants, n_samples, ploidy)
Genotype array.
subpops : sequence of sequences of ints
Sample indices for each subpopulation.
size : int
The window size (number of bases).
start : int, optional
The position at which to start (1-based).
stop : int, optional
The position at which to stop (1-based).
step : int, optional
The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows.
windows : array_like, int, shape (n_windows, 2), optional
Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters.
fill : object, optional
The value to use where there are no variants within a window.
max_allele : int, optional
The highest allele index to consider.
Returns: fst : ndarray, float, shape (n_windows,)
Average Fst in each window.
windows : ndarray, int, shape (n_windows, 2)
The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates.
counts : ndarray, int, shape (n_windows,)
Number of variants in each window.
- allel.stats.diversity.windowed_hudson_fst(pos, ac1, ac2, size, start=None, stop=None, step=None, windows=None, fill=nan)[source]¶
Estimate average Fst in windows over a single chromosome/contig, following the method of Hudson (1992) elaborated by Bhatia et al. (2013).
Parameters: pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
ac1 : array_like, int, shape (n_variants, n_alleles)
Allele counts array from the first population.
ac2 : array_like, int, shape (n_variants, n_alleles)
Allele counts array from the second population.
size : int
The window size (number of bases).
start : int, optional
The position at which to start (1-based).
stop : int, optional
The position at which to stop (1-based).
step : int, optional
The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows.
windows : array_like, int, shape (n_windows, 2), optional
Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters.
fill : object, optional
The value to use where there are no variants within a window.
Returns: fst : ndarray, float, shape (n_windows,)
Average Fst in each window.
windows : ndarray, int, shape (n_windows, 2)
The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates.
counts : ndarray, int, shape (n_windows,)
Number of variants in each window.
Fixed differences¶
- allel.stats.diversity.windowed_df(pos, ac1, ac2, size, start=None, stop=None, step=None, windows=None, is_accessible=None, fill=nan)[source]¶
Calculate the density of fixed differences between two populations in windows over a single chromosome/contig.
Parameters: pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
ac1 : array_like, int, shape (n_variants, n_alleles)
Allele counts array for the first population.
ac2 : array_like, int, shape (n_variants, n_alleles)
Allele counts array for the second population.
size : int
The window size (number of bases).
start : int, optional
The position at which to start (1-based).
stop : int, optional
The position at which to stop (1-based).
step : int, optional
The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows.
windows : array_like, int, shape (n_windows, 2), optional
Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters.
is_accessible : array_like, bool, shape (len(contig),), optional
Boolean array indicating accessibility status for all positions in the chromosome/contig.
fill : object, optional
The value to use where a window is completely inaccessible.
Returns: df : ndarray, float, shape (n_windows,)
Per-base density of fixed differences in each window.
windows : ndarray, int, shape (n_windows, 2)
The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates.
n_bases : ndarray, int, shape (n_windows,)
Number of (accessible) bases in each window.
counts : ndarray, int, shape (n_windows,)
Number of variants in each window.
See also