Diversity & divergence¶
-
allel.
mean_pairwise_difference
(ac, an=None, fill=nan)[source]¶ Calculate for each variant the mean number of pairwise differences between chromosomes sampled from within a single population.
Parameters: - ac : array_like, int, shape (n_variants, n_alleles)
Allele counts array.
- an : array_like, int, shape (n_variants,), optional
Allele numbers. If not provided, will be calculated from ac.
- fill : float
Use this value where there are no pairs to compare (e.g., all allele calls are missing).
Returns: - mpd : ndarray, float, shape (n_variants,)
See also
Notes
The values returned by this function can be summed over a genome region and divided by the number of accessible bases to estimate nucleotide diversity, a.k.a. pi.
Examples
>>> import allel >>> h = allel.HaplotypeArray([[0, 0, 0, 0], ... [0, 0, 0, 1], ... [0, 0, 1, 1], ... [0, 1, 1, 1], ... [1, 1, 1, 1], ... [0, 0, 1, 2], ... [0, 1, 1, 2], ... [0, 1, -1, -1]]) >>> ac = h.count_alleles() >>> allel.mean_pairwise_difference(ac) array([0. , 0.5 , 0.66666667, 0.5 , 0. , 0.83333333, 0.83333333, 1. ])
-
allel.
sequence_diversity
(pos, ac, start=None, stop=None, is_accessible=None)[source]¶ Estimate nucleotide diversity within a given region, which is the average proportion of sites (including monomorphic sites not present in the data) that differ between randomly chosen pairs of chromosomes.
Parameters: - pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
- ac : array_like, int, shape (n_variants, n_alleles)
Allele counts array.
- start : int, optional
The position at which to start (1-based). Defaults to the first position.
- stop : int, optional
The position at which to stop (1-based). Defaults to the last position.
- is_accessible : array_like, bool, shape (len(contig),), optional
Boolean array indicating accessibility status for all positions in the chromosome/contig.
Returns: - pi : float
Nucleotide diversity.
Notes
If start and/or stop are not provided, uses the difference between the last and the first position as a proxy for the total number of sites, which can overestimate the sequence diversity.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 0], [0, 1]], ... [[0, 0], [1, 1]], ... [[0, 1], [1, 1]], ... [[1, 1], [1, 1]], ... [[0, 0], [1, 2]], ... [[0, 1], [1, 2]], ... [[0, 1], [-1, -1]], ... [[-1, -1], [-1, -1]]]) >>> ac = g.count_alleles() >>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27] >>> pi = allel.sequence_diversity(pos, ac, start=1, stop=31) >>> pi 0.13978494623655915
-
allel.
windowed_diversity
(pos, ac, size=None, start=None, stop=None, step=None, windows=None, is_accessible=None, fill=nan)[source]¶ Estimate nucleotide diversity in windows over a single chromosome/contig.
Parameters: - pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
- ac : array_like, int, shape (n_variants, n_alleles)
Allele counts array.
- size : int, optional
The window size (number of bases).
- start : int, optional
The position at which to start (1-based).
- stop : int, optional
The position at which to stop (1-based).
- step : int, optional
The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows.
- windows : array_like, int, shape (n_windows, 2), optional
Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters.
- is_accessible : array_like, bool, shape (len(contig),), optional
Boolean array indicating accessibility status for all positions in the chromosome/contig.
- fill : object, optional
The value to use where a window is completely inaccessible.
Returns: - pi : ndarray, float, shape (n_windows,)
Nucleotide diversity in each window.
- windows : ndarray, int, shape (n_windows, 2)
The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates.
- n_bases : ndarray, int, shape (n_windows,)
Number of (accessible) bases in each window.
- counts : ndarray, int, shape (n_windows,)
Number of variants in each window.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 0], [0, 1]], ... [[0, 0], [1, 1]], ... [[0, 1], [1, 1]], ... [[1, 1], [1, 1]], ... [[0, 0], [1, 2]], ... [[0, 1], [1, 2]], ... [[0, 1], [-1, -1]], ... [[-1, -1], [-1, -1]]]) >>> ac = g.count_alleles() >>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27] >>> pi, windows, n_bases, counts = allel.windowed_diversity( ... pos, ac, size=10, start=1, stop=31 ... ) >>> pi array([0.11666667, 0.21666667, 0.09090909]) >>> windows array([[ 1, 10], [11, 20], [21, 31]]) >>> n_bases array([10, 10, 11]) >>> counts array([3, 4, 2])
-
allel.
mean_pairwise_difference_between
(ac1, ac2, an1=None, an2=None, fill=nan)[source]¶ Calculate for each variant the mean number of pairwise differences between chromosomes sampled from two different populations.
Parameters: - ac1 : array_like, int, shape (n_variants, n_alleles)
Allele counts array from the first population.
- ac2 : array_like, int, shape (n_variants, n_alleles)
Allele counts array from the second population.
- an1 : array_like, int, shape (n_variants,), optional
Allele numbers for the first population. If not provided, will be calculated from ac1.
- an2 : array_like, int, shape (n_variants,), optional
Allele numbers for the second population. If not provided, will be calculated from ac2.
- fill : float
Use this value where there are no pairs to compare (e.g., all allele calls are missing).
Returns: - mpd : ndarray, float, shape (n_variants,)
See also
Notes
The values returned by this function can be summed over a genome region and divided by the number of accessible bases to estimate nucleotide divergence between two populations, a.k.a. Dxy.
Examples
>>> import allel >>> h = allel.HaplotypeArray([[0, 0, 0, 0], ... [0, 0, 0, 1], ... [0, 0, 1, 1], ... [0, 1, 1, 1], ... [1, 1, 1, 1], ... [0, 0, 1, 2], ... [0, 1, 1, 2], ... [0, 1, -1, -1]]) >>> ac1 = h.count_alleles(subpop=[0, 1]) >>> ac2 = h.count_alleles(subpop=[2, 3]) >>> allel.mean_pairwise_difference_between(ac1, ac2) array([0. , 0.5 , 1. , 0.5 , 0. , 1. , 0.75, nan])
-
allel.
sequence_divergence
(pos, ac1, ac2, an1=None, an2=None, start=None, stop=None, is_accessible=None)[source]¶ Estimate nucleotide divergence between two populations within a given region, which is the average proportion of sites (including monomorphic sites not present in the data) that differ between randomly chosen pairs of chromosomes, one from each population.
Parameters: - pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
- ac1 : array_like, int, shape (n_variants, n_alleles)
Allele counts array for the first population.
- ac2 : array_like, int, shape (n_variants, n_alleles)
Allele counts array for the second population.
- an1 : array_like, int, shape (n_variants,), optional
Allele numbers for the first population. If not provided, will be calculated from ac1.
- an2 : array_like, int, shape (n_variants,), optional
Allele numbers for the second population. If not provided, will be calculated from ac2.
- start : int, optional
The position at which to start (1-based). Defaults to the first position.
- stop : int, optional
The position at which to stop (1-based). Defaults to the last position.
- is_accessible : array_like, bool, shape (len(contig),), optional
Boolean array indicating accessibility status for all positions in the chromosome/contig.
Returns: - Dxy : float
Nucleotide divergence.
Examples
Simplest case, two haplotypes in each population:
>>> import allel >>> h = allel.HaplotypeArray([[0, 0, 0, 0], ... [0, 0, 0, 1], ... [0, 0, 1, 1], ... [0, 1, 1, 1], ... [1, 1, 1, 1], ... [0, 0, 1, 2], ... [0, 1, 1, 2], ... [0, 1, -1, -1], ... [-1, -1, -1, -1]]) >>> ac1 = h.count_alleles(subpop=[0, 1]) >>> ac2 = h.count_alleles(subpop=[2, 3]) >>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27] >>> dxy = sequence_divergence(pos, ac1, ac2, start=1, stop=31) >>> dxy 0.12096774193548387
-
allel.
windowed_divergence
(pos, ac1, ac2, size=None, start=None, stop=None, step=None, windows=None, is_accessible=None, fill=nan)[source]¶ Estimate nucleotide divergence between two populations in windows over a single chromosome/contig.
Parameters: - pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
- ac1 : array_like, int, shape (n_variants, n_alleles)
Allele counts array for the first population.
- ac2 : array_like, int, shape (n_variants, n_alleles)
Allele counts array for the second population.
- size : int, optional
The window size (number of bases).
- start : int, optional
The position at which to start (1-based).
- stop : int, optional
The position at which to stop (1-based).
- step : int, optional
The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows.
- windows : array_like, int, shape (n_windows, 2), optional
Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters.
- is_accessible : array_like, bool, shape (len(contig),), optional
Boolean array indicating accessibility status for all positions in the chromosome/contig.
- fill : object, optional
The value to use where a window is completely inaccessible.
Returns: - Dxy : ndarray, float, shape (n_windows,)
Nucleotide divergence in each window.
- windows : ndarray, int, shape (n_windows, 2)
The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates.
- n_bases : ndarray, int, shape (n_windows,)
Number of (accessible) bases in each window.
- counts : ndarray, int, shape (n_windows,)
Number of variants in each window.
Examples
Simplest case, two haplotypes in each population:
>>> import allel >>> h = allel.HaplotypeArray([[0, 0, 0, 0], ... [0, 0, 0, 1], ... [0, 0, 1, 1], ... [0, 1, 1, 1], ... [1, 1, 1, 1], ... [0, 0, 1, 2], ... [0, 1, 1, 2], ... [0, 1, -1, -1], ... [-1, -1, -1, -1]]) >>> ac1 = h.count_alleles(subpop=[0, 1]) >>> ac2 = h.count_alleles(subpop=[2, 3]) >>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27] >>> dxy, windows, n_bases, counts = windowed_divergence( ... pos, ac1, ac2, size=10, start=1, stop=31 ... ) >>> dxy array([0.15 , 0.225, 0. ]) >>> windows array([[ 1, 10], [11, 20], [21, 31]]) >>> n_bases array([10, 10, 11]) >>> counts array([3, 4, 2])
-
allel.
watterson_theta
(pos, ac, start=None, stop=None, is_accessible=None)[source]¶ Calculate the value of Watterson’s estimator over a given region.
Parameters: - pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
- ac : array_like, int, shape (n_variants, n_alleles)
Allele counts array.
- start : int, optional
The position at which to start (1-based). Defaults to the first position.
- stop : int, optional
The position at which to stop (1-based). Defaults to the last position.
- is_accessible : array_like, bool, shape (len(contig),), optional
Boolean array indicating accessibility status for all positions in the chromosome/contig.
Returns: - theta_hat_w : float
Watterson’s estimator (theta hat per base).
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 0], [0, 1]], ... [[0, 0], [1, 1]], ... [[0, 1], [1, 1]], ... [[1, 1], [1, 1]], ... [[0, 0], [1, 2]], ... [[0, 1], [1, 2]], ... [[0, 1], [-1, -1]], ... [[-1, -1], [-1, -1]]]) >>> ac = g.count_alleles() >>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27] >>> theta_hat_w = allel.watterson_theta(pos, ac, start=1, stop=31) >>> theta_hat_w 0.10557184750733138
-
allel.
windowed_watterson_theta
(pos, ac, size=None, start=None, stop=None, step=None, windows=None, is_accessible=None, fill=nan)[source]¶ Calculate the value of Watterson’s estimator in windows over a single chromosome/contig.
Parameters: - pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
- ac : array_like, int, shape (n_variants, n_alleles)
Allele counts array.
- size : int, optional
The window size (number of bases).
- start : int, optional
The position at which to start (1-based).
- stop : int, optional
The position at which to stop (1-based).
- step : int, optional
The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows.
- windows : array_like, int, shape (n_windows, 2), optional
Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters.
- is_accessible : array_like, bool, shape (len(contig),), optional
Boolean array indicating accessibility status for all positions in the chromosome/contig.
- fill : object, optional
The value to use where a window is completely inaccessible.
Returns: - theta_hat_w : ndarray, float, shape (n_windows,)
Watterson’s estimator (theta hat per base).
- windows : ndarray, int, shape (n_windows, 2)
The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates.
- n_bases : ndarray, int, shape (n_windows,)
Number of (accessible) bases in each window.
- counts : ndarray, int, shape (n_windows,)
Number of variants in each window.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 0], [0, 1]], ... [[0, 0], [1, 1]], ... [[0, 1], [1, 1]], ... [[1, 1], [1, 1]], ... [[0, 0], [1, 2]], ... [[0, 1], [1, 2]], ... [[0, 1], [-1, -1]], ... [[-1, -1], [-1, -1]]]) >>> ac = g.count_alleles() >>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27] >>> theta_hat_w, windows, n_bases, counts = allel.windowed_watterson_theta( ... pos, ac, size=10, start=1, stop=31 ... ) >>> theta_hat_w array([0.10909091, 0.16363636, 0.04958678]) >>> windows array([[ 1, 10], [11, 20], [21, 31]]) >>> n_bases array([10, 10, 11]) >>> counts array([3, 4, 2])
-
allel.
tajima_d
(ac, pos=None, start=None, stop=None, min_sites=3)[source]¶ Calculate the value of Tajima’s D over a given region.
Parameters: - ac : array_like, int, shape (n_variants, n_alleles)
Allele counts array.
- pos : array_like, int, shape (n_items,), optional
Variant positions, using 1-based coordinates, in ascending order.
- start : int, optional
The position at which to start (1-based). Defaults to the first position.
- stop : int, optional
The position at which to stop (1-based). Defaults to the last position.
- min_sites : int, optional
Minimum number of segregating sites for which to calculate a value. If there are fewer, np.nan is returned. Defaults to 3.
Returns: - D : float
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 0], [0, 1]], ... [[0, 0], [1, 1]], ... [[0, 1], [1, 1]], ... [[1, 1], [1, 1]], ... [[0, 0], [1, 2]], ... [[0, 1], [1, 2]], ... [[0, 1], [-1, -1]], ... [[-1, -1], [-1, -1]]]) >>> ac = g.count_alleles() >>> allel.tajima_d(ac) 3.1445848780213814 >>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27] >>> allel.tajima_d(ac, pos=pos, start=7, stop=25) 3.8779735196179366
-
allel.
windowed_tajima_d
(pos, ac, size=None, start=None, stop=None, step=None, windows=None, min_sites=3)[source]¶ Calculate the value of Tajima’s D in windows over a single chromosome/contig.
Parameters: - pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
- ac : array_like, int, shape (n_variants, n_alleles)
Allele counts array.
- size : int, optional
The window size (number of bases).
- start : int, optional
The position at which to start (1-based).
- stop : int, optional
The position at which to stop (1-based).
- step : int, optional
The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows.
- windows : array_like, int, shape (n_windows, 2), optional
Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters.
- min_sites : int, optional
Minimum number of segregating sites for which to calculate a value. If there are fewer, np.nan is returned. Defaults to 3.
Returns: - D : ndarray, float, shape (n_windows,)
Tajima’s D.
- windows : ndarray, int, shape (n_windows, 2)
The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates.
- counts : ndarray, int, shape (n_windows,)
Number of variants in each window.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 0], [0, 1]], ... [[0, 0], [1, 1]], ... [[0, 1], [1, 1]], ... [[1, 1], [1, 1]], ... [[0, 0], [1, 2]], ... [[0, 1], [1, 2]], ... [[0, 1], [-1, -1]], ... [[-1, -1], [-1, -1]]]) >>> ac = g.count_alleles() >>> pos = [2, 4, 7, 14, 15, 20, 22, 25, 27] >>> D, windows, counts = allel.windowed_tajima_d(pos, ac, size=20, step=10, start=1, stop=31) >>> D array([1.36521524, 4.22566622]) >>> windows array([[ 1, 20], [11, 31]]) >>> counts array([6, 6])
-
allel.
moving_tajima_d
(ac, size, start=0, stop=None, step=None, min_sites=3)[source]¶ Calculate the value of Tajima’s D in moving windows of size variants.
Parameters: - ac : array_like, int, shape (n_variants, n_alleles)
Allele counts array.
- size : int
The window size (number of variants).
- start : int, optional
The index at which to start.
- stop : int, optional
The index at which to stop.
- step : int, optional
The number of variants between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows.
- min_sites : int, optional
Minimum number of segregating sites for which to calculate a value. If there are fewer, np.nan is returned. Defaults to 3.
Returns: - d : ndarray, float, shape (n_windows,)
Tajima’s D.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 0], [0, 1]], ... [[0, 0], [1, 1]], ... [[0, 1], [1, 1]], ... [[1, 1], [1, 1]], ... [[0, 0], [1, 2]], ... [[0, 1], [1, 2]], ... [[0, 1], [-1, -1]], ... [[-1, -1], [-1, -1]]]) >>> ac = g.count_alleles() >>> D = allel.moving_tajima_d(ac, size=4, step=2) >>> D array([0.1676558 , 2.01186954, 5.70029703])
-
allel.
windowed_df
(pos, ac1, ac2, size=None, start=None, stop=None, step=None, windows=None, is_accessible=None, fill=nan)[source]¶ Calculate the density of fixed differences between two populations in windows over a single chromosome/contig.
Parameters: - pos : array_like, int, shape (n_items,)
Variant positions, using 1-based coordinates, in ascending order.
- ac1 : array_like, int, shape (n_variants, n_alleles)
Allele counts array for the first population.
- ac2 : array_like, int, shape (n_variants, n_alleles)
Allele counts array for the second population.
- size : int, optional
The window size (number of bases).
- start : int, optional
The position at which to start (1-based).
- stop : int, optional
The position at which to stop (1-based).
- step : int, optional
The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows.
- windows : array_like, int, shape (n_windows, 2), optional
Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters.
- is_accessible : array_like, bool, shape (len(contig),), optional
Boolean array indicating accessibility status for all positions in the chromosome/contig.
- fill : object, optional
The value to use where a window is completely inaccessible.
Returns: - df : ndarray, float, shape (n_windows,)
Per-base density of fixed differences in each window.
- windows : ndarray, int, shape (n_windows, 2)
The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates.
- n_bases : ndarray, int, shape (n_windows,)
Number of (accessible) bases in each window.
- counts : ndarray, int, shape (n_windows,)
Number of variants in each window.
See also
allel.model.locate_fixed_differences