Diversity & divergence¶

`allel.``mean_pairwise_difference`(ac, an=None, fill=nan)[source]

Calculate for each variant the mean number of pairwise differences between chromosomes sampled from within a single population.

Parameters: ac : array_like, int, shape (n_variants, n_alleles) Allele counts array. an : array_like, int, shape (n_variants,), optional Allele numbers. If not provided, will be calculated from ac. fill : float Use this value where there are no pairs to compare (e.g., all allele calls are missing). mpd : ndarray, float, shape (n_variants,)

Notes

The values returned by this function can be summed over a genome region and divided by the number of accessible bases to estimate nucleotide diversity, a.k.a. pi.

Examples

```>>> import allel
>>> h = allel.HaplotypeArray([[0, 0, 0, 0],
...                           [0, 0, 0, 1],
...                           [0, 0, 1, 1],
...                           [0, 1, 1, 1],
...                           [1, 1, 1, 1],
...                           [0, 0, 1, 2],
...                           [0, 1, 1, 2],
...                           [0, 1, -1, -1]])
>>> ac = h.count_alleles()
>>> allel.mean_pairwise_difference(ac)
array([0.        , 0.5       , 0.66666667, 0.5       , 0.        ,
0.83333333, 0.83333333, 1.        ])
```
`allel.``sequence_diversity`(pos, ac, start=None, stop=None, is_accessible=None)[source]

Estimate nucleotide diversity within a given region, which is the average proportion of sites (including monomorphic sites not present in the data) that differ between randomly chosen pairs of chromosomes.

Parameters: pos : array_like, int, shape (n_items,) Variant positions, using 1-based coordinates, in ascending order. ac : array_like, int, shape (n_variants, n_alleles) Allele counts array. start : int, optional The position at which to start (1-based). Defaults to the first position. stop : int, optional The position at which to stop (1-based). Defaults to the last position. is_accessible : array_like, bool, shape (len(contig),), optional Boolean array indicating accessibility status for all positions in the chromosome/contig. pi : float Nucleotide diversity.

Notes

If start and/or stop are not provided, uses the difference between the last and the first position as a proxy for the total number of sites, which can overestimate the sequence diversity.

Examples

```>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [0, 1]],
...                          [[0, 0], [1, 1]],
...                          [[0, 1], [1, 1]],
...                          [[1, 1], [1, 1]],
...                          [[0, 0], [1, 2]],
...                          [[0, 1], [1, 2]],
...                          [[0, 1], [-1, -1]],
...                          [[-1, -1], [-1, -1]]])
>>> ac = g.count_alleles()
>>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
>>> pi = allel.sequence_diversity(pos, ac, start=1, stop=31)
>>> pi
0.13978494623655915
```
`allel.``windowed_diversity`(pos, ac, size=None, start=None, stop=None, step=None, windows=None, is_accessible=None, fill=nan)[source]

Estimate nucleotide diversity in windows over a single chromosome/contig.

Parameters: pos : array_like, int, shape (n_items,) Variant positions, using 1-based coordinates, in ascending order. ac : array_like, int, shape (n_variants, n_alleles) Allele counts array. size : int, optional The window size (number of bases). start : int, optional The position at which to start (1-based). stop : int, optional The position at which to stop (1-based). step : int, optional The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows. windows : array_like, int, shape (n_windows, 2), optional Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters. is_accessible : array_like, bool, shape (len(contig),), optional Boolean array indicating accessibility status for all positions in the chromosome/contig. fill : object, optional The value to use where a window is completely inaccessible. pi : ndarray, float, shape (n_windows,) Nucleotide diversity in each window. windows : ndarray, int, shape (n_windows, 2) The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates. n_bases : ndarray, int, shape (n_windows,) Number of (accessible) bases in each window. counts : ndarray, int, shape (n_windows,) Number of variants in each window.

Examples

```>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [0, 1]],
...                          [[0, 0], [1, 1]],
...                          [[0, 1], [1, 1]],
...                          [[1, 1], [1, 1]],
...                          [[0, 0], [1, 2]],
...                          [[0, 1], [1, 2]],
...                          [[0, 1], [-1, -1]],
...                          [[-1, -1], [-1, -1]]])
>>> ac = g.count_alleles()
>>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
>>> pi, windows, n_bases, counts = allel.windowed_diversity(
...     pos, ac, size=10, start=1, stop=31
... )
>>> pi
array([0.11666667, 0.21666667, 0.09090909])
>>> windows
array([[ 1, 10],
[11, 20],
[21, 31]])
>>> n_bases
array([10, 10, 11])
>>> counts
array([3, 4, 2])
```
`allel.``mean_pairwise_difference_between`(ac1, ac2, an1=None, an2=None, fill=nan)[source]

Calculate for each variant the mean number of pairwise differences between chromosomes sampled from two different populations.

Parameters: ac1 : array_like, int, shape (n_variants, n_alleles) Allele counts array from the first population. ac2 : array_like, int, shape (n_variants, n_alleles) Allele counts array from the second population. an1 : array_like, int, shape (n_variants,), optional Allele numbers for the first population. If not provided, will be calculated from ac1. an2 : array_like, int, shape (n_variants,), optional Allele numbers for the second population. If not provided, will be calculated from ac2. fill : float Use this value where there are no pairs to compare (e.g., all allele calls are missing). mpd : ndarray, float, shape (n_variants,)

Notes

The values returned by this function can be summed over a genome region and divided by the number of accessible bases to estimate nucleotide divergence between two populations, a.k.a. Dxy.

Examples

```>>> import allel
>>> h = allel.HaplotypeArray([[0, 0, 0, 0],
...                           [0, 0, 0, 1],
...                           [0, 0, 1, 1],
...                           [0, 1, 1, 1],
...                           [1, 1, 1, 1],
...                           [0, 0, 1, 2],
...                           [0, 1, 1, 2],
...                           [0, 1, -1, -1]])
>>> ac1 = h.count_alleles(subpop=[0, 1])
>>> ac2 = h.count_alleles(subpop=[2, 3])
>>> allel.mean_pairwise_difference_between(ac1, ac2)
array([0.  , 0.5 , 1.  , 0.5 , 0.  , 1.  , 0.75,  nan])
```
`allel.``sequence_divergence`(pos, ac1, ac2, an1=None, an2=None, start=None, stop=None, is_accessible=None)[source]

Estimate nucleotide divergence between two populations within a given region, which is the average proportion of sites (including monomorphic sites not present in the data) that differ between randomly chosen pairs of chromosomes, one from each population.

Parameters: pos : array_like, int, shape (n_items,) Variant positions, using 1-based coordinates, in ascending order. ac1 : array_like, int, shape (n_variants, n_alleles) Allele counts array for the first population. ac2 : array_like, int, shape (n_variants, n_alleles) Allele counts array for the second population. an1 : array_like, int, shape (n_variants,), optional Allele numbers for the first population. If not provided, will be calculated from ac1. an2 : array_like, int, shape (n_variants,), optional Allele numbers for the second population. If not provided, will be calculated from ac2. start : int, optional The position at which to start (1-based). Defaults to the first position. stop : int, optional The position at which to stop (1-based). Defaults to the last position. is_accessible : array_like, bool, shape (len(contig),), optional Boolean array indicating accessibility status for all positions in the chromosome/contig. Dxy : float Nucleotide divergence.

Examples

Simplest case, two haplotypes in each population:

```>>> import allel
>>> h = allel.HaplotypeArray([[0, 0, 0, 0],
...                           [0, 0, 0, 1],
...                           [0, 0, 1, 1],
...                           [0, 1, 1, 1],
...                           [1, 1, 1, 1],
...                           [0, 0, 1, 2],
...                           [0, 1, 1, 2],
...                           [0, 1, -1, -1],
...                           [-1, -1, -1, -1]])
>>> ac1 = h.count_alleles(subpop=[0, 1])
>>> ac2 = h.count_alleles(subpop=[2, 3])
>>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
>>> dxy = sequence_divergence(pos, ac1, ac2, start=1, stop=31)
>>> dxy
0.12096774193548387
```
`allel.``windowed_divergence`(pos, ac1, ac2, size=None, start=None, stop=None, step=None, windows=None, is_accessible=None, fill=nan)[source]

Estimate nucleotide divergence between two populations in windows over a single chromosome/contig.

Parameters: pos : array_like, int, shape (n_items,) Variant positions, using 1-based coordinates, in ascending order. ac1 : array_like, int, shape (n_variants, n_alleles) Allele counts array for the first population. ac2 : array_like, int, shape (n_variants, n_alleles) Allele counts array for the second population. size : int, optional The window size (number of bases). start : int, optional The position at which to start (1-based). stop : int, optional The position at which to stop (1-based). step : int, optional The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows. windows : array_like, int, shape (n_windows, 2), optional Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters. is_accessible : array_like, bool, shape (len(contig),), optional Boolean array indicating accessibility status for all positions in the chromosome/contig. fill : object, optional The value to use where a window is completely inaccessible. Dxy : ndarray, float, shape (n_windows,) Nucleotide divergence in each window. windows : ndarray, int, shape (n_windows, 2) The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates. n_bases : ndarray, int, shape (n_windows,) Number of (accessible) bases in each window. counts : ndarray, int, shape (n_windows,) Number of variants in each window.

Examples

Simplest case, two haplotypes in each population:

```>>> import allel
>>> h = allel.HaplotypeArray([[0, 0, 0, 0],
...                           [0, 0, 0, 1],
...                           [0, 0, 1, 1],
...                           [0, 1, 1, 1],
...                           [1, 1, 1, 1],
...                           [0, 0, 1, 2],
...                           [0, 1, 1, 2],
...                           [0, 1, -1, -1],
...                           [-1, -1, -1, -1]])
>>> ac1 = h.count_alleles(subpop=[0, 1])
>>> ac2 = h.count_alleles(subpop=[2, 3])
>>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
>>> dxy, windows, n_bases, counts = windowed_divergence(
...     pos, ac1, ac2, size=10, start=1, stop=31
... )
>>> dxy
array([0.15 , 0.225, 0.   ])
>>> windows
array([[ 1, 10],
[11, 20],
[21, 31]])
>>> n_bases
array([10, 10, 11])
>>> counts
array([3, 4, 2])
```
`allel.``watterson_theta`(pos, ac, start=None, stop=None, is_accessible=None)[source]

Calculate the value of Watterson’s estimator over a given region.

Parameters: pos : array_like, int, shape (n_items,) Variant positions, using 1-based coordinates, in ascending order. ac : array_like, int, shape (n_variants, n_alleles) Allele counts array. start : int, optional The position at which to start (1-based). Defaults to the first position. stop : int, optional The position at which to stop (1-based). Defaults to the last position. is_accessible : array_like, bool, shape (len(contig),), optional Boolean array indicating accessibility status for all positions in the chromosome/contig. theta_hat_w : float Watterson’s estimator (theta hat per base).

Examples

```>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [0, 1]],
...                          [[0, 0], [1, 1]],
...                          [[0, 1], [1, 1]],
...                          [[1, 1], [1, 1]],
...                          [[0, 0], [1, 2]],
...                          [[0, 1], [1, 2]],
...                          [[0, 1], [-1, -1]],
...                          [[-1, -1], [-1, -1]]])
>>> ac = g.count_alleles()
>>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
>>> theta_hat_w = allel.watterson_theta(pos, ac, start=1, stop=31)
>>> theta_hat_w
0.10557184750733138
```
`allel.``windowed_watterson_theta`(pos, ac, size=None, start=None, stop=None, step=None, windows=None, is_accessible=None, fill=nan)[source]

Calculate the value of Watterson’s estimator in windows over a single chromosome/contig.

Parameters: pos : array_like, int, shape (n_items,) Variant positions, using 1-based coordinates, in ascending order. ac : array_like, int, shape (n_variants, n_alleles) Allele counts array. size : int, optional The window size (number of bases). start : int, optional The position at which to start (1-based). stop : int, optional The position at which to stop (1-based). step : int, optional The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows. windows : array_like, int, shape (n_windows, 2), optional Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters. is_accessible : array_like, bool, shape (len(contig),), optional Boolean array indicating accessibility status for all positions in the chromosome/contig. fill : object, optional The value to use where a window is completely inaccessible. theta_hat_w : ndarray, float, shape (n_windows,) Watterson’s estimator (theta hat per base). windows : ndarray, int, shape (n_windows, 2) The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates. n_bases : ndarray, int, shape (n_windows,) Number of (accessible) bases in each window. counts : ndarray, int, shape (n_windows,) Number of variants in each window.

Examples

```>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [0, 1]],
...                          [[0, 0], [1, 1]],
...                          [[0, 1], [1, 1]],
...                          [[1, 1], [1, 1]],
...                          [[0, 0], [1, 2]],
...                          [[0, 1], [1, 2]],
...                          [[0, 1], [-1, -1]],
...                          [[-1, -1], [-1, -1]]])
>>> ac = g.count_alleles()
>>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
>>> theta_hat_w, windows, n_bases, counts = allel.windowed_watterson_theta(
...     pos, ac, size=10, start=1, stop=31
... )
>>> theta_hat_w
array([0.10909091, 0.16363636, 0.04958678])
>>> windows
array([[ 1, 10],
[11, 20],
[21, 31]])
>>> n_bases
array([10, 10, 11])
>>> counts
array([3, 4, 2])
```
`allel.``tajima_d`(ac, pos=None, start=None, stop=None, min_sites=3)[source]

Calculate the value of Tajima’s D over a given region.

Parameters: ac : array_like, int, shape (n_variants, n_alleles) Allele counts array. pos : array_like, int, shape (n_items,), optional Variant positions, using 1-based coordinates, in ascending order. start : int, optional The position at which to start (1-based). Defaults to the first position. stop : int, optional The position at which to stop (1-based). Defaults to the last position. min_sites : int, optional Minimum number of segregating sites for which to calculate a value. If there are fewer, np.nan is returned. Defaults to 3. D : float

Examples

```>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                    [[0, 0], [0, 1]],
...                          [[0, 0], [1, 1]],
...                          [[0, 1], [1, 1]],
...                          [[1, 1], [1, 1]],
...                          [[0, 0], [1, 2]],
...                          [[0, 1], [1, 2]],
...                          [[0, 1], [-1, -1]],
...                          [[-1, -1], [-1, -1]]])
>>> ac = g.count_alleles()
>>> allel.tajima_d(ac)
3.1445848780213814
>>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
>>> allel.tajima_d(ac, pos=pos, start=7, stop=25)
3.8779735196179366
```
`allel.``windowed_tajima_d`(pos, ac, size=None, start=None, stop=None, step=None, windows=None, min_sites=3)[source]

Calculate the value of Tajima’s D in windows over a single chromosome/contig.

Parameters: pos : array_like, int, shape (n_items,) Variant positions, using 1-based coordinates, in ascending order. ac : array_like, int, shape (n_variants, n_alleles) Allele counts array. size : int, optional The window size (number of bases). start : int, optional The position at which to start (1-based). stop : int, optional The position at which to stop (1-based). step : int, optional The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows. windows : array_like, int, shape (n_windows, 2), optional Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters. min_sites : int, optional Minimum number of segregating sites for which to calculate a value. If there are fewer, np.nan is returned. Defaults to 3. D : ndarray, float, shape (n_windows,) Tajima’s D. windows : ndarray, int, shape (n_windows, 2) The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates. counts : ndarray, int, shape (n_windows,) Number of variants in each window.

Examples

```>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [0, 1]],
...                          [[0, 0], [1, 1]],
...                          [[0, 1], [1, 1]],
...                          [[1, 1], [1, 1]],
...                          [[0, 0], [1, 2]],
...                          [[0, 1], [1, 2]],
...                          [[0, 1], [-1, -1]],
...                          [[-1, -1], [-1, -1]]])
>>> ac = g.count_alleles()
>>> pos = [2, 4, 7, 14, 15, 20, 22, 25, 27]
>>> D, windows, counts = allel.windowed_tajima_d(pos, ac, size=20, step=10, start=1, stop=31)
>>> D
array([1.36521524, 4.22566622])
>>> windows
array([[ 1, 20],
[11, 31]])
>>> counts
array([6, 6])
```
`allel.``moving_tajima_d`(ac, size, start=0, stop=None, step=None, min_sites=3)[source]

Calculate the value of Tajima’s D in moving windows of size variants.

Parameters: ac : array_like, int, shape (n_variants, n_alleles) Allele counts array. size : int The window size (number of variants). start : int, optional The index at which to start. stop : int, optional The index at which to stop. step : int, optional The number of variants between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows. min_sites : int, optional Minimum number of segregating sites for which to calculate a value. If there are fewer, np.nan is returned. Defaults to 3. d : ndarray, float, shape (n_windows,) Tajima’s D.

Examples

```>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [0, 1]],
...                          [[0, 0], [1, 1]],
...                          [[0, 1], [1, 1]],
...                          [[1, 1], [1, 1]],
...                          [[0, 0], [1, 2]],
...                          [[0, 1], [1, 2]],
...                          [[0, 1], [-1, -1]],
...                          [[-1, -1], [-1, -1]]])
>>> ac = g.count_alleles()
>>> D = allel.moving_tajima_d(ac, size=4, step=2)
>>> D
array([0.1676558 , 2.01186954, 5.70029703])
```
`allel.``windowed_df`(pos, ac1, ac2, size=None, start=None, stop=None, step=None, windows=None, is_accessible=None, fill=nan)[source]

Calculate the density of fixed differences between two populations in windows over a single chromosome/contig.

Parameters: pos : array_like, int, shape (n_items,) Variant positions, using 1-based coordinates, in ascending order. ac1 : array_like, int, shape (n_variants, n_alleles) Allele counts array for the first population. ac2 : array_like, int, shape (n_variants, n_alleles) Allele counts array for the second population. size : int, optional The window size (number of bases). start : int, optional The position at which to start (1-based). stop : int, optional The position at which to stop (1-based). step : int, optional The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows. windows : array_like, int, shape (n_windows, 2), optional Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters. is_accessible : array_like, bool, shape (len(contig),), optional Boolean array indicating accessibility status for all positions in the chromosome/contig. fill : object, optional The value to use where a window is completely inaccessible. df : ndarray, float, shape (n_windows,) Per-base density of fixed differences in each window. windows : ndarray, int, shape (n_windows, 2) The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates. n_bases : ndarray, int, shape (n_windows,) Number of (accessible) bases in each window. counts : ndarray, int, shape (n_windows,) Number of variants in each window.

See also

`allel.model.locate_fixed_differences`