# Statistics¶

This module provides statistical functions for use with variant call data.

## Diversity & divergence¶

allel.stats.mean_pairwise_diversity(ac, fill=nan)[source]

Calculate for each variant the mean number of pairwise differences between haplotypes within a single population.

Parameters: ac : array_like, int, shape (n_variants, n_alleles) Allele counts array. fill : float Use this value where there are no pairs to compare (e.g., all allele calls are missing). mpd : ndarray, float, shape (n_variants,)

Notes

The values returned by this function can be summed over a genome region and divided by the number of accessible bases to estimate nucleotide diversity, a.k.a. pi.

Examples

```>>> import allel
>>> h = allel.model.HaplotypeArray([[0, 0, 0, 0],
...                                 [0, 0, 0, 1],
...                                 [0, 0, 1, 1],
...                                 [0, 1, 1, 1],
...                                 [1, 1, 1, 1],
...                                 [0, 0, 1, 2],
...                                 [0, 1, 1, 2],
...                                 [0, 1, -1, -1]])
>>> ac = h.count_alleles()
>>> allel.stats.mean_pairwise_diversity(ac)
array([ 0.        ,  0.5       ,  0.66666667,  0.5       ,  0.        ,
0.83333333,  0.83333333,  1.        ])
```
allel.stats.sequence_diversity(pos, ac, start=None, stop=None, is_accessible=None)[source]

Calculate nucleotide diversity within a given region.

Parameters: pos : array_like, int, shape (n_items,) Variant positions, using 1-based coordinates, in ascending order. ac : array_like, int, shape (n_variants, n_alleles) Allele counts array. start : int, optional The position at which to start (1-based). stop : int, optional The position at which to stop (1-based). is_accessible : array_like, bool, shape (len(contig),), optional Boolean array indicating accessibility status for all positions in the chromosome/contig. pi : ndarray, float, shape (n_windows,) Nucleotide diversity.

Examples

```>>> import allel
>>> g = allel.model.GenotypeArray([[[0, 0], [0, 0]],
...                                [[0, 0], [0, 1]],
...                                [[0, 0], [1, 1]],
...                                [[0, 1], [1, 1]],
...                                [[1, 1], [1, 1]],
...                                [[0, 0], [1, 2]],
...                                [[0, 1], [1, 2]],
...                                [[0, 1], [-1, -1]],
...                                [[-1, -1], [-1, -1]]])
>>> ac = g.count_alleles()
>>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
>>> pi = allel.stats.sequence_diversity(pos, ac, start=1, stop=31)
>>> pi
0.13978494623655915
```
allel.stats.windowed_diversity(pos, ac, size, start=None, stop=None, step=None, windows=None, is_accessible=None, fill=nan)[source]

Calculate nucleotide diversity in windows over a single chromosome/contig.

Parameters: pos : array_like, int, shape (n_items,) Variant positions, using 1-based coordinates, in ascending order. ac : array_like, int, shape (n_variants, n_alleles) Allele counts array. size : int The window size (number of bases). start : int, optional The position at which to start (1-based). stop : int, optional The position at which to stop (1-based). step : int, optional The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows. windows : array_like, int, shape (n_windows, 2), optional Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters. is_accessible : array_like, bool, shape (len(contig),), optional Boolean array indicating accessibility status for all positions in the chromosome/contig. fill : object, optional The value to use where a window is completely inaccessible. pi : ndarray, float, shape (n_windows,) Nucleotide diversity in each window. windows : ndarray, int, shape (n_windows, 2) The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates. n_bases : ndarray, int, shape (n_windows,) Number of (accessible) bases in each window. counts : ndarray, int, shape (n_windows,) Number of variants in each window.

Examples

```>>> import allel
>>> g = allel.model.GenotypeArray([[[0, 0], [0, 0]],
...                                [[0, 0], [0, 1]],
...                                [[0, 0], [1, 1]],
...                                [[0, 1], [1, 1]],
...                                [[1, 1], [1, 1]],
...                                [[0, 0], [1, 2]],
...                                [[0, 1], [1, 2]],
...                                [[0, 1], [-1, -1]],
...                                [[-1, -1], [-1, -1]]])
>>> ac = g.count_alleles()
>>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
>>> pi, windows, n_bases, counts = allel.stats.windowed_diversity(
...     pos, ac, size=10, start=1, stop=31
... )
>>> pi
array([ 0.11666667,  0.21666667,  0.09090909])
>>> windows
array([[ 1, 10],
[11, 20],
[21, 31]])
>>> n_bases
array([10, 10, 11])
>>> counts
array([3, 4, 2])
```
allel.stats.mean_pairwise_divergence(ac1, ac2, an1=None, an2=None, fill=nan)[source]

Calculate for each variant the mean number of pairwise differences between haplotypes from two different populations.

Parameters: ac1 : array_like, int, shape (n_variants, n_alleles) Allele counts array from the first population. ac2 : array_like, int, shape (n_variants, n_alleles) Allele counts array from the second population. an1 : array_like, int, shape (n_variants,), optional Allele numbers for the first population. If not provided, will be calculated from ac1. an2 : array_like, int, shape (n_variants,), optional Allele numbers for the second population. If not provided, will be calculated from ac2. fill : float Use this value where there are no pairs to compare (e.g., all allele calls are missing). mpd : ndarray, float, shape (n_variants,)

Notes

The values returned by this function can be summed over a genome region and divided by the number of accessible bases to estimate nucleotide divergence between two populations, a.k.a. Dxy.

Examples

```>>> import allel
>>> h = allel.model.HaplotypeArray([[0, 0, 0, 0],
...                                 [0, 0, 0, 1],
...                                 [0, 0, 1, 1],
...                                 [0, 1, 1, 1],
...                                 [1, 1, 1, 1],
...                                 [0, 0, 1, 2],
...                                 [0, 1, 1, 2],
...                                 [0, 1, -1, -1]])
>>> ac1 = h.take([0, 1], axis=1).count_alleles()
>>> ac2 = h.take([2, 3], axis=1).count_alleles()
>>> allel.stats.mean_pairwise_divergence(ac1, ac2)
array([ 0.  ,  0.5 ,  1.  ,  0.5 ,  0.  ,  1.  ,  0.75,   nan])
```
allel.stats.sequence_divergence(pos, ac1, ac2, an1=None, an2=None, start=None, stop=None, is_accessible=None)[source]

Calculate nucleotide divergence between two populations within a given region.

Parameters: pos : array_like, int, shape (n_items,) Variant positions, using 1-based coordinates, in ascending order. ac1 : array_like, int, shape (n_variants, n_alleles) Allele counts array for the first population. ac2 : array_like, int, shape (n_variants, n_alleles) Allele counts array for the second population. start : int, optional The position at which to start (1-based). stop : int, optional The position at which to stop (1-based). is_accessible : array_like, bool, shape (len(contig),), optional Boolean array indicating accessibility status for all positions in the chromosome/contig. Dxy : ndarray, float, shape (n_windows,) Nucleotide divergence.

Examples

Simplest case, two haplotypes in each population:

```>>> import allel
>>> h = allel.model.HaplotypeArray([[0, 0, 0, 0],
...                                 [0, 0, 0, 1],
...                                 [0, 0, 1, 1],
...                                 [0, 1, 1, 1],
...                                 [1, 1, 1, 1],
...                                 [0, 0, 1, 2],
...                                 [0, 1, 1, 2],
...                                 [0, 1, -1, -1],
...                                 [-1, -1, -1, -1]])
>>> h1 = h.subset(haplotypes=[0, 1])
>>> h2 = h.subset(haplotypes=[2, 3])
>>> ac1 = h1.count_alleles()
>>> ac2 = h2.count_alleles()
>>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
>>> dxy = sequence_divergence(pos, ac1, ac2, start=1, stop=31)
>>> dxy
0.12096774193548387
```
allel.stats.windowed_divergence(pos, ac1, ac2, size, start=None, stop=None, step=None, is_accessible=None, fill=nan)[source]

Calculate nucleotide divergence between two populations in windows over a single chromosome/contig.

Parameters: pos : array_like, int, shape (n_items,) Variant positions, using 1-based coordinates, in ascending order. ac1 : array_like, int, shape (n_variants, n_alleles) Allele counts array for the first population. ac2 : array_like, int, shape (n_variants, n_alleles) Allele counts array for the second population. size : int The window size (number of bases). start : int, optional The position at which to start (1-based). stop : int, optional The position at which to stop (1-based). step : int, optional The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows. windows : array_like, int, shape (n_windows, 2), optional Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters. is_accessible : array_like, bool, shape (len(contig),), optional Boolean array indicating accessibility status for all positions in the chromosome/contig. fill : object, optional The value to use where a window is completely inaccessible. Dxy : ndarray, float, shape (n_windows,) Nucleotide divergence in each window. windows : ndarray, int, shape (n_windows, 2) The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates. n_bases : ndarray, int, shape (n_windows,) Number of (accessible) bases in each window. counts : ndarray, int, shape (n_windows,) Number of variants in each window.

Examples

Simplest case, two haplotypes in each population:

```>>> import allel
>>> h = allel.model.HaplotypeArray([[0, 0, 0, 0],
...                                 [0, 0, 0, 1],
...                                 [0, 0, 1, 1],
...                                 [0, 1, 1, 1],
...                                 [1, 1, 1, 1],
...                                 [0, 0, 1, 2],
...                                 [0, 1, 1, 2],
...                                 [0, 1, -1, -1],
...                                 [-1, -1, -1, -1]])
>>> h1 = h.subset(haplotypes=[0, 1])
>>> h2 = h.subset(haplotypes=[2, 3])
>>> ac1 = h1.count_alleles()
>>> ac2 = h2.count_alleles()
>>> pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
>>> dxy, windows, n_bases, counts = windowed_divergence(
...     pos, ac1, ac2, size=10, start=1, stop=31
... )
>>> dxy
array([ 0.15 ,  0.225,  0.   ])
>>> windows
array([[ 1, 10],
[11, 20],
[21, 31]])
>>> n_bases
array([10, 10, 11])
>>> counts
array([3, 4, 2])
```

## Pairwise distance¶

allel.stats.pairwise_distance(x, metric)[source]

Compute pairwise distance between individuals (e.g., samples or haplotypes).

Parameters: x : array_like, shape (n, m, ...) Array of m observations (e.g., samples or haplotypes) in a space with n dimensions (e.g., variants). Note that the order of the first two dimensions is swapped compared to what is expected by scipy.spatial.distance.pdist. metric : string or function Distance metric. See documentation for the function scipy.spatial.distance.pdist() for a list of built-in distance metrics. dist : ndarray, shape (n_individuals * (n_individuals - 1) / 2,) Distance matrix in condensed form.

Notes

If x is a bcolz carray, a chunk-wise implementation will be used to avoid loading the entire input array into memory. This means that a distance matrix will be calculated for each chunk in the input array, and the results will be summed to produce the final output. For some distance metrics this will return a different result from the standard implementation, although the relative distances may be equivalent.

Examples

```>>> import allel
>>> g = allel.model.GenotypeArray([[[0, 0], [0, 1], [1, 1]],
...                                [[0, 1], [1, 1], [1, 2]],
...                                [[0, 2], [2, 2], [-1, -1]]])
>>> d = allel.stats.pairwise_distance(g.to_n_alt(), metric='cityblock')
>>> d
array([ 3.,  4.,  3.])
>>> import scipy.spatial
>>> scipy.spatial.distance.squareform(d)
array([[ 0.,  3.,  4.],
[ 3.,  0.,  3.],
[ 4.,  3.,  0.]])
```
allel.stats.pairwise_dxy(pos, gac, start=None, stop=None, is_accessible=None)[source]

Convenience function to calculate a pairwise distance matrix using nucleotide divergence (a.k.a. Dxy) as the distance metric.

Parameters: pos : array_like, int, shape (n_variants,) Variant positions. gac : array_like, int, shape (n_variants, n_samples, n_alleles) Per-genotype allele counts. start : int, optional Start position of region to use. stop : int, optional Stop position of region to use. is_accessible : array_like, bool, shape (len(contig),), optional Boolean array indicating accessibility status for all positions in the chromosome/contig. dist : ndarray Distance matrix in condensed form.

## Hardy-Weinberg equilibrium¶

allel.stats.heterozygosity_observed(g, fill=nan)[source]

Calculate the rate of observed heterozygosity for each variant.

Parameters: g : array_like, int, shape (n_variants, n_samples, ploidy) Genotype array. fill : float, optional Use this value for variants where all calls are missing. ho : ndarray, float, shape (n_variants,) Observed heterozygosity

Examples

```>>> import allel
>>> g = allel.model.GenotypeArray([[[0, 0], [0, 0], [0, 0]],
...                                [[0, 0], [0, 1], [1, 1]],
...                                [[0, 0], [1, 1], [2, 2]],
...                                [[1, 1], [1, 2], [-1, -1]]])
>>> allel.stats.heterozygosity_observed(g)
array([ 0.        ,  0.33333333,  0.        ,  0.5       ])
```
allel.stats.heterozygosity_expected(af, ploidy, fill=nan)[source]

Calculate the expected rate of heterozygosity for each variant under Hardy-Weinberg equilibrium.

Parameters: af : array_like, float, shape (n_variants, n_alleles) Allele frequencies array. fill : float, optional Use this value for variants where allele frequencies do not sum to 1. he : ndarray, float, shape (n_variants,) Expected heterozygosity

Examples

```>>> import allel
>>> g = allel.model.GenotypeArray([[[0, 0], [0, 0], [0, 0]],
...                                [[0, 0], [0, 1], [1, 1]],
...                                [[0, 0], [1, 1], [2, 2]],
...                                [[1, 1], [1, 2], [-1, -1]]])
>>> af = g.count_alleles().to_frequencies()
>>> allel.stats.heterozygosity_expected(af, ploidy=2)
array([ 0.        ,  0.5       ,  0.66666667,  0.375     ])
```
allel.stats.inbreeding_coefficient(g, fill=nan)[source]

Calculate the inbreeding coefficient for each variant.

Parameters: g : array_like, int, shape (n_variants, n_samples, ploidy) Genotype array. fill : float, optional Use this value for variants where the expected heterozygosity is zero. f : ndarray, float, shape (n_variants,) Inbreeding coefficient.

Notes

The inbreeding coefficient is calculated as 1 - (Ho/He) where Ho is the observed heterozygosity and He is the expected heterozygosity.

Examples

```>>> import allel
>>> g = allel.model.GenotypeArray([[[0, 0], [0, 0], [0, 0]],
...                                [[0, 0], [0, 1], [1, 1]],
...                                [[0, 0], [1, 1], [2, 2]],
...                                [[1, 1], [1, 2], [-1, -1]]])
>>> allel.stats.inbreeding_coefficient(g)
array([        nan,  0.33333333,  1.        , -0.33333333])
```

## Window utilities¶

allel.stats.moving_statistic(values, statistic, size=None, start=0, stop=None, step=None)[source]

Calculate a statistic in a moving window over values.

Parameters: values : array_like The data to summarise. statistic : function The statistic to compute within each window. size : int The window size (number of values). start : int, optional The index at which to start. stop : int, optional The index at which to stop. step : int, optional The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows. out : ndarray, shape (n_windows,)

Examples

```>>> import allel
>>> values = [2, 5, 8, 16]
>>> allel.stats.moving_statistic(values, np.sum, size=2)
array([ 7, 24])
>>> allel.stats.moving_statistic(values, np.sum, size=2, step=1)
array([ 7, 13, 24])
```
allel.stats.windowed_count(pos, size=None, start=None, stop=None, step=None, windows=None)[source]

Count the number of items in windows over a single chromosome/contig.

Parameters: pos : array_like, int, shape (n_items,) The item positions in ascending order, using 1-based coordinates.. size : int The window size (number of bases). start : int, optional The position at which to start (1-based). stop : int, optional The position at which to stop (1-based). step : int, optional The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows. windows : array_like, int, shape (n_windows, 2), optional Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters. counts : ndarray, int, shape (n_windows,) The number of items in each window. windows : ndarray, int, shape (n_windows, 2) The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates.

Notes

The window stop positions are included within a window.

The final window will be truncated to the specified stop position, and so may be smaller than the other windows.

Examples

Non-overlapping windows:

```>>> import allel
>>> pos = [1, 7, 12, 15, 28]
>>> counts, windows = allel.stats.windowed_count(pos, size=10)
>>> counts
array([2, 2, 1])
>>> windows
array([[ 1, 10],
[11, 20],
[21, 28]])
```

Half-overlapping windows:

```>>> counts, windows = allel.stats.windowed_count(pos, size=10, step=5)
>>> counts
array([2, 3, 2, 0, 1])
>>> windows
array([[ 1, 10],
[ 6, 15],
[11, 20],
[16, 25],
[21, 28]])
```
allel.stats.windowed_statistic(pos, values, statistic, size, start=None, stop=None, step=None, windows=None, fill=nan)[source]

Calculate a statistic from items in windows over a single chromosome/contig.

Parameters: pos : array_like, int, shape (n_items,) The item positions in ascending order, using 1-based coordinates.. values : array_like, int, shape (n_items,) The values to summarise. statistic : function The statistic to compute. size : int The window size (number of bases). start : int, optional The position at which to start (1-based). stop : int, optional The position at which to stop (1-based). step : int, optional The distance between start positions of windows. If not given, defaults to the window size, i.e., non-overlapping windows. windows : array_like, int, shape (n_windows, 2), optional Manually specify the windows to use as a sequence of (window_start, window_stop) positions, using 1-based coordinates. Overrides the size/start/stop/step parameters. fill : object, optional The value to use where a window is empty, i.e., contains no items. out : ndarray, shape (n_windows,) The value of the statistic for each window. windows : ndarray, int, shape (n_windows, 2) The windows used, as an array of (window_start, window_stop) positions, using 1-based coordinates. counts : ndarray, int, shape (n_windows,) The number of items in each window.

Notes

The window stop positions are included within a window.

The final window will be truncated to the specified stop position, and so may be smaller than the other windows.

Examples

Count non-zero (i.e., True) items in non-overlapping windows:

```>>> import allel
>>> pos = [1, 7, 12, 15, 28]
>>> values = [True, False, True, False, False]
>>> nnz, windows, counts = allel.stats.windowed_statistic(
...     pos, values, statistic=np.count_nonzero, size=10
... )
>>> nnz
array([1, 1, 0])
>>> windows
array([[ 1, 10],
[11, 20],
[21, 28]])
>>> counts
array([2, 2, 1])
```

Compute a sum over items in half-overlapping windows:

```>>> values = [3, 4, 2, 6, 9]
>>> x, windows, counts = allel.stats.windowed_statistic(
...     pos, values, statistic=np.sum, size=10, step=5, fill=0
... )
>>> x
array([  7.,  12.,   8.,   0.,   9.])
>>> windows
array([[ 1, 10],
[ 6, 15],
[11, 20],
[16, 25],
[21, 28]])
>>> counts
array([2, 3, 2, 0, 1])
```
allel.stats.per_base(x, windows, is_accessible=None, fill=nan)[source]

Calculate the per-base value of a windowed statistic.

Parameters: x : array_like, shape (n_windows,) The statistic to average per-base. windows : array_like, int, shape (n_windows, 2) The windows used, as an array of (window_start, window_stop) positions using 1-based coordinates. is_accessible : array_like, bool, shape (len(contig),), optional Boolean array indicating accessibility status for all positions in the chromosome/contig. fill : object, optional Use this value where there are no accessible bases in a window. y : ndarray, float, shape (n_windows,) The input array divided by the number of (accessible) bases in each window. n_bases : ndarray, int, shape (n_windows,) The number of (accessible) bases in each window