Statistics

This module defines statistical functions for use with variant call data.

Windowed statistics

allel.stats.windowed_statistic(pos, values, window, statistic, start=None, stop=None)[source]

Compute a statistic in non-overlapping windows over a single chromosome/contig.

Parameters:

pos : array_like, int, shape (n_variants,)

Positions array (1-based).

values : array_like, shape (n_variants,)

Values to be summarised within each window.

window : int

Window size.

statistic : string or function

Statistic to compute.

start : int, optional

Start position of first window (1-based).

stop : int, optional

Stop position of last window (1-based).

Returns:

s : ndarray, shape (n_windows,)

Computed statistic.

edges : ndarray, shape (n_windows + 1,)

Window edge positions.

allel.stats.windowed_nnz(pos, b, window, start=None, stop=None)[source]

Count nonzero (i.e., True) elements in non-overlapping windows over a single chromosome/contig.

Parameters:

pos : array_like, int, shape (n_variants,)

Positions array.

b : array_like, bool, shape (n_variants,) or (n_variants, n_samples)

Boolean array.

window : int

Window size.

start : int, optional

Start position of first window (1-based).

stop : int, optional

Stop position of last window (1-based).

Returns:

counts : ndarray, int, shape (n_windows,) or (n_windows, n_samples)

Counts array.

edges : ndarray, shape (n_windows + 1,)

Window edge positions.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 1], [0, 1]],
...                          [[1, 1], [1, 2]],
...                          [[2, 2], [-1, -1]]], dtype='i1')
>>> pos = allel.PositionIndex([2, 14, 15, 27])
>>> b = g.is_variant()
>>> counts, edges = allel.windowed_nnz(pos, b, window=10)
>>> edges
array([ 2, 12, 22, 32])
>>> counts
array([0, 2, 1])
>>> counts, edges = allel.windowed_nnz(pos, b, window=10, start=1,
...                                      stop=27)
>>> edges
array([ 1, 11, 21, 27])
>>> counts
array([0, 2, 1])
>>> b = g.is_het()
>>> counts, edges = allel.windowed_nnz(pos, b, window=10)
>>> edges
array([ 2, 12, 22, 32])
>>> counts
array([[0, 0],
       [1, 2],
       [0, 0]])
allel.stats.windowed_mean_per_base(pos, values, window, start=None, stop=None, is_accessible=None, fill=nan)[source]

Calculate the mean per genome position of the given values in non-overlapping windows over a single chromosome/contig.

Parameters:

pos : array_like, int, shape (n_variants,)

Positions array.

values : array_like, shape (n_variants,)

Values to be summarised within each window.

window : int

Window size.

start : int, optional

Start position of first window (1-based).

stop : int, optional

Stop position of last window (1-based).

is_accessible : array_like, bool, shape (len(contig),), optional

Accessibility mask. If provided, the size of each window will be calculated as the number of accessible positions, rather than the absolute window size.

fill : optional

Fill value to use if window size is zero.

Returns:

m : ndarray, shape (n_windows,)

Mean per base of values in each window.

edges : ndarray, shape (n_windows + 1,)

Window edge positions.

widths : array_like, int, shape (n_windows,)

Number of accessible positions in each window.

allel.stats.windowed_nnz_per_base(pos, b, window, start=None, stop=None, is_accessible=None, fill=nan)[source]

Calculate the per-base-pair density of nonzero (i.e., True) elements in non-overlapping windows over a single chromosome/contig.

Parameters:

pos : array_like, int, shape (n_variants,)

Positions array.

b : array_like, bool, shape (n_variants,) or (n_variants, n_samples)

Boolean array.

window : int

Window size.

start : int, optional

Start position of first window (1-based).

stop : int, optional

Stop position of last window (1-based).

is_accessible : array_like, bool, shape (len(contig),), optional

Accessibility mask. If provided, the size of each window will be calculated as the number of accessible positions, rather than the absolute window size.

fill : optional

Fill value to use if window size is zero.

Returns:

densities : ndarray, shape (n_windows,) or (n_windows, n_samples)

Per base density of True values in each window.

edges : ndarray, shape (n_windows + 1,)

Window edge positions.

counts : ndarray, int, shape (n_windows,) or (n_windows, n_samples)

Counts array.

widths : array_like, int, shape (n_windows,)

Number of accessible positions in each window.

Examples

Assuming all positions are accessible:

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 1], [0, 1]],
...                          [[1, 1], [1, 2]],
...                          [[2, 2], [-1, -1]]], dtype='i1')
>>> pos = allel.PositionIndex([2, 14, 15, 27])
>>> b = g.is_variant()
>>> densities, edges, counts, widths = allel.windowed_nnz_per_base(
...     pos, b, window=10
... )
>>> edges
array([ 2, 12, 22, 32])
>>> widths
array([10, 10, 11])
>>> counts
array([0, 2, 1])
>>> densities
array([ 0.        ,  0.2       ,  0.09090909])

Density calculations can take into account the number of accessible positions within each window, e.g.:

>>> is_accessible = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
...                           0, 1, 1, 1, 0, 0, 1, 1, 0, 0,
...                           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
...                           dtype=bool)
>>> densities, edges, counts, widths = allel.windowed_nnz_per_base(
...     pos, b, start=1, stop=31, window=10,
...     is_accessible=is_accessible, fill=np.nan
... )
>>> edges
array([ 1, 11, 21, 31])
>>> widths
array([ 0,  5, 11])
>>> counts
array([0, 2, 1])
>>> densities
array([        nan,  0.4     ,  0.09090909])
allel.stats.windowed_nucleotide_diversity(g, pos, window, start=None, stop=None, is_accessible=None, fill=nan)[source]

Calculate nucleotide diversity in non-overlapping windows over a single chromosome/contig.

Parameters:

g : array_like, int, shape (n_variants, n_samples, ploidy)

Genotype array.

pos : array_like, int, shape (n_variants,)

Positions array.

window : int

Window size.

start : int, optional

Start position of first window (1-based).

stop : int, optional

Stop position of last window (1-based).

is_accessible : array_like, bool, shape (len(contig),), optional

Accessibility mask. If provided, the size of each window will be calculated as the number of accessible positions, rather than the absolute window size.

fill : optional

Fill value to use if window size is zero.

Returns:

pi : ndarray, shape (n_windows,)

Nucleotide diversity in each window.

edges : ndarray, shape (n_windows + 1,)

Window edge positions.

widths : array_like, int, shape (n_windows,)

Number of accessible positions in each window.