Data structures¶
This module defines array classes for variant call data.
GenotypeArray¶
- class allel.model.GenotypeArray[source]¶
Array of discrete genotype calls.
Parameters: data : array_like, int, shape (n_variants, n_samples, ploidy)
Genotype data.
**kwargs : keyword arguments
All keyword arguments are passed through to numpy.array().
Notes
This class represents data on discrete genotype calls as a 3-dimensional numpy array of integers. By convention the first dimension corresponds to the variants genotyped, the second dimension corresponds to the samples genotyped, and the third dimension corresponds to the ploidy of the samples.
Each integer within the array corresponds to an allele index, where 0 is the reference allele, 1 is the first alternate allele, 2 is the second alternate allele, ... and -1 (or any other negative integer) is a missing allele call. A single byte integer dtype (int8) can represent up to 127 distinct alleles, which is usually sufficient. The actual alleles (i.e., the alternate nucleotide sequences) and the physical positions of the variants within the genome of an organism are stored in separate arrays, discussed elsewhere.
In many cases the number of distinct alleles for each variant is small, e.g., less than 10, or even 2 (all variants are biallelic). In these cases a genotype array is not the most compact way of storing genotype data in memory. This class defines functions for bit-packing diploid genotype calls into single bytes, and for transforming genotype arrays into sparse matrices, which can assist in cases where memory usage needs to be minimised. Note however that these more compact representations do not allow the same flexibility in terms of using numpy universal functions to access and manipulate data.
Arrays of this class can store either phased or unphased genotype calls. If the genotypes are phased (i.e., haplotypes have been resolved) then individual haplotypes can be extracted by converting to a HaplotypeArray then indexing the second dimension. If the genotype calls are unphased then the ordering of alleles along the third (ploidy) dimension is arbitrary. N.B., this means that an unphased diploid heterozygous call could be stored as (0, 1) or equivalently as (1, 0).
A genotype array can store genotype calls with any ploidy > 1. For haploid calls, use a HaplotypeArray. Note that genotype arrays are not capable of storing calls for samples with differing or variable ploidy.
Examples
Instantiate a genotype array:
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 1], [1, 1]], ... [[0, 2], [-1, -1]]], dtype='i1') >>> g.dtype dtype('int8') >>> g.ndim 3 >>> g.shape (3, 2, 2) >>> g.n_variants 3 >>> g.n_samples 2 >>> g.ploidy 2
Genotype calls for a single variant at all samples can be obtained by indexing the first dimension, e.g.:
>>> g[1] array([[0, 1], [1, 1]], dtype=int8)
Genotype calls for a single sample at all variants can be obtained by indexing the second dimension, e.g.:
>>> g[:, 1] array([[ 0, 1], [ 1, 1], [-1, -1]], dtype=int8)
A genotype call for a single sample at a single variant can be obtained by indexing the first and second dimensions, e.g.:
>>> g[1, 0] array([0, 1], dtype=int8)
A genotype array can store polyploid calls, e.g.:
>>> g = allel.GenotypeArray([[[0, 0, 0], [0, 0, 1]], ... [[0, 1, 1], [1, 1, 1]], ... [[0, 1, 2], [-1, -1, -1]]], dtype='i1') >>> g.ploidy 3
- is_called()[source]¶
Find non-missing genotype calls.
Returns: out : ndarray, bool, shape (n_variants, n_samples)
Array where elements are True if the genotype call matches the condition.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 1], [1, 1]], ... [[0, 2], [-1, -1]]]) >>> g.is_called() array([[ True, True], [ True, True], [ True, False]], dtype=bool)
- is_missing()[source]¶
Find missing genotype calls.
Returns: out : ndarray, bool, shape (n_variants, n_samples)
Array where elements are True if the genotype call matches the condition.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 1], [1, 1]], ... [[0, 2], [-1, -1]]]) >>> g.is_missing() array([[False, False], [False, False], [False, True]], dtype=bool)
- is_hom(allele=None)[source]¶
Find genotype calls that are homozygous.
Parameters: allele : int, optional
Allele index.
Returns: out : ndarray, bool, shape (n_variants, n_samples)
Array where elements are True if the genotype call matches the condition.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 1], [1, 1]], ... [[2, 2], [-1, -1]]]) >>> g.is_hom() array([[ True, False], [False, True], [ True, False]], dtype=bool) >>> g.is_hom(allele=1) array([[False, False], [False, True], [False, False]], dtype=bool)
- is_hom_ref()[source]¶
Find genotype calls that are homozygous for the reference allele.
Returns: out : ndarray, bool, shape (n_variants, n_samples)
Array where elements are True if the genotype call matches the condition.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 1], [1, 1]], ... [[0, 2], [-1, -1]]]) >>> g.is_hom_ref() array([[ True, False], [False, False], [False, False]], dtype=bool)
- is_hom_alt()[source]¶
Find genotype calls that are homozygous for any alternate (i.e., non-reference) allele.
Returns: out : ndarray, bool, shape (n_variants, n_samples)
Array where elements are True if the genotype call matches the condition.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 1], [1, 1]], ... [[2, 2], [-1, -1]]]) >>> g.is_hom_alt() array([[False, False], [False, True], [ True, False]], dtype=bool)
- is_het()[source]¶
Find genotype calls that are heterozygous.
Returns: out : ndarray, bool, shape (n_variants, n_samples)
Array where elements are True if the genotype call matches the condition.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 1], [1, 1]], ... [[0, 2], [-1, -1]]]) >>> g.is_het() array([[False, True], [ True, False], [ True, False]], dtype=bool)
- is_call(call)[source]¶
Find genotypes with a given call.
Parameters: call : array_like, int, shape (ploidy,)
The genotype call to find.
Returns: out : ndarray, bool, shape (n_variants, n_samples)
Array where elements are True if the genotype is call.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 1], [1, 1]], ... [[0, 2], [-1, -1]]]) >>> g.is_call((0, 2)) array([[False, False], [False, False], [ True, False]], dtype=bool)
- view_haplotypes()[source]¶
Reshape a genotype array to view it as haplotypes by dropping the ploidy dimension.
Returns: h : HaplotypeArray, shape (n_variants, n_samples * ploidy)
Haplotype array (sharing same underlying buffer).
Notes
If genotype calls are unphased, the haplotypes returned by this function will bear no resemblance to the true haplotypes.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 1], [1, 1]], ... [[0, 2], [-1, -1]]]) >>> g.view_haplotypes() HaplotypeArray([[ 0, 0, 0, 1], [ 0, 1, 1, 1], [ 0, 2, -1, -1]], n_variants=3, n_haplotypes=4)
- to_n_alt(fill=0)[source]¶
Transform each genotype call into the number of non-reference alleles.
Parameters: fill : int, optional
Use this value to represent missing calls.
Returns: out : ndarray, int, shape (n_variants, n_samples)
Array of non-ref alleles per genotype call.
Notes
This function simply counts the number of non-reference alleles, it makes no distinction between different alternate alleles.
By default this function returns 0 for missing genotype calls and for homozygous reference genotype calls. Use the fill argument to change how missing calls are represented.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 2], [1, 1]], ... [[2, 2], [-1, -1]]]) >>> g.to_n_alt() array([[0, 1], [1, 2], [2, 0]], dtype=int8) >>> g.to_n_alt(fill=-1) array([[ 0, 1], [ 1, 2], [ 2, -1]], dtype=int8)
- to_allele_counts(alleles=None)[source]¶
Transform genotype calls into allele counts per call.
Parameters: alleles : sequence of ints, optional
If not None, count only the given alleles. (By default, count all alleles.)
Returns: out : ndarray, uint8, shape (n_variants, n_samples, len(alleles))
Array of allele counts per call.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 2], [1, 1]], ... [[2, 2], [-1, -1]]]) >>> g.to_allele_counts() array([[[2, 0, 0], [1, 1, 0]], [[1, 0, 1], [0, 2, 0]], [[0, 0, 2], [0, 0, 0]]], dtype=uint8) >>> g.to_allele_counts(alleles=(0, 1)) array([[[2, 0], [1, 1]], [[1, 0], [0, 2]], [[0, 0], [0, 0]]], dtype=uint8)
- to_packed(boundscheck=True)[source]¶
Pack diploid genotypes into a single byte for each genotype, using the left-most 4 bits for the first allele and the right-most 4 bits for the second allele. Allows single byte encoding of diploid genotypes for variants with up to 15 alleles.
Parameters: boundscheck : bool, optional
If False, do not check that minimum and maximum alleles are compatible with bit-packing.
Returns: packed : ndarray, uint8, shape (n_variants, n_samples)
Bit-packed genotype array.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 2], [1, 1]], ... [[2, 2], [-1, -1]]], dtype='i1') >>> g.to_packed() array([[ 0, 1], [ 2, 17], [ 34, 239]], dtype=uint8)
- static from_packed(packed)[source]¶
Unpack diploid genotypes that have been bit-packed into single bytes.
Parameters: packed : ndarray, uint8, shape (n_variants, n_samples)
Bit-packed diploid genotype array.
Returns: g : GenotypeArray, shape (n_variants, n_samples, 2)
Genotype array.
Examples
>>> import allel >>> import numpy as np >>> packed = np.array([[0, 1], ... [2, 17], ... [34, 239]], dtype='u1') >>> allel.GenotypeArray.from_packed(packed) GenotypeArray([[[ 0, 0], [ 0, 1]], [[ 0, 2], [ 1, 1]], [[ 2, 2], [-1, -1]]], dtype=int8, n_variants=3, n_samples=2, ploidy=2)
- to_sparse(format='csr', **kwargs)[source]¶
Convert into a sparse matrix.
Parameters: format : {‘coo’, ‘csc’, ‘csr’, ‘dia’, ‘dok’, ‘lil’}
Sparse matrix format.
kwargs : keyword arguments
Passed through to sparse matrix constructor.
Returns: m : scipy.sparse.spmatrix
Sparse matrix
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 1], [0, 1]], ... [[1, 1], [0, 0]], ... [[0, 0], [-1, -1]]], dtype='i1') >>> m = g.to_sparse(format='csr') >>> m <4x4 sparse matrix of type '<class 'numpy.int8'>' with 6 stored elements in Compressed Sparse Row format> >>> m.data array([ 1, 1, 1, 1, -1, -1], dtype=int8) >>> m.indices array([1, 3, 0, 1, 2, 3], dtype=int32) >>> m.indptr array([0, 0, 2, 4, 6], dtype=int32)
- static from_sparse(m, ploidy, order=None, out=None)[source]¶
Construct a genotype array from a sparse matrix.
Parameters: m : scipy.sparse.spmatrix
Sparse matrix
ploidy : int
The sample ploidy.
order : {‘C’, ‘F’}, optional
Whether to store data in C (row-major) or Fortran (column-major) order in memory.
out : ndarray, shape (n_variants, n_samples), optional
Use this array as the output buffer.
Returns: g : GenotypeArray, shape (n_variants, n_samples, ploidy)
Genotype array.
Examples
>>> import allel >>> import numpy as np >>> import scipy.sparse >>> data = np.array([ 1, 1, 1, 1, -1, -1], dtype=np.int8) >>> indices = np.array([1, 3, 0, 1, 2, 3], dtype=np.int32) >>> indptr = np.array([0, 0, 2, 4, 6], dtype=np.int32) >>> m = scipy.sparse.csr_matrix((data, indices, indptr)) >>> g = allel.GenotypeArray.from_sparse(m, ploidy=2) >>> g GenotypeArray([[[ 0, 0], [ 0, 0]], [[ 0, 1], [ 0, 1]], [[ 1, 1], [ 0, 0]], [[ 0, 0], [-1, -1]]], dtype=int8, n_variants=4, n_samples=2, ploidy=2)
- allelism()[source]¶
Determine the number of distinct alleles for each variant.
Returns: n : ndarray, int, shape (n_variants,)
Allelism array.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 2], [1, 1]], ... [[2, 2], [-1, -1]]]) >>> g.allelism() array([2, 3, 1])
- allele_number()[source]¶
Count the number of non-missing allele calls per variant.
Returns: an : ndarray, int, shape (n_variants,)
Allele number array.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 2], [1, 1]], ... [[2, 2], [-1, -1]]]) >>> g.allele_number() array([4, 4, 2])
- allele_count(allele=1)[source]¶
Count the number of calls of the given allele per variant.
Parameters: allele : int, optional
Allele index.
Returns: ac : ndarray, int, shape (n_variants,)
Allele count array.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 2], [1, 1]], ... [[2, 2], [-1, -1]]]) >>> g.allele_count(allele=1) array([1, 2, 0]) >>> g.allele_count(allele=2) array([0, 1, 2])
- allele_frequency(allele=1, fill=0)[source]¶
Calculate the frequency of the given allele per variant.
Parameters: allele : int, optional
Allele index.
fill : int, optional
The value to use where all genotype calls are missing for a variant.
Returns: af : ndarray, float, shape (n_variants,)
Allele frequency array.
ac : ndarray, int, shape (n_variants,)
Allele count array (numerator).
an : ndarray, int, shape (n_variants,)
Allele number array (denominator).
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 2], [1, 1]], ... [[2, 2], [-1, -1]]]) >>> af, ac, an = g.allele_frequency(allele=1) >>> af array([ 0.25, 0.5 , 0. ]) >>> af, ac, an = g.allele_frequency(allele=2) >>> af array([ 0. , 0.25, 1. ])
- allele_counts(alleles=None)[source]¶
Count the number of calls of each allele per variant.
Parameters: alleles : sequence of ints, optional
The alleles to count. If None, all alleles will be counted.
Returns: ac : ndarray, int, shape (n_variants, len(alleles))
Allele counts array.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 2], [1, 1]], ... [[2, 2], [-1, -1]]]) >>> g.allele_counts() array([[3, 1, 0], [1, 2, 1], [0, 0, 2]], dtype=int32) >>> g.allele_counts(alleles=(1, 2)) array([[1, 0], [2, 1], [0, 2]], dtype=int32)
- allele_frequencies(alleles=None, fill=0)[source]¶
Calculate the frequency of each allele per variant.
Parameters: alleles : sequence of ints, optional
The alleles to calculate frequency of. If None, all allele frequencies will be calculated.
fill : int, optional
The value to use where all genotype calls are missing for a variant.
Returns: af : ndarray, float, shape (n_variants, len(alleles))
Allele frequencies array.
ac : ndarray, int, shape (n_variants, len(alleles))
Allele counts array (numerator).
an : ndarray, int, shape (n_variants,)
Allele number array (denominator).
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 2], [1, 1]], ... [[2, 2], [-1, -1]]]) >>> af, ac, an = g.allele_frequencies() >>> af array([[ 0.75, 0.25, 0. ], [ 0.25, 0.5 , 0.25], [ 0. , 0. , 1. ]]) >>> af, ac, an = g.allele_frequencies(alleles=(1, 2)) >>> af array([[ 0.25, 0. ], [ 0.5 , 0.25], [ 0. , 1. ]])
- is_variant()[source]¶
Find variants with at least one non-reference allele call.
Returns: out : ndarray, bool, shape (n_variants,)
Boolean array where elements are True if variant matches the condition.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 0], [0, 1]], ... [[0, 2], [1, 1]], ... [[2, 2], [-1, -1]]]) >>> g.is_variant() array([False, True, True, True], dtype=bool)
- is_non_variant()[source]¶
Find variants with no non-reference allele calls.
Returns: out : ndarray, bool, shape (n_variants,)
Boolean array where elements are True if variant matches the condition.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 0], [0, 1]], ... [[0, 2], [1, 1]], ... [[2, 2], [-1, -1]]]) >>> g.is_non_variant() array([ True, False, False, False], dtype=bool)
- is_segregating()[source]¶
Find segregating variants (where more than one allele is observed).
Returns: out : ndarray, bool, shape (n_variants,)
Boolean array where elements are True if variant matches the condition.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 0], [0, 1]], ... [[0, 2], [1, 1]], ... [[2, 2], [-1, -1]]]) >>> g.is_segregating() array([False, True, True, False], dtype=bool)
- is_non_segregating(allele=None)[source]¶
Find non-segregating variants (where at most one allele is observed).
Parameters: allele : int, optional
Allele index.
Returns: out : ndarray, bool, shape (n_variants,)
Boolean array where elements are True if variant matches the condition.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 0], [0, 1]], ... [[0, 2], [1, 1]], ... [[2, 2], [-1, -1]]]) >>> g.is_non_segregating() array([ True, False, False, True], dtype=bool) >>> g.is_non_segregating(allele=2) array([False, False, False, True], dtype=bool)
- is_singleton(allele=1)[source]¶
Find variants with a single call for the given allele.
Parameters: allele : int, optional
Allele index.
Returns: out : ndarray, bool, shape (n_variants,)
Boolean array where elements are True if variant matches the condition.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 0], [0, 1]], ... [[1, 1], [1, 2]], ... [[2, 2], [-1, -1]]]) >>> g.is_singleton(allele=1) array([False, True, False, False], dtype=bool) >>> g.is_singleton(allele=2) array([False, False, True, False], dtype=bool)
- is_doubleton(allele=1)[source]¶
Find variants with exactly two calls for the given allele.
Parameters: allele : int, optional
Allele index.
Returns: out : ndarray, bool, shape (n_variants,)
Boolean array where elements are True if variant matches the condition.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 0], [1, 1]], ... [[1, 1], [1, 2]], ... [[2, 2], [-1, -1]]]) >>> g.is_doubleton(allele=1) array([False, True, False, False], dtype=bool) >>> g.is_doubleton(allele=2) array([False, False, False, True], dtype=bool)
- haploidify_samples()[source]¶
Construct representative haplotypes by randomly selecting an allele from each genotype call.
Returns: h : HaplotypeArray Examples
>>> import allel >>> import numpy as np >>> np.random.seed(42) >>> g = allel.GenotypeArray([[[0, 0], [0, 1]], ... [[0, 2], [1, 1]], ... [[1, 2], [2, 1]], ... [[2, 2], [-1, -1]]]) >>> g.haploidify_samples() HaplotypeArray([[ 0, 1], [ 0, 1], [ 1, 1], [ 2, -1]], n_variants=4, n_haplotypes=2) >>> g = allel.GenotypeArray([[[0, 0, 0], [0, 0, 1]], ... [[0, 1, 1], [1, 1, 1]], ... [[0, 1, 2], [-1, -1, -1]]]) >>> g.haploidify_samples() HaplotypeArray([[ 0, 0], [ 1, 1], [ 2, -1]], n_variants=3, n_haplotypes=2)
- heterozygosity_observed(fill=0)[source]¶
Calculate the rate of observed heterozygosity for each variant.
Parameters: fill : float, optional
Use this value for variants where all calls are missing.
Returns: ho : ndarray, float, shape (n_variants,)
Observed heterozygosity
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0], [0, 0]], ... [[0, 0], [0, 1], [1, 1]], ... [[0, 0], [1, 1], [2, 2]], ... [[1, 1], [1, 2], [-1, -1]]]) >>> g.heterozygosity_observed() array([ 0. , 0.33333333, 0. , 0.5 ])
- heterozygosity_expected(fill=0)[source]¶
Calculate the expected rate of heterozygosity for each variant under Hardy-Weinberg equilibrium.
Parameters: fill : float, optional
Use this value for variants where all calls are missing.
Returns: he : ndarray, float, shape (n_variants,)
Expected heterozygosity
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0], [0, 0]], ... [[0, 0], [0, 1], [1, 1]], ... [[0, 0], [1, 1], [2, 2]], ... [[1, 1], [1, 2], [-1, -1]]]) >>> g.heterozygosity_expected() array([ 0. , 0.5 , 0.66666667, 0.375 ])
- inbreeding_coefficient(fill=0)[source]¶
Calculate the inbreeding coefficient for each variant.
Parameters: fill : float, optional
Use this value for variants where the expected heterozygosity is zero.
Returns: he : ndarray, float, shape (n_variants,)
Expected heterozygosity
Notes
The inbreeding coefficient is calculated as 1 - (Ho/He) where Ho is the observed heterozygosity and He is the expected heterozygosity.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0], [0, 0]], ... [[0, 0], [0, 1], [1, 1]], ... [[0, 0], [1, 1], [2, 2]], ... [[1, 1], [1, 2], [-1, -1]]]) >>> g.heterozygosity_observed() array([ 0. , 0.33333333, 0. , 0.5 ]) >>> g.heterozygosity_expected() array([ 0. , 0.5 , 0.66666667, 0.375 ]) >>> g.inbreeding_coefficient() array([ 0. , 0.33333333, 1. , -0.33333333])
HaplotypeArray¶
- class allel.model.HaplotypeArray[source]¶
Array of haplotypes.
Parameters: data : array_like, int, shape (n_variants, n_haplotypes)
Haplotype data.
**kwargs : keyword arguments
All keyword arguments are passed through to numpy.array().
Notes
This class represents haplotype data as a 2-dimensional numpy array of integers. By convention the first dimension corresponds to the variants genotyped, the second dimension corresponds to the haplotypes.
Each integer within the array corresponds to an allele index, where 0 is the reference allele, 1 is the first alternate allele, 2 is the second alternate allele, ... and -1 (or any other negative integer) is a missing allele call.
If adjacent haplotypes originate from the same sample, then a haplotype array can also be viewed as a genotype array. However, this is not a requirement.
Examples
Instantiate a haplotype array:
>>> import allel >>> h = allel.HaplotypeArray([[0, 0, 0, 1], ... [0, 1, 1, 1], ... [0, 2, -1, -1]], dtype='i1') >>> h.dtype dtype('int8') >>> h.ndim 2 >>> h.shape (3, 4) >>> h.n_variants 3 >>> h.n_haplotypes 4
Allele calls for a single variant at all haplotypes can be obtained by indexing the first dimension, e.g.:
>>> h[1] array([0, 1, 1, 1], dtype=int8)
A single haplotype can be obtained by indexing the second dimension, e.g.:
>>> h[:, 1] array([0, 1, 2], dtype=int8)
An allele call for a single haplotype at a single variant can be obtained by indexing the first and second dimensions, e.g.:
>>> h[1, 0] 0
View haplotypes as diploid genotypes:
>>> h.view_genotypes(ploidy=2) GenotypeArray([[[ 0, 0], [ 0, 1]], [[ 0, 1], [ 1, 1]], [[ 0, 2], [-1, -1]]], dtype=int8, n_variants=3, n_samples=2, ploidy=2)
- view_genotypes(ploidy)[source]¶
Reshape a haplotype array to view it as genotypes by restoring the ploidy dimension.
Parameters: ploidy : int
The sample ploidy.
Returns: g : ndarray, int, shape (n_variants, n_samples, ploidy)
Genotype array (sharing same underlying buffer).
Examples
>>> import allel >>> h = allel.HaplotypeArray([[0, 0, 0, 1], ... [0, 1, 1, 1], ... [0, 2, -1, -1]], dtype='i1') >>> h.view_genotypes(ploidy=2) GenotypeArray([[[ 0, 0], [ 0, 1]], [[ 0, 1], [ 1, 1]], [[ 0, 2], [-1, -1]]], dtype=int8, n_variants=3, n_samples=2, ploidy=2)
- to_sparse(format='csr', **kwargs)[source]¶
Convert into a sparse matrix.
Parameters: format : {‘coo’, ‘csc’, ‘csr’, ‘dia’, ‘dok’, ‘lil’}
Sparse matrix format.
kwargs : keyword arguments
Passed through to sparse matrix constructor.
Returns: m : scipy.sparse.spmatrix
Sparse matrix
Examples
>>> import allel >>> h = allel.HaplotypeArray([[0, 0, 0, 0], ... [0, 1, 0, 1], ... [1, 1, 0, 0], ... [0, 0, -1, -1]], dtype='i1') >>> m = h.to_sparse(format='csr') >>> m <4x4 sparse matrix of type '<class 'numpy.int8'>' with 6 stored elements in Compressed Sparse Row format> >>> m.data array([ 1, 1, 1, 1, -1, -1], dtype=int8) >>> m.indices array([1, 3, 0, 1, 2, 3], dtype=int32) >>> m.indptr array([0, 0, 2, 4, 6], dtype=int32)
- static from_sparse(m, order=None, out=None)[source]¶
Construct a haplotype array from a sparse matrix.
Parameters: m : scipy.sparse.spmatrix
Sparse matrix
order : {‘C’, ‘F’}, optional
Whether to store data in C (row-major) or Fortran (column-major) order in memory.
out : ndarray, shape (n_variants, n_samples), optional
Use this array as the output buffer.
Returns: h : HaplotypeArray, shape (n_variants, n_haplotypes)
Haplotype array.
Examples
>>> import allel >>> import numpy as np >>> import scipy.sparse >>> data = np.array([ 1, 1, 1, 1, -1, -1], dtype=np.int8) >>> indices = np.array([1, 3, 0, 1, 2, 3], dtype=np.int32) >>> indptr = np.array([0, 0, 2, 4, 6], dtype=np.int32) >>> m = scipy.sparse.csr_matrix((data, indices, indptr)) >>> h = allel.HaplotypeArray.from_sparse(m) >>> h HaplotypeArray([[ 0, 0, 0, 0], [ 0, 1, 0, 1], [ 1, 1, 0, 0], [ 0, 0, -1, -1]], dtype=int8, n_variants=4, n_haplotypes=4)
- allelism()[source]¶
Determine the number of distinct alleles for each variant.
Returns: n : ndarray, int, shape (n_variants,)
Allelism array.
- allele_number()[source]¶
Count the number of non-missing allele calls per variant.
Returns: an : ndarray, int, shape (n_variants,)
Allele number array.
- allele_count(allele=1)[source]¶
Count the number of calls of the given allele per variant.
Parameters: allele : int, optional
Allele index.
Returns: ac : ndarray, int, shape (n_variants,)
Allele count array.
- allele_frequency(allele=1, fill=0)[source]¶
Calculate the frequency of the given allele per variant.
Parameters: allele : int, optional
Allele index.
fill : int, optional
The value to use where all genotype calls are missing for a variant.
Returns: af : ndarray, float, shape (n_variants,)
Allele frequency array.
ac : ndarray, int, shape (n_variants,)
Allele count array (numerator).
an : ndarray, int, shape (n_variants,)
Allele number array (denominator).
- allele_counts(alleles=None)[source]¶
Count the number of calls of each allele per variant.
Parameters: alleles : sequence of ints, optional
The alleles to count. If None, all alleles will be counted.
Returns: ac : ndarray, int, shape (n_variants, len(alleles))
Allele counts array.
- allele_frequencies(alleles=None, fill=0)[source]¶
Calculate the frequency of each allele per variant.
Parameters: alleles : sequence of ints, optional
The alleles to calculate frequency of. If None, all allele frequencies will be calculated.
fill : int, optional
The value to use where all genotype calls are missing for a variant.
Returns: af : ndarray, float, shape (n_variants, len(alleles))
Allele frequencies array.
ac : ndarray, int, shape (n_variants, len(alleles))
Allele counts array (numerator).
an : ndarray, int, shape (n_variants,)
Allele number array (denominator).
- is_variant()[source]¶
Find variants with at least one non-reference allele call.
Returns: out : ndarray, bool, shape (n_variants,)
Boolean array where elements are True if variant matches the condition.
- is_non_variant()[source]¶
Find variants with no non-reference allele calls.
Returns: out : ndarray, bool, shape (n_variants,)
Boolean array where elements are True if variant matches the condition.
- is_segregating()[source]¶
Find segregating variants (where more than one allele is observed).
Returns: out : ndarray, bool, shape (n_variants,)
Boolean array where elements are True if variant matches the condition.
- is_non_segregating(allele=None)[source]¶
Find non-segregating variants (where at most one allele is observed).
Parameters: allele : int, optional
Allele index.
Returns: out : ndarray, bool, shape (n_variants,)
Boolean array where elements are True if variant matches the condition.
- is_singleton(allele=1)[source]¶
Find variants with a single call for the given allele.
Parameters: allele : int, optional
Allele index.
Returns: out : ndarray, bool, shape (n_variants,)
Boolean array where elements are True if variant matches the condition.
PositionIndex¶
- class allel.model.PositionIndex[source]¶
Array of positions from a single chromosome or contig.
Parameters: data : array_like, int
Positions (1-based) in ascending order.
**kwargs : keyword arguments
All keyword arguments are passed through to numpy.array().
Notes
This class represents the genomic positions of a set of variants as a 1-dimensional numpy array of integers.
Each integer within the array is a 1-based coordinate position within a single chromosome or contig. Positions must be given in ascending order, although duplicate positions may be present (i.e., positions must be monotonically increasing).
Examples
>>> import allel >>> pos = allel.PositionIndex([2, 5, 14, 15, 42, 42, 77], dtype='i4') >>> pos.dtype dtype('int32') >>> pos.ndim 1 >>> pos.shape (7,) >>> pos.is_unique False
- locate_key(key)[source]¶
Get index location for the requested key.
Parameters: key : int
Position to locate.
Returns: loc : int or slice
Location of key (will be slice if there are duplicate entries).
Examples
>>> import allel >>> pos = allel.PositionIndex([3, 6, 6, 11]) >>> pos.locate_key(3) 0 >>> pos.locate_key(11) 3 >>> pos.locate_key(6) slice(1, 3, None) >>> try: ... pos.locate_key(2) ... except KeyError as e: ... print(e) ... 2
- locate_keys(keys, strict=True)[source]¶
Get index locations for the requested keys.
Parameters: keys : array_like, int
Array of keys to locate.
strict : bool, optional
If True, raise KeyError if any keys are not found in the index.
Returns: loc : ndarray, bool
Boolean array with location of positions.
Examples
>>> import allel >>> pos1 = allel.PositionIndex([3, 6, 11, 20, 35]) >>> pos2 = allel.PositionIndex([4, 6, 20, 39]) >>> loc = pos1.locate_keys(pos2, strict=False) >>> loc array([False, True, False, True, False], dtype=bool) >>> pos1[loc] PositionIndex([ 6, 20])
- locate_intersection(other)[source]¶
Locate the intersection with another position array.
Parameters: other : array_like, int
Array of positions to intersect.
Returns: loc : ndarray, bool
Boolean array with location of intersection.
loc_other : ndarray, bool
Boolean array with location in other of intersection.
Examples
>>> import allel >>> pos1 = allel.PositionIndex([3, 6, 11, 20, 35]) >>> pos2 = allel.PositionIndex([4, 6, 20, 39]) >>> loc1, loc2 = pos1.locate_intersection(pos2) >>> loc1 array([False, True, False, True, False], dtype=bool) >>> loc2 array([False, True, True, False], dtype=bool) >>> pos1[loc1] PositionIndex([ 6, 20]) >>> pos2[loc2] PositionIndex([ 6, 20])
- intersect(other)[source]¶
Intersect with other positions.
Parameters: other : array_like, int
Array of positions to locate.
Returns: out : PositionIndex
Positions in common.
Examples
>>> import allel >>> pos1 = allel.PositionIndex([3, 6, 11, 20, 35]) >>> pos2 = allel.PositionIndex([4, 6, 20, 39]) >>> pos1.intersect(pos2) PositionIndex([ 6, 20])
- locate_range(start=None, stop=None)[source]¶
Locate slice of index containing all entries within start and stop positions inclusive.
Parameters: start : int, optional
Start position.
stop : int, optional
Stop position.
Returns: loc : slice
Slice object.
Examples
>>> import allel >>> pos = allel.PositionIndex([3, 6, 11, 20, 35]) >>> loc = pos.locate_range(4, 32) >>> loc slice(1, 4, None) >>> pos[loc] PositionIndex([ 6, 11, 20])
- intersect_range(start=None, stop=None)[source]¶
Intersect with range defined by start and stop positions inclusive.
Parameters: start : int, optional
Start position.
stop : int, optional
Stop position.
Returns: pos : PositionIndex
Examples
>>> import allel >>> pos = allel.PositionIndex([3, 6, 11, 20, 35]) >>> pos.intersect_range(4, 32) PositionIndex([ 6, 11, 20])
- locate_ranges(starts, stops, strict=True)[source]¶
Locate items within the given ranges.
Parameters: starts : array_like, int
Range start positions.
stops : array_like, int
Range stop positions.
strict : bool, optional
If True, raise KeyError if any ranges contain no entries.
Returns: loc : ndarray, bool
Boolean array with location of entries found.
Examples
>>> import allel >>> import numpy as np >>> pos = allel.PositionIndex([3, 6, 11, 20, 35]) >>> ranges = np.array([[0, 2], [6, 17], [12, 15], [31, 35], ... [100, 120]]) >>> starts = ranges[:, 0] >>> stops = ranges[:, 1] >>> loc = pos.locate_ranges(starts, stops, strict=False) >>> loc array([False, True, True, False, True], dtype=bool) >>> pos[loc] PositionIndex([ 6, 11, 35])
- locate_intersection_ranges(starts, stops)[source]¶
Locate the intersection with a set of ranges.
Parameters: starts : array_like, int
Range start positions.
stops : array_like, int
Range stop positions.
Returns: loc : ndarray, bool
Boolean array with location of entries found.
loc_ranges : ndarray, bool
Boolean array with location of ranges containing one or more entries.
Examples
>>> import allel >>> import numpy as np >>> pos = allel.PositionIndex([3, 6, 11, 20, 35]) >>> ranges = np.array([[0, 2], [6, 17], [12, 15], [31, 35], ... [100, 120]]) >>> starts = ranges[:, 0] >>> stops = ranges[:, 1] >>> loc, loc_ranges = pos.locate_intersection_ranges(starts, stops) >>> loc array([False, True, True, False, True], dtype=bool) >>> loc_ranges array([False, True, False, True, False], dtype=bool) >>> pos[loc] PositionIndex([ 6, 11, 35]) >>> ranges[loc_ranges] array([[ 6, 17], [31, 35]])
- intersect_ranges(starts, stops)[source]¶
Intersect with a set of ranges.
Parameters: starts : array_like, int
Range start positions.
stops : array_like, int
Range stop positions.
Returns: pos : PositionIndex
Examples
>>> import allel >>> import numpy as np >>> pos = allel.PositionIndex([3, 6, 11, 20, 35]) >>> ranges = np.array([[0, 2], [6, 17], [12, 15], [31, 35], ... [100, 120]]) >>> starts = ranges[:, 0] >>> stops = ranges[:, 1] >>> pos.intersect_ranges(starts, stops) PositionIndex([ 6, 11, 35])
LabelIndex¶
- class allel.model.LabelIndex[source]¶
Array of labels (e.g., variant or sample identifiers).
Parameters: data : array_like
Labels.
**kwargs : keyword arguments
All keyword arguments are passed through to numpy.array().
Notes
This class represents an arbitrary set of unique labels, e.g., sample or variant identifiers.
There is no need for labels to be sorted. However, all labels must be unique within the array, and must be hashable objects.
Examples
>>> import allel >>> lbl = allel.LabelIndex(['A', 'C', 'B', 'F']) >>> lbl.dtype dtype('<U1') >>> lbl.ndim 1 >>> lbl.shape (4,)
- locate_key(key)[source]¶
Get index location for the requested key.
Parameters: key : object
Key to locate.
Returns: loc : int
Location of key.
Examples
>>> import allel >>> lbl = allel.LabelIndex(['A', 'C', 'B', 'F']) >>> lbl.locate_key('A') 0 >>> lbl.locate_key('B') 2 >>> try: ... lbl.locate_key('X') ... except KeyError as e: ... print(e) ... 'X'
- locate_keys(keys, strict=True)[source]¶
Get index locations for the requested keys.
Parameters: keys : array_like
Array of keys to locate.
strict : bool, optional
If True, raise KeyError if any keys are not found in the index.
Returns: loc : ndarray, bool
Boolean array with location of keys.
Examples
>>> import allel >>> lbl = allel.LabelIndex(['A', 'C', 'B', 'F']) >>> lbl.locate_keys(['F', 'C']) array([False, True, False, True], dtype=bool) >>> lbl.locate_keys(['X', 'F', 'G', 'C', 'Z'], strict=False) array([False, True, False, True], dtype=bool)
- locate_intersection(other)[source]¶
Locate the intersection with another array.
Parameters: other : array_like
Array to intersect.
Returns: loc : ndarray, bool
Boolean array with location of intersection.
loc_other : ndarray, bool
Boolean array with location in other of intersection.
Examples
>>> import allel >>> lbl1 = allel.LabelIndex(['A', 'C', 'B', 'F']) >>> lbl2 = allel.LabelIndex(['X', 'F', 'G', 'C', 'Z']) >>> loc1, loc2 = lbl1.locate_intersection(lbl2) >>> loc1 array([False, True, False, True], dtype=bool) >>> loc2 array([False, True, False, True, False], dtype=bool) >>> lbl1[loc1] LabelIndex(['C', 'F'], dtype='<U1') >>> lbl2[loc2] LabelIndex(['F', 'C'], dtype='<U1')
- intersect(other)[source]¶
Intersect with other.
Parameters: other : array_like
Array to intersect.
Returns: out : LabelIndex
Examples
>>> import allel >>> lbl1 = allel.LabelIndex(['A', 'C', 'B', 'F']) >>> lbl2 = allel.LabelIndex(['X', 'F', 'G', 'C', 'Z']) >>> lbl1.intersect(lbl2) LabelIndex(['C', 'F'], dtype='<U1') >>> lbl2.intersect(lbl1) LabelIndex(['F', 'C'], dtype='<U1')