Runs of homozygosity (ROH)

allel.roh_mhmm(gv, pos, phet_roh=0.001, phet_nonroh=(0.0025, 0.01), transition=1e-06, min_roh=0, is_accessible=None, contig_size=None)[source]

Call ROH (runs of homozygosity) in a single individual given a genotype vector.

This function computes the likely ROH using a Multinomial HMM model. There are 3 observable states at each position in a chromosome/contig: 0 = Hom, 1 = Het, 2 = inaccessible (i.e., unobserved).

The model is provided with a probability of observing a het in a ROH (phet_roh) and one or more probabilities of observing a het in a non-ROH, as this probability may not be constant across the genome (phet_nonroh).

Parameters:
gv : array_like, int, shape (n_variants, ploidy)

Genotype vector.

pos: array_like, int, shape (n_variants,)

Positions of variants, same 0th dimension as gv.

phet_roh: float, optional

Probability of observing a heterozygote in a ROH. Appropriate values will depend on de novo mutation rate and genotype error rate.

phet_nonroh: tuple of floats, optional

One or more probabilites of observing a heterozygote outside of ROH. Appropriate values will depend primarily on nucleotide diversity within the population, but also on mutation rate and genotype error rate.

transition: float, optional

Probability of moving between states.

min_roh: integer, optional

Minimum size (bp) to condsider as a ROH. Will depend on contig size and recombination rate.

is_accessible: array_like, bool, shape (`contig_size`,), optional

Boolean array for each position in contig describing whether accessible or not.

contig_size: int, optional

If is_accessible not known/not provided, allows specification of total length of contig.

Returns:
df_roh: DataFrame

Data frame where each row describes a run of homozygosity. Columns are ‘start’, ‘stop’, ‘length’ and ‘is_marginal’. Start and stop are 1-based, stop-inclusive.

froh: float

Proportion of genome in a ROH.

Notes

This function requires hmmlearn to be installed.

This function currently requires around 4GB memory for a contig size of ~50Mbp.

allel.roh_poissonhmm(gv, pos, phet_roh=0.001, phet_nonroh=(0.0025, 0.01), transition=0.001, window_size=1000, min_roh=0, is_accessible=None, contig_size=None)[source]

Call ROH (runs of homozygosity) in a single individual given a genotype vector.

This function computes the likely ROH using a Poisson HMM model. The chromosome is divided into equally accessible windows of specified size, then the number of hets observed in each is used to fit a Poisson HMM. Note this is much faster than roh_mhmm, but at the cost of some resolution.

The model is provided with a probability of observing a het in a ROH (phet_roh) and one or more probabilities of observing a het in a non-ROH, as this probability may not be constant across the genome (phet_nonroh).

Parameters:
gv : array_like, int, shape (n_variants, ploidy)

Genotype vector.

pos: array_like, int, shape (n_variants,)

Positions of variants, same 0th dimension as gv.

phet_roh: float, optional

Probability of observing a heterozygote in a ROH. Appropriate values will depend on de novo mutation rate and genotype error rate.

phet_nonroh: tuple of floats, optional

One or more probabilites of observing a heterozygote outside of ROH. Appropriate values will depend primarily on nucleotide diversity within the population, but also on mutation rate and genotype error rate.

transition: float, optional

Probability of moving between states. This is based on windows, so a larger window size may call for a larger transitional probability

window_size: integer, optional

Window size (equally accessible bases) to consider as a potential ROH. Setting this window too small may result in spurious ROH calls, while too large will result in a lack of resolution.

min_roh: integer, optional

Minimum size (bp) to condsider as a ROH. Will depend on contig size and recombination rate.

is_accessible: array_like, bool, shape (`contig_size`,), optional

Boolean array for each position in contig describing whether accessible or not. Although optional, highly recommended so invariant sites are distinguishable from sites where variation is inaccessible

contig_size: integer, optional

If is_accessible is not available, use this to specify the size of the contig, and assume all sites are accessible.

Returns:
df_roh: DataFrame

Data frame where each row describes a run of homozygosity. Columns are ‘start’, ‘stop’, ‘length’ and ‘is_marginal’. Start and stop are 1-based, stop-inclusive.

froh: float

Proportion of genome in a ROH.

Notes

This function requires pomegranate (>= 0.9.0) to be installed.