Runs of homozygosity (ROH)¶

allel.
roh_mhmm
(gv, pos, phet_roh=0.001, phet_nonroh=(0.0025, 0.01), transition=1e06, min_roh=0, is_accessible=None, contig_size=None)[source]¶ Call ROH (runs of homozygosity) in a single individual given a genotype vector.
This function computes the likely ROH using a Multinomial HMM model. There are 3 observable states at each position in a chromosome/contig: 0 = Hom, 1 = Het, 2 = inaccessible (i.e., unobserved).
The model is provided with a probability of observing a het in a ROH (phet_roh) and one or more probabilities of observing a het in a nonROH, as this probability may not be constant across the genome (phet_nonroh).
Parameters:  gv : array_like, int, shape (n_variants, ploidy)
Genotype vector.
 pos: array_like, int, shape (n_variants,)
Positions of variants, same 0th dimension as gv.
 phet_roh: float, optional
Probability of observing a heterozygote in a ROH. Appropriate values will depend on de novo mutation rate and genotype error rate.
 phet_nonroh: tuple of floats, optional
One or more probabilites of observing a heterozygote outside of ROH. Appropriate values will depend primarily on nucleotide diversity within the population, but also on mutation rate and genotype error rate.
 transition: float, optional
Probability of moving between states.
 min_roh: integer, optional
Minimum size (bp) to condsider as a ROH. Will depend on contig size and recombination rate.
 is_accessible: array_like, bool, shape (`contig_size`,), optional
Boolean array for each position in contig describing whether accessible or not.
 contig_size: int, optional
If is_accessible not known/not provided, allows specification of total length of contig.
Returns:  df_roh: DataFrame
Data frame where each row describes a run of homozygosity. Columns are ‘start’, ‘stop’, ‘length’ and ‘is_marginal’. Start and stop are 1based, stopinclusive.
 froh: float
Proportion of genome in a ROH.
Notes
This function requires hmmlearn to be installed.
This function currently requires around 4GB memory for a contig size of ~50Mbp.

allel.
roh_poissonhmm
(gv, pos, phet_roh=0.001, phet_nonroh=(0.0025, 0.01), transition=0.001, window_size=1000, min_roh=0, is_accessible=None, contig_size=None)[source]¶ Call ROH (runs of homozygosity) in a single individual given a genotype vector.
This function computes the likely ROH using a Poisson HMM model. The chromosome is divided into equally accessible windows of specified size, then the number of hets observed in each is used to fit a Poisson HMM. Note this is much faster than roh_mhmm, but at the cost of some resolution.
The model is provided with a probability of observing a het in a ROH (phet_roh) and one or more probabilities of observing a het in a nonROH, as this probability may not be constant across the genome (phet_nonroh).
Parameters:  gv : array_like, int, shape (n_variants, ploidy)
Genotype vector.
 pos: array_like, int, shape (n_variants,)
Positions of variants, same 0th dimension as gv.
 phet_roh: float, optional
Probability of observing a heterozygote in a ROH. Appropriate values will depend on de novo mutation rate and genotype error rate.
 phet_nonroh: tuple of floats, optional
One or more probabilites of observing a heterozygote outside of ROH. Appropriate values will depend primarily on nucleotide diversity within the population, but also on mutation rate and genotype error rate.
 transition: float, optional
Probability of moving between states. This is based on windows, so a larger window size may call for a larger transitional probability
 window_size: integer, optional
Window size (equally accessible bases) to consider as a potential ROH. Setting this window too small may result in spurious ROH calls, while too large will result in a lack of resolution.
 min_roh: integer, optional
Minimum size (bp) to condsider as a ROH. Will depend on contig size and recombination rate.
 is_accessible: array_like, bool, shape (`contig_size`,), optional
Boolean array for each position in contig describing whether accessible or not. Although optional, highly recommended so invariant sites are distinguishable from sites where variation is inaccessible
 contig_size: integer, optional
If is_accessible is not available, use this to specify the size of the contig, and assume all sites are accessible.
Returns:  df_roh: DataFrame
Data frame where each row describes a run of homozygosity. Columns are ‘start’, ‘stop’, ‘length’ and ‘is_marginal’. Start and stop are 1based, stopinclusive.
 froh: float
Proportion of genome in a ROH.
Notes
This function requires pomegranate (>= 0.9.0) to be installed.