Runs of homozygosity (ROH)

allel.roh_mhmm(gv, pos, phet_roh=0.001, phet_nonroh=(0.0025, 0.01), transition=1e-06, min_roh=0, is_accessible=None, contig_size=None)[source]

Call ROH (runs of homozygosity) in a single individual given a genotype vector.

This function computes the likely ROH using a Multinomial HMM model. There are 3 observable states at each position in a chromosome/contig: 0 = Hom, 1 = Het, 2 = inaccessible (i.e., unobserved).

The model is provided with a probability of observing a het in a ROH (phet_roh) and one or more probabilities of observing a het in a non-ROH, as this probability may not be constant across the genome (phet_nonroh).

gv : array_like, int, shape (n_variants, ploidy)

Genotype vector.

pos: array_like, int, shape (n_variants,)

Positions of variants, same 0th dimension as gv.

phet_roh: float, optional

Probability of observing a heterozygote in a ROH. Appropriate values will depend on de novo mutation rate and genotype error rate.

phet_nonroh: tuple of floats, optional

One or more probabilites of observing a heterozygote outside of ROH. Appropriate values will depend primarily on nucleotide diversity within the population, but also on mutation rate and genotype error rate.

transition: float, optional

Probability of moving between states.

min_roh: integer, optional

Minimum size (bp) to condsider as a ROH. Will depend on contig size and recombination rate.

is_accessible: array_like, bool, shape (`contig_size`,), optional

Boolean array for each position in contig describing whether accessible or not.

contig_size: int, optional

If is_accessible not known/not provided, allows specification of total length of contig.

df_roh: DataFrame

Data frame where each row describes a run of homozygosity. Columns are ‘start’, ‘stop’, ‘length’ and ‘is_marginal’. Start and stop are 1-based, stop-inclusive.

froh: float

Proportion of genome in a ROH.


This function requires hmmlearn to be installed.

This function currently requires around 4GB memory for a contig size of ~50Mbp.