Runs of homozygosity (ROH)¶
roh_mhmm(gv, pos, phet_roh=0.001, phet_nonroh=(0.0025, 0.01), transition=1e-06, min_roh=0, is_accessible=None, contig_size=None)¶
Call ROH (runs of homozygosity) in a single individual given a genotype vector.
This function computes the likely ROH using a Multinomial HMM model. There are 3 observable states at each position in a chromosome/contig: 0 = Hom, 1 = Het, 2 = inaccessible (i.e., unobserved).
The model is provided with a probability of observing a het in a ROH (phet_roh) and one or more probabilities of observing a het in a non-ROH, as this probability may not be constant across the genome (phet_nonroh).
- gv : array_like, int, shape (n_variants, ploidy)
- pos: array_like, int, shape (n_variants,)
Positions of variants, same 0th dimension as gv.
- phet_roh: float, optional
Probability of observing a heterozygote in a ROH. Appropriate values will depend on de novo mutation rate and genotype error rate.
- phet_nonroh: tuple of floats, optional
One or more probabilites of observing a heterozygote outside of ROH. Appropriate values will depend primarily on nucleotide diversity within the population, but also on mutation rate and genotype error rate.
- transition: float, optional
Probability of moving between states.
- min_roh: integer, optional
Minimum size (bp) to condsider as a ROH. Will depend on contig size and recombination rate.
- is_accessible: array_like, bool, shape (`contig_size`,), optional
Boolean array for each position in contig describing whether accessible or not.
- contig_size: int, optional
If is_accessible not known/not provided, allows specification of total length of contig.
- df_roh: DataFrame
Data frame where each row describes a run of homozygosity. Columns are ‘start’, ‘stop’, ‘length’ and ‘is_marginal’. Start and stop are 1-based, stop-inclusive.
- froh: float
Proportion of genome in a ROH.
This function requires hmmlearn to be installed.
This function currently requires around 4GB memory for a contig size of ~50Mbp.