Input/output utilities¶

Variant Call Format (VCF)¶

allel.read_vcf(input, fields=None, exclude_fields=None, rename_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', samples=None, transformers=None, buffer_size=16384, chunk_length=65536, log=None)[source]¶

Read data from a VCF file into NumPy arrays.

Changed in version 1.12.0: Now returns None if no variants are found in the VCF file or matching the requested region.

Parameters:

input : string or file-like: Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).
fields : list of strings, optional: Fields to extract data for. Should be a list of strings, e.g., ['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e., ['CHROM', 'POS', 'DP', 'GT'] will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string '*'. To extract all variants fields (including all INFO fields) provide 'variants/*'. To extract all calldata fields (i.e., defined in FORMAT headers) provide 'calldata/*'.
exclude_fields : list of strings, optional: Fields to exclude. E.g., for use in combination with fields='*'.
rename_fields : dict[str -> str], optional: Fields to be renamed. Should be a dictionary mapping old to new names,
giving the complete path, e.g., ``{‘variants/FOO’: ‘variants/bar’}``.
types : dict, optional: Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary {'variants/DP': 'i8', 'calldata/GQ': 'i2'} will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.
numbers : dict, optional: Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary {'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2} will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.
alt_number : int, optional: Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.
fills : dict, optional: Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.
region : string, optional: Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.
tabix : string, optional: Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.
samples : list of strings: Selection of samples to extract calldata for. If provided, should be a list of strings giving sample identifiers. May also be a list of integers giving indices of selected samples.
transformers : list of transformer objects, optional: Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the ANNTransformer class which implements post-processing of data from SNPEFF.
buffer_size : int, optional: Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.
chunk_length : int, optional: Length (number of variants) of chunks in which data are processed.
log : file-like, optional: A file-like object (e.g., sys.stderr) to print progress information.

Returns:

data : dict[str, ndarray]: A dictionary holding arrays, or None if no variants were found.

allel.vcf_to_npz(input, output, compressed=True, overwrite=False, fields=None, exclude_fields=None, rename_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix=True, samples=None, transformers=None, buffer_size=16384, chunk_length=65536, log=None)[source]¶

Read data from a VCF file into NumPy arrays and save as a .npz file.

Changed in version 1.12.0: Now will not create any output file if no variants are found in the VCF file or matching the requested region.

Parameters:

input : string: Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).
output : string: File-system path to write output to.
compressed : bool, optional: If True (default), save with compression.
overwrite : bool, optional: If False (default), do not overwrite an existing file.
fields : list of strings, optional: Fields to extract data for. Should be a list of strings, e.g., ['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e., ['CHROM', 'POS', 'DP', 'GT'] will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string '*'. To extract all variants fields (including all INFO fields) provide 'variants/*'. To extract all calldata fields (i.e., defined in FORMAT headers) provide 'calldata/*'.
exclude_fields : list of strings, optional: Fields to exclude. E.g., for use in combination with fields='*'.
rename_fields : dict[str -> str], optional: Fields to be renamed. Should be a dictionary mapping old to new names,
giving the complete path, e.g., ``{‘variants/FOO’: ‘variants/bar’}``.
types : dict, optional: Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary {'variants/DP': 'i8', 'calldata/GQ': 'i2'} will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.
numbers : dict, optional: Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary {'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2} will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.
alt_number : int, optional: Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.
fills : dict, optional: Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.
region : string, optional: Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.
tabix : string, optional: Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.
samples : list of strings: Selection of samples to extract calldata for. If provided, should be a list of strings giving sample identifiers. May also be a list of integers giving indices of selected samples.
transformers : list of transformer objects, optional: Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the ANNTransformer class which implements post-processing of data from SNPEFF.
buffer_size : int, optional: Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.
chunk_length : int, optional: Length (number of variants) of chunks in which data are processed.
log : file-like, optional: A file-like object (e.g., sys.stderr) to print progress information.

allel.vcf_to_hdf5(input, output, group='/', compression='gzip', compression_opts=1, shuffle=False, overwrite=False, vlen=True, fields=None, exclude_fields=None, rename_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', samples=None, transformers=None, buffer_size=16384, chunk_length=65536, chunk_width=64, log=None)[source]¶

Read data from a VCF file and load into an HDF5 file.

Changed in version 1.12.0: Now will not create any output file if no variants are found in the VCF file or matching the requested region.

Parameters:

input : string: Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).
output : string: File-system path to write output to.
group : string: Group within destination HDF5 file to store data in.
compression : string: Compression algorithm, e.g., ‘gzip’ (default).
compression_opts : int: Compression level, e.g., 1 (default).
shuffle : bool: Use byte shuffling, which may improve compression (default is False).
overwrite : bool: If False (default), do not overwrite an existing file.
vlen : bool: If True, store variable length strings. Note that there is considerable storage overhead for variable length strings in HDF5, and leaving this option as True ( default) may lead to large file sizes. If False, all strings will be stored in the HDF5 file as fixed length strings, even if they are specified as ‘object’ type. In this case, the string length for any field with ‘object’ type will be determined based on the maximum length of strings found in the first chunk, and this may cause values to be truncated if longer values are found in later chunks. To avoid truncation and large file sizes, manually set the type for all string fields to an explicit fixed length string type, e.g., ‘S10’ for a field where you know at most 10 characters are required.
fields : list of strings, optional: Fields to extract data for. Should be a list of strings, e.g., ['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e., ['CHROM', 'POS', 'DP', 'GT'] will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string '*'. To extract all variants fields (including all INFO fields) provide 'variants/*'. To extract all calldata fields (i.e., defined in FORMAT headers) provide 'calldata/*'.
exclude_fields : list of strings, optional: Fields to exclude. E.g., for use in combination with fields='*'.
rename_fields : dict[str -> str], optional: Fields to be renamed. Should be a dictionary mapping old to new names,
giving the complete path, e.g., ``{‘variants/FOO’: ‘variants/bar’}``.
types : dict, optional: Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary {'variants/DP': 'i8', 'calldata/GQ': 'i2'} will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.
numbers : dict, optional: Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary {'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2} will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.
alt_number : int, optional: Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.
fills : dict, optional: Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.
region : string, optional: Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.
tabix : string, optional: Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.
samples : list of strings: Selection of samples to extract calldata for. If provided, should be a list of strings giving sample identifiers. May also be a list of integers giving indices of selected samples.
transformers : list of transformer objects, optional: Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the ANNTransformer class which implements post-processing of data from SNPEFF.
buffer_size : int, optional: Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.
chunk_length : int, optional: Length (number of variants) of chunks in which data are processed.
chunk_width : int, optional: Width (number of samples) to use when storing chunks in output.
log : file-like, optional: A file-like object (e.g., sys.stderr) to print progress information.

allel.vcf_to_zarr(input, output, group='/', compressor='default', overwrite=False, fields=None, exclude_fields=None, rename_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', samples=None, transformers=None, buffer_size=16384, chunk_length=65536, chunk_width=64, log=None)[source]¶

Read data from a VCF file and load into a Zarr on-disk store.

Changed in version 1.12.0: Now will not create any output files if no variants are found in the VCF file or matching the requested region.

Parameters:

input : string: Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).
output : string: File-system path to write output to.
group : string: Group within destination Zarr hierarchy to store data in.
compressor : compressor: Compression algorithm, e.g., zarr.Blosc(cname=’zstd’, clevel=1, shuffle=1).
overwrite : bool: If False (default), do not overwrite an existing file.
fields : list of strings, optional: Fields to extract data for. Should be a list of strings, e.g., ['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e., ['CHROM', 'POS', 'DP', 'GT'] will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string '*'. To extract all variants fields (including all INFO fields) provide 'variants/*'. To extract all calldata fields (i.e., defined in FORMAT headers) provide 'calldata/*'.
exclude_fields : list of strings, optional: Fields to exclude. E.g., for use in combination with fields='*'.
rename_fields : dict[str -> str], optional: Fields to be renamed. Should be a dictionary mapping old to new names,
giving the complete path, e.g., ``{‘variants/FOO’: ‘variants/bar’}``.
types : dict, optional: Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary {'variants/DP': 'i8', 'calldata/GQ': 'i2'} will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.
numbers : dict, optional: Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary {'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2} will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.
alt_number : int, optional: Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.
fills : dict, optional: Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.
region : string, optional: Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.
tabix : string, optional: Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.
samples : list of strings: Selection of samples to extract calldata for. If provided, should be a list of strings giving sample identifiers. May also be a list of integers giving indices of selected samples.
transformers : list of transformer objects, optional: Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the ANNTransformer class which implements post-processing of data from SNPEFF.
buffer_size : int, optional: Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.
chunk_length : int, optional: Length (number of variants) of chunks in which data are processed.
chunk_width : int, optional: Width (number of samples) to use when storing chunks in output.
log : file-like, optional: A file-like object (e.g., sys.stderr) to print progress information.

allel.vcf_to_dataframe(input, fields=None, exclude_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', transformers=None, buffer_size=16384, chunk_length=65536, log=None)[source]¶

Read data from a VCF file into a pandas DataFrame.

Parameters:

input : string: Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).
fields : list of strings, optional: Fields to extract data for. Should be a list of strings, e.g., ['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e., ['CHROM', 'POS', 'DP', 'GT'] will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string '*'. To extract all variants fields (including all INFO fields) provide 'variants/*'. To extract all calldata fields (i.e., defined in FORMAT headers) provide 'calldata/*'.
exclude_fields : list of strings, optional: Fields to exclude. E.g., for use in combination with fields='*'.
types : dict, optional: Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary {'variants/DP': 'i8', 'calldata/GQ': 'i2'} will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.
numbers : dict, optional: Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary {'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2} will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.
alt_number : int, optional: Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.
fills : dict, optional: Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.
region : string, optional: Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.
tabix : string, optional: Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.
transformers : list of transformer objects, optional: Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the ANNTransformer class which implements post-processing of data from SNPEFF.
buffer_size : int, optional: Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.
chunk_length : int, optional: Length (number of variants) of chunks in which data are processed.
log : file-like, optional: A file-like object (e.g., sys.stderr) to print progress information.

Returns:

df : pandas.DataFrame

allel.vcf_to_csv(input, output, fields=None, exclude_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', transformers=None, buffer_size=16384, chunk_length=65536, log=None, **kwargs)[source]¶

Read data from a VCF file and write out to a comma-separated values (CSV) file.

Parameters:

input : string: Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).
output : string: File-system path to write output to.
fields : list of strings, optional: Fields to extract data for. Should be a list of strings, e.g., ['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e., ['CHROM', 'POS', 'DP', 'GT'] will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string '*'. To extract all variants fields (including all INFO fields) provide 'variants/*'. To extract all calldata fields (i.e., defined in FORMAT headers) provide 'calldata/*'.
exclude_fields : list of strings, optional: Fields to exclude. E.g., for use in combination with fields='*'.
types : dict, optional: Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary {'variants/DP': 'i8', 'calldata/GQ': 'i2'} will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.
numbers : dict, optional: Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary {'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2} will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.
alt_number : int, optional: Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.
fills : dict, optional: Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.
region : string, optional: Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.
tabix : string, optional: Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.
transformers : list of transformer objects, optional: Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the ANNTransformer class which implements post-processing of data from SNPEFF.
buffer_size : int, optional: Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.
chunk_length : int, optional: Length (number of variants) of chunks in which data are processed.
log : file-like, optional: A file-like object (e.g., sys.stderr) to print progress information.
kwargs : keyword arguments: All remaining keyword arguments are passed through to pandas.DataFrame.to_csv(). E.g., to write a tab-delimited file, provide sep=’t’.

allel.vcf_to_recarray(input, fields=None, exclude_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', transformers=None, buffer_size=16384, chunk_length=65536, log=None)[source]¶

Read data from a VCF file into a NumPy recarray.

Parameters:

input : string: Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).
fields : list of strings, optional: Fields to extract data for. Should be a list of strings, e.g., ['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e., ['CHROM', 'POS', 'DP', 'GT'] will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string '*'. To extract all variants fields (including all INFO fields) provide 'variants/*'. To extract all calldata fields (i.e., defined in FORMAT headers) provide 'calldata/*'.
exclude_fields : list of strings, optional: Fields to exclude. E.g., for use in combination with fields='*'.
types : dict, optional: Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary {'variants/DP': 'i8', 'calldata/GQ': 'i2'} will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.
numbers : dict, optional: Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary {'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2} will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.
alt_number : int, optional: Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.
fills : dict, optional: Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.
region : string, optional: Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.
tabix : string, optional: Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.
transformers : list of transformer objects, optional: Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the ANNTransformer class which implements post-processing of data from SNPEFF.
buffer_size : int, optional: Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.
chunk_length : int, optional: Length (number of variants) of chunks in which data are processed.
log : file-like, optional: A file-like object (e.g., sys.stderr) to print progress information.

Returns:

ra : np.rec.array

allel.iter_vcf_chunks(input, fields=None, exclude_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', samples=None, transformers=None, buffer_size=16384, chunk_length=65536)[source]¶

Iterate over chunks of data from a VCF file as NumPy arrays.

Parameters:

input : string: Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).
fields : list of strings, optional: Fields to extract data for. Should be a list of strings, e.g., ['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e., ['CHROM', 'POS', 'DP', 'GT'] will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string '*'. To extract all variants fields (including all INFO fields) provide 'variants/*'. To extract all calldata fields (i.e., defined in FORMAT headers) provide 'calldata/*'.
exclude_fields : list of strings, optional: Fields to exclude. E.g., for use in combination with fields='*'.
types : dict, optional: Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary {'variants/DP': 'i8', 'calldata/GQ': 'i2'} will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.
numbers : dict, optional: Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary {'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2} will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.
alt_number : int, optional: Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.
fills : dict, optional: Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.
region : string, optional: Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.
tabix : string, optional: Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.
samples : list of strings: Selection of samples to extract calldata for. If provided, should be a list of strings giving sample identifiers. May also be a list of integers giving indices of selected samples.
transformers : list of transformer objects, optional: Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the ANNTransformer class which implements post-processing of data from SNPEFF.
buffer_size : int, optional: Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.
chunk_length : int, optional: Length (number of variants) of chunks in which data are processed.

Returns:

fields : list of strings: Normalised names of fields that will be extracted.
samples : ndarray: Samples for which data will be extracted.
headers : VCFHeaders: Tuple of metadata extracted from VCF headers.
it : iterator: Chunk iterator.

allel.read_vcf_headers(input)[source]¶: Read headers from a VCF file.

class allel.ANNTransformer¶

allel.write_vcf(path, callset, rename=None, number=None, description=None, fill=None, write_header=True)[source]¶: Preliminary support for writing a VCF file. Currently does not support sample data. Needs further work.

GFF3¶

allel.gff3_to_dataframe(path, attributes=None, region=None, score_fill=-1, phase_fill=-1, attributes_fill='.', tabix='tabix', **kwargs)[source]¶

Load data from a GFF3 into a pandas DataFrame.

Parameters:

path : string: Path to input file.
attributes : list of strings, optional: List of columns to extract from the “attributes” field.
region : string, optional: Genome region to extract. If given, file must be position sorted, bgzipped and tabix indexed. Tabix must also be installed and on the system path.
score_fill : int, optional: Value to use where score field has a missing value.
phase_fill : int, optional: Value to use where phase field has a missing value.
attributes_fill : object or list of objects, optional: Value(s) to use where attribute field(s) have a missing value.
tabix : string, optional: Tabix command.

Returns:

pandas.DataFrame

allel.gff3_to_recarray(path, attributes=None, region=None, score_fill=-1, phase_fill=-1, attributes_fill='.', tabix='tabix', dtype=None)[source]¶

Load data from a GFF3 into a NumPy recarray.

Parameters:

path : string: Path to input file.
attributes : list of strings, optional: List of columns to extract from the “attributes” field.
region : string, optional: Genome region to extract. If given, file must be position sorted, bgzipped and tabix indexed. Tabix must also be installed and on the system path.
score_fill : int, optional: Value to use where score field has a missing value.
phase_fill : int, optional: Value to use where phase field has a missing value.
attributes_fill : object or list of objects, optional: Value(s) to use where attribute field(s) have a missing value.
tabix : string, optional: Tabix command.
dtype : dtype, optional: Override dtype.

Returns:

np.recarray

allel.iter_gff3(path, attributes=None, region=None, score_fill=-1, phase_fill=-1, attributes_fill='.', tabix='tabix')[source]¶

Iterate over records in a GFF3 file.

Parameters:

path : string: Path to input file.
attributes : list of strings, optional: List of columns to extract from the “attributes” field.
region : string, optional: Genome region to extract. If given, file must be position sorted, bgzipped and tabix indexed. Tabix must also be installed and on the system path.
score_fill : int, optional: Value to use where score field has a missing value.
phase_fill : int, optional: Value to use where phase field has a missing value.
attributes_fill : object or list of objects, optional: Value(s) to use where attribute field(s) have a missing value.
tabix : string: Tabix command.

Returns:

Iterator

Fasta¶

allel.write_fasta(path, sequences, names, mode='w', width=80)[source]¶

Write nucleotide sequences stored as numpy arrays to a FASTA file.

Parameters:	path : string File path. sequences : sequence of arrays One or more ndarrays of dtype ‘S1’ containing the sequences. names : sequence of strings Names of the sequences. mode : string, optional Use ‘a’ to append to an existing file. width : int, optional Maximum line width.