Input/output utilities

Variant Call Format (VCF)

allel.read_vcf(input, fields=None, exclude_fields=None, rename_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', samples=None, transformers=None, buffer_size=16384, chunk_length=65536, log=None)[source]

Read data from a VCF file into NumPy arrays.

Changed in version 1.12.0: Now returns None if no variants are found in the VCF file or matching the requested region.

Parameters:
input : string or file-like

Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).

fields : list of strings, optional

Fields to extract data for. Should be a list of strings, e.g., ['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e., ['CHROM', 'POS', 'DP', 'GT'] will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string '*'. To extract all variants fields (including all INFO fields) provide 'variants/*'. To extract all calldata fields (i.e., defined in FORMAT headers) provide 'calldata/*'.

exclude_fields : list of strings, optional

Fields to exclude. E.g., for use in combination with fields='*'.

rename_fields : dict[str -> str], optional

Fields to be renamed. Should be a dictionary mapping old to new names,

giving the complete path, e.g., ``{‘variants/FOO’: ‘variants/bar’}``.
types : dict, optional

Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary {'variants/DP': 'i8', 'calldata/GQ': 'i2'} will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.

numbers : dict, optional

Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary {'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2} will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.

alt_number : int, optional

Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.

fills : dict, optional

Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.

region : string, optional

Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.

tabix : string, optional

Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.

samples : list of strings

Selection of samples to extract calldata for. If provided, should be a list of strings giving sample identifiers. May also be a list of integers giving indices of selected samples.

transformers : list of transformer objects, optional

Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the ANNTransformer class which implements post-processing of data from SNPEFF.

buffer_size : int, optional

Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.

chunk_length : int, optional

Length (number of variants) of chunks in which data are processed.

log : file-like, optional

A file-like object (e.g., sys.stderr) to print progress information.

Returns:
data : dict[str, ndarray]

A dictionary holding arrays, or None if no variants were found.

allel.vcf_to_npz(input, output, compressed=True, overwrite=False, fields=None, exclude_fields=None, rename_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix=True, samples=None, transformers=None, buffer_size=16384, chunk_length=65536, log=None)[source]

Read data from a VCF file into NumPy arrays and save as a .npz file.

Changed in version 1.12.0: Now will not create any output file if no variants are found in the VCF file or matching the requested region.

Parameters:
input : string

Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).

output : string

File-system path to write output to.

compressed : bool, optional

If True (default), save with compression.

overwrite : bool, optional

If False (default), do not overwrite an existing file.

fields : list of strings, optional

Fields to extract data for. Should be a list of strings, e.g., ['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e., ['CHROM', 'POS', 'DP', 'GT'] will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string '*'. To extract all variants fields (including all INFO fields) provide 'variants/*'. To extract all calldata fields (i.e., defined in FORMAT headers) provide 'calldata/*'.

exclude_fields : list of strings, optional

Fields to exclude. E.g., for use in combination with fields='*'.

rename_fields : dict[str -> str], optional

Fields to be renamed. Should be a dictionary mapping old to new names,

giving the complete path, e.g., ``{‘variants/FOO’: ‘variants/bar’}``.
types : dict, optional

Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary {'variants/DP': 'i8', 'calldata/GQ': 'i2'} will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.

numbers : dict, optional

Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary {'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2} will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.

alt_number : int, optional

Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.

fills : dict, optional

Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.

region : string, optional

Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.

tabix : string, optional

Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.

samples : list of strings

Selection of samples to extract calldata for. If provided, should be a list of strings giving sample identifiers. May also be a list of integers giving indices of selected samples.

transformers : list of transformer objects, optional

Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the ANNTransformer class which implements post-processing of data from SNPEFF.

buffer_size : int, optional

Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.

chunk_length : int, optional

Length (number of variants) of chunks in which data are processed.

log : file-like, optional

A file-like object (e.g., sys.stderr) to print progress information.

allel.vcf_to_hdf5(input, output, group='/', compression='gzip', compression_opts=1, shuffle=False, overwrite=False, vlen=True, fields=None, exclude_fields=None, rename_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', samples=None, transformers=None, buffer_size=16384, chunk_length=65536, chunk_width=64, log=None)[source]

Read data from a VCF file and load into an HDF5 file.

Changed in version 1.12.0: Now will not create any output file if no variants are found in the VCF file or matching the requested region.

Parameters:
input : string

Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).

output : string

File-system path to write output to.

group : string

Group within destination HDF5 file to store data in.

compression : string

Compression algorithm, e.g., ‘gzip’ (default).

compression_opts : int

Compression level, e.g., 1 (default).

shuffle : bool

Use byte shuffling, which may improve compression (default is False).

overwrite : bool

If False (default), do not overwrite an existing file.

vlen : bool

If True, store variable length strings. Note that there is considerable storage overhead for variable length strings in HDF5, and leaving this option as True ( default) may lead to large file sizes. If False, all strings will be stored in the HDF5 file as fixed length strings, even if they are specified as ‘object’ type. In this case, the string length for any field with ‘object’ type will be determined based on the maximum length of strings found in the first chunk, and this may cause values to be truncated if longer values are found in later chunks. To avoid truncation and large file sizes, manually set the type for all string fields to an explicit fixed length string type, e.g., ‘S10’ for a field where you know at most 10 characters are required.

fields : list of strings, optional

Fields to extract data for. Should be a list of strings, e.g., ['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e., ['CHROM', 'POS', 'DP', 'GT'] will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string '*'. To extract all variants fields (including all INFO fields) provide 'variants/*'. To extract all calldata fields (i.e., defined in FORMAT headers) provide 'calldata/*'.

exclude_fields : list of strings, optional

Fields to exclude. E.g., for use in combination with fields='*'.

rename_fields : dict[str -> str], optional

Fields to be renamed. Should be a dictionary mapping old to new names,

giving the complete path, e.g., ``{‘variants/FOO’: ‘variants/bar’}``.
types : dict, optional

Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary {'variants/DP': 'i8', 'calldata/GQ': 'i2'} will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.

numbers : dict, optional

Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary {'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2} will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.

alt_number : int, optional

Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.

fills : dict, optional

Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.

region : string, optional

Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.

tabix : string, optional

Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.

samples : list of strings

Selection of samples to extract calldata for. If provided, should be a list of strings giving sample identifiers. May also be a list of integers giving indices of selected samples.

transformers : list of transformer objects, optional

Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the ANNTransformer class which implements post-processing of data from SNPEFF.

buffer_size : int, optional

Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.

chunk_length : int, optional

Length (number of variants) of chunks in which data are processed.

chunk_width : int, optional

Width (number of samples) to use when storing chunks in output.

log : file-like, optional

A file-like object (e.g., sys.stderr) to print progress information.

allel.vcf_to_zarr(input, output, group='/', compressor='default', overwrite=False, fields=None, exclude_fields=None, rename_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', samples=None, transformers=None, buffer_size=16384, chunk_length=65536, chunk_width=64, log=None)[source]

Read data from a VCF file and load into a Zarr on-disk store.

Changed in version 1.12.0: Now will not create any output files if no variants are found in the VCF file or matching the requested region.

Parameters:
input : string

Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).

output : string

File-system path to write output to.

group : string

Group within destination Zarr hierarchy to store data in.

compressor : compressor

Compression algorithm, e.g., zarr.Blosc(cname=’zstd’, clevel=1, shuffle=1).

overwrite : bool

If False (default), do not overwrite an existing file.

fields : list of strings, optional

Fields to extract data for. Should be a list of strings, e.g., ['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e., ['CHROM', 'POS', 'DP', 'GT'] will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string '*'. To extract all variants fields (including all INFO fields) provide 'variants/*'. To extract all calldata fields (i.e., defined in FORMAT headers) provide 'calldata/*'.

exclude_fields : list of strings, optional

Fields to exclude. E.g., for use in combination with fields='*'.

rename_fields : dict[str -> str], optional

Fields to be renamed. Should be a dictionary mapping old to new names,

giving the complete path, e.g., ``{‘variants/FOO’: ‘variants/bar’}``.
types : dict, optional

Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary {'variants/DP': 'i8', 'calldata/GQ': 'i2'} will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.

numbers : dict, optional

Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary {'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2} will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.

alt_number : int, optional

Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.

fills : dict, optional

Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.

region : string, optional

Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.

tabix : string, optional

Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.

samples : list of strings

Selection of samples to extract calldata for. If provided, should be a list of strings giving sample identifiers. May also be a list of integers giving indices of selected samples.

transformers : list of transformer objects, optional

Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the ANNTransformer class which implements post-processing of data from SNPEFF.

buffer_size : int, optional

Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.

chunk_length : int, optional

Length (number of variants) of chunks in which data are processed.

chunk_width : int, optional

Width (number of samples) to use when storing chunks in output.

log : file-like, optional

A file-like object (e.g., sys.stderr) to print progress information.

allel.vcf_to_dataframe(input, fields=None, exclude_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', transformers=None, buffer_size=16384, chunk_length=65536, log=None)[source]

Read data from a VCF file into a pandas DataFrame.

Parameters:
input : string

Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).

fields : list of strings, optional

Fields to extract data for. Should be a list of strings, e.g., ['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e., ['CHROM', 'POS', 'DP', 'GT'] will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string '*'. To extract all variants fields (including all INFO fields) provide 'variants/*'. To extract all calldata fields (i.e., defined in FORMAT headers) provide 'calldata/*'.

exclude_fields : list of strings, optional

Fields to exclude. E.g., for use in combination with fields='*'.

types : dict, optional

Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary {'variants/DP': 'i8', 'calldata/GQ': 'i2'} will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.

numbers : dict, optional

Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary {'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2} will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.

alt_number : int, optional

Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.

fills : dict, optional

Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.

region : string, optional

Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.

tabix : string, optional

Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.

transformers : list of transformer objects, optional

Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the ANNTransformer class which implements post-processing of data from SNPEFF.

buffer_size : int, optional

Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.

chunk_length : int, optional

Length (number of variants) of chunks in which data are processed.

log : file-like, optional

A file-like object (e.g., sys.stderr) to print progress information.

Returns:
df : pandas.DataFrame
allel.vcf_to_csv(input, output, fields=None, exclude_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', transformers=None, buffer_size=16384, chunk_length=65536, log=None, **kwargs)[source]

Read data from a VCF file and write out to a comma-separated values (CSV) file.

Parameters:
input : string

Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).

output : string

File-system path to write output to.

fields : list of strings, optional

Fields to extract data for. Should be a list of strings, e.g., ['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e., ['CHROM', 'POS', 'DP', 'GT'] will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string '*'. To extract all variants fields (including all INFO fields) provide 'variants/*'. To extract all calldata fields (i.e., defined in FORMAT headers) provide 'calldata/*'.

exclude_fields : list of strings, optional

Fields to exclude. E.g., for use in combination with fields='*'.

types : dict, optional

Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary {'variants/DP': 'i8', 'calldata/GQ': 'i2'} will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.

numbers : dict, optional

Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary {'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2} will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.

alt_number : int, optional

Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.

fills : dict, optional

Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.

region : string, optional

Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.

tabix : string, optional

Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.

transformers : list of transformer objects, optional

Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the ANNTransformer class which implements post-processing of data from SNPEFF.

buffer_size : int, optional

Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.

chunk_length : int, optional

Length (number of variants) of chunks in which data are processed.

log : file-like, optional

A file-like object (e.g., sys.stderr) to print progress information.

kwargs : keyword arguments

All remaining keyword arguments are passed through to pandas.DataFrame.to_csv(). E.g., to write a tab-delimited file, provide sep=’t’.

allel.vcf_to_recarray(input, fields=None, exclude_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', transformers=None, buffer_size=16384, chunk_length=65536, log=None)[source]

Read data from a VCF file into a NumPy recarray.

Parameters:
input : string

Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).

fields : list of strings, optional

Fields to extract data for. Should be a list of strings, e.g., ['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e., ['CHROM', 'POS', 'DP', 'GT'] will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string '*'. To extract all variants fields (including all INFO fields) provide 'variants/*'. To extract all calldata fields (i.e., defined in FORMAT headers) provide 'calldata/*'.

exclude_fields : list of strings, optional

Fields to exclude. E.g., for use in combination with fields='*'.

types : dict, optional

Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary {'variants/DP': 'i8', 'calldata/GQ': 'i2'} will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.

numbers : dict, optional

Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary {'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2} will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.

alt_number : int, optional

Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.

fills : dict, optional

Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.

region : string, optional

Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.

tabix : string, optional

Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.

transformers : list of transformer objects, optional

Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the ANNTransformer class which implements post-processing of data from SNPEFF.

buffer_size : int, optional

Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.

chunk_length : int, optional

Length (number of variants) of chunks in which data are processed.

log : file-like, optional

A file-like object (e.g., sys.stderr) to print progress information.

Returns:
ra : np.rec.array
allel.iter_vcf_chunks(input, fields=None, exclude_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', samples=None, transformers=None, buffer_size=16384, chunk_length=65536)[source]

Iterate over chunks of data from a VCF file as NumPy arrays.

Parameters:
input : string

Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).

fields : list of strings, optional

Fields to extract data for. Should be a list of strings, e.g., ['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e., ['CHROM', 'POS', 'DP', 'GT'] will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string '*'. To extract all variants fields (including all INFO fields) provide 'variants/*'. To extract all calldata fields (i.e., defined in FORMAT headers) provide 'calldata/*'.

exclude_fields : list of strings, optional

Fields to exclude. E.g., for use in combination with fields='*'.

types : dict, optional

Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary {'variants/DP': 'i8', 'calldata/GQ': 'i2'} will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.

numbers : dict, optional

Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary {'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2} will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.

alt_number : int, optional

Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.

fills : dict, optional

Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.

region : string, optional

Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.

tabix : string, optional

Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.

samples : list of strings

Selection of samples to extract calldata for. If provided, should be a list of strings giving sample identifiers. May also be a list of integers giving indices of selected samples.

transformers : list of transformer objects, optional

Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the ANNTransformer class which implements post-processing of data from SNPEFF.

buffer_size : int, optional

Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.

chunk_length : int, optional

Length (number of variants) of chunks in which data are processed.

Returns:
fields : list of strings

Normalised names of fields that will be extracted.

samples : ndarray

Samples for which data will be extracted.

headers : VCFHeaders

Tuple of metadata extracted from VCF headers.

it : iterator

Chunk iterator.

allel.read_vcf_headers(input)[source]

Read headers from a VCF file.

class allel.ANNTransformer
allel.write_vcf(path, callset, rename=None, number=None, description=None, fill=None, write_header=True)[source]

Preliminary support for writing a VCF file. Currently does not support sample data. Needs further work.

GFF3

allel.gff3_to_dataframe(path, attributes=None, region=None, score_fill=-1, phase_fill=-1, attributes_fill='.', tabix='tabix', **kwargs)[source]

Load data from a GFF3 into a pandas DataFrame.

Parameters:
path : string

Path to input file.

attributes : list of strings, optional

List of columns to extract from the “attributes” field.

region : string, optional

Genome region to extract. If given, file must be position sorted, bgzipped and tabix indexed. Tabix must also be installed and on the system path.

score_fill : int, optional

Value to use where score field has a missing value.

phase_fill : int, optional

Value to use where phase field has a missing value.

attributes_fill : object or list of objects, optional

Value(s) to use where attribute field(s) have a missing value.

tabix : string, optional

Tabix command.

Returns:
pandas.DataFrame
allel.gff3_to_recarray(path, attributes=None, region=None, score_fill=-1, phase_fill=-1, attributes_fill='.', tabix='tabix', dtype=None)[source]

Load data from a GFF3 into a NumPy recarray.

Parameters:
path : string

Path to input file.

attributes : list of strings, optional

List of columns to extract from the “attributes” field.

region : string, optional

Genome region to extract. If given, file must be position sorted, bgzipped and tabix indexed. Tabix must also be installed and on the system path.

score_fill : int, optional

Value to use where score field has a missing value.

phase_fill : int, optional

Value to use where phase field has a missing value.

attributes_fill : object or list of objects, optional

Value(s) to use where attribute field(s) have a missing value.

tabix : string, optional

Tabix command.

dtype : dtype, optional

Override dtype.

Returns:
np.recarray
allel.iter_gff3(path, attributes=None, region=None, score_fill=-1, phase_fill=-1, attributes_fill='.', tabix='tabix')[source]

Iterate over records in a GFF3 file.

Parameters:
path : string

Path to input file.

attributes : list of strings, optional

List of columns to extract from the “attributes” field.

region : string, optional

Genome region to extract. If given, file must be position sorted, bgzipped and tabix indexed. Tabix must also be installed and on the system path.

score_fill : int, optional

Value to use where score field has a missing value.

phase_fill : int, optional

Value to use where phase field has a missing value.

attributes_fill : object or list of objects, optional

Value(s) to use where attribute field(s) have a missing value.

tabix : string

Tabix command.

Returns:
Iterator

Fasta

allel.write_fasta(path, sequences, names, mode='w', width=80)[source]

Write nucleotide sequences stored as numpy arrays to a FASTA file.

Parameters:
path : string

File path.

sequences : sequence of arrays

One or more ndarrays of dtype ‘S1’ containing the sequences.

names : sequence of strings

Names of the sequences.

mode : string, optional

Use ‘a’ to append to an existing file.

width : int, optional

Maximum line width.