Input/output utilities¶
Variant Call Format (VCF)¶
-
allel.
read_vcf
(input, fields=None, exclude_fields=None, rename_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', samples=None, transformers=None, buffer_size=16384, chunk_length=65536, log=None)[source]¶ Read data from a VCF file into NumPy arrays.
Changed in version 1.12.0: Now returns None if no variants are found in the VCF file or matching the requested region.
Parameters: - input : string or file-like
Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).
- fields : list of strings, optional
Fields to extract data for. Should be a list of strings, e.g.,
['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']
. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e.,['CHROM', 'POS', 'DP', 'GT']
will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string'*'
. To extract all variants fields (including all INFO fields) provide'variants/*'
. To extract all calldata fields (i.e., defined in FORMAT headers) provide'calldata/*'
.- exclude_fields : list of strings, optional
Fields to exclude. E.g., for use in combination with
fields='*'
.- rename_fields : dict[str -> str], optional
Fields to be renamed. Should be a dictionary mapping old to new names,
- giving the complete path, e.g., ``{‘variants/FOO’: ‘variants/bar’}``.
- types : dict, optional
Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary
{'variants/DP': 'i8', 'calldata/GQ': 'i2'}
will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.- numbers : dict, optional
Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary
{'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2}
will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.- alt_number : int, optional
Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.
- fills : dict, optional
Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.
- region : string, optional
Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.
- tabix : string, optional
Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.
- samples : list of strings
Selection of samples to extract calldata for. If provided, should be a list of strings giving sample identifiers. May also be a list of integers giving indices of selected samples.
- transformers : list of transformer objects, optional
Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the
ANNTransformer
class which implements post-processing of data from SNPEFF.- buffer_size : int, optional
Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.
- chunk_length : int, optional
Length (number of variants) of chunks in which data are processed.
- log : file-like, optional
A file-like object (e.g., sys.stderr) to print progress information.
Returns: - data : dict[str, ndarray]
A dictionary holding arrays, or None if no variants were found.
-
allel.
vcf_to_npz
(input, output, compressed=True, overwrite=False, fields=None, exclude_fields=None, rename_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix=True, samples=None, transformers=None, buffer_size=16384, chunk_length=65536, log=None)[source]¶ Read data from a VCF file into NumPy arrays and save as a .npz file.
Changed in version 1.12.0: Now will not create any output file if no variants are found in the VCF file or matching the requested region.
Parameters: - input : string
Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).
- output : string
File-system path to write output to.
- compressed : bool, optional
If True (default), save with compression.
- overwrite : bool, optional
If False (default), do not overwrite an existing file.
- fields : list of strings, optional
Fields to extract data for. Should be a list of strings, e.g.,
['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']
. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e.,['CHROM', 'POS', 'DP', 'GT']
will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string'*'
. To extract all variants fields (including all INFO fields) provide'variants/*'
. To extract all calldata fields (i.e., defined in FORMAT headers) provide'calldata/*'
.- exclude_fields : list of strings, optional
Fields to exclude. E.g., for use in combination with
fields='*'
.- rename_fields : dict[str -> str], optional
Fields to be renamed. Should be a dictionary mapping old to new names,
- giving the complete path, e.g., ``{‘variants/FOO’: ‘variants/bar’}``.
- types : dict, optional
Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary
{'variants/DP': 'i8', 'calldata/GQ': 'i2'}
will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.- numbers : dict, optional
Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary
{'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2}
will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.- alt_number : int, optional
Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.
- fills : dict, optional
Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.
- region : string, optional
Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.
- tabix : string, optional
Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.
- samples : list of strings
Selection of samples to extract calldata for. If provided, should be a list of strings giving sample identifiers. May also be a list of integers giving indices of selected samples.
- transformers : list of transformer objects, optional
Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the
ANNTransformer
class which implements post-processing of data from SNPEFF.- buffer_size : int, optional
Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.
- chunk_length : int, optional
Length (number of variants) of chunks in which data are processed.
- log : file-like, optional
A file-like object (e.g., sys.stderr) to print progress information.
-
allel.
vcf_to_hdf5
(input, output, group='/', compression='gzip', compression_opts=1, shuffle=False, overwrite=False, vlen=True, fields=None, exclude_fields=None, rename_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', samples=None, transformers=None, buffer_size=16384, chunk_length=65536, chunk_width=64, log=None)[source]¶ Read data from a VCF file and load into an HDF5 file.
Changed in version 1.12.0: Now will not create any output file if no variants are found in the VCF file or matching the requested region.
Parameters: - input : string
Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).
- output : string
File-system path to write output to.
- group : string
Group within destination HDF5 file to store data in.
- compression : string
Compression algorithm, e.g., ‘gzip’ (default).
- compression_opts : int
Compression level, e.g., 1 (default).
- shuffle : bool
Use byte shuffling, which may improve compression (default is False).
- overwrite : bool
If False (default), do not overwrite an existing file.
- vlen : bool
If True, store variable length strings. Note that there is considerable storage overhead for variable length strings in HDF5, and leaving this option as True ( default) may lead to large file sizes. If False, all strings will be stored in the HDF5 file as fixed length strings, even if they are specified as ‘object’ type. In this case, the string length for any field with ‘object’ type will be determined based on the maximum length of strings found in the first chunk, and this may cause values to be truncated if longer values are found in later chunks. To avoid truncation and large file sizes, manually set the type for all string fields to an explicit fixed length string type, e.g., ‘S10’ for a field where you know at most 10 characters are required.
- fields : list of strings, optional
Fields to extract data for. Should be a list of strings, e.g.,
['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']
. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e.,['CHROM', 'POS', 'DP', 'GT']
will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string'*'
. To extract all variants fields (including all INFO fields) provide'variants/*'
. To extract all calldata fields (i.e., defined in FORMAT headers) provide'calldata/*'
.- exclude_fields : list of strings, optional
Fields to exclude. E.g., for use in combination with
fields='*'
.- rename_fields : dict[str -> str], optional
Fields to be renamed. Should be a dictionary mapping old to new names,
- giving the complete path, e.g., ``{‘variants/FOO’: ‘variants/bar’}``.
- types : dict, optional
Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary
{'variants/DP': 'i8', 'calldata/GQ': 'i2'}
will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.- numbers : dict, optional
Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary
{'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2}
will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.- alt_number : int, optional
Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.
- fills : dict, optional
Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.
- region : string, optional
Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.
- tabix : string, optional
Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.
- samples : list of strings
Selection of samples to extract calldata for. If provided, should be a list of strings giving sample identifiers. May also be a list of integers giving indices of selected samples.
- transformers : list of transformer objects, optional
Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the
ANNTransformer
class which implements post-processing of data from SNPEFF.- buffer_size : int, optional
Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.
- chunk_length : int, optional
Length (number of variants) of chunks in which data are processed.
- chunk_width : int, optional
Width (number of samples) to use when storing chunks in output.
- log : file-like, optional
A file-like object (e.g., sys.stderr) to print progress information.
-
allel.
vcf_to_zarr
(input, output, group='/', compressor='default', overwrite=False, fields=None, exclude_fields=None, rename_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', samples=None, transformers=None, buffer_size=16384, chunk_length=65536, chunk_width=64, log=None)[source]¶ Read data from a VCF file and load into a Zarr on-disk store.
Changed in version 1.12.0: Now will not create any output files if no variants are found in the VCF file or matching the requested region.
Parameters: - input : string
Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).
- output : string
File-system path to write output to.
- group : string
Group within destination Zarr hierarchy to store data in.
- compressor : compressor
Compression algorithm, e.g., zarr.Blosc(cname=’zstd’, clevel=1, shuffle=1).
- overwrite : bool
If False (default), do not overwrite an existing file.
- fields : list of strings, optional
Fields to extract data for. Should be a list of strings, e.g.,
['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']
. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e.,['CHROM', 'POS', 'DP', 'GT']
will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string'*'
. To extract all variants fields (including all INFO fields) provide'variants/*'
. To extract all calldata fields (i.e., defined in FORMAT headers) provide'calldata/*'
.- exclude_fields : list of strings, optional
Fields to exclude. E.g., for use in combination with
fields='*'
.- rename_fields : dict[str -> str], optional
Fields to be renamed. Should be a dictionary mapping old to new names,
- giving the complete path, e.g., ``{‘variants/FOO’: ‘variants/bar’}``.
- types : dict, optional
Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary
{'variants/DP': 'i8', 'calldata/GQ': 'i2'}
will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.- numbers : dict, optional
Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary
{'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2}
will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.- alt_number : int, optional
Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.
- fills : dict, optional
Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.
- region : string, optional
Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.
- tabix : string, optional
Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.
- samples : list of strings
Selection of samples to extract calldata for. If provided, should be a list of strings giving sample identifiers. May also be a list of integers giving indices of selected samples.
- transformers : list of transformer objects, optional
Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the
ANNTransformer
class which implements post-processing of data from SNPEFF.- buffer_size : int, optional
Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.
- chunk_length : int, optional
Length (number of variants) of chunks in which data are processed.
- chunk_width : int, optional
Width (number of samples) to use when storing chunks in output.
- log : file-like, optional
A file-like object (e.g., sys.stderr) to print progress information.
-
allel.
vcf_to_dataframe
(input, fields=None, exclude_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', transformers=None, buffer_size=16384, chunk_length=65536, log=None)[source]¶ Read data from a VCF file into a pandas DataFrame.
Parameters: - input : string
Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).
- fields : list of strings, optional
Fields to extract data for. Should be a list of strings, e.g.,
['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']
. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e.,['CHROM', 'POS', 'DP', 'GT']
will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string'*'
. To extract all variants fields (including all INFO fields) provide'variants/*'
. To extract all calldata fields (i.e., defined in FORMAT headers) provide'calldata/*'
.- exclude_fields : list of strings, optional
Fields to exclude. E.g., for use in combination with
fields='*'
.- types : dict, optional
Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary
{'variants/DP': 'i8', 'calldata/GQ': 'i2'}
will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.- numbers : dict, optional
Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary
{'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2}
will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.- alt_number : int, optional
Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.
- fills : dict, optional
Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.
- region : string, optional
Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.
- tabix : string, optional
Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.
- transformers : list of transformer objects, optional
Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the
ANNTransformer
class which implements post-processing of data from SNPEFF.- buffer_size : int, optional
Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.
- chunk_length : int, optional
Length (number of variants) of chunks in which data are processed.
- log : file-like, optional
A file-like object (e.g., sys.stderr) to print progress information.
Returns: - df : pandas.DataFrame
-
allel.
vcf_to_csv
(input, output, fields=None, exclude_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', transformers=None, buffer_size=16384, chunk_length=65536, log=None, **kwargs)[source]¶ Read data from a VCF file and write out to a comma-separated values (CSV) file.
Parameters: - input : string
Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).
- output : string
File-system path to write output to.
- fields : list of strings, optional
Fields to extract data for. Should be a list of strings, e.g.,
['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']
. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e.,['CHROM', 'POS', 'DP', 'GT']
will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string'*'
. To extract all variants fields (including all INFO fields) provide'variants/*'
. To extract all calldata fields (i.e., defined in FORMAT headers) provide'calldata/*'
.- exclude_fields : list of strings, optional
Fields to exclude. E.g., for use in combination with
fields='*'
.- types : dict, optional
Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary
{'variants/DP': 'i8', 'calldata/GQ': 'i2'}
will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.- numbers : dict, optional
Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary
{'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2}
will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.- alt_number : int, optional
Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.
- fills : dict, optional
Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.
- region : string, optional
Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.
- tabix : string, optional
Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.
- transformers : list of transformer objects, optional
Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the
ANNTransformer
class which implements post-processing of data from SNPEFF.- buffer_size : int, optional
Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.
- chunk_length : int, optional
Length (number of variants) of chunks in which data are processed.
- log : file-like, optional
A file-like object (e.g., sys.stderr) to print progress information.
- kwargs : keyword arguments
All remaining keyword arguments are passed through to pandas.DataFrame.to_csv(). E.g., to write a tab-delimited file, provide sep=’t’.
-
allel.
vcf_to_recarray
(input, fields=None, exclude_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', transformers=None, buffer_size=16384, chunk_length=65536, log=None)[source]¶ Read data from a VCF file into a NumPy recarray.
Parameters: - input : string
Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).
- fields : list of strings, optional
Fields to extract data for. Should be a list of strings, e.g.,
['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']
. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e.,['CHROM', 'POS', 'DP', 'GT']
will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string'*'
. To extract all variants fields (including all INFO fields) provide'variants/*'
. To extract all calldata fields (i.e., defined in FORMAT headers) provide'calldata/*'
.- exclude_fields : list of strings, optional
Fields to exclude. E.g., for use in combination with
fields='*'
.- types : dict, optional
Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary
{'variants/DP': 'i8', 'calldata/GQ': 'i2'}
will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.- numbers : dict, optional
Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary
{'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2}
will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.- alt_number : int, optional
Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.
- fills : dict, optional
Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.
- region : string, optional
Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.
- tabix : string, optional
Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.
- transformers : list of transformer objects, optional
Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the
ANNTransformer
class which implements post-processing of data from SNPEFF.- buffer_size : int, optional
Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.
- chunk_length : int, optional
Length (number of variants) of chunks in which data are processed.
- log : file-like, optional
A file-like object (e.g., sys.stderr) to print progress information.
Returns: - ra : np.rec.array
-
allel.
iter_vcf_chunks
(input, fields=None, exclude_fields=None, types=None, numbers=None, alt_number=3, fills=None, region=None, tabix='tabix', samples=None, transformers=None, buffer_size=16384, chunk_length=65536)[source]¶ Iterate over chunks of data from a VCF file as NumPy arrays.
Parameters: - input : string
Path to VCF file on the local file system. May be uncompressed or gzip-compatible compressed file. May also be a file-like object (e.g., io.BytesIO).
- fields : list of strings, optional
Fields to extract data for. Should be a list of strings, e.g.,
['variants/CHROM', 'variants/POS', 'variants/DP', 'calldata/GT']
. If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case the fields will be matched against fields declared in the VCF header, with variants taking priority over calldata if a field with the same ID exists both in INFO and FORMAT headers. I.e.,['CHROM', 'POS', 'DP', 'GT']
will work, although watch out for fields like ‘DP’ which can be both INFO and FORMAT. For convenience, some special string values are also recognized. To extract all fields, provide just the string'*'
. To extract all variants fields (including all INFO fields) provide'variants/*'
. To extract all calldata fields (i.e., defined in FORMAT headers) provide'calldata/*'
.- exclude_fields : list of strings, optional
Fields to exclude. E.g., for use in combination with
fields='*'
.- types : dict, optional
Overide data types. Should be a dictionary mapping field names to NumPy data types. E.g., providing the dictionary
{'variants/DP': 'i8', 'calldata/GQ': 'i2'}
will mean the ‘variants/DP’ field is stored in a 64-bit integer array, and the ‘calldata/GQ’ field is stored in a 16-bit integer array.- numbers : dict, optional
Override the expected number of values. Should be a dictionary mapping field names to integers. E.g., providing the dictionary
{'variants/ALT': 5, 'variants/AC': 5, 'calldata/HQ': 2}
will mean that, for each variant, 5 values are stored for the ‘variants/ALT’ field, 5 values are stored for the ‘variants/AC’ field, and for each sample, 2 values are stored for the ‘calldata/HQ’ field.- alt_number : int, optional
Assume this number of alternate alleles and set expected number of values accordingly for any field declared with number ‘A’ or ‘R’ in the VCF meta-information.
- fills : dict, optional
Override the fill value used for empty values. Should be a dictionary mapping field names to fill values.
- region : string, optional
Genomic region to extract variants for. If provided, should be a tabix-style region string, which can be either just a chromosome name (e.g., ‘2L’), or a chromosome name followed by 1-based beginning and end coordinates (e.g., ‘2L:100000-200000’). Note that only variants whose start position (POS) is within the requested range will be included. This is slightly different from the default tabix behaviour, where a variant (e.g., deletion) may be included if its position (POS) occurs before the requested region but its reference allele overlaps the region - such a variant will not be included in the data returned by this function.
- tabix : string, optional
Name or path to tabix executable. Only required if region is given. Setting tabix to None will cause a fall-back to scanning through the VCF file from the beginning, which may be much slower than tabix but the only option if tabix is not available on your system and/or the VCF file has not been tabix-indexed.
- samples : list of strings
Selection of samples to extract calldata for. If provided, should be a list of strings giving sample identifiers. May also be a list of integers giving indices of selected samples.
- transformers : list of transformer objects, optional
Transformers for post-processing data. If provided, should be a list of Transformer objects, each of which must implement a “transform()” method that accepts a dict containing the chunk of data to be transformed. See also the
ANNTransformer
class which implements post-processing of data from SNPEFF.- buffer_size : int, optional
Size in bytes of the I/O buffer used when reading data from the underlying file or tabix stream.
- chunk_length : int, optional
Length (number of variants) of chunks in which data are processed.
Returns: - fields : list of strings
Normalised names of fields that will be extracted.
- samples : ndarray
Samples for which data will be extracted.
- headers : VCFHeaders
Tuple of metadata extracted from VCF headers.
- it : iterator
Chunk iterator.
-
class
allel.
ANNTransformer
¶
GFF3¶
-
allel.
gff3_to_dataframe
(path, attributes=None, region=None, score_fill=-1, phase_fill=-1, attributes_fill='.', tabix='tabix', **kwargs)[source]¶ Load data from a GFF3 into a pandas DataFrame.
Parameters: - path : string
Path to input file.
- attributes : list of strings, optional
List of columns to extract from the “attributes” field.
- region : string, optional
Genome region to extract. If given, file must be position sorted, bgzipped and tabix indexed. Tabix must also be installed and on the system path.
- score_fill : int, optional
Value to use where score field has a missing value.
- phase_fill : int, optional
Value to use where phase field has a missing value.
- attributes_fill : object or list of objects, optional
Value(s) to use where attribute field(s) have a missing value.
- tabix : string, optional
Tabix command.
Returns: - pandas.DataFrame
-
allel.
gff3_to_recarray
(path, attributes=None, region=None, score_fill=-1, phase_fill=-1, attributes_fill='.', tabix='tabix', dtype=None)[source]¶ Load data from a GFF3 into a NumPy recarray.
Parameters: - path : string
Path to input file.
- attributes : list of strings, optional
List of columns to extract from the “attributes” field.
- region : string, optional
Genome region to extract. If given, file must be position sorted, bgzipped and tabix indexed. Tabix must also be installed and on the system path.
- score_fill : int, optional
Value to use where score field has a missing value.
- phase_fill : int, optional
Value to use where phase field has a missing value.
- attributes_fill : object or list of objects, optional
Value(s) to use where attribute field(s) have a missing value.
- tabix : string, optional
Tabix command.
- dtype : dtype, optional
Override dtype.
Returns: - np.recarray
-
allel.
iter_gff3
(path, attributes=None, region=None, score_fill=-1, phase_fill=-1, attributes_fill='.', tabix='tabix')[source]¶ Iterate over records in a GFF3 file.
Parameters: - path : string
Path to input file.
- attributes : list of strings, optional
List of columns to extract from the “attributes” field.
- region : string, optional
Genome region to extract. If given, file must be position sorted, bgzipped and tabix indexed. Tabix must also be installed and on the system path.
- score_fill : int, optional
Value to use where score field has a missing value.
- phase_fill : int, optional
Value to use where phase field has a missing value.
- attributes_fill : object or list of objects, optional
Value(s) to use where attribute field(s) have a missing value.
- tabix : string
Tabix command.
Returns: - Iterator
Fasta¶
-
allel.
write_fasta
(path, sequences, names, mode='w', width=80)[source]¶ Write nucleotide sequences stored as numpy arrays to a FASTA file.
Parameters: - path : string
File path.
- sequences : sequence of arrays
One or more ndarrays of dtype ‘S1’ containing the sequences.
- names : sequence of strings
Names of the sequences.
- mode : string, optional
Use ‘a’ to append to an existing file.
- width : int, optional
Maximum line width.