bcolz arrays (deprecated)

This module provides alternative implementations of array classes defined in the allel.model.ndarray module, using bcolz compressed arrays instead of numpy arrays for data storage.

Note

Please note this module is now deprecated and will be removed in a future release. It has been superseded by the allel.model.chunked module which supports both bcolz and HDF5 as the underlying storage layer.

GenotypeCArray

class allel.model.bcolz.GenotypeCArray(data=None, copy=False, **kwargs)[source]

Alternative implementation of the allel.model.ndarray.GenotypeArray class, using a bcolz.carray as the backing store.

Parameters:

data : array_like, int, shape (n_variants, n_samples, ploidy), optional

Data to initialise the array with. May be a bcolz carray, which will not be copied if copy=False. May also be None, in which case rootdir must be provided (disk-based array).

copy : bool, optional

If True, copy the input data into a new bcolz carray.

**kwargs : keyword arguments

Passed through to the bcolz carray constructor.

Examples

Instantiate a compressed genotype array from existing data:

>>> import allel
>>> g = allel.GenotypeCArray([[[0, 0], [0, 1]],
...                           [[0, 1], [1, 1]],
...                           [[0, 2], [-1, -1]]], dtype='i1')
>>> g
GenotypeCArray((3, 2, 2), int8)
  nbytes := 12; cbytes := 16.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
  chunklen := 4096; chunksize: 16384; blocksize: 0
[[[ 0  0]
  [ 0  1]]
 [[ 0  1]
  [ 1  1]]
 [[ 0  2]
  [-1 -1]]]

Obtain a numpy ndarray from a compressed array by slicing:

>>> g[:]
GenotypeArray((3, 2, 2), dtype=int8)
[[[ 0  0]
  [ 0  1]]
 [[ 0  1]
  [ 1  1]]
 [[ 0  2]
  [-1 -1]]]

Build incrementally:

>>> import bcolz
>>> data = bcolz.zeros((0, 2, 2), dtype='i1')
>>> data.append([[0, 0], [0, 1]])
>>> data.append([[0, 1], [1, 1]])
>>> data.append([[0, 2], [-1, -1]])
>>> g = allel.GenotypeCArray(data)
>>> g
GenotypeCArray((3, 2, 2), int8)
  nbytes := 12; cbytes := 16.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
  chunklen := 4096; chunksize: 16384; blocksize: 0
[[[ 0  0]
  [ 0  1]]
 [[ 0  1]
  [ 1  1]]
 [[ 0  2]
  [-1 -1]]]

Load from HDF5:

>>> import h5py
>>> with h5py.File('test1.h5', mode='w') as h5f:
...     h5f.create_dataset('genotype',
...                        data=[[[0, 0], [0, 1]],
...                              [[0, 1], [1, 1]],
...                              [[0, 2], [-1, -1]]],
...                        dtype='i1',
...                        chunks=(2, 2, 2))
...
<HDF5 dataset "genotype": shape (3, 2, 2), type "|i1">
>>> g = allel.GenotypeCArray.from_hdf5('test1.h5', 'genotype')
>>> g
GenotypeCArray((3, 2, 2), int8)
  nbytes := 12; cbytes := 16.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
  chunklen := 4096; chunksize: 16384; blocksize: 0
[[[ 0  0]
  [ 0  1]]
 [[ 0  1]
  [ 1  1]]
 [[ 0  2]
  [-1 -1]]]

Note that methods of this class will return bcolz carrays rather than numpy ndarrays where possible. E.g.:

>>> g.take([0, 2], axis=0)
GenotypeCArray((2, 2, 2), int8)
  nbytes := 8; cbytes := 16.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
  chunklen := 4096; chunksize: 16384; blocksize: 0
[[[ 0  0]
  [ 0  1]]
 [[ 0  2]
  [-1 -1]]]
>>> g.is_called()
CArrayWrapper((3, 2), bool)
  nbytes := 6; cbytes := 16.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
  chunklen := 8192; chunksize: 16384; blocksize: 0
[[ True  True]
 [ True  True]
 [ True False]]
>>> g.to_haplotypes()
HaplotypeCArray((3, 4), int8)
  nbytes := 12; cbytes := 16.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
  chunklen := 4096; chunksize: 16384; blocksize: 0
[[ 0  0  0  1]
 [ 0  1  1  1]
 [ 0  2 -1 -1]]
>>> g.count_alleles()
AlleleCountsCArray((3, 3), int32)
  nbytes := 36; cbytes := 16.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
  chunklen := 1365; chunksize: 16380; blocksize: 0
[[3 1 0]
 [1 3 0]
 [1 0 1]]

HaplotypeCArray

class allel.model.bcolz.HaplotypeCArray(data=None, copy=False, **kwargs)[source]

Alternative implementation of the allel.model.ndarray.HaplotypeArray class, using a bcolz.carray as the backing store.

Parameters:

data : array_like, int, shape (n_variants, n_haplotypes), optional

Data to initialise the array with. May be a bcolz carray, which will not be copied if copy=False. May also be None, in which case rootdir must be provided (disk-based array).

copy : bool, optional

If True, copy the input data into a new bcolz carray.

**kwargs : keyword arguments

Passed through to the bcolz carray constructor.

AlleleCountsCArray

class allel.model.bcolz.AlleleCountsCArray(data=None, copy=False, **kwargs)[source]

Alternative implementation of the allel.model.ndarray.AlleleCountsArray class, using a bcolz.carray as the backing store.

Parameters:

data : array_like, int, shape (n_variants, n_alleles), optional

Data to initialise the array with. May be a bcolz carray, which will not be copied if copy=False. May also be None, in which case rootdir must be provided (disk-based array).

copy : bool, optional

If True, copy the input data into a new bcolz carray.

**kwargs : keyword arguments

Passed through to the bcolz carray constructor.

VariantCTable

class allel.model.bcolz.VariantCTable(data=None, copy=False, index=None, **kwargs)[source]

Alternative implementation of the allel.model.ndarray.VariantTable class, using a bcolz.ctable as the backing store.

Parameters:

data : tuple or list of column objects, optional

The list of column data to build the ctable object. This can also be a pure NumPy structured array. May also be a bcolz ctable, which will not be copied if copy=False. May also be None, in which case rootdir must be provided (disk-based array).

copy : bool, optional

If True, copy the input data into a new bcolz ctable.

index : string or pair of strings, optional

If a single string, name of column to use for a sorted index. If a pair of strings, name of columns to use for a sorted multi-index.

**kwargs : keyword arguments

Passed through to the bcolz ctable constructor.

Examples

Instantiate from existing data:

>>> import allel
>>> chrom = [b'chr1', b'chr1', b'chr2', b'chr2', b'chr3']
>>> pos = [2, 7, 3, 9, 6]
>>> dp = [35, 12, 78, 22, 99]
>>> qd = [4.5, 6.7, 1.2, 4.4, 2.8]
>>> ac = [(1, 2), (3, 4), (5, 6), (7, 8), (9, 10)]
>>> vt = allel.VariantCTable([chrom, pos, dp, qd, ac],
...                           names=['CHROM', 'POS', 'DP', 'QD', 'AC'],
...                           index=('CHROM', 'POS'))
>>> vt
VariantCTable((5,), [('CHROM', 'S4'), ('POS', '<i8'), ('DP', '<i8'), ('QD', '<f8'), ('AC', '<i8', (2,))])
  nbytes: 220; cbytes: 80.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
[(b'chr1', 2, 35, 4.5, [1, 2]) (b'chr1', 7, 12, 6.7, [3, 4])
 (b'chr2', 3, 78, 1.2, [5, 6]) (b'chr2', 9, 22, 4.4, [7, 8])
 (b'chr3', 6, 99, 2.8, [9, 10])]

Slicing rows returns allel.model.ndarray.VariantTable:

>>> vt[:2]
VariantTable((2,), dtype=(numpy.record, [('CHROM', 'S4'), ('POS', '<i8'), ('DP', '<i8'), ('QD', '<f8'), ('AC', '<i8', (2,))]))
[(b'chr1', 2, 35, 4.5, array([1, 2])) (b'chr1', 7, 12, 6.7, array([3, 4]))]

Accessing columns returns allel.model.bcolz.VariantCTable:

>>> vt[['DP', 'QD']]
VariantCTable((5,), [('DP', '<i8'), ('QD', '<f8')])
  nbytes: 80; cbytes: 32.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
[(35, 4.5) (12, 6.7) (78, 1.2) (22, 4.4) (99, 2.8)]

Use the index to locate variants:

>>> loc = vt.index.locate_range(b'chr2', 1, 10)
>>> vt[loc]
VariantTable((2,), dtype=(numpy.record, [('CHROM', 'S4'), ('POS', '<i8'), ('DP', '<i8'), ('QD', '<f8'), ('AC', '<i8', (2,))]))
[(b'chr2', 3, 78, 1.2, array([5, 6])) (b'chr2', 9, 22, 4.4, array([7, 8]))]

FeatureCTable

class allel.model.bcolz.FeatureCTable(data=None, copy=False, **kwargs)[source]

Alternative implementation of the allel.model.ndarray.FeatureTable class, using a bcolz.ctable as the backing store.

Parameters:

data : tuple or list of column objects, optional

The list of column data to build the ctable object. This can also be a pure NumPy structured array. May also be a bcolz ctable, which will not be copied if copy=False. May also be None, in which case rootdir must be provided (disk-based array).

copy : bool, optional

If True, copy the input data into a new bcolz ctable.

index : pair or triplet of strings, optional

Names of columns to use for positional index, e.g., (‘start’, ‘stop’) if table contains ‘start’ and ‘stop’ columns and records from a single chromosome/contig, or (‘seqid’, ‘start’, ‘end’) if table contains records from multiple chromosomes/contigs.

**kwargs : keyword arguments

Passed through to the bcolz ctable constructor.

Utility functions

allel.model.bcolz.carray_block_map(domain, f, out=None, blen=None, wrap=None, **kwargs)[source]
allel.model.bcolz.carray_block_sum(carr, axis=None, blen=None, transform=None)[source]
allel.model.bcolz.carray_block_max(carr, axis=None, blen=None)[source]
allel.model.bcolz.carray_block_min(carr, axis=None, blen=None)[source]
allel.model.bcolz.carray_block_compress(carr, condition, axis, blen=None, **kwargs)[source]
allel.model.bcolz.carray_block_take(carr, indices, axis, blen=None, **kwargs)[source]
allel.model.bcolz.carray_block_vstack(tup, blen=None, **kwargs)[source]
allel.model.bcolz.carray_block_hstack(tup, blen=None, **kwargs)[source]
allel.model.bcolz.carray_from_hdf5(*args, **kwargs)[source]

Load a bcolz carray from an HDF5 dataset.

Either provide an h5py dataset as a single positional argument, or provide two positional arguments giving the HDF5 file path and the dataset node path within the file.

The following optional parameters may be given. Any other keyword arguments are passed through to the bcolz.carray constructor.

Parameters:

start : int, optional

Index to start loading from.

stop : int, optional

Index to finish loading at.

condition : array_like, bool, optional

A 1-dimensional boolean array of the same length as the first dimension of the dataset to load, indicating a selection of rows to load.

blen : int, optional

Block size to use when loading.

allel.model.bcolz.carray_to_hdf5(carr, parent, name, **kwargs)[source]

Write a bcolz carray to an HDF5 dataset.

Parameters:

carr : bcolz.carray

Data to write.

parent : string or h5py group

Parent HDF5 file or group. If a string, will be treated as HDF5 file name.

name : string

Name or path of dataset to write data into.

kwargs : keyword arguments

Passed through to h5py require_dataset() function.

Returns:

h5d : h5py dataset

allel.model.bcolz.ctable_block_compress(ctbl, condition, blen=None, **kwargs)[source]
allel.model.bcolz.ctable_block_take(ctbl, indices, **kwargs)[source]
allel.model.bcolz.ctable_from_hdf5_group(*args, **kwargs)[source]

Load a bcolz ctable from columns stored as separate datasets with an HDF5 group.

Either provide an h5py group as a single positional argument, or provide two positional arguments giving the HDF5 file path and the group node path within the file.

The following optional parameters may be given. Any other keyword arguments are passed through to the bcolz.carray constructor.

Parameters:

start : int, optional

Index to start loading from.

stop : int, optional

Index to finish loading at.

condition : array_like, bool, optional

A 1-dimensional boolean array of the same length as the columns of the table to load, indicating a selection of rows to load.

blen : int, optional

Block size to use when loading.

allel.model.bcolz.ctable_to_hdf5_group(ctbl, parent, name, **kwargs)[source]

Write each column in a bcolz ctable to a dataset in an HDF5 group.

Parameters:

parent : string or h5py group

Parent HDF5 file or group. If a string, will be treated as HDF5 file name.

name : string

Name or path of group to write data into.

kwargs : keyword arguments

Passed through to h5py require_dataset() function.

Returns:

h5g : h5py group