Compressed arrays (bcolz)

This module provides alternative implementations of array interfaces defined in the allel.model module, using bcolz compressed arrays (bcolz.carray) instead of numpy arrays for data storage. Compressed arrays can use either main memory or be stored on disk. In either case, the use of compressed arrays enables analysis of data that are too large to fit uncompressed into main memory.

GenotypeCArray

class allel.bcolz.GenotypeCArray(data=None, copy=True, **kwargs)[source]

Alternative implementation of the allel.model.GenotypeArray interface, using a bcolz.carray as the backing store.

Parameters:

data : array_like, int, shape (n_variants, n_samples, ploidy), optional

Data to initialise the array with. May be a bcolz carray, which will not be copied if copy=False. May also be None, in which case rootdir must be provided (disk-based array).

copy : bool, optional

If True, copy the input data into a new bcolz carray.

**kwargs : keyword arguments

Passed through to the bcolz carray constructor.

Examples

Instantiate a compressed genotype array from existing data:

>>> import allel
>>> g = allel.bcolz.GenotypeCArray([[[0, 0], [0, 1]],
...                                 [[0, 1], [1, 1]],
...                                 [[0, 2], [-1, -1]]], dtype='i1')
>>> g
GenotypeCArray((3, 2, 2), int8)
  nbytes: 12; cbytes: 16.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[[[ 0  0]
  [ 0  1]]
 [[ 0  1]
  [ 1  1]]
 [[ 0  2]
  [-1 -1]]]

Obtain a numpy ndarray from a compressed array by slicing:

>>> g[:]
GenotypeArray((3, 2, 2), dtype=int8)
[[[ 0  0]
  [ 0  1]]
 [[ 0  1]
  [ 1  1]]
 [[ 0  2]
  [-1 -1]]]

Build incrementally:

>>> import bcolz
>>> data = bcolz.zeros((0, 2, 2), dtype='i1')
>>> data.append([[0, 0], [0, 1]])
>>> data.append([[0, 1], [1, 1]])
>>> data.append([[0, 2], [-1, -1]])
>>> g = allel.bcolz.GenotypeCArray(data, copy=False)
>>> g
GenotypeCArray((3, 2, 2), int8)
  nbytes: 12; cbytes: 16.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[[[ 0  0]
  [ 0  1]]
 [[ 0  1]
  [ 1  1]]
 [[ 0  2]
  [-1 -1]]]

Load from HDF5:

>>> import h5py
>>> with h5py.File('example.h5', mode='w') as h5f:
...     h5f.create_dataset('genotype',
...                        data=[[[0, 0], [0, 1]],
...                              [[0, 1], [1, 1]],
...                              [[0, 2], [-1, -1]]],
...                        dtype='i1',
...                        chunks=(2, 2, 2))
...
<HDF5 dataset "genotype": shape (3, 2, 2), type "|i1">
>>> g = allel.bcolz.GenotypeCArray.from_hdf5('example.h5', 'genotype')
>>> g
GenotypeCArray((3, 2, 2), int8)
  nbytes: 12; cbytes: 16.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[[[ 0  0]
  [ 0  1]]
 [[ 0  1]
  [ 1  1]]
 [[ 0  2]
  [-1 -1]]]

Note that methods of this class will return bcolz carrays rather than numpy ndarrays where possible. E.g.:

>>> g.take([0, 2], axis=0)
GenotypeCArray((2, 2, 2), int8)
  nbytes: 8; cbytes: 16.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[[[ 0  0]
  [ 0  1]]
 [[ 0  2]
  [-1 -1]]]
>>> g.is_called()
carray((3, 2), bool)
  nbytes: 6; cbytes: 16.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[[ True  True]
 [ True  True]
 [ True False]]
>>> g.to_haplotypes()
HaplotypeCArray((3, 4), int8)
  nbytes: 12; cbytes: 16.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[[ 0  0  0  1]
 [ 0  1  1  1]
 [ 0  2 -1 -1]]
>>> g.count_alleles()
AlleleCountsCArray((3, 3), int32)
  nbytes: 36; cbytes: 16.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[[3 1 0]
 [1 3 0]
 [1 0 1]]

HaplotypeCArray

class allel.bcolz.HaplotypeCArray(data=None, copy=True, **kwargs)[source]

Alternative implementation of the allel.model.HaplotypeArray interface, using a bcolz.carray as the backing store.

Parameters:

data : array_like, int, shape (n_variants, n_haplotypes), optional

Data to initialise the array with. May be a bcolz carray, which will not be copied if copy=False. May also be None, in which case rootdir must be provided (disk-based array).

copy : bool, optional

If True, copy the input data into a new bcolz carray.

**kwargs : keyword arguments

Passed through to the bcolz carray constructor.

AlleleCountsCArray

class allel.bcolz.AlleleCountsCArray(data=None, copy=True, **kwargs)[source]

Alternative implementation of the allel.model.AlleleCountsArray interface, using a bcolz.carray as the backing store.

Parameters:

data : array_like, int, shape (n_variants, n_alleles), optional

Data to initialise the array with. May be a bcolz carray, which will not be copied if copy=False. May also be None, in which case rootdir must be provided (disk-based array).

copy : bool, optional

If True, copy the input data into a new bcolz carray.

**kwargs : keyword arguments

Passed through to the bcolz carray constructor.

VariantCTable

class allel.bcolz.VariantCTable(data=None, copy=True, index=None, **kwargs)[source]

Alternative implementation of the allel.model.VariantTable interface, using a bcolz.ctable as the backing store.

Parameters:

data : tuple or list of column objects, optional

The list of column data to build the ctable object. This can also be a pure NumPy structured array. May also be a bcolz ctable, which will not be copied if copy=False. May also be None, in which case rootdir must be provided (disk-based array).

copy : bool, optional

If True, copy the input data into a new bcolz ctable.

index : string or pair of strings, optional

If a single string, name of column to use for a sorted index. If a pair of strings, name of columns to use for a sorted multi-index.

**kwargs : keyword arguments

Passed through to the bcolz ctable constructor.

Examples

Instantiate from existing data:

>>> import allel
>>> chrom = [b'chr1', b'chr1', b'chr2', b'chr2', b'chr3']
>>> pos = [2, 7, 3, 9, 6]
>>> dp = [35, 12, 78, 22, 99]
>>> qd = [4.5, 6.7, 1.2, 4.4, 2.8]
>>> ac = [(1, 2), (3, 4), (5, 6), (7, 8), (9, 10)]
>>> vt = allel.bcolz.VariantCTable([chrom, pos, dp, qd, ac],
...                                names=['CHROM', 'POS', 'DP', 'QD', 'AC'],
...                                index=('CHROM', 'POS'))
>>> vt
VariantCTable((5,), [('CHROM', 'S4'), ('POS', '<i8'), ('DP', '<i8'), ('QD', '<f8'), ('AC', '<i8', (2,))])
  nbytes: 220; cbytes: 80.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[(b'chr1', 2, 35, 4.5, [1, 2]) (b'chr1', 7, 12, 6.7, [3, 4])
 (b'chr2', 3, 78, 1.2, [5, 6]) (b'chr2', 9, 22, 4.4, [7, 8])
 (b'chr3', 6, 99, 2.8, [9, 10])]

Slicing rows returns allel.model.VariantTable:

>>> vt[:2]
VariantTable((2,), dtype=[('CHROM', 'S4'), ('POS', '<i8'), ('DP', '<i8'), ('QD', '<f8'), ('AC', '<i8', (2,))])
[(b'chr1', 2, 35, 4.5, array([1, 2])) (b'chr1', 7, 12, 6.7, array([3, 4]))]

Accessing columns returns allel.bcolz.VariantCTable:

>>> vt[['DP', 'QD']]
VariantCTable((5,), [('DP', '<i8'), ('QD', '<f8')])
  nbytes: 80; cbytes: 32.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[(35, 4.5) (12, 6.7) (78, 1.2) (22, 4.4) (99, 2.8)]

Use the index to locate variants:

>>> loc = vt.index.locate_range(b'chr2', 1, 10)
>>> vt[loc]
VariantTable((2,), dtype=[('CHROM', 'S4'), ('POS', '<i8'), ('DP', '<i8'), ('QD', '<f8'), ('AC', '<i8', (2,))])
[(b'chr2', 3, 78, 1.2, array([5, 6])) (b'chr2', 9, 22, 4.4, array([7, 8]))]

FeatureCTable

class allel.bcolz.FeatureCTable(data=None, copy=True, **kwargs)[source]

Alternative implementation of the allel.model.FeatureTable interface, using a bcolz.ctable as the backing store.

Parameters:

data : tuple or list of column objects, optional

The list of column data to build the ctable object. This can also be a pure NumPy structured array. May also be a bcolz ctable, which will not be copied if copy=False. May also be None, in which case rootdir must be provided (disk-based array).

copy : bool, optional

If True, copy the input data into a new bcolz ctable.

index : pair or triplet of strings, optional

Names of columns to use for positional index, e.g., (‘start’, ‘stop’) if table contains ‘start’ and ‘stop’ columns and records from a single chromosome/contig, or (‘seqid’, ‘start’, ‘end’) if table contains records from multiple chromosomes/contigs.

**kwargs : keyword arguments

Passed through to the bcolz ctable constructor.

Utility functions

allel.bcolz.carray_block_map(carr, f, out=None, blen=None, **kwargs)[source]
allel.bcolz.carray_block_sum(carr, axis=None, blen=None, transform=None)[source]
allel.bcolz.carray_block_max(carr, axis=None, blen=None)[source]
allel.bcolz.carray_block_min(carr, axis=None, blen=None)[source]
allel.bcolz.carray_block_compress(carr, condition, axis, blen=None, **kwargs)[source]
allel.bcolz.carray_block_take(carr, indices, axis, **kwargs)[source]
allel.bcolz.carray_from_hdf5(*args, **kwargs)[source]

Load a bcolz carray from an HDF5 dataset.

Either provide an h5py dataset as a single positional argument, or provide two positional arguments giving the HDF5 file path and the dataset node path within the file.

All keyword arguments are passed through to the bcolz.carray constructor.

allel.bcolz.ctable_block_compress(ctbl, condition, blen=None, **kwargs)[source]
allel.bcolz.ctable_block_take(ctbl, indices, **kwargs)[source]
allel.bcolz.ctable_from_hdf5_group(*args, **kwargs)[source]

Load a bcolz ctable from columns stored as separate datasets with an HDF5 group.

Either provide an h5py group as a single positional argument, or provide two positional arguments giving the HDF5 file path and the group node path within the file.

All keyword arguments are passed through to the bcolz.ctable constructor.