Chunked arrays

This module provides alternative implementations of array and table classes defined in the allel.model.ndarray module, using chunked arrays for data storage. Chunked arrays can be compressed and optionally stored on disk, providing a means for working with data too large to fit uncompressed in main memory.

Either HDF5 (via h5py) or bcolz can be used as the underlying storage layer. Choice of storage layer can be made via the storage keyword argument which all class methods accept. This argument can either be a string identifying one of the predefined storage layer configurations, or an object implementing the chunked storage API. For more information about controlling storage see the allel.chunked module.

GenotypeChunkedArray

class allel.model.chunked.GenotypeChunkedArray(data)[source]

Alternative implementation of the allel.model.ndarray.GenotypeArray class, wrapping a chunked array as the backing store.

Parameters:

data : array_like, int, shape (n_variants, n_samples, ploidy)

Genotype data to be wrapped. May be a bcolz carray, h5py dataset, or anything providing a similar interface.

Examples

Wrap an HDF5 dataset:

>>> import h5py
>>> with h5py.File('callset.h5', mode='w') as h5f:
...     h5g = h5f.create_group('/3L/calldata')
...     h5g.create_dataset('genotype',
...                        data=[[[0, 0], [0, 1]],
...                              [[0, 1], [1, 1]],
...                              [[0, 2], [-1, -1]]],
...                        dtype='i1', chunks=(2, 2, 2),
...                        compression='gzip', compression_opts=1)
...
<HDF5 dataset "genotype": shape (3, 2, 2), type "|i1">
>>> import allel
>>> callset = h5py.File('callset.h5', mode='r')
>>> g = allel.GenotypeChunkedArray(callset['/3L/calldata/genotype'])
>>> g
GenotypeChunkedArray((3, 2, 2), int8, chunks=(2, 2, 2))
  nbytes: 12; cbytes: 30; cratio: 0.4;
  compression: gzip; compression_opts: 1;
  data: h5py._hl.dataset.Dataset
>>> g.data
<HDF5 dataset "genotype": shape (3, 2, 2), type "|i1">

Obtain a numpy array by slicing, e.g.:

>>> g[:]
GenotypeArray((3, 2, 2), dtype=int8)
[[[ 0  0]
  [ 0  1]]
 [[ 0  1]
  [ 1  1]]
 [[ 0  2]
  [-1 -1]]]

Note that most methods will return a chunked array, using whatever chunked storage is set as default (bcolz carray) or specified directly via the storage keyword argument. E.g.:

>>> g.copy()
GenotypeChunkedArray((3, 2, 2), int8, chunks=(4096, 2, 2))
  nbytes: 12; cbytes: 16.0K; cratio: 0.0;
  compression: blosc; compression_opts: cparams(clevel=5, shuffle=1, cname='lz4', quantize=0);
  data: bcolz.carray_ext.carray
>>> g.copy(storage='zarrmem')
GenotypeChunkedArray((3, 2, 2), int8, chunks=(2, 2, 2))
  nbytes: 12; cbytes: 379; cratio: 0.0;
  compression: blosc; compression_opts: {'shuffle': 1, 'cname': 'lz4', 'clevel': 5};
  data: zarr.core.Array
>>> g.copy(storage='hdf5mem_zlib1')
GenotypeChunkedArray((3, 2, 2), int8, chunks=(262144, 2, 2))
  nbytes: 12; cbytes: 4.5K; cratio: 0.0;
  compression: gzip; compression_opts: 1;
  data: h5py._hl.dataset.Dataset

HaplotypeChunkedArray

class allel.model.chunked.HaplotypeChunkedArray(data)[source]

Alternative implementation of the allel.model.ndarray.HaplotypeArray class, using a chunked array as the backing store.

Parameters:

data : array_like, int, shape (n_variants, n_haplotypes)

Haplotype data to be wrapped. May be a bcolz carray, h5py dataset, or anything providing a similar interface.

AlleleCountsChunkedArray

class allel.model.chunked.AlleleCountsChunkedArray(data)[source]

Alternative implementation of the allel.model.ndarray.AlleleCountsArray class, using a chunked array as the backing store.

Parameters:

data : array_like, int, shape (n_variants, n_alleles)

Allele counts data to be wrapped. May be a bcolz carray, h5py dataset, or anything providing a similar interface.

VariantChunkedTable

class allel.model.chunked.VariantChunkedTable(data, names=None, index=None)[source]

Alternative implementation of the allel.model.ndarray.VariantTable class, using a chunked table as the backing store.

Parameters:

data: table_like

Data to be wrapped. May be a tuple or list of columns (array-like), a dict mapping names to columns, a bcolz ctable, h5py group, numpy recarray, or anything providing a similar interface.

names : sequence of strings

Column names.

Examples

Wrap columns stored as datasets within an HDF5 group:

>>> import h5py
>>> chrom = [b'chr1', b'chr1', b'chr2', b'chr2', b'chr3']
>>> pos = [2, 7, 3, 9, 6]
>>> dp = [35, 12, 78, 22, 99]
>>> qd = [4.5, 6.7, 1.2, 4.4, 2.8]
>>> ac = [(1, 2), (3, 4), (5, 6), (7, 8), (9, 10)]
>>> with h5py.File('callset.h5', mode='w') as h5f:
...     h5g = h5f.create_group('/3L/variants')
...     h5g.create_dataset('CHROM', data=chrom, chunks=True)
...     h5g.create_dataset('POS', data=pos, chunks=True)
...     h5g.create_dataset('DP', data=dp, chunks=True)
...     h5g.create_dataset('QD', data=qd, chunks=True)
...     h5g.create_dataset('AC', data=ac, chunks=True)
...
<HDF5 dataset "CHROM": shape (5,), type "|S4">
<HDF5 dataset "POS": shape (5,), type "<i8">
<HDF5 dataset "DP": shape (5,), type "<i8">
<HDF5 dataset "QD": shape (5,), type "<f8">
<HDF5 dataset "AC": shape (5, 2), type "<i8">
>>> import allel
>>> callset = h5py.File('callset.h5', mode='r')
>>> vt = allel.VariantChunkedTable(callset['/3L/variants'],
...                                names=['CHROM', 'POS', 'AC', 'QD', 'DP'])
>>> vt
VariantChunkedTable(5)
  nbytes: 220; cbytes: 220; cratio: 1.0;
  data: h5py._hl.group.Group

Obtain a single row:

>>> vt[0]
row(CHROM=b'chr1', POS=2, AC=array([1, 2]), QD=4.5, DP=35)

Obtain a numpy array by slicing:

>>> vt[:] 
VariantTable((5,), dtype=[('CHROM', 'S4'), ('POS', '<i8'), ('AC', ...
[(b'chr1', 2, [1, 2], 4.5, 35) (b'chr1', 7, [3, 4], 6.7, 12)
 (b'chr2', 3, [5, 6], 1.2, 78) (b'chr2', 9, [7, 8], 4.4, 22)
 (b'chr3', 6, [9, 10], 2.8, 99)]

Access a subset of columns:

>>> vt[['CHROM', 'POS']]
VariantChunkedTable(5)
  nbytes: 60; cbytes: 60; cratio: 1.0;
  data: builtins.list

Note that most methods will return a chunked table, using whatever chunked storage is set as default (bcolz ctable) or specified directly via the storage keyword argument. E.g.:

>>> vt.copy()
VariantChunkedTable(5)
  nbytes: 220; cbytes: 80.0K; cratio: 0.0;
  data: bcolz.ctable.ctable
>>> vt.copy(storage='zarr')
VariantChunkedTable(5)
  nbytes: 220; cbytes: 1.7K; cratio: 0.1;
  data: allel.chunked.storage_zarr.ZarrTable
>>> vt.copy(storage='hdf5mem_zlib1')
VariantChunkedTable(5)
  nbytes: 220; cbytes: 22.5K; cratio: 0.0;
  data: h5py._hl.files.File

FeatureChunkedTable

class allel.model.chunked.FeatureChunkedTable(data, names=None)[source]

Alternative implementation of the allel.model.ndarray.FeatureTable class, using a chunked table as the backing store.

Parameters:

data: table_like

Data to be wrapped. May be a tuple or list of columns (array-like), a dict mapping names to columns, a bcolz ctable, h5py group, numpy recarray, or anything providing a similar interface.

names : sequence of strings

Column names.

AlleleCountsChunkedTable

class allel.model.chunked.FeatureChunkedTable(data, names=None)[source]

Alternative implementation of the allel.model.ndarray.FeatureTable class, using a chunked table as the backing store.

Parameters:

data: table_like

Data to be wrapped. May be a tuple or list of columns (array-like), a dict mapping names to columns, a bcolz ctable, h5py group, numpy recarray, or anything providing a similar interface.

names : sequence of strings

Column names.