Chunked arrays

This module is maintained for backwards-compatibility, however it is recommended to migrate to use the equivalent classes from the allel.model.dask module where possible.

This module provides alternative implementations of array and table classes defined in the allel.model.ndarray module, using chunked arrays for data storage. Chunked arrays can be compressed and optionally stored on disk, providing a means for working with data too large to fit uncompressed in main memory.

Either Zarr, HDF5 (via h5py) or bcolz can be used as the underlying storage layer. Choice of storage layer can be made via the storage keyword argument which all class methods accept. This argument can either be a string identifying one of the predefined storage layer configurations, or an object implementing the chunked storage API. For more information about controlling storage see the allel.chunked module.

GenotypeChunkedArray

class allel.GenotypeChunkedArray(data)[source]

Alternative implementation of the allel.model.ndarray.GenotypeArray class, wrapping a chunked array as the backing store.

Parameters:

data : array_like, int, shape (n_variants, n_samples, ploidy)

Genotype data to be wrapped. May be a bcolz carray, h5py dataset, or anything providing a similar interface.

Examples

Wrap an HDF5 dataset:

>>> import h5py
>>> with h5py.File('callset.h5', mode='w') as h5f:
...     h5g = h5f.create_group('/3L/calldata')
...     h5g.create_dataset('genotype',
...                        data=[[[0, 0], [0, 1]],
...                              [[0, 1], [1, 1]],
...                              [[0, 2], [-1, -1]]],
...                        dtype='i1', chunks=(2, 2, 2),
...                        compression='gzip', compression_opts=1)
...
<HDF5 dataset "genotype": shape (3, 2, 2), type "|i1">
>>> import allel
>>> callset = h5py.File('callset.h5', mode='r')
>>> g = allel.GenotypeChunkedArray(callset['/3L/calldata/genotype'])
>>> g
<GenotypeChunkedArray shape=(3, 2, 2) dtype=int8 chunks=(2, 2, 2)
   nbytes=12 cbytes=30 cratio=0.4
   compression=gzip compression_opts=1
   values=h5py._hl.dataset.Dataset>
>>> g.values
<HDF5 dataset "genotype": shape (3, 2, 2), type "|i1">

Obtain a numpy array by slicing, e.g.:

>>> g[:]
<GenotypeArray shape=(3, 2, 2) dtype=int8>
0/0 0/1
0/1 1/1
0/2 ./.

Note that most methods will return a chunked array, using whatever chunked storage is set as default (bcolz carray) or specified directly via the storage keyword argument. E.g.:

>>> g.copy()
<GenotypeChunkedArray shape=(3, 2, 2) dtype=int8 chunks=(3, 2, 2)
   ...
>>> g.copy(storage='bcolzmem')
<GenotypeChunkedArray shape=(3, 2, 2) dtype=int8 chunks=(4096, 2, 2)
   ...
>>> g.copy(storage='hdf5mem_zlib1')
<GenotypeChunkedArray shape=(3, 2, 2) dtype=int8 chunks=(3, 2, 2)
   ...

HaplotypeChunkedArray

class allel.HaplotypeChunkedArray(data)[source]

Alternative implementation of the allel.model.ndarray.HaplotypeArray class, using a chunked array as the backing store.

Parameters:

data : array_like, int, shape (n_variants, n_haplotypes)

Haplotype data to be wrapped. May be a bcolz carray, h5py dataset, or anything providing a similar interface.

AlleleCountsChunkedArray

class allel.AlleleCountsChunkedArray(data)[source]

Alternative implementation of the allel.model.ndarray.AlleleCountsArray class, using a chunked array as the backing store.

Parameters:

data : array_like, int, shape (n_variants, n_alleles)

Allele counts data to be wrapped. May be a bcolz carray, h5py dataset, or anything providing a similar interface.

VariantChunkedTable

class allel.VariantChunkedTable(data, names=None, index=None)[source]

Alternative implementation of the allel.model.ndarray.VariantTable class, using a chunked table as the backing store.

Parameters:

data: table_like

Data to be wrapped. May be a tuple or list of columns (array-like), a dict mapping names to columns, a bcolz ctable, h5py group, numpy recarray, or anything providing a similar interface.

names : sequence of strings

Column names.

Examples

Wrap columns stored as datasets within an HDF5 group:

>>> import h5py
>>> chrom = [b'chr1', b'chr1', b'chr2', b'chr2', b'chr3']
>>> pos = [2, 7, 3, 9, 6]
>>> dp = [35, 12, 78, 22, 99]
>>> qd = [4.5, 6.7, 1.2, 4.4, 2.8]
>>> ac = [(1, 2), (3, 4), (5, 6), (7, 8), (9, 10)]
>>> with h5py.File('callset.h5', mode='w') as h5f:
...     h5g = h5f.create_group('/3L/variants')
...     h5g.create_dataset('CHROM', data=chrom, chunks=True)
...     h5g.create_dataset('POS', data=pos, chunks=True)
...     h5g.create_dataset('DP', data=dp, chunks=True)
...     h5g.create_dataset('QD', data=qd, chunks=True)
...     h5g.create_dataset('AC', data=ac, chunks=True)
...
<HDF5 dataset "CHROM": shape (5,), type "|S4">
<HDF5 dataset "POS": shape (5,), type "<i8">
<HDF5 dataset "DP": shape (5,), type "<i8">
<HDF5 dataset "QD": shape (5,), type "<f8">
<HDF5 dataset "AC": shape (5, 2), type "<i8">
>>> import allel
>>> callset = h5py.File('callset.h5', mode='r')
>>> vt = allel.VariantChunkedTable(callset['/3L/variants'],
...                                names=['CHROM', 'POS', 'AC', 'QD', 'DP'])
>>> vt
<VariantChunkedTable shape=(5,) dtype=[('CHROM', 'S4'), ('POS', '<i8'), ('AC', ...
   ...

Obtain a single row:

>>> vt[0]
row(CHROM=b'chr1', POS=2, AC=array([1, 2]), QD=4.5, DP=35)

Obtain a numpy array by slicing:

>>> vt[:]  
<VariantTable shape=(5,) dtype=(numpy.record, [('CHROM', 'S4'), ('POS', '<i8'), ...
[(b'chr1', 2, [ 1,  2], 4.5, 35) (b'chr1', 7, [ 3,  4], 6.7, 12)
 (b'chr2', 3, [ 5,  6], 1.2, 78) (b'chr2', 9, [ 7,  8], 4.4, 22)
 (b'chr3', 6, [ 9, 10], 2.8, 99)]

Access a subset of columns:

>>> vt[['CHROM', 'POS']]
<VariantChunkedTable shape=(5,) dtype=[('CHROM', 'S4'), ('POS', '<i8')]
   nbytes=60 cbytes=60 cratio=1.0
   values=builtins.list>

Note that most methods will return a chunked table, using whatever chunked storage is set as default (bcolz ctable) or specified directly via the storage keyword argument. E.g.:

>>> vt.copy()  
<VariantChunkedTable shape=(5,) dtype=[('CHROM', 'S4'), ('POS', '<i8'), ('AC', ...
   nbytes=220 cbytes=1.7K cratio=0.1
   values=allel.chunked.storage_zarr.ZarrTable>
>>> vt.copy(storage='bcolzmem')  
<VariantChunkedTable shape=(5,) dtype=[('CHROM', 'S4'), ('POS', '<i8'), ('AC', ...
   nbytes=220 cbytes=80.0K cratio=0.0
   values=bcolz.ctable.ctable>
>>> vt.copy(storage='hdf5mem_zlib1')  
<VariantChunkedTable shape=(5,) dtype=[('CHROM', 'S4'), ('POS', '<i8'), ('AC', ...
   nbytes=220 cbytes=131 cratio=1.7
   values=h5py._hl.files.File>

AlleleCountsChunkedTable

class allel.AlleleCountsChunkedTable(data, names=None)[source]