dataset
– Unified data format
Common module to structure data in same way for all codes.
Dataset definition and initialisation
A dataset is composed by 5 members:
value: a
numpy.ndarray
or anumpy.generic
(scalar from numpy represented like the arrays, with dim, etc)error: an object of same type as value
bins: an
collections.OrderedDict
(optional and named argument)name: name of the dataset (optional and named argument)
what: can be used to store the name of the quantity represented by the dataset (optional and named argument)
The bins object should have the same dimension as the value, the order matches
the dimensions. If no bins are available it is still possible to use an empty
collections.OrderedDict
.
>>> from valjean.eponine.dataset import Dataset
>>> import numpy as np
>>> from collections import OrderedDict
>>> bins = OrderedDict([('e', np.array([1, 2, 3])), ('t', np.arange(5))])
>>> ds1 = Dataset(np.arange(10).reshape(2, 5),
... np.array([0.3]*10).reshape(2, 5),
... bins=bins, name='ds1', what='spam')
>>> ds1.name
'ds1'
>>> ds1.what
'spam'
>>> len(bins) == ds1.ndim
True
>>> ds1.error.shape == ds1.value.shape
True
>>> ds1.value.shape == (2, 5)
True
>>> np.array_equal(ds1.value, [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]])
True
>>> np.array_equal(ds1.error,
... [[0.3, 0.3, 0.3, 0.3, 0.3], [0.3, 0.3, 0.3, 0.3, 0.3]])
True
>>> list(bins.keys())
['e', 't']
>>> np.array_equal(bins['e'], [1, 2, 3])
True
>>> np.array_equal(bins['t'], [0, 1, 2, 3, 4])
True
A new dataset can also be created from an already existing one, using copy
.
No matter how the dataset is generated, attributes can be changed afterwards:
>>> nds = ds1.copy()
>>> nds.name = 'egg'
>>> nds.what = ''
>>> print(f'name: ds1={ds1.name!r}, nds={nds.name!r}')
name: ds1='ds1', nds='egg'
>>> print(f'what: ds1={ds1.what!r}, nds={nds.what!r}')
what: ds1='spam', nds=''
>>> np.array_equal(ds1.value, nds.value)
True
Errors are emitted if the arguments do not have the expected type or if the shapes or dimensions are not consistent:
>>> tds = Dataset([1, 2, 3], [0.5, 0.5, 0.5])
Traceback (most recent call last):
[...]
TypeError: value does not have the expected type (numpy.ndarray or numpy.generic = scalar)
>>> tds = Dataset(np.arange(6).reshape(2, 3), np.arange(6).reshape(3, 2))
Traceback (most recent call last):
[...]
ValueError: Value and error do not have the same shape
>>> tds = Dataset(np.arange(6).reshape(2, 3),
... np.array([0.5]*6).reshape(2, 3),
... bins={'spam': [1, 2], 'egg': [1, 2, 3]})
Traceback (most recent call last):
[...]
TypeError: bins should be an OrderedDict
>>> tds = Dataset(np.arange(6).reshape(2, 3),
... np.array([0.5]*6).reshape(2, 3),
... bins=OrderedDict([('spam', [1, 2])]))
Traceback (most recent call last):
[...]
ValueError: Number of dimensions of bins does not correspond to number of dimensions of value
Squeezing a dataset
If there are useless dimensions, the dataset can be squeezed:
>>> vals = np.arange(6).reshape(1, 2, 1, 3)
>>> errs = np.array([0.1]*6).reshape(1, 2, 1, 3)
>>> bins2 = OrderedDict([('bacon', np.array([0, 1])),
... ('egg', np.array([0, 2, 4])),
... ('sausage', np.array([10, 20])),
... ('spam', np.array([-5, 0, 5, 10]))])
>>> ds = Dataset(vals, errs, bins=bins2)
>>> ds.value.shape == (1, 2, 1, 3)
True
>>> len(ds.bins) == 4
True
>>> np.array_equal(ds.value, np.array([[[[0, 1, 2]], [[3, 4, 5]]]]))
True
>>> sds = ds.squeeze()
>>> sds.ndim
2
>>> len(sds.bins) == 2
True
>>> sds.shape
(2, 3)
>>> list(len(x)-1 for x in sds.bins.values()) # edges of bins, so N+1
[2, 3]
>>> list(sds.bins.keys()) == ['egg', 'spam']
True
>>> np.array_equal(sds.value, np.array([[0, 1, 2], [3, 4, 5]]))
True
The dimensions with only one bin are squeezed, the same is done on the bins.
>>> nsds = ds.squeeze()
>>> np.array_equal(nsds.value, sds.value)
True
>>> nsds.bins == sds.bins
True
Operations available on Datasets
Standard operations, (+, -, *, /, [])
, are available for Dataset
.
Examples of their use are given, including failing cases. Some are used in
various methods but shown only once.
All operations conserve the name of the initial dataset.
The what attribute can be used to store the name of the quantity represented by the dataset. It is updated if it involves another dataset:
in addition or substraction, if what is different it contains both separated by the symbol
in multiplication or division it always contains both separeted by the symbol
Addition and subtraction
Addition and subtraction are possible between a Dataset
and a scalar
(numpy.generic
), a numpy.ndarray
and another Dataset
.
The operator to use for addition is +
and for subtraction -
.
Restrictions in addition or subtraction with a numpy.ndarray
are handled
by NumPy.
The addition or subtraction of two Dataset
can be done if
both values have the same shape (
consistent_datasets
)bins conditions (
same_coords
):EITHER the second
Dataset
does not have any binsOR bins are the same, i.e. have the same keys and bins values
Example addition of a scalar value (only on value)
>>> ds1p10 = ds1 + 10
>>> np.array_equal(ds1p10.value,
... [[10, 11, 12, 13, 14], [15, 16, 17, 18, 19]])
True
>>> np.array_equal(ds1p10.error, ds1.error)
True
>>> ds1p10.bins == ds1.bins
True
>>> ds1p10.what == ds1.what
True
As expected it ‘only’ acts on the value, error and bins are unchanged.
Example of subtraction of a numpy.ndarray
>>> a = np.array([100]*10).reshape(2, 5)
>>> ds1.value.shape == a.shape
True
>>> ds1ma = ds1 - a
>>> ds1ma.__class__
<class 'valjean.eponine.dataset.Dataset'>
>>> np.array_equal(ds1ma.value, [[-100, -99, -98, -97, -96],
... [-95, -94, -93, -92, -91]])
True
>>> np.array_equal(ds1ma.error, ds1.error)
True
>>> ds1ma.bins == ds1.bins
True
>>> ds1ma.what == ds1.what
True
a
and ds1
have the same shape to everything is fine.
>>> b = np.array([100]*10)
>>> ds1.value.shape == b.shape
False
>>> ds1pb = ds1 + b
Traceback (most recent call last):
...
ValueError: operands could not be broadcast together with shapes (2,5) (10,)
If the shapes are not the same NumPy raises an exception (and ds1pb is not defined).
Example of addition or subtraction of another Dataset
>>> ds2 = Dataset(value=np.arange(20, 30).reshape(2, 5),
... error=np.array([0.4]*10).reshape(2, 5),
... bins=bins, name='ds2', what='spam')
>>> ds1.bins == ds2.bins
True
>>> ds1p2 = ds1 + ds2
>>> ds1p2.name == ds1.name
True
>>> ds1p2.what == ds1.what
True
>>> np.array_equal(ds1p2.value, [[20, 22, 24, 26, 28],
... [30, 32, 34, 36, 38]])
True
>>> np.array_equal(ds1p2.error, np.array([0.5]*10).reshape(2, 5))
True
The error is calulated considering both datasets are independent, so quadratically ().
Datasets without binning can also be added:
>>> ds3 = Dataset(value=np.arange(200, 300, 10).reshape(2, 5),
... error=np.array([0.4]*10).reshape(2, 5), name='ds3')
>>> ds3.bins
OrderedDict()
>>> ds1p3 = ds1 + ds3
>>> ds1p3.name == ds1.name
True
>>> ds1p3.what == ds1.what
False
>>> ds1p3.what
'spam+'
>>> np.array_equal(ds1p3.value, [[200, 211, 222, 233, 244],
... [255, 266, 277, 288, 299]])
True
>>> np.array_equal(ds1p3.error, np.array([0.5]*10).reshape(2, 5))
True
>>> same_coords(ds1p3, ds1)
True
Bins of the dataset on the left are kept.
Like in NumPy array addition, values need to have the same shape:
>>> ds4 = Dataset(np.arange(5), np.array([0.01]*5), name='ds4')
>>> f"shape ds1 {ds1.value.shape}, ds4 {ds4.value.shape} -> comp = {ds1.value.shape == ds4.value.shape}"
'shape ds1 (2, 5), ds4 (5,) -> comp = False'
>>> ds1 + ds4
Traceback (most recent call last):
[...]
ValueError: Datasets to add do not have same shape
If bins are given, they need to have the same keys and the same values.
>>> bins5 = OrderedDict([('E', np.array([1, 2, 3])), ('t', np.arange(5))])
>>> ds5 = Dataset(np.arange(0, -10, -1).reshape(2, 5),
... np.array([0.01]*10).reshape(2, 5),
... bins=bins5, name='ds5')
>>> ds1 + ds5
Traceback (most recent call last):
[...]
ValueError: Datasets to add do not have same bin names
>>> f"bins ds1: {list(ds1.bins.keys())}, bins ds5: {list(ds5.bins.keys())}"
"bins ds1: ['e', 't'], bins ds5: ['E', 't']"
>>> bins6 = OrderedDict([('e', np.array([1, 2, 30])), ('t', np.arange(5))])
>>> ds6 = Dataset(np.arange(0, -10, -1).reshape(2, 5),
... np.array([0.01]*10).reshape(2, 5),
... bins=bins6, name='ds6')
>>> ds1 - ds6
Traceback (most recent call last):
[...]
ValueError: Datasets to subtract do not have the same bins
>>> same_coords(ds1, ds6)
False
>>> list(ds1.bins.keys()) == list(ds6.bins.keys())
True
>>> np.array_equal(ds1.bins['e'], ds6.bins['e'])
False
>>> np.array_equal(ds1.bins['t'], ds6.bins['t'])
True
Multiplication and division
Multiplication and division are possible between a Dataset
and a
scalar (numpy.generic
), a numpy.ndarray
and another
Dataset
.
The operator to use for multiplication is *
and for division /
.
Restrictions in multiplication and division with a numpy.ndarray
are
handled by NumPy.
The multiplication or division of 2 Dataset
can be done if
both values have the same shape (
consistent_datasets
)bins conditions (
same_coords
):EITHER the second
Dataset
does not have any binsOR bins are the same, i.e. have the same keys and bins values
Division by zero, nan or inf are handled by NumPy and return a RuntimeWarning from NumPy (only in the zero case).
Example multiplication of a scalar value
>>> ds1m10 = ds1 * 10
>>> ds1m10.__class__
<class 'valjean.eponine.dataset.Dataset'>
>>> np.array_equal(ds1m10.value, [[0, 10, 20, 30, 40],
... [50, 60, 70, 80, 90]])
True
>>> np.array_equal(ds1m10.error, np.array([3.]*10).reshape(2, 5))
True
>>> same_coords(ds1m10, ds1)
True
As expected it acts on the value and on the error. Bins are unchanged.
Example of division of a numpy.ndarray
>>> ds1da = ds1 / a
>>> ds1da.name == ds1.name
True
>>> ds1da.what == ds1.what
True
>>> np.array_equal(ds1da.value, [[0., 0.01, 0.02, 0.03, 0.04],
... [0.05, 0.06, 0.07, 0.08, 0.09]])
True
>>> np.array_equal(ds1da.error, np.array([0.003]*10).reshape(2, 5))
True
>>> same_coords(ds1da, ds1)
True
a
and ds1
have the same shape to everything is fine.
>>> ds1 / b
Traceback (most recent call last):
...
ValueError: operands could not be broadcast together with shapes (2,5) (10,)
If the shapes are not the same NumPy raises an exception.
If the numpy.ndarray
: contains 0
, nan
or inf
, NumPy deals
with them. It sends a RunningWarning about the division by zero.
>>> c = np.array([[2., 3., np.nan, 4., 0.], [1., np.inf, 5., 10., 0.]])
>>> ds1 / c
class: <class 'valjean.gavroche.dataset.Dataset'>, data type: <class 'numpy.ndarray'>
name: ds1, with shape (2, 5),
value: [[0. 0.33333333 nan 0.75 inf]
[5. 0. 1.4 0.8 inf]],
error: [[0.15 0.03333333 nan 0.075 inf]
[0.3 0. 0.06 0.03 inf]],
bins: OrderedDict([('e', array([1, 2, 3])), ('t', array([0, 1, 2, 3, 4]))])
>>> # prints *** RuntimeWarning: divide by zero encountered in true_divide
Example of multiplication or division of another Dataset
>>> ds1m2 = ds1 * ds2
>>> ds1m2
class: <class 'valjean.gavroche.dataset.Dataset'>, data type: <class 'numpy.ndarray'>
shape (2, 5),
value: [[ 0 21 44 69 96]
[125 156 189 224 261]],
error: [[6. 6.31268564 6.64830806 7.00357052 7.37563557]
[7.76208735 8.16088231 8.57029754 8.98888202 9.4154129 ]],
bins: OrderedDict([('e', array([1, 2, 3])), ('t', array([0, 1, 2, 3, 4]))])
>>> ds1m2.what
'spam*spam'
>>> same_coords(ds1m2, ds2)
True
>>> ds1o2 = ds1 / ds2
>>> ds1o2
class: <class 'valjean.gavroche.dataset.Dataset'>, data type: <class 'numpy.ndarray'>
shape (2, 5),
value: [[0. 0.04761905 0.09090909 0.13043478 0.16666667]
[0.2 0.23076923 0.25925926 0.28571429 0.31034483]],
error: [[0.015 0.01431448 0.01373617 0.01323926 0.01280492]
[0.01241934 0.01207231 0.01175624 0.01146541 0.0111955 ]],
bins: OrderedDict([('e', array([1, 2, 3])), ('t', array([0, 1, 2, 3, 4]))])
>>> ds1o2.what
'spam/spam'
In both cases the error is calulated considering both datasets are independent, so quadratically .
The same restictions on bins as for addition and subtraction are set for
multiplication and division, same AssertError
are raised, see
Example of addition or subtraction of another Dataset.
About the division by 0
, nan
or inf
, it acts like in the
multiplication or division by a numpy.ndarray
, see
Example of division of a numpy.ndarray (warnings and nan
, inf
, etc.)
>>> np.isnan((ds1/ds1).value[0][0])
True
>>> np.isinf(((ds1+1)/ds1).value[0][0])
True
Indexing and slicing
It is only possible to get a slice of a dataset, getting an Dataset at a given index is not possible (for dimensions consistency reasons). Requiring a given index can then be done using a slice.
Getting a subset of the Dataset
Time is the second dimension, to remove first and last bins the usual slice is
[1:-1]
, the first dimension, energy, is conserved, so its slice is [:]
.
The slice to apply is then [:, 1:-1]
.
>>> ds1sltfl = ds1[:, 1:-1]
>>> ds1sltfl.__class__
<class 'valjean.eponine.dataset.Dataset'>
>>> np.array_equal(ds1sltfl.value, [[1, 2, 3], [6, 7, 8]])
True
>>> np.array_equal(ds1sltfl.error, np.array([0.3]*6).reshape(2, 3))
True
>>> list(ds1sltfl.bins.keys()) == list(ds1.bins.keys())
True
>>> np.array_equal(ds1sltfl.bins['e'], ds1.bins['e'])
True
>>> np.array_equal(ds1sltfl.bins['t'], ds1.bins['t'])
False
>>> np.array_equal(ds1sltfl.bins['t'], [1, 2, 3])
True
>>> ds1sltfl.name == ds1.name
True
Slicing is also applied on bins.
Warning
Requiring a slice when there are not enough elements on the dimension give empty arrays.
For example: removing first and last bin in energy on ds1
. The slice is
[1:-1, :]
in that case, but ds1
has only 2 bins in energy.
>>> ds1slefl = ds1[1:-1, :]
>>> ds1slefl.value.shape == (0, 5)
True
>>> np.array_equal(ds1slefl.value, np.array([]).reshape(0, 5))
True
>>> np.array_equal(ds1slefl.error, np.array([]).reshape(0, 5))
True
>>> np.array_equal(ds1slefl.bins['t'], ds1.bins['t'])
True
>>> np.array_equal(ds1slefl.bins['e'], ds1.bins['e'])
False
>>> np.array_equal(ds1slefl.bins['e'], [2])
True
Note that in this case, as bins are in reality the edges of the bins, so we have N+1 values in the bins compared to the value/error where we have N values. Slicing then give one value in energy bins, so unusable here (it would be empty if we have values and centers of bins instead of edges of bins).
All dimensions have to be present in the slice
Use :
for the untouched dimensions. Number of ,
has to be dimension -1.
Let’s consider a new Dataset, with 4 dimensions:
>>> bins2 = OrderedDict([('e', np.arange(4)), ('t', np.arange(3)),
... ('mu', np.arange(3)), ('phi', np.arange(5))])
>>> ds6 = Dataset(np.arange(48).reshape(3, 2, 2, 4),
... np.array([0.5]*48).reshape(3, 2, 2, 4),
... bins=bins2, name='ds6')
>>> ds6.value.ndim == 4
True
To remove first bin on energy dimension and last bin on phi dimension, the
slice to be used is: [1:, :, :, :-1]
.
>>> ds6_1 = ds6[1:, :, :, :-1]
>>> ds6.value.shape == (3, 2, 2, 4)
True
>>> ds6_1.value.shape == (2, 2, 2, 3)
True
>>> np.array_equal(ds6_1.value, [[[[16, 17, 18], [20, 21, 22]],
... [[24, 25, 26], [28, 29, 30]]],
... [[[32, 33, 34], [36, 37, 38]],
... [[40, 41, 42], [44, 45, 46]]]])
True
>>> np.array_equal(ds6_1.error, np.array([0.5]*24).reshape(2, 2, 2, 3))
True
>>> list(ds6_1.bins.keys()) == list(ds6.bins.keys())
True
>>> np.array_equal(ds6_1.bins['e'], ds6.bins['e'][1:])
True
>>> np.array_equal(ds6_1.bins['t'], ds6.bins['t'])
True
>>> np.array_equal(ds6_1.bins['mu'], ds6.bins['mu'])
True
>>> np.array_equal(ds6_1.bins['phi'], ds6.bins['phi'][:-1])
True
If we only want the second bin in time keeping all bins in energy and direction
angles, the slice is [:, 1:2, :, :]
.
>>> ds6_2 = ds6[:, 1:2, :, :]
>>> ds6_2.value.shape == (3, 1, 2, 4)
True
>>> np.array_equal(ds6_2.value, [[[[8, 9, 10, 11], [12, 13, 14, 15]]],
... [[[24, 25, 26, 27], [28, 29, 30, 31]]],
... [[[40, 41, 42, 43], [44, 45, 46, 47]]]])
True
>>> list(ds6_2.bins.keys()) == list(ds6.bins.keys())
True
>>> all(x.size == y+1
... for x, y in zip(ds6_2.bins.values(), ds6_2.value.shape)) == True
True
>>> np.array_equal(ds6_2.bins['t'], [1, 2])
True
Bins are changed accordingly.
Warning
Comparison to NumPy: index and ellipsis are other slicing possibilities
on numpy.ndarray
(see numpy indexing for current version of
NumPy), but they are disabled here to avoid confusions. Errors are raised
if required.
>>> ds1[1]
Traceback (most recent call last):
[...]
TypeError: Index can only be a slice or a tuple of slices
>>> ds6_2e = ds6[:, 1, :, :]
Traceback (most recent call last):
[...]
TypeError: Index can only be a slice or a tuple of slices
>>> ds6e = ds6[1:, ..., :-1]
Traceback (most recent call last):
[...]
TypeError: Index can only be a slice or a tuple of slices
It also need to have the same dimension as the value:
>>> ds6_2e = ds6[:, 1:2]
Traceback (most recent call last):
[...]
ValueError: len(index) should have the same dimension as the value numpy.ndarray, i.e. (# ',' = dim-1).
':' can be used for a slice (dimension) not affected by the selection.
Slicing is only possible if ndim == 1
A single slice is only possible for arrays for dimension 1:
>>> ds6_2f = ds6[1:2]
Traceback (most recent call last):
[...]
ValueError: len(index) should have the same dimension as the value numpy.ndarray, i.e. (# ',' = dim-1).
':' can be used for a slice (dimension) not affected by the selection.
Slicing is only possible if ndim == 1
>>> ds7 = Dataset(value=np.arange(48), error=np.array([1]*48))
>>> ds7.ndim
1
>>> ds7.shape
(48,)
>>> ds7_extract = ds7[20:24]
>>> np.array_equal(ds7_extract.value, [20, 21, 22, 23])
True
Warning
Slicing can also only be applied on numpy.ndarray
, not on
numpy.generic
:
>>> ds8 = Dataset(value=100, error=1)
>>> ds8[0:1]
Traceback (most recent call last):
[...]
TypeError: [] (__getitem__) can only be applied on numpy.ndarrays
Masked datasets
In some case it can be useful to mask some elements of a dataset. This functionality is provided by the numpy masked arrays module. In the case of datasets, once the mask is given it is applied to both the value and the error.
>>> mask = np.ma.masked_greater(ds1.value, 6).mask
>>> np.array_equal(mask, [[False, False, False, False, False],
... [False, False, True, True, True]])
True
>>> mds = ds1.mask(mask)
>>> np.ma.is_masked(ds1.value)
False
>>> np.ma.is_masked(mds.value)
True
>>> np.ma.is_masked(mds.error)
True
Bins and shape are kept.
>>> mds.shape == ds1.shape
True
>>> np.array_equal(mds.bins['e'], ds1.bins['e'])
True
>>> np.array_equal(mds.bins['t'], ds1.bins['t'])
True
The mask is propagated when performing operations on dataset.
>>> np.sum(ds1.value) == 45
True
>>> np.sum(mds.value) == 21
True
>>> sds = ds1 + mds
>>> np.ma.is_masked(sds.value)
True
>>> np.ma.is_masked(sds.error)
True
>>> np.sum(sds.value) == 42
True
- class valjean.eponine.dataset.Dataset(value, error, *, bins=None, name='', what='')[source]
Common class for data from all codes.
For the moment, units are not treated (removed).
Todo
Think about units. Possibility: using a units package from scipy.
Todo
How to deal with bins of N values (= center of bins)
- __init__(value, error, *, bins=None, name='', what='')[source]
Dataset class initialization.
- Parameters:
value (int or float or numpy.ndarray or numpy.generic) – array of N dimensions representing the values
error (int or float or numpy.ndarray or numpy.generic) – array of N dimensions representing the absolute errors
bins (collections.OrderedDict (str, numpy.ndarray)) – bins corresponding to value (named optional parameter)
name (str) – name of the dataset (used in test representation)
what (str) – name of the quantity represented by the dataset
- squeeze()[source]
Squeeze dataset: remove useless dimensions.
Squeeze is based on the shape and on the bins dim ??? To confirm… First squeeze bins, then arrays. Edges, if only one bin are not kept. Example: spectrum with one bin in energy (quite common)
- property shape
Return the data shape, as a read-only property.
- property ndim
Return the data dimension, as a read-only property.
- property size
Return the data size (total number of elements in the array), as a read-only property.
- data()[source]
Generator yielding objects supporting the buffer protocol that (as a whole) represent a serialized version of self.
>>> from valjean.fingerprint import fingerprint >>> vals = np.arange(10).reshape(2, 5) >>> errs = np.array([1]*10).reshape(2, 5) >>> ds = Dataset(vals, errs) >>> fingerprint(ds) 'a8411470d7766c543e90f0f38241dc918b9448d1b9d19b0a9b8b6c91f61944d0' >>> bins = OrderedDict([('bacon', np.arange(3)), ... ('egg', np.arange(5))]) >>> ds = Dataset(vals, errs, bins=bins) >>> fingerprint(ds) 'dfec4e6ac118d30b7a83cfc500fefaec94f93a76a024d7550732cd08fbfb0fee' >>> ds = Dataset(vals, errs, bins=bins, name='name') >>> fingerprint(ds) '441623c096bf917414013ebbfe345c66124f9a8dfbbd7051cf733ea20d7b651a' >>> ds = Dataset(vals, errs, bins=bins, name='name', what='what!') >>> fingerprint(ds) '9a42c91381d696c1af651665f84efc43efa14d9c471c4ac851239ed0779a48a0'