diagnostics – Diagnostic tests

Test and task statistics

This module defines a few tests that report about the success/failure status of other tests/tasks.

Three different tests are available:

  • statistics over all the tasks done, generated with task_stats

  • statistics over all the tests performed with test_stats

  • statistics over the tests performed based on labels given at test initialization with test_stats_by_labels

The two tests over the test results are performed using the TestResult, so the statistics only includes the tests that were actually run.

valjean.gavroche.diagnostics.stats.stats_worker(test_fn, name, description, tasks, **kwargs)[source]

Function creating the test for all the required tasks (summary tests of test tasks or summary tests on all tasks for example).

Parameters:
  • test_fn – function generating the test to apply to tasks

  • name (str) – the name of the stat task

  • description (str) – its description

  • tasks (list(Task)) – the list of tasks to be tested.

Returns:

an EvalTestTask that evaluates the diagnostic test.

Return type:

EvalTestTask

valjean.gavroche.diagnostics.stats.task_stats(*, name, description='', labels=None, tasks)[source]

Create a TestStatsTasks from a list of tasks.

The TestStatsTasks class must be instantiated with the list of task results, which are not available to the user when the tests are specified in the job file. Therefore, the creation of the TestStatsTasks must be delayed until the other tasks have finished and their results are available in the environment. For this purpose it is necessary to wrap the instantiation of TestStatsTasks in a Use wrapper, and evaluate the resulting test using a EvalTestTask.

This function hides this bit of complexity from the user. Assume you have a list of tasks that you would like to produce statistics about (we will use DelayTask objects for our example):

>>> from valjean.cosette.task import DelayTask
>>> my_tasks = [DelayTask(1), DelayTask(3), DelayTask(0.2)]

Here is how you make a TestStatsTasks:

>>> stats = task_stats(name='delays', tasks=my_tasks)
>>> from valjean.gavroche.eval_test_task import EvalTestTask
>>> isinstance(stats, EvalTestTask)
True
>>> print(stats.depends_on)
{Task('delays.stats')}
>>> create_stats = next(task for task in stats.depends_on)

Here create_stats is the task that actually creates the TestStatsTasks. It soft-depends on the tasks in my_tasks:

>>> [task in create_stats.soft_depends_on for task in my_tasks]
[True, True, True]

The reason why the dependency is soft is that we want to collect statistics about the task outcome in any case, even (especially!) if some of the tasks failed.

Parameters:
  • name (str) – the name of the task to create.

  • description (str) – its description.

  • tasks (list(Task)) – the list of tasks to be tested.

Returns:

an EvalTestTask that evaluates the diagnostic test.

Return type:

EvalTestTask

class valjean.gavroche.diagnostics.stats.NameFingerprint(name, fingerprint=None)[source]

A small helper class to store a name and an optional fingerprint for the referenced item.

__init__(name, fingerprint=None)[source]
__str__()[source]

Return str(self).

__repr__()[source]

Return repr(self).

__eq__(other)[source]

Return self==value.

__lt__(other)[source]

Return self<value.

__ge__(other)

Return a >= b. Computed by @total_ordering from (not a < b).

__gt__(other)

Return a > b. Computed by @total_ordering from (not a < b) and (a != b).

__hash__ = None
__le__(other)

Return a <= b. Computed by @total_ordering from (a < b) or (a == b).

class valjean.gavroche.diagnostics.stats.TestStatsTasks(*, name, description='', labels=None, task_results)[source]

A test that evaluates statistics about the success/failure status of the given tasks.

__init__(*, name, description='', labels=None, task_results)[source]

Instantiate a TestStatsTasks.

Parameters:
  • name (str) – the test name.

  • description (str) – the test description.

  • task_results (list(dict(str, stuff))) – a list of task results, intended as the contents of the environment sections associated with the executed tasks. This test notably inspects the 'status' key to see if the task succeeded.

evaluate()[source]

Evaluate this test and turn it into a TestResultStatsTasks.

data()[source]

Generator yielding objects supporting the buffer protocol that (as a whole) represent a serialized version of self.

class valjean.gavroche.diagnostics.stats.TestResultStatsTasks(*, test, classify)[source]

The result of the evaluation of a TestStatsTasks. The test is considered successful if all the observed tasks have successfully completed (TaskStatus.DONE).

__init__(*, test, classify)[source]

Instantiate a TestResultStatsTasks.

Parameters:
  • test (TestStatsTasks) – the test producing this result.

  • classify (dict(TaskStatus, list(str))) – a dictionary mapping the task status to the list of task names with the given status.

__bool__()[source]

Returns True if all the observed tests have succeeded.

class valjean.gavroche.diagnostics.stats.TestOutcome(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

An enumeration that represents the possible outcomes of a test:

SUCCESS

represents tests that have been evaluated and have succeeded;

FAILURE

represents tests that have been evaluated and have failed;

MISSING

represents tasks that did not generate any 'result' key;

NOT_A_TEST

represents tasks that did not generate a TestResult object as a result;

__format__(format_spec, /)

Default object formatter.

__new__(value)
valjean.gavroche.diagnostics.stats.test_stats(*, name, description='', labels=None, tasks)[source]

Create a TestStatsTests from a list of tests.

The TestStatsTests class must be instantiated with the list of test results, which are not available to the user when the tests are specified in the job file. Therefore, the creation of the TestStatsTests must be delayed until the test tasks have finished and their results are available in the environment. For this purpose it is necessary to wrap the instantiation of TestStatsTests in a Use wrapper, and evaluate the resulting test using a EvalTestTask.

This function hides this bit of complexity from the user. Assume you have a list of tasks that evaluate some tests and that you would like to produce statistics about the tests results. Let us construct a toy dataset first:

>>> from collections import OrderedDict
>>> import numpy as np
>>> from valjean.eponine.dataset import Dataset
>>> x = np.linspace(-5., 5., num=100)
>>> y = x**2
>>> error = np.zeros_like(y)
>>> bins = OrderedDict([('x', x)])
>>> parabola = Dataset(y, error, bins=bins, name='parabola')
>>> parabola2 = Dataset(y*(1+1e-6), error, bins=bins, name='parabola2')

Now we write a function that generates dummy tests for the parabola dataset:

>>> from valjean.gavroche.test import TestEqual, TestApproxEqual
>>> def test_generator():
...     result = [TestEqual(parabola, parabola2, name='equal?').evaluate(),
...               TestApproxEqual(parabola, parabola2,
...                               name='approx_equal?').evaluate()]
...     return {'test_generator': {'result': result}}, TaskStatus.DONE

We need to wrap this function in a PythonTask so that it can be executed as a part of the dependency graph:

>>> from valjean.cosette.pythontask import PythonTask
>>> create_tests_task = PythonTask('test_generator', test_generator)

Here is how you make a TestStatsTests to collect statistics about the results of the generated tests:

>>> stats = test_stats(name='equal', tasks=[create_tests_task])
>>> from valjean.gavroche.eval_test_task import EvalTestTask
>>> isinstance(stats, EvalTestTask)
True

Here stats evaluates the test that gathers the statistics, and it depends on a special task that generates the TestStatsTests instance:

>>> print(stats.depends_on)
{Task('equal.stats')}
>>> create_stats = next(task for task in stats.depends_on)

In turn, create_stats has a soft dependency on the task that generates our test, create_tests_task:

>>> create_tests_task in create_stats.soft_depends_on
True

The reason why the dependency is soft is that we want to collect statistics about the test outcome in any case, even (especially!) if some of the tests failed or threw exceptions.

Let’s run the tests:

>>> from valjean.config import Config
>>> config = Config()
>>> from valjean.cosette.env import Env
>>> env = Env()
>>> for task in [create_tests_task, create_stats, stats]:
...     env_up, status = task.do(env=env, config=config)
...     env.apply(env_up)
>>> print(status)
TaskStatus.DONE

The results are stored in a list under the key 'result':

>>> print(len(env[stats.name]['result']))
1
>>> stats_res = env[stats.name]['result'][0]
>>> print("SUCCESS:", stats_res.classify[TestOutcome.SUCCESS])
SUCCESS: ['approx_equal?']
>>> print("FAILURE:", stats_res.classify[TestOutcome.FAILURE])
FAILURE: ['equal?']
Parameters:
  • name (str) – the name of the task to create.

  • description (str) – its description.

  • tasks (list(Task)) – the list of tasks that generate the tests to observe.

Returns:

an EvalTestTask that evaluates the diagnostic test.

Return type:

EvalTestTask

class valjean.gavroche.diagnostics.stats.TestStatsTests(*, name, description='', labels=None, task_results)[source]

A test that evaluates statistics about the success/failure of the given tests.

__init__(*, name, description='', labels=None, task_results)[source]

Instantiate a TestStatsTests from a collection of task results. The tasks are expected to generate TestResult objects, which must appear in the 'result' key of the task result.

evaluate()[source]

Evaluate this test and turn it into a TestResultStatsTests.

data()[source]

Generator yielding objects supporting the buffer protocol that (as a whole) represent a serialized version of self.

class valjean.gavroche.diagnostics.stats.TestResultStatsTests(*, test, classify)[source]

The result of the evaluation of a TestStatsTests. The test is considered successful if all the observed tests have been successfully evaluated and have succeeded.

__bool__()[source]

Returns True if all the observed tests have succeeded.

valjean.gavroche.diagnostics.stats.test_stats_by_labels(*, name, description='', labels=None, tasks, by_labels)[source]

Create a TestStatsTestsByLabels from a list of tests.

See test_stats for the generalities about this function.

Compared to test_stats it takes one argument more: 'by_labels' to classify then build statistics based on these labels. The order of the labels matters, as they are successively selected.

Let’s define three menus:

>>> menu1 = {'food': 'egg + spam', 'drink': 'beer'}
>>> menu2 = {'food': 'egg + bacon', 'drink': 'beer'}
>>> menu3 = {'food': 'lobster thermidor', 'drink': 'brandy'}

These three menus are ordered by pairs. Statistics on meals are kept in the restaurant, using TestMetadata. The goal of the tests is to know if both persons of a pair order the same menu and when they do it.

orders = [TestMetadata(
    {'Graham': menu1, 'Terry': menu1}, name='gt_wday_lunch',
    labels={'day': 'Wednesday', 'meal': 'lunch'}),
          TestMetadata(
    {'Michael': menu1, 'Eric': menu2}, name='me_wday_dinner',
    labels={'day': 'Wednesday', 'meal': 'dinner'}),
          TestMetadata(
    {'John': menu2, 'Terry': menu2}, name='jt_wday',
    labels={'day': 'Wednesday'}),
          TestMetadata(
    {'Terry': menu3, 'John': menu3}, name='Xmasday',
    labels={'day': "Christmas Eve"})]

The restaurant owner uses test_stats_by_labels to build statistics on his menus and the habits of his consumers.

For example, the menus filtered on day will give:

=============  ============  =============
      day       % success      % failure
=============  ============  =============
Christmas Eve      1/1            0/1
    Wednesday      2/3            1/3
=============  ============  =============

These results means, considering the tests requested, both consumers have the same meal on Christmas Eve. On Wednesday, one pair of customers out of three did not order the same menu.

The same kind of statistics can be done based on the meal:

==========  ============  =============
   meal       % success     % failure
==========  ============  =============
  dinner         0/1            1/1
   lunch         1/1            0/1
==========  ============  =============

In that case two tests were not taken into account as they did not have any 'meal' label.

It is also possible to make selections on multiple labels. In that case the order matters: the classification is performed following the order of the labels requested. For example, 'meal' then 'day' :

==========  =========  ============  =============
   meal        day       % success     % failure
==========  =========  ============  =============
  dinner    Wednesday       0/1            1/1
   lunch    Wednesday       1/1            0/1
==========  =========  ============  =============
  • Only two tests are filtered due to the meal selection

  • Requesting 'day' then 'meal' would only inverse the two first columns in that case and emit a warning: a preselection on 'day' is done and in Christmas Eve case the 'meal' label is not provided, the selection cannot be done. In the Wednesday one, no problem 'meal' appears at least in one case (two in our cases).

Finally if the request involves a label that does not exist in any test an exception will be raised, mentioning the failing label.

Parameters:
  • name (str) – the name of the task to create.

  • description (str) – its description.

  • tasks (list(Task)) – the list of tasks that generate the tests to observe.

  • by_labels (tuple(str)) – labels from the tests on which the classification will be based

Returns:

an EvalTestTask that evaluates the diagnostic test.

Return type:

EvalTestTask

exception valjean.gavroche.diagnostics.stats.TestStatsTestsByLabelsException[source]

Exception raised during the diagnostic test on TestResult when a classification by labels is required.

class valjean.gavroche.diagnostics.stats.TestStatsTestsByLabels(*, name, description='', labels=None, task_results, by_labels)[source]

A test that evaluates statistics about the success/failure of the given tests using their labels to classify them.

Usually more than one test is performed for each tested case. This test summarize tests done on a given category defined by the user in the usual tests (TestStudent, TestMetadata, etc.).

During the evaluation a list of dictionaries of labels is built for each test. These labels are given by the user at the initialization of the test. Each dictionary also contains the name of the test (name of the task) and its result (sucess or failure). From this list of dictionaries an Index is built.

The result of the evaluation is given a a list of dictionaries containing the strings corresponding to the chosen labels under the 'labels' key and the number of results OK, KO and total.

__init__(*, name, description='', labels=None, task_results, by_labels)[source]

Instantiate a TestStatsTestsByLabels from a collection of task results. The tasks are expected to generate TestResult objects, which must appear in the 'result' key of the task result.

Parameters:
  • name (str) – the test name

  • description (str) – the test description

  • task_result – a list of task results, each task normally contains a TestResult, used in this test.

  • by_labels (tuple) – ordered labels to sort the test results. These labels are the test labels.

evaluate()[source]

Evaluate this test and turn it into a TestResultStatsTestsByLabels.

data()[source]

Generator yielding objects supporting the buffer protocol that (as a whole) represent a serialized version of self.

class valjean.gavroche.diagnostics.stats.TestResultStatsTestsByLabels(*, test, classify, n_labels)[source]

The result of the evaluation of a TestStatsTestsByLabels. The test is considered successful if all the observed tests have been successfully evaluated and have succeeded.

An oracle is available for each individual test (usually what is required here).

self.classify is here a list of dictionaries with the following keys: ['labels', 'OK', 'KO', 'total'].

__init__(*, test, classify, n_labels)[source]

Instantiate a TestResultStatsTasks.

Parameters:
  • test (TestStatsTasks) – the test producing this result.

  • classify (dict(TaskStatus, list(str))) – a dictionary mapping the task status to the list of task names with the given status.

__bool__()[source]

Test is successful if all tests are.

oracles()[source]

Test if each test is successful.

Return type:

list(bool)

nb_missing_labels()[source]

Return the number of tests where at least one of the labels required were missing.

Return type:

int

valjean.gavroche.diagnostics.stats.classification_counts(classify, status_first)[source]

Count the occurrences of different statuses in the classify dictionary.

Parameters:
  • classify (dict) – a dictionary associating things to statuses. The statuses must have the same type as status_first

  • status_first – the status that is considered as success. This must be an enum class

Returns:

a pair of lists of equal length. The first element of the pair is the list of statuses appearing in classify (status_first is guaranteed to come first in this list); the second element is the number of times the corresponding status appears in classify.

Test of metadata

This module define tests for metadata.

class valjean.gavroche.diagnostics.metadata.Missing[source]

Class for missing metadata.

__str__()[source]

Return str(self).

class valjean.gavroche.diagnostics.metadata.TestResultMetadata(test, dict_res)[source]

Results of metadata comparisons.

__init__(test, dict_res)[source]

Initialisation of TestResultMetadata.

per_key()[source]

Test result sorted by key.

only_failed_comparisons()[source]

Return only the failed comparisons. Structure is the same as the dict_res.

class valjean.gavroche.diagnostics.metadata.TestMetadata(dict_md, name, description='', labels=None, exclude=('results', 'index', 'score_index', 'response_index', 'response_type'))[source]

A test that compares metadata.

Todo

Document the parameters…

__init__(dict_md, name, description='', labels=None, exclude=('results', 'index', 'score_index', 'response_index', 'response_type'))[source]

Initialisation of TestMetadata.

Parameters:
  • name (str) – local name of the test

  • description (str) – specific description of the test

  • labels (dict) – labels to be used for test classification in reports, for example category, input file name, type of result, …

  • exclude (tuple) – a tuple of keys that will not be considered as metadata. Default: ('results', 'index', 'score_index', 'response_index', 'response_type')

build_metadata_dict()[source]

Build the dictionary of metadata.

Contains all the metadata for all samples.

compare_metadata()[source]

Metadata are compared with respect to the first one.

evaluate()[source]

Evaluate this test and turn it into a TestResultMetadata.

data()[source]

Generator yielding objects supporting the buffer protocol that (as a whole) represent a serialized version of self.