Skip to content

benchmarks

Utilities for computing statistics on benchmark data.

Translated from https://github.com/jupyterlab/jupyterlab/blob/82df0b635dae2c1a70a7c41fe7ee7af1c1caefb2/galata/src/benchmarkReporter.ts#L150-L244 which was originally added in https://github.com/jupyterlab/benchmarks/blob/f55db969bf4d988f9d627ba187e28823a50153ba/src/compare.ts#L136-L213

Distribution dataclass

Statistical description of a distribution

Source code in lineapy/utils/benchmarks.py
41
42
43
44
45
46
47
48
49
50
51
52
@dataclass
class Distribution:
    """
    Statistical description of a distribution
    """

    mean: float
    variance: float

    @classmethod
    def from_data(cls, data: List[float]) -> Distribution:
        return cls(mean(data), variance(data))

DistributionChange dataclass

Change between two distributions

Source code in lineapy/utils/benchmarks.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
@dataclass
class DistributionChange:
    """
    Change between two distributions
    """

    # Mean value
    mean: float
    # Spread around the mean value
    confidence_interval: float
    # The confidence interval level, i.e. 0.95 for a 95% confidence interval
    confidence_interval_level: float

    def __str__(self):
        """
        Format a performance changes like `between 20.1% slower and 30.3% faster (95% CI)`.
        """
        return (
            f"between {format_percent(self.mean + self.confidence_interval)} "
            f"and {format_percent(self.mean - self.confidence_interval)} "
            f"({self.confidence_interval_level * 100}% CI)"
        )

__str__()

Format a performance changes like between 20.1% slower and 30.3% faster (95% CI).

Source code in lineapy/utils/benchmarks.py
30
31
32
33
34
35
36
37
38
def __str__(self):
    """
    Format a performance changes like `between 20.1% slower and 30.3% faster (95% CI)`.
    """
    return (
        f"between {format_percent(self.mean + self.confidence_interval)} "
        f"and {format_percent(self.mean - self.confidence_interval)} "
        f"({self.confidence_interval_level * 100}% CI)"
    )

distribution_change(old_measures, new_measures, confidence_interval=0.95)

Compute the performance change based on a number of old and new measurements.

Based on the work by Tomas Kalibera and Richard Jones. See their paper "Quantifying Performance Changes with Effect Size Confidence Intervals", section 6.2, formula "Quantifying Performance Change".

Note: The measurements must have the same length. As fallback, you could use the minimum size of the two measurement sets.

Parameters:

Name Type Description Default
old_measures List[float]

The list of timings from the old system

required
new_measures List[float]

The list of timings from the new system

required
confidence_interval float

The confidence interval for the results. The default is a 95% confidence interval (95% of the time the true mean will be between the resulting mean +- the resulting CI)

0.95

Test against the example in the paper, from Table V, on pages 18-19

res = distribution_change(
    old_measures=[
        round(mean([9, 11, 5, 6]), 1),
        round(mean([16, 13, 12, 8]), 1),
        round(mean([15, 7, 10, 14]), 1),
    ],
    new_measures=[
        round(mean([10, 12, 6, 7]), 1),
        round(mean([9, 1, 11, 4]), 1),
        round(mean([8, 5, 3, 2]), 1),
    ],
    confidence_interval=0.95
)
from math import isclose
assert isclose(res.mean, 68.3 / 74.5, rel_tol=0.05)
assert isclose(res.confidence_interval, 60.2 / 74.5, rel_tol=0.05)
Source code in lineapy/utils/benchmarks.py
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
def distribution_change(
    old_measures: List[float],
    new_measures: List[float],
    confidence_interval: float = 0.95,
) -> DistributionChange:
    """
    Compute the performance change based on a number of old and new measurements.

    Based on the work by Tomas Kalibera and Richard Jones. See their paper
    "Quantifying Performance Changes with Effect Size Confidence Intervals", section 6.2,
    formula "Quantifying Performance Change".

    Note: The measurements must have the same length. As fallback, you could use the minimum
    size of the two measurement sets.

    Parameters
    ----------
    old_measures: List[float]
        The list of timings from the old system
    new_measures: List[float]
        The list of timings from the new system
    confidence_interval: float
        The confidence interval for the results.
        The default is a 95% confidence interval (95% of the time the true mean will be
        between the resulting mean +- the resulting CI)

    Test against the example in the paper, from Table V, on pages 18-19

    ```python
    res = distribution_change(
        old_measures=[
            round(mean([9, 11, 5, 6]), 1),
            round(mean([16, 13, 12, 8]), 1),
            round(mean([15, 7, 10, 14]), 1),
        ],
        new_measures=[
            round(mean([10, 12, 6, 7]), 1),
            round(mean([9, 1, 11, 4]), 1),
            round(mean([8, 5, 3, 2]), 1),
        ],
        confidence_interval=0.95
    )
    from math import isclose
    assert isclose(res.mean, 68.3 / 74.5, rel_tol=0.05)
    assert isclose(res.confidence_interval, 60.2 / 74.5, rel_tol=0.05)
    ```
    """
    n = len(old_measures)
    if n != len(new_measures):
        raise ValueError("Data have different length")
    return performance_change(
        Distribution.from_data(old_measures),
        Distribution.from_data(new_measures),
        n,
        confidence_interval,
    )

Was this helpful?

Help us improve docs with your feedback!