-
Notifications
You must be signed in to change notification settings - Fork 1
Description
TLDR. time_in_nanoseconds
is reported wrongly, we need to implement more advanced algorithm to measure that.
Full version.
I wanted to check the influence of the hyperion framework on the measured data. In order to do that I've used the same technique as @Shimuuar did for criterion. For criterion result told that the statistics measurements during run have very high impact on results and solution was to write temporary data to file and analyze that later, so I wanted to see if hyperion is affected.
Idea is to have a large number of the very simple measurements:
#!/usr/bin/env stack
-- stack --no-nix-pure runghc --package hyperion
{-# LANGUAGE OverloadedLists #-}
module Main where
import Hyperion.Benchmark
import Hyperion.Main
import Hyperion.Run
import Data.Monoid
import Text.Printf
benchmarks :: [Benchmark]
benchmarks =
[ bench (printf "%03i" n) $ nf exp (0.73::Double) | n <- [1..999::Int]]
main :: IO ()
main = -- defaultMainWith defaultConfig{configMonoidSamplingStrategy=First $ Just $ fixed 1} "hyperion-example-micro-benchmarks" benchmarks
defaultMain "hyperion-example-micro-benchmarks" benchmarks
I expect that when I'll measure that data all benchmarks will show more or less equal results. I used R
in order to analyze results:
qnikst@qwork ~/workspace/tweag/hyperion $ ./dist-newstyle/build/hyperion-0.1.0.0/build/hyperion-sanity-check/hyperion-sanity-check --flat output.flat.json
library(jsonlite);
dat <- fromJSON("output.flatten.json");
> summary(dat$time_in_nanos)
Min. 1st Qu. Median Mean 3rd Qu. Max.
160.1 336.8 501.0 524.8 662.5 12990.0
> mean(dat$time_in_nanos);
[1] 524.8087
> sd(dat$time_in_nanos);
[1] 485.5656
> d <- dat[with(dat, order(bench_name)),]
> plot(d$time_in_nanos)
We see that standard deviation is almost equal to mean. Also we see that time_in_nanos
grows with test index. Up to this point we may think that measurements affects experiment in undesirable way. As I told above criterion had that problem, I've tried to modify hyperion a bit and use Storable
vectors instead of Unboxed
to move them to non-haskell heap, and actually this helped a bit. Measurements no longer have that trend:
but what we can tell from the graph - we still have very large deviation in measurements and that ring us a bell.
In addition even if we use Storable
vectors and run program with --pure
we still have increasing time_in_nanos
trend over an experiment number. So that solution is not full.
Happily hyperion does provide raw measurements data and I tried to analyze that myself. Dropping out all iterations and surprises that I had, I finished with the following code.
qnikst@qwork ~/workspace/tweag/hyperion $ ./dist-newstyle/build/hyperion-0.1.0.0/build/hyperion-sanity-check/hyperion-sanity-check --raw --flat output.flat.json
Because like criterion hyperion runs each measurement in a number of batches, we could build a linear model, Time = value + batchSize*t
- this way we can rule out an experiment costs from the results. See https://blog.janestreet.com/core_bench-micro-benchmarking-for-ocaml/ for more details.
dat <- fromJSON("output.json"); # read data
dat <- dat$results # ignore metadata
dat <- dat[,c("bench_name","measurements")] # select only interesting fields
dat <- dat[with(dat, order(bench_name)),] # order data
result <- apply(dat, 1, function(x) lm(formula=duration ~ batchSize, data=x$measurements)$coefficients["batchSize"]) # calculate linear model for each experiment
r <- result[order(names(result)) # reorder data so it's reported in the same order as it runs
plot(r)
Ignoring outlines those will be the matter of further analysis we see a flat line! After removing an outliners we see:
> mean(cleared)
[1] 115.1444
> sd(cleared)
[1] 24.45192
Those are perfect results! This means that actually hyperion does it's job right and correct analysis over the data leads to a correct results.
Actually what was the problem? After doing this I went to the sources and saw the following:
hyperion/src/Hyperion/Analysis.hs
Lines 49 to 66 in ccdccfa
, _reportTimeInNanos = | |
totalDuration / trueNumIterations | |
, _reportCycles = Nothing | |
, _reportAlloc = Nothing | |
, _reportGarbageCollections = Nothing | |
, _reportMeasurements = Just samp | |
} | |
where | |
totalDuration = | |
ala | |
Sum | |
(foldMapOf (measurements.each.duration.to realToFrac)) | |
samp | |
trueNumIterations = | |
ala | |
Sum | |
(foldMapOf (measurements.each.batchSize.to realToFrac)) | |
samp |
instead of building a linear regression, we just summarize all measurement times and divide them by number of iterations! As a result measurement effect is not erased. I understand the purpose of that, such analysis works for all strategies, both fixed and incremental batches, however as analysis shows it's unreliable.