bunsen ai/ml testsuite analysis prototype

public inbox for buildbot@sourceware.org
 help / color / mirror / Atom feed

From: "Frank Ch. Eigler" <fche@redhat.com>
To: bunsen@sourceware.org, buildbot@sourceware.org
Subject: bunsen ai/ml testsuite analysis prototype
Date: Mon, 6 Feb 2023 11:05:07 -0500	[thread overview]
Message-ID: <20230206160507.GA31394@redhat.com> (raw)

Hi -

The initial fruits of a bunch of weeks of learning & playing, I've
committed a pytorch-based script to predict (dejagnu) test results in
bunsen.  Specifically, a testrun's metadata plus the expfile and
subtest names, the code attempts to predict the most likely test
result (PASS/FAIL/etc.).  This prediction can then be compared to the
actual test results to identify those results that are unexpected
(mispredicted), with respect to the historical corpus the underlying
neural net has been trained for.

The intent is that by identifying these "unexpected" results, a
developer can focus on them, in a similar way that he or she could
look at regressions.  The difference is that regressions are computed
with respect to specific given testruns, whereas these neural net
predictions are done with respect to the entire training corpus, so
are not manually parametrized.  Therefore, these results show up in
the web ui as associated with the target testrun only.  See:

https://tinyurl.com/4c9ha9f8

The idea is that the last column's nonempty cells identify those test
cases may well deserve more scrutiny.  The probability of the prediction
is also shown, and for eye candy purposes also encoded into opacity.

This initial version of the bin/r-dejagnu-testcase-classify script
focuses on dejagnu results only, but the concept applies generally and
the code should generalize easily.  It relies on the "torch" (pytorch)
package, which is a large package that is unfortunately not generally
available on distros, so one's stuck doing a "pip3 install torch"
first.

Run it thusly:

% wget https://builder.sourceware.org/bunsen-dejagnu.pth  # once
% r-dejagnu-testcase-classify  --mode=predict TESTRUN ... 

and you may start from a 250MB neural network state snapshot given
above.  I'm running an ongoing training job on a fast workstation, and
plan to periodically update the model file.  We have roughly 2 billion
dejagnu test cases in bunsen, which will take several weeks of
processing time just to train the network initially, plus we have
millions of new results every day.  This means that the results will
vary over time, perhaps rapidly.

My hope is that as the neural network learns more characteristics
about our enormous dataset, it'll figure out what's common, what's
random noise, and what's abnormal.  It should gradually adapt to but
not entirely normalize outliers.

The same script also implements the training operation, so you
can play completely at home:

% r-dejagnu-testcase-classify  --mode=train TESTRUN ... 

to create and/or update the neural net stored the model file.  It's
possible to create smaller (faster and probably dumber) networks, for
example if one's dataset is small.  I'm planning to add more training
related workflows and parameters.

Finally, the testrun dejagnu view now also shows off the testsuite
"entropy" numbers, which have been active but hidden until now.  That
quantity is a statistical measurement of the amount of variation of
results for the same testsuite (expfile) among various clusters that
any particular testrun is a member of.  Recall that bunsen "clusters"
are just groups of testruns that have a matching piece of metadata.
(Think: "same compiler", "same date", or "same source snapshot".)

Low entropy (0) means that all other testruns in the same cluster had
matching results (same number of PASS/FAIL/etc.).  A high entropy
means that there is lots of variation between results: it's a noisy
test case within that cluster.  My hope is that the data behind this
statistic can help tune the testsuite itself.  Some test cases that
always behave the exact same way may not be needed at all; test cases
that are highly noisy may not be worth paying attention to.  These
numbers quantify that characteristic, on a per-cluster basis.

Please play with it and offer any feedback you have.

- FChE

                 reply	other threads:[~2023-02-06 16:05 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230206160507.GA31394@redhat.com \
    --to=fche@redhat.com \
    --cc=buildbot@sourceware.org \
    --cc=bunsen@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).