From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=eX2U=6C=redhat.com=fche@sourceware.org>
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by sourceware.org (Postfix) with ESMTPS id 2F4583858D1E
	for <buildbot@sourceware.org>; Mon,  6 Feb 2023 16:05:14 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 2F4583858D1E
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1675699513;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type;
	bh=ygtrxqZVUID70O5Sn8Y2UHqzQ7xYdn7KJbPCBdfDqhw=;
	b=CD0hiypKQH5jCZJqxVCeg7FlrlO7x5ZSLRQhsJb+DD45hb3DfxazwlGTfuMhFcV3duGM0d
	kkMR5VJsa61jK7bc+B2EAnB/KAIXa60Rdk4ScvbUF+YQxWMdKIc2g8HtwtJVi8jU7vXHl1
	JLYbVKqEXEsfSO2VdPVVFyVe3XCqjhE=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-655-2F6iNAJIPDS0jUbXX2asfg-1; Mon, 06 Feb 2023 11:05:10 -0500
X-MC-Unique: 2F6iNAJIPDS0jUbXX2asfg-1
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id D7463100F916;
	Mon,  6 Feb 2023 16:05:08 +0000 (UTC)
Received: from redhat.com (unknown [10.2.16.10])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id C7D5240398A0;
	Mon,  6 Feb 2023 16:05:08 +0000 (UTC)
Received: from fche by redhat.com with local (Exim 4.94.2)
	(envelope-from <fche@redhat.com>)
	id 1pP3zQ-0003KR-28; Mon, 06 Feb 2023 11:05:08 -0500
Date: Mon, 6 Feb 2023 11:05:07 -0500
From: "Frank Ch. Eigler" <fche@redhat.com>
To: bunsen@sourceware.org, buildbot@sourceware.org
Subject: bunsen ai/ml testsuite analysis prototype
Message-ID: <20230206160507.GA31394@redhat.com>
MIME-Version: 1.0
User-Agent: Mutt/1.12.0 (2019-05-25)
X-Scanned-By: MIMEDefang 3.1 on 10.11.54.2
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
X-Spam-Status: No, score=-7.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_SHORT,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <buildbot.sourceware.org>

Hi -

The initial fruits of a bunch of weeks of learning & playing, I've
committed a pytorch-based script to predict (dejagnu) test results in
bunsen.  Specifically, a testrun's metadata plus the expfile and
subtest names, the code attempts to predict the most likely test
result (PASS/FAIL/etc.).  This prediction can then be compared to the
actual test results to identify those results that are unexpected
(mispredicted), with respect to the historical corpus the underlying
neural net has been trained for.

The intent is that by identifying these "unexpected" results, a
developer can focus on them, in a similar way that he or she could
look at regressions.  The difference is that regressions are computed
with respect to specific given testruns, whereas these neural net
predictions are done with respect to the entire training corpus, so
are not manually parametrized.  Therefore, these results show up in
the web ui as associated with the target testrun only.  See:

https://tinyurl.com/4c9ha9f8

The idea is that the last column's nonempty cells identify those test
cases may well deserve more scrutiny.  The probability of the prediction
is also shown, and for eye candy purposes also encoded into opacity.

This initial version of the bin/r-dejagnu-testcase-classify script
focuses on dejagnu results only, but the concept applies generally and
the code should generalize easily.  It relies on the "torch" (pytorch)
package, which is a large package that is unfortunately not generally
available on distros, so one's stuck doing a "pip3 install torch"
first.

Run it thusly:

% wget https://builder.sourceware.org/bunsen-dejagnu.pth  # once
% r-dejagnu-testcase-classify  --mode=predict TESTRUN ... 

and you may start from a 250MB neural network state snapshot given
above.  I'm running an ongoing training job on a fast workstation, and
plan to periodically update the model file.  We have roughly 2 billion
dejagnu test cases in bunsen, which will take several weeks of
processing time just to train the network initially, plus we have
millions of new results every day.  This means that the results will
vary over time, perhaps rapidly.

My hope is that as the neural network learns more characteristics
about our enormous dataset, it'll figure out what's common, what's
random noise, and what's abnormal.  It should gradually adapt to but
not entirely normalize outliers.


The same script also implements the training operation, so you
can play completely at home:

% r-dejagnu-testcase-classify  --mode=train TESTRUN ... 

to create and/or update the neural net stored the model file.  It's
possible to create smaller (faster and probably dumber) networks, for
example if one's dataset is small.  I'm planning to add more training
related workflows and parameters.


Finally, the testrun dejagnu view now also shows off the testsuite
"entropy" numbers, which have been active but hidden until now.  That
quantity is a statistical measurement of the amount of variation of
results for the same testsuite (expfile) among various clusters that
any particular testrun is a member of.  Recall that bunsen "clusters"
are just groups of testruns that have a matching piece of metadata.
(Think: "same compiler", "same date", or "same source snapshot".)

Low entropy (0) means that all other testruns in the same cluster had
matching results (same number of PASS/FAIL/etc.).  A high entropy
means that there is lots of variation between results: it's a noisy
test case within that cluster.  My hope is that the data behind this
statistic can help tune the testsuite itself.  Some test cases that
always behave the exact same way may not be needed at all; test cases
that are highly noisy may not be worth paying attention to.  These
numbers quantify that characteristic, on a per-cluster basis.

Please play with it and offer any feedback you have.

- FChE