From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id 2F4583858D1E for ; Mon, 6 Feb 2023 16:05:14 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 2F4583858D1E Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1675699513; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type; bh=ygtrxqZVUID70O5Sn8Y2UHqzQ7xYdn7KJbPCBdfDqhw=; b=CD0hiypKQH5jCZJqxVCeg7FlrlO7x5ZSLRQhsJb+DD45hb3DfxazwlGTfuMhFcV3duGM0d kkMR5VJsa61jK7bc+B2EAnB/KAIXa60Rdk4ScvbUF+YQxWMdKIc2g8HtwtJVi8jU7vXHl1 JLYbVKqEXEsfSO2VdPVVFyVe3XCqjhE= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-655-2F6iNAJIPDS0jUbXX2asfg-1; Mon, 06 Feb 2023 11:05:10 -0500 X-MC-Unique: 2F6iNAJIPDS0jUbXX2asfg-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id D7463100F916; Mon, 6 Feb 2023 16:05:08 +0000 (UTC) Received: from redhat.com (unknown [10.2.16.10]) by smtp.corp.redhat.com (Postfix) with ESMTPS id C7D5240398A0; Mon, 6 Feb 2023 16:05:08 +0000 (UTC) Received: from fche by redhat.com with local (Exim 4.94.2) (envelope-from ) id 1pP3zQ-0003KR-28; Mon, 06 Feb 2023 11:05:08 -0500 Date: Mon, 6 Feb 2023 11:05:07 -0500 From: "Frank Ch. Eigler" To: bunsen@sourceware.org, buildbot@sourceware.org Subject: bunsen ai/ml testsuite analysis prototype Message-ID: <20230206160507.GA31394@redhat.com> MIME-Version: 1.0 User-Agent: Mutt/1.12.0 (2019-05-25) X-Scanned-By: MIMEDefang 3.1 on 10.11.54.2 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Spam-Status: No, score=-7.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_SHORT,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi - The initial fruits of a bunch of weeks of learning & playing, I've committed a pytorch-based script to predict (dejagnu) test results in bunsen. Specifically, a testrun's metadata plus the expfile and subtest names, the code attempts to predict the most likely test result (PASS/FAIL/etc.). This prediction can then be compared to the actual test results to identify those results that are unexpected (mispredicted), with respect to the historical corpus the underlying neural net has been trained for. The intent is that by identifying these "unexpected" results, a developer can focus on them, in a similar way that he or she could look at regressions. The difference is that regressions are computed with respect to specific given testruns, whereas these neural net predictions are done with respect to the entire training corpus, so are not manually parametrized. Therefore, these results show up in the web ui as associated with the target testrun only. See: https://tinyurl.com/4c9ha9f8 The idea is that the last column's nonempty cells identify those test cases may well deserve more scrutiny. The probability of the prediction is also shown, and for eye candy purposes also encoded into opacity. This initial version of the bin/r-dejagnu-testcase-classify script focuses on dejagnu results only, but the concept applies generally and the code should generalize easily. It relies on the "torch" (pytorch) package, which is a large package that is unfortunately not generally available on distros, so one's stuck doing a "pip3 install torch" first. Run it thusly: % wget https://builder.sourceware.org/bunsen-dejagnu.pth # once % r-dejagnu-testcase-classify --mode=predict TESTRUN ... and you may start from a 250MB neural network state snapshot given above. I'm running an ongoing training job on a fast workstation, and plan to periodically update the model file. We have roughly 2 billion dejagnu test cases in bunsen, which will take several weeks of processing time just to train the network initially, plus we have millions of new results every day. This means that the results will vary over time, perhaps rapidly. My hope is that as the neural network learns more characteristics about our enormous dataset, it'll figure out what's common, what's random noise, and what's abnormal. It should gradually adapt to but not entirely normalize outliers. The same script also implements the training operation, so you can play completely at home: % r-dejagnu-testcase-classify --mode=train TESTRUN ... to create and/or update the neural net stored the model file. It's possible to create smaller (faster and probably dumber) networks, for example if one's dataset is small. I'm planning to add more training related workflows and parameters. Finally, the testrun dejagnu view now also shows off the testsuite "entropy" numbers, which have been active but hidden until now. That quantity is a statistical measurement of the amount of variation of results for the same testsuite (expfile) among various clusters that any particular testrun is a member of. Recall that bunsen "clusters" are just groups of testruns that have a matching piece of metadata. (Think: "same compiler", "same date", or "same source snapshot".) Low entropy (0) means that all other testruns in the same cluster had matching results (same number of PASS/FAIL/etc.). A high entropy means that there is lots of variation between results: it's a noisy test case within that cluster. My hope is that the data behind this statistic can help tune the testsuite itself. Some test cases that always behave the exact same way may not be needed at all; test cases that are highly noisy may not be worth paying attention to. These numbers quantify that characteristic, on a per-cluster basis. Please play with it and offer any feedback you have. - FChE