From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <fche@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTPS id 29F8C3858D39
 for <bunsen@sourceware.org>; Thu, 10 Mar 2022 23:01:05 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 29F8C3858D39
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-462-OMVVgzP0Mz6EYcXZzWguKQ-1; Thu, 10 Mar 2022 18:01:01 -0500
X-MC-Unique: OMVVgzP0Mz6EYcXZzWguKQ-1
Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com
 [10.5.11.22])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 7057D520F;
 Thu, 10 Mar 2022 23:01:00 +0000 (UTC)
Received: from redhat.com (unknown [10.2.16.64])
 by smtp.corp.redhat.com (Postfix) with ESMTPS id 09CFE100691D;
 Thu, 10 Mar 2022 23:01:00 +0000 (UTC)
Received: from [127.0.0.1] (helo=vm-rhel7)
 by redhat.com with esmtp (Exim 4.94.2)
 (envelope-from <fche@redhat.com>)
 id 1nSRmD-0000fy-FR; Thu, 10 Mar 2022 18:00:58 -0500
From: fche@redhat.com (Frank Ch. Eigler)
To: "Serhei Makarov" <serhei@serhei.io>
Cc: bunsen@sourceware.org
Subject: Re: bunsen (re)design discussion #1: testrun & branch identifiers,
 testrun representation
References: <320ed3c9-2612-4f64-bb1a-6a791bef4168@www.fastmail.com>
 <20220310180414.GC28310@redhat.com>
 <34fdccc3-5756-492a-89b9-3a2cab65b16a@www.fastmail.com>
Date: Thu, 10 Mar 2022 18:00:58 -0500
In-Reply-To: <34fdccc3-5756-492a-89b9-3a2cab65b16a@www.fastmail.com> (Serhei
 Makarov's message of "Thu, 10 Mar 2022 15:00:13 -0500")
Message-ID: <87ee39fm51.fsf@redhat.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux)
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain
X-Spam-Status: No, score=-5.5 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_NONE,
 RCVD_IN_MSPIKE_H5, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: bunsen@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Bunsen mailing list <bunsen.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/bunsen>,
 <mailto:bunsen-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/bunsen/>
List-Help: <mailto:bunsen-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/bunsen>,
 <mailto:bunsen-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Thu, 10 Mar 2022 23:01:08 -0000


> (When thinking in SQLite first, I can immediately think of a number of
> 'metaknowledge' bits that are currently hardcoded in the corresponding parsing
> scripts, but might conveniently be maintained and extended in tables.
> [...])

(Pure configuration could well just live outside too.)


>>> [Q] Should the bunsen_commit_id be deprecated entirely as a unique
>>> identifier? We could conceivably allow a single set of testlogs to
>>> define several testruns for the *same* project, which would break the
>>> uniqueness property for <project>+<bunsen_commit_id>
>>
>> Given that we're reconsidering use of git for purposes of storing this
>> analysis data, there is no obvious hash for it.
> Disagree. The bunsen_commit_id hash is based on the commit storing the testlogs,
> not the commit storing the testruns JSON files. So moving the testruns to SQLite
> wouldn't eliminate the possibility of using this hash in a testrun
> identifier.

I believe you were talking about identification (unique key) for
testruns rather than testlogs.

>>OTOH, given that it
>> is the result of analysis, we have a lot of rich data to query by.  I
>> suspect we don't need much of a standard identifier scheme at all.  An
>> arbitrary nickname for query shortcutting could be a field associated
>> with the "testrun" object itself.

> I don't get it.
>
> The testrun ID could be stored in a field associated with the "testrun" object.
> It is indeed an 'arbitrary nickname' that we are free to generate when the
> testrun is created. But we do need such a nickname to exist in order to store
> tuples in the (testrun_id, keyword, value) schema, as well as to issue
> unambiguous commands like 'delete this testrun'.

As far as the database is concerned, a testrun ID can just be some
random unique integer that the user never even sees.  A user may see the
nickname or some other synthetic description (kinda like git describe?)
when needed, which may be mappable back to the related rows.  The
testrun objects would relate to the testlog commit#.  (If that
relationship happens to be one-to-one, then that commit# could be a good
nickname.)


> In my prior email I was trying to improve on 'arbitrary hexsha' and
> design a 'nickname' that identifies the testrun somewhat and can be
> handled by human minds when necessary. The bunsen_commit_id 'hexsha'
> was sufficiently unique but it was also very confusing from a UI
> perspective since users will be thinking of commit IDs in the project
> Git repo whenever they see a bare hexsha.

Yeah.  I suspect an identification-by-query that happens to expose a
convenient nickname would work here.  And you're right, it'd be
unfortunate to have ambiguity between the testlog & testrun objects.
(OTOH it could be that the bunsen CLIs generally deal with testrun
such things.)


>> Noting we switched to talking about testlogs - the raw test result log
>> files that are archived in git.  These (rather than analysis data
>> stored somewhere else) are indeed worth pushing & pulling &
>> transporting between bunsen installations.  I suspect that imposing a
>> naming standard here is not necessary either.  As long as we
>> standardize what bunsen consumes from these repos (every tag?  every
>> commit?  every branch?  every ref matching a given regex?), a human or
>> other tooling could decide their favorite naming convention.
>
> I'll need to think about this carefully. My intuition screams a strong
> disagreement.
>
> If we just treat the transported data as a set of unlabeled testlogs,
> and say that each Bunsen instance is going to be naming them separately
> & differently....

Well, if they are externally identified usually by common nickname, or
by predicates like time/tool/host or something, it need not vary from
installation to installation.


>>> #1: Testrun Schema, very quick take1
>> [...] the list of
>> key names can be left open in the tooling core, so queries can use any
>> set of them.  i.e., the schema for "testrun" analysis objects could be:
>>
>>    (id, keyword, value)
>>
>> and then a user can find testruns by combination of keywords having
>> particular values (or range or pattern).  That can map
>> straightforwardly to a relational filter (select query) if that's how
>> we end up storing these things.
>
> As I understand it, this a neat solution to my earlier question of needing to
> define metadata fields in an SQLite schema and end up having to deal with
> migrations, etc during the lifecycle of a Bunsen repo. Of course, it means we
> definitely don't get Django-like ORM functionality for free [...]

Yes, right.  If one doesn't reify the set of keys to describe a testrun,
they won't show up as ORM columns/fields.



> [...]
> To test the 'straightforward' part of your claim about querying, what might a
> 'select' query look like for 'all configuration fields of {set of testruns}'?
>
> We would want it to produce a table along the lines of:
>
>     testrun arch    osver     kernel_ver
>     id1     x86_64  fedora-35 4.6-whatever
>     id2     aarch64 rhel-8    4.7-whatever

In SQL, it'd be a join like this:

select tr.id, trkv1.value, trkv2.value, trkv3.value
  from testrun tr, testrunkv kv1, testrunkv kv2, testrunkv kv3,
 where kv1.id = tr.id and kv1.name = 'arch'
   and kv2.id = tr.id and kv2.name = 'osver'
   and kv3.id = tr.id and kv3.name = 'kernel_ver';

(modulo testrun nickname).


> (Another functionality [Q] came up for distribution & kernel version:
> we may want to store the most exact version of each component
> but then specify a granularity for analysis
> which combines some 'similar-enough' versions into
> the same configuration
> (e.g. treat all 5.x kernels as one configuration,
> then treat all 4.x kernels as another configuration).
>
> If we see a problem arise on 5.x but not 4.x,
> *then* we would want to look at the detailed history of
> changes within 5.x.)

That should be expressible a variety of ways, even within sql ...

   where .... and kv3.value like '4.%';


>> These could also go as additional key/value tuples into the testrun.
>> (Prefix their names with "dejagnu-" to identify the analysis/parse
>> tool that created them.)
> Do you mean something like 'dejagnu-PASS' or 'dejagnu-PASS-count'?
> That could work, with an iterator like:
>
>     for outcome in dejagnu_all_outcomes: # 'PASS', 'FAIL', ...
>         key = 'dejagnu-{}-count'
>         val = testrun[key]
>         yield outcome, key, val
>         ...

Yeah.  Just a place to stash summaries.

Or: why not, a whole separate derived analysis table with only
a handful of rows:
   dejagnu-TOOL-counts (testrun_id, outcome, count)


- FChE