From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id 29F8C3858D39 for ; Thu, 10 Mar 2022 23:01:05 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 29F8C3858D39 Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-462-OMVVgzP0Mz6EYcXZzWguKQ-1; Thu, 10 Mar 2022 18:01:01 -0500 X-MC-Unique: OMVVgzP0Mz6EYcXZzWguKQ-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 7057D520F; Thu, 10 Mar 2022 23:01:00 +0000 (UTC) Received: from redhat.com (unknown [10.2.16.64]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 09CFE100691D; Thu, 10 Mar 2022 23:01:00 +0000 (UTC) Received: from [127.0.0.1] (helo=vm-rhel7) by redhat.com with esmtp (Exim 4.94.2) (envelope-from ) id 1nSRmD-0000fy-FR; Thu, 10 Mar 2022 18:00:58 -0500 From: fche@redhat.com (Frank Ch. Eigler) To: "Serhei Makarov" Cc: bunsen@sourceware.org Subject: Re: bunsen (re)design discussion #1: testrun & branch identifiers, testrun representation References: <320ed3c9-2612-4f64-bb1a-6a791bef4168@www.fastmail.com> <20220310180414.GC28310@redhat.com> <34fdccc3-5756-492a-89b9-3a2cab65b16a@www.fastmail.com> Date: Thu, 10 Mar 2022 18:00:58 -0500 In-Reply-To: <34fdccc3-5756-492a-89b9-3a2cab65b16a@www.fastmail.com> (Serhei Makarov's message of "Thu, 10 Mar 2022 15:00:13 -0500") Message-ID: <87ee39fm51.fsf@redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain X-Spam-Status: No, score=-5.5 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H5, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: bunsen@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Bunsen mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Mar 2022 23:01:08 -0000 > (When thinking in SQLite first, I can immediately think of a number of > 'metaknowledge' bits that are currently hardcoded in the corresponding parsing > scripts, but might conveniently be maintained and extended in tables. > [...]) (Pure configuration could well just live outside too.) >>> [Q] Should the bunsen_commit_id be deprecated entirely as a unique >>> identifier? We could conceivably allow a single set of testlogs to >>> define several testruns for the *same* project, which would break the >>> uniqueness property for + >> >> Given that we're reconsidering use of git for purposes of storing this >> analysis data, there is no obvious hash for it. > Disagree. The bunsen_commit_id hash is based on the commit storing the testlogs, > not the commit storing the testruns JSON files. So moving the testruns to SQLite > wouldn't eliminate the possibility of using this hash in a testrun > identifier. I believe you were talking about identification (unique key) for testruns rather than testlogs. >>OTOH, given that it >> is the result of analysis, we have a lot of rich data to query by. I >> suspect we don't need much of a standard identifier scheme at all. An >> arbitrary nickname for query shortcutting could be a field associated >> with the "testrun" object itself. > I don't get it. > > The testrun ID could be stored in a field associated with the "testrun" object. > It is indeed an 'arbitrary nickname' that we are free to generate when the > testrun is created. But we do need such a nickname to exist in order to store > tuples in the (testrun_id, keyword, value) schema, as well as to issue > unambiguous commands like 'delete this testrun'. As far as the database is concerned, a testrun ID can just be some random unique integer that the user never even sees. A user may see the nickname or some other synthetic description (kinda like git describe?) when needed, which may be mappable back to the related rows. The testrun objects would relate to the testlog commit#. (If that relationship happens to be one-to-one, then that commit# could be a good nickname.) > In my prior email I was trying to improve on 'arbitrary hexsha' and > design a 'nickname' that identifies the testrun somewhat and can be > handled by human minds when necessary. The bunsen_commit_id 'hexsha' > was sufficiently unique but it was also very confusing from a UI > perspective since users will be thinking of commit IDs in the project > Git repo whenever they see a bare hexsha. Yeah. I suspect an identification-by-query that happens to expose a convenient nickname would work here. And you're right, it'd be unfortunate to have ambiguity between the testlog & testrun objects. (OTOH it could be that the bunsen CLIs generally deal with testrun such things.) >> Noting we switched to talking about testlogs - the raw test result log >> files that are archived in git. These (rather than analysis data >> stored somewhere else) are indeed worth pushing & pulling & >> transporting between bunsen installations. I suspect that imposing a >> naming standard here is not necessary either. As long as we >> standardize what bunsen consumes from these repos (every tag? every >> commit? every branch? every ref matching a given regex?), a human or >> other tooling could decide their favorite naming convention. > > I'll need to think about this carefully. My intuition screams a strong > disagreement. > > If we just treat the transported data as a set of unlabeled testlogs, > and say that each Bunsen instance is going to be naming them separately > & differently.... Well, if they are externally identified usually by common nickname, or by predicates like time/tool/host or something, it need not vary from installation to installation. >>> #1: Testrun Schema, very quick take1 >> [...] the list of >> key names can be left open in the tooling core, so queries can use any >> set of them. i.e., the schema for "testrun" analysis objects could be: >> >> (id, keyword, value) >> >> and then a user can find testruns by combination of keywords having >> particular values (or range or pattern). That can map >> straightforwardly to a relational filter (select query) if that's how >> we end up storing these things. > > As I understand it, this a neat solution to my earlier question of needing to > define metadata fields in an SQLite schema and end up having to deal with > migrations, etc during the lifecycle of a Bunsen repo. Of course, it means we > definitely don't get Django-like ORM functionality for free [...] Yes, right. If one doesn't reify the set of keys to describe a testrun, they won't show up as ORM columns/fields. > [...] > To test the 'straightforward' part of your claim about querying, what might a > 'select' query look like for 'all configuration fields of {set of testruns}'? > > We would want it to produce a table along the lines of: > > testrun arch osver kernel_ver > id1 x86_64 fedora-35 4.6-whatever > id2 aarch64 rhel-8 4.7-whatever In SQL, it'd be a join like this: select tr.id, trkv1.value, trkv2.value, trkv3.value from testrun tr, testrunkv kv1, testrunkv kv2, testrunkv kv3, where kv1.id = tr.id and kv1.name = 'arch' and kv2.id = tr.id and kv2.name = 'osver' and kv3.id = tr.id and kv3.name = 'kernel_ver'; (modulo testrun nickname). > (Another functionality [Q] came up for distribution & kernel version: > we may want to store the most exact version of each component > but then specify a granularity for analysis > which combines some 'similar-enough' versions into > the same configuration > (e.g. treat all 5.x kernels as one configuration, > then treat all 4.x kernels as another configuration). > > If we see a problem arise on 5.x but not 4.x, > *then* we would want to look at the detailed history of > changes within 5.x.) That should be expressible a variety of ways, even within sql ... where .... and kv3.value like '4.%'; >> These could also go as additional key/value tuples into the testrun. >> (Prefix their names with "dejagnu-" to identify the analysis/parse >> tool that created them.) > Do you mean something like 'dejagnu-PASS' or 'dejagnu-PASS-count'? > That could work, with an iterator like: > > for outcome in dejagnu_all_outcomes: # 'PASS', 'FAIL', ... > key = 'dejagnu-{}-count' > val = testrun[key] > yield outcome, key, val > ... Yeah. Just a place to stash summaries. Or: why not, a whole separate derived analysis table with only a handful of rows: dejagnu-TOOL-counts (testrun_id, outcome, count) - FChE