From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <me@serhei.io>
Received: from out5-smtp.messagingengine.com (out5-smtp.messagingengine.com
 [66.111.4.29])
 by sourceware.org (Postfix) with ESMTPS id 42489398680F
 for <bunsen@sourceware.org>; Wed, 16 Sep 2020 23:10:54 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 42489398680F
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=serhei.io
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=me@serhei.io
Received: from compute7.internal (compute7.nyi.internal [10.202.2.47])
 by mailout.nyi.internal (Postfix) with ESMTP id 6B1F45C0B1E;
 Wed, 16 Sep 2020 19:10:52 -0400 (EDT)
Received: from imap21 ([10.202.2.71])
 by compute7.internal (MEProxy); Wed, 16 Sep 2020 19:10:52 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=serhei.io; h=
 mime-version:message-id:in-reply-to:references:date:from:to
 :subject:content-type; s=fm2; bh=0lfA+sScxFtgjBWIY1Sty2iGWEC/o8l
 OkMr5mN7S48A=; b=c6Y5o4pvuL1kZw5jU0AeheejU7zgoc4zSV403nq2oK006i6
 fuOTpo0rW0gkN49i30H/8yorASMACH+rqGhjoPdpxe0yU1biFBkYyY1cxjeJPYjR
 7J+vwGGMNxwmam84Sfo92A273sEKtvnLkmrBWcqhHiICbGaR8EM7YSwNzJQSi+Uo
 H3jd+Ok/Xx+yHzvHukvJjNYqanDUcai6YnUMxeb5VarKCBN2rRWoxJZoN28CGzeh
 kvrfRag9UgDhwVO8Ry9mSlIcXFg21FMaPZmX4VLQE48hCk6IKSIp2+0oEslh2Fwy
 0nk0DfN2LwEMz3ocKdWEZrM559PmjJ09s9N535g==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
 messagingengine.com; h=content-type:date:from:in-reply-to
 :message-id:mime-version:references:subject:to:x-me-proxy
 :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; bh=0lfA+s
 ScxFtgjBWIY1Sty2iGWEC/o8lOkMr5mN7S48A=; b=YZYxETA11QIfDbIRUgqDgt
 dwO3e9/u3n8K4K/kmV7v0nnYolyFKyvIM9mJBT00obRCmjErWpok2+6F5wkZaL6L
 OfKOksAeb/x3Eku4lSLfy+SGfcUAmy0d9oZxXwyGIeq72XZ8Wi7nsSZG5/70EqTg
 j3xyMLkQCqmab42/idTddOcOXlhalG++N/Fp/q7xXc9r8DQ3BWFK81fbZ8deO2hC
 2T9tPdO4sM1rpEAmrhccNXDdpN0+UJ3cwjRJk8ZhqVbmLSyOz9s1nEnAKa+zdugk
 CfulAp1xoj/rNqJa7h1xvwrikETrcs7A8p6paqLKaLr6nbFJIY05D/0LfQPXT3aQ
 ==
X-ME-Sender: <xms:e5tiX3dWDJJvh5HjhNKp1b2XJo0i6HsjBl9yUnZ9MjFM9u8ZTbdGLA>
 <xme:e5tiX9PWmXS9N4vBCxKrVaa9ByqkJ7JmkhPkbHMSibUJqPa9GPibIo1d2v18ajcbu
 woxfcVbnCSwLULuwg>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedujedrtdefgddukecutefuodetggdotefrodftvf
 curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu
 uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc
 fjughrpefofgggkfgjfhffhffvufgtsehttdertderredtnecuhfhrohhmpedfufgvrhhh
 vghiucforghkrghrohhvfdcuoehmvgesshgvrhhhvghirdhioheqnecuggftrfgrthhtvg
 hrnhepfeejgffggeffudehiedtledtgeejtdekgfeuleehuddtgfegvdfghfejkeeigfdt
 necuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepmhgvse
 hsvghrhhgvihdrihho
X-ME-Proxy: <xmx:e5tiXwhiZtjdeBAvmzzWXul96wz-h1APj1Mpr5lq6KhSnNSLnrqDqA>
 <xmx:e5tiX49Z7rPcHw_lOMiPnzcik8Y8GXZB-dPKWIgcQ3-MTc9XEB6dbg>
 <xmx:e5tiXztgCo-ik3_63_xvL9D4r_-xhP094iaDL3TRfvBqEcaEGUf7lw>
 <xmx:fJtiXw7xhB1vmHDGArqT6Uq9JQclqo9er6vvEV58PQnfSssuFkkBSA>
Received: by mailuser.nyi.internal (Postfix, from userid 501)
 id C036E660069; Wed, 16 Sep 2020 19:10:44 -0400 (EDT)
X-Mailer: MessagingEngine.com Webmail Interface
User-Agent: Cyrus-JMAP/3.3.0-324-g0f99587-fm-20200916.004-g0f995879
Mime-Version: 1.0
Message-Id: <ae67ec10-d36b-498d-bbef-cc49c1503dbd@www.fastmail.com>
In-Reply-To: <30950cc2-5d7f-eb93-42b1-d1c7a9138e81@redhat.com>
References: <30950cc2-5d7f-eb93-42b1-d1c7a9138e81@redhat.com>
Date: Wed, 16 Sep 2020 19:09:18 -0400
From: "Serhei Makarov" <me@serhei.io>
To: "Keith Seitz" <keiths@redhat.com>,
 "goldgold098123 at gmail dot com via Bunsen" <bunsen@sourceware.org>
Subject: Re: Initial findings of bunsen performance and questions
Content-Type: text/plain
X-Spam-Status: No, score=-3.4 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, JMQ_SPF_NEUTRAL, RCVD_IN_DNSWL_LOW,
 RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: bunsen@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Bunsen mailing list <bunsen.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/bunsen>,
 <mailto:bunsen-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/bunsen/>
List-Help: <mailto:bunsen-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/bunsen>,
 <mailto:bunsen-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Wed, 16 Sep 2020 23:10:55 -0000

Hello Keith,

Thanks for the extensive work and experimentation.
I'm not sure your show_test script should be that slow and I hope to get to the bottom of why that's happening for you.
At the same time, your SQLite database work should be very useful and necessary for other reasons.

On Wed, Sep 16, 2020, at 6:18 PM, Keith Seitz via Bunsen wrote:
> The first question to ask is: Is this an oversimplified/naive implementation
> of this script?
It looks right to me. I briefly suspected the lambda might do some additional
object copying, but that doesn't seem to be the case.

> Continuing: This script takes over twenty seconds to complete on my computer.
> IMO that is a show-stopping problem.
Hmm. Very odd.

A script would take very long to complete if it was a 'column query'
(e.g. looking at one testcase across *all* versions would require parsing the JSON
for every testrun -- essentially the entire repo -- and throwing away most of that data).
There's good reasons to run queries like that, which makes your concerns legitimate.

This doesn't look like it. It reads just one testrun. My brief tests of your script yielded
~3sec to run against a SystemTap testrun and ~1.5s against a GDB buildbot run.
Since it reads one particular testrun, the scale of the repo is immaterial
(my SystemTap repo has ~2000-3000 runs in it).

I suspect my code for building the repo has some bug when using consolidate_pass=False.
Could you place the Git/JSON repo you built somewhere I have access to?

(Also, you could try the +diff_runs script on your repo. If the JSON parsing is the source
of the slowdown and reading one run took 20s,
logically reading two runs would take you 40s.)

> There are two quantitative measures that I concentrated on: speed and size.
> 
> First up: Speed. The "show_test.py" script above now completes in less
> than one second (down from 20 seconds). Even as an unoptimized proof-of-concept,
> that is, IMO, adequate performance. [Additional gains could be realized by
> bypassing serialization altogether and using only sql queries.]
> 
> I took two measurements (with consolidate_pass=False to get all test results
> stored):
I'll need to look further into whether that option is appropriate.

> 2) With eight additional, different gdb.{log,sum} files imported on top of
>    #1. Bunsen used 46340kB of storage. The database approach used 49400kB.
>    That's a 6.6% difference in just ten runs.
IMO the comparison has to be done with 100s to 1000s of similar test runs
since Git's de-duplication must be compared to whatever SQLite does,
at that scale of data.
I doubt it's important though, for this use case we have disk space to burn
and the query speedup even justifies keeping both forms of storage.
 
> Is this an approach that seems viable as a supplement (replacement?)
> to JSON output? Is this approach something worth pursuing?
Definitely worth pursuing due to the aforementioned possibility of 'column queries'
which I don't see any way of handling well with the design I currently have.

I'm not sure if SQLite is better used as a replacement for the JSON/Git storage
or as a supplemental cache built from it and used to speed up queries.
(Also, the original log files must be retained in any case.)

All the best,
     Serhei