public inbox for bunsen@sourceware.org
 help / color / mirror / Atom feed
From: "Serhei Makarov" <me@serhei.io>
To: "Keith Seitz" <keiths@redhat.com>,
	"goldgold098123 at gmail dot com via Bunsen"
	<bunsen@sourceware.org>
Subject: Re: Initial findings of bunsen performance and questions
Date: Wed, 16 Sep 2020 19:09:18 -0400	[thread overview]
Message-ID: <ae67ec10-d36b-498d-bbef-cc49c1503dbd@www.fastmail.com> (raw)
In-Reply-To: <30950cc2-5d7f-eb93-42b1-d1c7a9138e81@redhat.com>

Hello Keith,

Thanks for the extensive work and experimentation.
I'm not sure your show_test script should be that slow and I hope to get to the bottom of why that's happening for you.
At the same time, your SQLite database work should be very useful and necessary for other reasons.

On Wed, Sep 16, 2020, at 6:18 PM, Keith Seitz via Bunsen wrote:
> The first question to ask is: Is this an oversimplified/naive implementation
> of this script?
It looks right to me. I briefly suspected the lambda might do some additional
object copying, but that doesn't seem to be the case.

> Continuing: This script takes over twenty seconds to complete on my computer.
> IMO that is a show-stopping problem.
Hmm. Very odd.

A script would take very long to complete if it was a 'column query'
(e.g. looking at one testcase across *all* versions would require parsing the JSON
for every testrun -- essentially the entire repo -- and throwing away most of that data).
There's good reasons to run queries like that, which makes your concerns legitimate.

This doesn't look like it. It reads just one testrun. My brief tests of your script yielded
~3sec to run against a SystemTap testrun and ~1.5s against a GDB buildbot run.
Since it reads one particular testrun, the scale of the repo is immaterial
(my SystemTap repo has ~2000-3000 runs in it).

I suspect my code for building the repo has some bug when using consolidate_pass=False.
Could you place the Git/JSON repo you built somewhere I have access to?

(Also, you could try the +diff_runs script on your repo. If the JSON parsing is the source
of the slowdown and reading one run took 20s,
logically reading two runs would take you 40s.)

> There are two quantitative measures that I concentrated on: speed and size.
> 
> First up: Speed. The "show_test.py" script above now completes in less
> than one second (down from 20 seconds). Even as an unoptimized proof-of-concept,
> that is, IMO, adequate performance. [Additional gains could be realized by
> bypassing serialization altogether and using only sql queries.]
> 
> I took two measurements (with consolidate_pass=False to get all test results
> stored):
I'll need to look further into whether that option is appropriate.

> 2) With eight additional, different gdb.{log,sum} files imported on top of
>    #1. Bunsen used 46340kB of storage. The database approach used 49400kB.
>    That's a 6.6% difference in just ten runs.
IMO the comparison has to be done with 100s to 1000s of similar test runs
since Git's de-duplication must be compared to whatever SQLite does,
at that scale of data.
I doubt it's important though, for this use case we have disk space to burn
and the query speedup even justifies keeping both forms of storage.
 
> Is this an approach that seems viable as a supplement (replacement?)
> to JSON output? Is this approach something worth pursuing?
Definitely worth pursuing due to the aforementioned possibility of 'column queries'
which I don't see any way of handling well with the design I currently have.

I'm not sure if SQLite is better used as a replacement for the JSON/Git storage
or as a supplemental cache built from it and used to speed up queries.
(Also, the original log files must be retained in any case.)

All the best,
     Serhei

  reply	other threads:[~2020-09-16 23:10 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-16 22:18 Keith Seitz
2020-09-16 23:09 ` Serhei Makarov [this message]
2020-09-18 16:16   ` Keith Seitz
2020-09-21 20:08     ` Serhei Makarov
2020-09-21 20:16       ` Keith Seitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ae67ec10-d36b-498d-bbef-cc49c1503dbd@www.fastmail.com \
    --to=me@serhei.io \
    --cc=bunsen@sourceware.org \
    --cc=keiths@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).