Initial findings of bunsen performance and questions

public inbox for bunsen@sourceware.org
 help / color / mirror / Atom feed

From: Keith Seitz <keiths@redhat.com>
To: bunsen@sourceware.org
Subject: Initial findings of bunsen performance and questions
Date: Wed, 16 Sep 2020 15:18:44 -0700	[thread overview]
Message-ID: <30950cc2-5d7f-eb93-42b1-d1c7a9138e81@redhat.com> (raw)

Hi,

I've spent the (two?) last week(s) playing with Bunsen, reading sources, and
otherwise attempting to comprehend how it all fits together, and I have some
initial findings on which I would like to seek comment before forging too
much further ahead.

First off, a reminder. My use case is release analysis. It's not too far off
from the Bunsen use model at all. However, a consequence of my use case is
that I need access to /all/ the results in the test run.

To start playing around, I wrote a simple script to display the results of
a given test in a given Bunsen commit, show_test.py (uninteresting bits
removed):

----- meat of show_test.py -----

    b = bunsen.Bunsen()
    opts = b.cmdline_args(sys.argv, info=info, args=cmdline_args,
                          required_args=['test_name', 'commit'])

    testrun = b.testrun(opts.commit)
    all_tests = testrun.testcases

    found_tests = list(filter(lambda t: t['name'] == opts.test_name, all_tests))

    if found_tests:
        print("found {} matching tests for \"{}\"".format(len(found_tests),
                                                          opts.test_name))
        for t in found_tests:
            print(t)
    else:
        print("found no tests matching \"{}\"".format(opts.test_name))

----- end of show_test.py -----

The first question to ask is: Is this an oversimplified/naive implementation
of this script?

Continuing: This script takes over twenty seconds to complete on my computer.
IMO that is a show-stopping problem.

After investigating this, I've determined that the majority of the slowness
is in the serialization of Testrun (from JSON).

Maybe there's a better way to serialize the data from JSON? I don't know.
Certainly nothing obvious jumps out at me. [Warning: I'm no expert in the
area of JSON parsing in python.]

So I experimented with replacing the JSON Testrun data with a sqlite
database. Why a database? I chose this route for one simple reason:
database queries are a standard of modern *-as-a-service design predominant
on the web, and this is how I intend to implement a front-end for my work.

Last year, I wrote a proof-of-concept which parsed gdb.sum (much like Bunsen
does today), outputting a sqlite database. Two trivial database queries could
reproduce the summary lines of the given test run, e.g., # passed tests,
# failed, etc -- in under 0.1 seconds. This shows the potential of database-based
data models.

This week, I've completed a proof-of-concept of this idea in Bunsen, and I'd
like to present my findings here.

There are two quantitative measures that I concentrated on: speed and size.

First up: Speed. The "show_test.py" script above now completes in less
than one second (down from 20 seconds). Even as an unoptimized proof-of-concept,
that is, IMO, adequate performance. [Additional gains could be realized by
bypassing serialization altogether and using only sql queries.]

Second: Size. Bunsen stores its data in a git repository located in bunsen_upload
IIUC. I've measured the size of this directory in between the two implementations.

I took two measurements (with consolidate_pass=False to get all test results
stored):

1) With two identical gdb.{log,sum} files (different commits). In this case,
   JSON-based Bunsen used 5.2MB of storage. The database-based approach
   uses 5.0MB. IMO there is no difference.

2) With eight additional, different gdb.{log,sum} files imported on top of
   #1. Bunsen used 46340kB of storage. The database approach used 49400kB.
   That's a 6.6% difference in just ten runs.

I consider this 6.6% storage trade-off acceptable for the massive increase
in speed (and the ability to avoid custom serialization altogether).

Is this an approach that seems viable as a supplement (replacement?)
to JSON output? Is this approach something worth pursuing?

Keith

next             reply	other threads:[~2020-09-16 22:18 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-16 22:18 Keith Seitz [this message]
2020-09-16 23:09 ` Serhei Makarov
2020-09-18 16:16   ` Keith Seitz
2020-09-21 20:08     ` Serhei Makarov
2020-09-21 20:16       ` Keith Seitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=30950cc2-5d7f-eb93-42b1-d1c7a9138e81@redhat.com \
    --to=keiths@redhat.com \
    --cc=bunsen@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).