public inbox for bunsen@sourceware.org
 help / color / mirror / Atom feed
* Initial findings of bunsen performance and questions
@ 2020-09-16 22:18 Keith Seitz
  2020-09-16 23:09 ` Serhei Makarov
  0 siblings, 1 reply; 5+ messages in thread
From: Keith Seitz @ 2020-09-16 22:18 UTC (permalink / raw)
  To: bunsen

Hi,

I've spent the (two?) last week(s) playing with Bunsen, reading sources, and
otherwise attempting to comprehend how it all fits together, and I have some
initial findings on which I would like to seek comment before forging too
much further ahead.

First off, a reminder. My use case is release analysis. It's not too far off
from the Bunsen use model at all. However, a consequence of my use case is
that I need access to /all/ the results in the test run.

To start playing around, I wrote a simple script to display the results of
a given test in a given Bunsen commit, show_test.py (uninteresting bits
removed):

----- meat of show_test.py -----

    b = bunsen.Bunsen()
    opts = b.cmdline_args(sys.argv, info=info, args=cmdline_args,
                          required_args=['test_name', 'commit'])

    testrun = b.testrun(opts.commit)
    all_tests = testrun.testcases

    found_tests = list(filter(lambda t: t['name'] == opts.test_name, all_tests))

    if found_tests:
        print("found {} matching tests for \"{}\"".format(len(found_tests),
                                                          opts.test_name))
        for t in found_tests:
            print(t)
    else:
        print("found no tests matching \"{}\"".format(opts.test_name))

----- end of show_test.py -----

The first question to ask is: Is this an oversimplified/naive implementation
of this script?

Continuing: This script takes over twenty seconds to complete on my computer.
IMO that is a show-stopping problem.

After investigating this, I've determined that the majority of the slowness
is in the serialization of Testrun (from JSON).

Maybe there's a better way to serialize the data from JSON? I don't know.
Certainly nothing obvious jumps out at me. [Warning: I'm no expert in the
area of JSON parsing in python.]

So I experimented with replacing the JSON Testrun data with a sqlite
database. Why a database? I chose this route for one simple reason:
database queries are a standard of modern *-as-a-service design predominant
on the web, and this is how I intend to implement a front-end for my work.

Last year, I wrote a proof-of-concept which parsed gdb.sum (much like Bunsen
does today), outputting a sqlite database. Two trivial database queries could
reproduce the summary lines of the given test run, e.g., # passed tests,
# failed, etc -- in under 0.1 seconds. This shows the potential of database-based
data models.

This week, I've completed a proof-of-concept of this idea in Bunsen, and I'd
like to present my findings here.

There are two quantitative measures that I concentrated on: speed and size.

First up: Speed. The "show_test.py" script above now completes in less
than one second (down from 20 seconds). Even as an unoptimized proof-of-concept,
that is, IMO, adequate performance. [Additional gains could be realized by
bypassing serialization altogether and using only sql queries.]

Second: Size. Bunsen stores its data in a git repository located in bunsen_upload
IIUC. I've measured the size of this directory in between the two implementations.

I took two measurements (with consolidate_pass=False to get all test results
stored):

1) With two identical gdb.{log,sum} files (different commits). In this case,
   JSON-based Bunsen used 5.2MB of storage. The database-based approach
   uses 5.0MB. IMO there is no difference.

2) With eight additional, different gdb.{log,sum} files imported on top of
   #1. Bunsen used 46340kB of storage. The database approach used 49400kB.
   That's a 6.6% difference in just ten runs.

I consider this 6.6% storage trade-off acceptable for the massive increase
in speed (and the ability to avoid custom serialization altogether).

Is this an approach that seems viable as a supplement (replacement?)
to JSON output? Is this approach something worth pursuing?

Keith


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-09-21 20:16 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-16 22:18 Initial findings of bunsen performance and questions Keith Seitz
2020-09-16 23:09 ` Serhei Makarov
2020-09-18 16:16   ` Keith Seitz
2020-09-21 20:08     ` Serhei Makarov
2020-09-21 20:16       ` Keith Seitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).