public inbox for bunsen@sourceware.org
 help / color / mirror / Atom feed
* Initial findings of bunsen performance and questions
@ 2020-09-16 22:18 Keith Seitz
  2020-09-16 23:09 ` Serhei Makarov
  0 siblings, 1 reply; 5+ messages in thread
From: Keith Seitz @ 2020-09-16 22:18 UTC (permalink / raw)
  To: bunsen

Hi,

I've spent the (two?) last week(s) playing with Bunsen, reading sources, and
otherwise attempting to comprehend how it all fits together, and I have some
initial findings on which I would like to seek comment before forging too
much further ahead.

First off, a reminder. My use case is release analysis. It's not too far off
from the Bunsen use model at all. However, a consequence of my use case is
that I need access to /all/ the results in the test run.

To start playing around, I wrote a simple script to display the results of
a given test in a given Bunsen commit, show_test.py (uninteresting bits
removed):

----- meat of show_test.py -----

    b = bunsen.Bunsen()
    opts = b.cmdline_args(sys.argv, info=info, args=cmdline_args,
                          required_args=['test_name', 'commit'])

    testrun = b.testrun(opts.commit)
    all_tests = testrun.testcases

    found_tests = list(filter(lambda t: t['name'] == opts.test_name, all_tests))

    if found_tests:
        print("found {} matching tests for \"{}\"".format(len(found_tests),
                                                          opts.test_name))
        for t in found_tests:
            print(t)
    else:
        print("found no tests matching \"{}\"".format(opts.test_name))

----- end of show_test.py -----

The first question to ask is: Is this an oversimplified/naive implementation
of this script?

Continuing: This script takes over twenty seconds to complete on my computer.
IMO that is a show-stopping problem.

After investigating this, I've determined that the majority of the slowness
is in the serialization of Testrun (from JSON).

Maybe there's a better way to serialize the data from JSON? I don't know.
Certainly nothing obvious jumps out at me. [Warning: I'm no expert in the
area of JSON parsing in python.]

So I experimented with replacing the JSON Testrun data with a sqlite
database. Why a database? I chose this route for one simple reason:
database queries are a standard of modern *-as-a-service design predominant
on the web, and this is how I intend to implement a front-end for my work.

Last year, I wrote a proof-of-concept which parsed gdb.sum (much like Bunsen
does today), outputting a sqlite database. Two trivial database queries could
reproduce the summary lines of the given test run, e.g., # passed tests,
# failed, etc -- in under 0.1 seconds. This shows the potential of database-based
data models.

This week, I've completed a proof-of-concept of this idea in Bunsen, and I'd
like to present my findings here.

There are two quantitative measures that I concentrated on: speed and size.

First up: Speed. The "show_test.py" script above now completes in less
than one second (down from 20 seconds). Even as an unoptimized proof-of-concept,
that is, IMO, adequate performance. [Additional gains could be realized by
bypassing serialization altogether and using only sql queries.]

Second: Size. Bunsen stores its data in a git repository located in bunsen_upload
IIUC. I've measured the size of this directory in between the two implementations.

I took two measurements (with consolidate_pass=False to get all test results
stored):

1) With two identical gdb.{log,sum} files (different commits). In this case,
   JSON-based Bunsen used 5.2MB of storage. The database-based approach
   uses 5.0MB. IMO there is no difference.

2) With eight additional, different gdb.{log,sum} files imported on top of
   #1. Bunsen used 46340kB of storage. The database approach used 49400kB.
   That's a 6.6% difference in just ten runs.

I consider this 6.6% storage trade-off acceptable for the massive increase
in speed (and the ability to avoid custom serialization altogether).

Is this an approach that seems viable as a supplement (replacement?)
to JSON output? Is this approach something worth pursuing?

Keith


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Initial findings of bunsen performance and questions
  2020-09-16 22:18 Initial findings of bunsen performance and questions Keith Seitz
@ 2020-09-16 23:09 ` Serhei Makarov
  2020-09-18 16:16   ` Keith Seitz
  0 siblings, 1 reply; 5+ messages in thread
From: Serhei Makarov @ 2020-09-16 23:09 UTC (permalink / raw)
  To: Keith Seitz, goldgold098123 at gmail dot com via Bunsen

Hello Keith,

Thanks for the extensive work and experimentation.
I'm not sure your show_test script should be that slow and I hope to get to the bottom of why that's happening for you.
At the same time, your SQLite database work should be very useful and necessary for other reasons.

On Wed, Sep 16, 2020, at 6:18 PM, Keith Seitz via Bunsen wrote:
> The first question to ask is: Is this an oversimplified/naive implementation
> of this script?
It looks right to me. I briefly suspected the lambda might do some additional
object copying, but that doesn't seem to be the case.

> Continuing: This script takes over twenty seconds to complete on my computer.
> IMO that is a show-stopping problem.
Hmm. Very odd.

A script would take very long to complete if it was a 'column query'
(e.g. looking at one testcase across *all* versions would require parsing the JSON
for every testrun -- essentially the entire repo -- and throwing away most of that data).
There's good reasons to run queries like that, which makes your concerns legitimate.

This doesn't look like it. It reads just one testrun. My brief tests of your script yielded
~3sec to run against a SystemTap testrun and ~1.5s against a GDB buildbot run.
Since it reads one particular testrun, the scale of the repo is immaterial
(my SystemTap repo has ~2000-3000 runs in it).

I suspect my code for building the repo has some bug when using consolidate_pass=False.
Could you place the Git/JSON repo you built somewhere I have access to?

(Also, you could try the +diff_runs script on your repo. If the JSON parsing is the source
of the slowdown and reading one run took 20s,
logically reading two runs would take you 40s.)

> There are two quantitative measures that I concentrated on: speed and size.
> 
> First up: Speed. The "show_test.py" script above now completes in less
> than one second (down from 20 seconds). Even as an unoptimized proof-of-concept,
> that is, IMO, adequate performance. [Additional gains could be realized by
> bypassing serialization altogether and using only sql queries.]
> 
> I took two measurements (with consolidate_pass=False to get all test results
> stored):
I'll need to look further into whether that option is appropriate.

> 2) With eight additional, different gdb.{log,sum} files imported on top of
>    #1. Bunsen used 46340kB of storage. The database approach used 49400kB.
>    That's a 6.6% difference in just ten runs.
IMO the comparison has to be done with 100s to 1000s of similar test runs
since Git's de-duplication must be compared to whatever SQLite does,
at that scale of data.
I doubt it's important though, for this use case we have disk space to burn
and the query speedup even justifies keeping both forms of storage.
 
> Is this an approach that seems viable as a supplement (replacement?)
> to JSON output? Is this approach something worth pursuing?
Definitely worth pursuing due to the aforementioned possibility of 'column queries'
which I don't see any way of handling well with the design I currently have.

I'm not sure if SQLite is better used as a replacement for the JSON/Git storage
or as a supplemental cache built from it and used to speed up queries.
(Also, the original log files must be retained in any case.)

All the best,
     Serhei

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Initial findings of bunsen performance and questions
  2020-09-16 23:09 ` Serhei Makarov
@ 2020-09-18 16:16   ` Keith Seitz
  2020-09-21 20:08     ` Serhei Makarov
  0 siblings, 1 reply; 5+ messages in thread
From: Keith Seitz @ 2020-09-18 16:16 UTC (permalink / raw)
  To: Serhei Makarov; +Cc: bunsen

On 9/16/20 4:09 PM, Serhei Makarov wrote:

> On Wed, Sep 16, 2020, at 6:18 PM, Keith Seitz via Bunsen wrote:
>> The first question to ask is: Is this an oversimplified/naive implementation
>> of this script?
> It looks right to me. I briefly suspected the lambda might do some additional
> object copying, but that doesn't seem to be the case.

I profiled the code at one time to ascertain where all the time was being
spent. I don't have the specific results handy, but IIRC, most of the time
was being spent in Testrun.__init__, specifically:

            for field in defer_fields:
                self[field] = self._deserialize_testrun_field(field, self[field],
                                                              cursor_commit_ids)


> I suspect my code for building the repo has some bug when using consolidate_pass=False.
> Could you place the Git/JSON repo you built somewhere I have access to?

Yes, I will send you (privately) what I've been benchmarking. This is nearly unaltered
bunsen repo, but I've added a few simple patches to fix some existing problems.

> (Also, you could try the +diff_runs script on your repo. If the JSON
> parsing is the source of the slowdown and reading one run took 20s, 
> logically reading two runs would take you 40s.)

That is exactly what happens:

$ time ./bunsen.py +diff_runs 42563db d2a72bd
...
real    0m57.300s
user    0m49.599s
sys     0m17.683s

> IMO the comparison has to be done with 100s to 1000s of similar test runs
> since Git's de-duplication must be compared to whatever SQLite does,
> at that scale of data.
> I doubt it's important though, for this use case we have disk space to burn
> and the query speedup even justifies keeping both forms of storage.

Indeed -- the sample space is too small, but I just wanted to get a feeling of
how this is going.

>> Is this an approach that seems viable as a supplement (replacement?)
>> to JSON output? Is this approach something worth pursuing?
> Definitely worth pursuing due to the aforementioned possibility of 'column queries'
> which I don't see any way of handling well with the design I currently have.
> 
> I'm not sure if SQLite is better used as a replacement for the JSON/Git storage
> or as a supplemental cache built from it and used to speed up queries.
> (Also, the original log files must be retained in any case.)

All my current proof-of-concept does is replace the *test* data .JSON -> .db.
So there are still JSON files that are used to describe the Testrun metadata, e.g.
I haven't attempted to change that since I haven't fully investigated it.

Right now, my plan is to make this an optional/configurable option. It really
only needs to be configured when data is imported. We should be able to otherwise
handle this transparently behind the scenes.

So I will just continue on my way and try to get something review-ready.

Thank you for your input!

Keith


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Initial findings of bunsen performance and questions
  2020-09-18 16:16   ` Keith Seitz
@ 2020-09-21 20:08     ` Serhei Makarov
  2020-09-21 20:16       ` Keith Seitz
  0 siblings, 1 reply; 5+ messages in thread
From: Serhei Makarov @ 2020-09-21 20:08 UTC (permalink / raw)
  To: Keith Seitz; +Cc: Bunsen

On Fri, Sep 18, 2020, at 12:16 PM, Keith Seitz wrote:
> Yes, I will send you (privately) what I've been benchmarking. This is 
> nearly unaltered
> bunsen repo, but I've added a few simple patches to fix some existing 
> problems.
The repo looked ok, so I compared the time to load 10,000 testcases
while skipping different parts of the testrun deserialization process.
I found that the main performance culprit is the deserialization code in Cursor.
That does a regexp match operation for every one of 60,000 testcase results.
Which neatly explains the slowdown.

For the record: as I understood, you wanted to store individual testcases
(consolidate_pass=False) since you wanted to have a Cursor giving the line
range for every passing testcase in the .log.

So fixing the Cursor deserialization performance is really necessary.

Thankfully, it's possible to make the regexp parsing into an on-demand operation
that occurs when a particular Cursor's line_start/line_end/testlog fields are accessed.
commit 2ed3d13f76c3 should solve the issue, reducing the observed 58sec parsing time to 2s.
https://sourceware.org/git/?p=bunsen.git;a=commitdiff;h=2ed3d13f76c37418cfd4180b149bca0e533ec92f

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Initial findings of bunsen performance and questions
  2020-09-21 20:08     ` Serhei Makarov
@ 2020-09-21 20:16       ` Keith Seitz
  0 siblings, 0 replies; 5+ messages in thread
From: Keith Seitz @ 2020-09-21 20:16 UTC (permalink / raw)
  To: Serhei Makarov; +Cc: Bunsen

On 9/21/20 1:08 PM, Serhei Makarov wrote:

> For the record: as I understood, you wanted to store individual testcases
> (consolidate_pass=False) since you wanted to have a Cursor giving the line
> range for every passing testcase in the .log.

Yes, that's correct.

> So fixing the Cursor deserialization performance is really necessary.
> 
> Thankfully, it's possible to make the regexp parsing into an on-demand operation
> that occurs when a particular Cursor's line_start/line_end/testlog fields are accessed.
> commit 2ed3d13f76c3 should solve the issue, reducing the observed 58sec parsing time to 2s.
> https://sourceware.org/git/?p=bunsen.git;a=commitdiff;h=2ed3d13f76c37418cfd4180b149bca0e533ec92f
> 

Awesome. That fixes things nicely (and sufficiently). I think this will
simplify the API design/implementation for database-based use cases. 

Thanks,

Keith


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-09-21 20:16 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-16 22:18 Initial findings of bunsen performance and questions Keith Seitz
2020-09-16 23:09 ` Serhei Makarov
2020-09-18 16:16   ` Keith Seitz
2020-09-21 20:08     ` Serhei Makarov
2020-09-21 20:16       ` Keith Seitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).