[Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64
@ 2016-11-14 18:57 dsmith at redhat dot com
  2016-11-16 21:26 ` [Bug runtime/20820] " dsmith at redhat dot com
                   ` (19 more replies)
  0 siblings, 20 replies; 21+ messages in thread
From: dsmith at redhat dot com @ 2016-11-14 18:57 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

            Bug ID: 20820
           Summary: another "soft lockup" BUG on RHEL7 ppc64
           Product: systemtap
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: runtime
          Assignee: systemtap at sourceware dot org
          Reporter: dsmith at redhat dot com
  Target Milestone: ---

When running the testsuite in parallel mode, I'm fairly consistently seeing the
following kernel BUG on 3.10.0-327.el7.ppc64:

========
Nov 11 14:48:47 kernel: BUG: soft lockup - CPU#15 stuck for 22s! [stapio:12395]
Nov 11 14:48:47 kernel: Modules linked in:
stap_26acea073d2390b6b4b93e91ac0f9692_13862(OE)
stap_0c9ef08fe0dae436a4433f8136bb310_13790(OE)
stap_bc2a857b167e703bc87d6cae0b9ba9d_13731(OE)
stap_b1cc2deb9fb8eb377283fff95e42020_12899(OE)
stap_0bc821bb685c8518da835f4f413c3a9e_12842(OE) PROCFS_BUFFER8(OE)
stap_6764e915e63f55da6e9c6bb630e1f83_12395(OE)
stap_4d4f80217a831e70ed5d551c209bfdf_12296(OE)
stap_1fee09fddaf3e217bc1be5bca0b707ca_21658(OE) binfmt_misc tun sg pseries_rng
nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod
crc_t10dif crct10dif_generic crct10dif_common ibmvscsi scsi_transport_srp
ibmveth scsi_tgt dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
stap_f9cfa97efba72cc8292ebcc4094674e_11936]
Nov 11 14:48:47 kernel: CPU: 15 PID: 12395 Comm: stapio Tainted: G           OE
 ------------   3.10.0-327.el7.ppc64 #1
Nov 11 14:48:47 kernel: task: c0000005f04be910 ti: c0000005eaf6c000 task.ti:
c0000005eaf6c000
Nov 11 14:48:47 kernel: NIP: d000000005dc3b80 LR: d000000005dc6e6c CTR:
c0000000004b9440
Nov 11 14:48:47 kernel: REGS: c0000005eaf6f710 TRAP: 0901   Tainted: G         
 OE  ------------    (3.10.0-327.el7.ppc64)
Nov 11 14:48:47 kernel: MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 88004428
 XER: 00000000
Nov 11 14:48:47 kernel: CFAR: d000000005dc3ba0 SOFTE: 1
                                                             GPR00:
d000000005dc6e6c c0000005eaf6f990 d000000005dd6a58 d000000007230000
                                                             GPR04:
000000000000bcd7 00000000000001e2 f9b5fab84824369d ffffffffc2b2ae35
                                                             GPR08:
000000000001ffff d000000007844910 000000000003029b d000000005dca4b0
                                                             GPR12:
c0000000004b9440 c000000007b38700
Nov 11 14:48:47 kernel: NIP [d000000005dc3b80] ._stp_map_set_ii+0x100/0x250
[stap_6764e915e63f55da6e9c6bb630e1f83_12395]
Nov 11 14:48:47 kernel: LR [d000000005dc6e6c] .probe_2893+0x11c/0x1b00
[stap_6764e915e63f55da6e9c6bb630e1f83_12395]
Nov 11 14:48:47 kernel: Call Trace:
Nov 11 14:48:47 kernel: [c0000005eaf6f990] [d000000005dc3bdc]
._stp_map_set_ii+0x15c/0x250 [stap_6764e915e63f55da6e9c6bb630e1f83_12395]
(unreliable)
Nov 11 14:48:47 kernel: [c0000005eaf6fa30] [d000000005dc6e6c]
.probe_2893+0x11c/0x1b00 [stap_6764e915e63f55da6e9c6bb630e1f83_12395]
Nov 11 14:48:47 kernel: [c0000005eaf6fb30] [d000000005dc674c]
.enter_be_probe+0x14c/0x240 [stap_6764e915e63f55da6e9c6bb630e1f83_12395]
Nov 11 14:48:47 kernel: [c0000005eaf6fbd0] [d000000005dc9a8c]
.systemtap_module_init+0x7fc/0x850 [stap_6764e915e63f55da6e9c6bb630e1f83_12395]
Nov 11 14:48:47 kernel: [c0000005eaf6fcd0] [d000000005dc9ccc]
._stp_ctl_write_cmd+0x1ec/0x644 [stap_6764e915e63f55da6e9c6bb630e1f83_12395]
Nov 11 14:48:47 kernel: [c0000005eaf6fd80] [c0000000002f089c]
.SyS_write+0x14c/0x400
Nov 11 14:48:47 kernel: [c0000005eaf6fe30] [c00000000000a17c]
system_call+0x38/0xb4
Nov 11 14:48:47 kernel: Instruction dump:
Nov 11 14:48:47 kernel: 7f9c4038 7b9c0020 3b9c000c 7b9c1f24 7d3ee02a 7f5ee214
2fa90000 419e0104
Nov 11 14:48:47 kernel: 3929fff0 60000000 60000000 60420000 <2fa90000> 419e0020
e9490020 7fbf5000
========

This typically happens once per testsuite run. I don't know yet which test (or
set of tests) trigger this problem.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
@ 2016-11-16 21:26 ` dsmith at redhat dot com
  2016-11-18 17:22 ` dsmith at redhat dot com
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: dsmith at redhat dot com @ 2016-11-16 21:26 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

--- Comment #1 from David Smith <dsmith at redhat dot com> ---
Here a similar lockup BUG, but on the debug kernel (3.10.0-327.el7.ppc64.debug)
and running the testsuite in non-parallel mode:

====
Nov 16 10:23:47kernel: BUG: soft lockup - CPU#9 stuck for 23s! [stapio:24254]
Nov 16 10:23:47kernel: Modules linked in:
stap_7764bfaa23b11806ef1c60b256715ff_24254(OE) sg pseries_rng nfsd auth_rpcgss
nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif
crct10dif_generic crct10dif_common ibmvscsi scsi_transport_srp ibmveth scsi_tgt
dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
stap_1f80bbd166a09dc1edf99396614ca76_23951]
Nov 16 10:23:47kernel: irq event stamp: 0
Nov 16 10:23:47kernel: hardirqs last  enabled at (0): [<          (null)>]     
     (null)
Nov 16 10:23:47kernel: hardirqs last disabled at (0): [<c0000000000ce5dc>]
.copy_process.part.22+0x5dc/0x1040
Nov 16 10:23:47kernel: softirqs last  enabled at (0): [<c0000000000ce5dc>]
.copy_process.part.22+0x5dc/0x1040
Nov 16 10:23:47kernel: softirqs last disabled at (0): [<          (null)>]     
     (null)
Nov 16 10:23:47kernel: CPU: 9 PID: 24254 Comm: stapio Tainted: G           OE 
------------   3.10.0-327.el7.ppc64.debug #1
Nov 16 10:23:47kernel: task: c0000005ebe0ee70 ti: c0000005f0a80000 task.ti:
c0000005f0a80000
Nov 16 10:23:47kernel: NIP: d00000000aba4a78 LR: d00000000aba742c CTR:
c00000000053e540
Nov 16 10:23:47kernel: REGS: c0000005f0a83740 TRAP: 0901   Tainted: G          
OE  ------------    (3.10.0-327.el7.ppc64.debug)
Nov 16 10:23:47kernel: MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 84004422 
XER: 00000007
Nov 16 10:23:47kernel: CFAR: d00000000aba7428 SOFTE: 1
                                                             GPR00:
d00000000aba742c c0000005f0a839c0 d00000000abba0f0 d00000000ce40000
                                                             GPR04:
000000000004c318 00000000000000e9 002db0ae77b1770f ffffffffc2b2ae35
                                                             GPR08:
000000000001ffff 0f670cfa7d56e501 c5ebf91ebfada51d 000000000fea18bf
                                                             GPR12:
c00000000053e540 c000000007b35100
Nov 16 10:23:47kernel: NIP [d00000000aba4a78] ._stp_map_set_ii+0xc8/0x180
[stap_7764bfaa23b11806ef1c60b256715ff_24254]
Nov 16 10:23:47kernel: LR [d00000000aba742c] .probe_2893+0xfc/0x1a60
[stap_7764bfaa23b11806ef1c60b256715ff_24254]
Nov 16 10:23:47kernel: Call Trace:
Nov 16 10:23:47kernel: [c0000005f0a839c0] [d00000000aba1048]
._new_map_create+0x58/0xf0 [stap_7764bfaa23b11806ef1c60b256715ff_24254]
(unreliable)
Nov 16 10:23:47kernel: [c0000005f0a83a50] [d00000000aba742c]
.probe_2893+0xfc/0x1a60 [stap_7764bfaa23b11806ef1c60b256715ff_24254]
Nov 16 10:23:47kernel: [c0000005f0a83b20] [d00000000aba9acc]
.enter_be_probe+0x1bc/0x360 [stap_7764bfaa23b11806ef1c60b256715ff_24254]
Nov 16 10:23:47kernel: [c0000005f0a83bc0] [d00000000abaaef8]
.systemtap_module_init+0xa18/0xaa0 [stap_7764bfaa23b11806ef1c60b256715ff_24254]
Nov 16 10:23:47kernel: [c0000005f0a83cd0] [d00000000abab52c]
._stp_ctl_write_cmd+0x5ac/0x7c0 [stap_7764bfaa23b11806ef1c60b256715ff_24254]
Nov 16 10:23:47kernel: [c0000005f0a83d80] [c00000000034cf6c]
.SyS_write+0x14c/0x400
Nov 16 10:23:47kernel: [c0000005f0a83e30] [c00000000000a188]
system_call+0x38/0xb4
Nov 16 10:23:47kernel: Instruction dump:
Nov 16 10:23:47kernel: 7d494a78 7d4941d6 8103005c 55499b7e 7d2a5278 7d2a39d6
5524843e 7c844a78
Nov 16 10:23:47kernel: 7c844038 78840020 3884000c 78841f24 <7d23202a> 7c832214
2fa90000 419e008c
====

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
  2016-11-16 21:26 ` [Bug runtime/20820] " dsmith at redhat dot com
@ 2016-11-18 17:22 ` dsmith at redhat dot com
  2016-11-18 19:01 ` dsmith at redhat dot com
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: dsmith at redhat dot com @ 2016-11-18 17:22 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

--- Comment #2 from David Smith <dsmith at redhat dot com> ---
I've tracked this down to the systemtap.base/optim_stats.exp testcase. I get
the "soft lockup" bug on ppc64 every time I've run this test (admittedly on a
very small sample).

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
  2016-11-16 21:26 ` [Bug runtime/20820] " dsmith at redhat dot com
  2016-11-18 17:22 ` dsmith at redhat dot com
@ 2016-11-18 19:01 ` dsmith at redhat dot com
  2016-11-18 20:15 ` mcermak at redhat dot com
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: dsmith at redhat dot com @ 2016-11-18 19:01 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

--- Comment #3 from David Smith <dsmith at redhat dot com> ---
(In reply to David Smith from comment #2)
> I've tracked this down to the systemtap.base/optim_stats.exp testcase. I get
> the "soft lockup" bug on ppc64 every time I've run this test (admittedly on
> a very small sample).

The optim_stats.exp testcase runs 2 scripts: systemtap.base/optim_stats1.stp
and systemtap.base/optim_stats2.stp. The latter script causes the BUG:

====
# stap -g --suppress-time-limits
../src/testsuite/systemtap.base/optim_stats2.stp 

Message from syslogd@ibm-p8-01-lp7 at Nov 18 13:59:04 ...
 kernel:BUG: soft lockup - CPU#9 stuck for 23s! [stapio:27761]
ERROR: division by 0 near operator '/' at
../src/testsuite/systemtap.base/optim_stats2.stp:48:28
IGNORE 300000000
IGNORE 300000000 29999850000000 0 199999
WARNING: Number of errors: 1, skipped probes: 0
WARNING: /usr/local/bin/staprun exited with status: 1
Pass 5: run failed.  [man error::pass5]
====

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
                   ` (2 preceding siblings ...)
  2016-11-18 19:01 ` dsmith at redhat dot com
@ 2016-11-18 20:15 ` mcermak at redhat dot com
  2016-11-24 16:06 ` mcermak at redhat dot com
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: mcermak at redhat dot com @ 2016-11-18 20:15 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

Martin Cermak <mcermak at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mcermak at redhat dot com

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
                   ` (3 preceding siblings ...)
  2016-11-18 20:15 ` mcermak at redhat dot com
@ 2016-11-24 16:06 ` mcermak at redhat dot com
  2016-11-24 17:30 ` fche at redhat dot com
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: mcermak at redhat dot com @ 2016-11-24 16:06 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

--- Comment #4 from Martin Cermak <mcermak at redhat dot com> ---
I'll be inexact but terse:  The testcase ensures, that the aggregation operator
'<<<' works faster for stats with only computionally simple operators like e.g.
@count, then for stats with computionally complex operators like @variance. 
For more verbose description, please, refer to [1].

Currently we support 6 stat operators: @count, @sum, @min, @max, @avg, and
@variance.  The optimization in question is based on GCC optimizing the
following inlined function based on its parameters:

=======
$ grep -A 33 __stp_stat_add runtime/stat-common.c
static inline void __stp_stat_add(Hist st, stat_data *sd, int64_t val,
                                  int stat_op_count, int stat_op_sum, int
stat_op_min,
                                  int stat_op_max, int stat_op_variance)
{
        int n;
        int delta = 0;

        sd->shift = st->bit_shift;
        sd->stat_ops = st->stat_ops;
        if (sd->count == 0) {
                sd->count = 1;
                sd->sum = sd->min = sd->max = val;
                sd->avg_s = val << sd->shift;
                sd->_M2 = 0;
        } else {
                if(stat_op_count)
                        sd->count++;
                if(stat_op_sum)
                        sd->sum += val;
                if (stat_op_min && (val > sd->max))
                        sd->max = val;
                if (stat_op_max && (val < sd->min))
                        sd->min = val;
                /*
                 * Below, we use Welford's online algorithm for computing
variance.
                 *
https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
                 */
                if (stat_op_variance) {
                    delta = (val << sd->shift) - sd->avg_s;
                    sd->avg_s += _stp_div64(NULL, delta, sd->count);
                    sd->_M2 += delta * ((val << sd->shift) - sd->avg_s);
                    sd->variance_s = (sd->count < 2) ? -1 : _stp_div64(NULL,
sd->_M2, (sd->count - 1));
                }
        }
$ 
=======

For example, if @variance isn't being used with given stat, stat_op_variance is
set to 0, and GCC is expected to optimize respective computations out.  Looking
at the above code snippet, it's easy to see, that the effect of optimizing the
@variance computations out is much more significant then the effect of
optimizing out the other stat op computations.  Of course, the effect of such
optimizations is also architecture and compiler dependent.

The testcase tries to detect all the optimizations and confirm they are there. 
Detecting optimizations for @count, @sum, @min, @max and @avg is relatively
tricky.  It's hard to distinguish their optimization effect from the noise. 
The test results are of a low quality and the test generates lots of load.  On
the other hand, detecting and verifying the @variance optimization is
relatively simple, testing this makes pretty good sense.

I've been running the testcase right now in its original form, and it gives all
expected passes for most of the rhel 6 and 7 supported arches. But sometimes
'kernel:NMI watchdog: BUG: soft lockup' errors are happening.  However, this
was using the testsuite serial mode, which certainly gives better results then
the parallel mode.

So, I propose to drop the first subtest (optim_stats1.stp) for @count, @sum,
@min, @max optimizations altogether, since it's "not so much fun for a lot of
money",  but to keep the second subtest (optim_stats2.stp) for the @variance
optimization.  Also the high count of iterations in optim_stats2.stp can be
lowered down (the values were copied from optim_stats1.stp, but appear to be
unnecessarily high).  Following seems to help:

=======
$ git diff
diff --git a/testsuite/systemtap.base/optim_stats.exp
b/testsuite/systemtap.base/optim_stats.exp
index e46de40..1955853 100644
--- a/testsuite/systemtap.base/optim_stats.exp
+++ b/testsuite/systemtap.base/optim_stats.exp
@@ -8,7 +8,7 @@ if {![installtest_p]} {
     return
 }

-for {set i 1} {$i <= 2} {incr i} {
+for {set i 2} {$i <= 2} {incr i} {
     foreach runtime [get_runtime_list] {
        if {$runtime != ""} {
            spawn stap --runtime=$runtime -g --suppress-time-limits
$srcdir/$subdir/$test$i.stp
diff --git a/testsuite/systemtap.base/optim_stats2.stp
b/testsuite/systemtap.base/optim_stats2.stp
index 53bbc69..65fe06d 100644
--- a/testsuite/systemtap.base/optim_stats2.stp
+++ b/testsuite/systemtap.base/optim_stats2.stp
@@ -2,9 +2,9 @@
  * Analogy to optim_stats1.stp, but for pmaps.  See optim_stats1.stp for
comments.
  */

-@define RANDCNT %( 200000 %)
+@define RANDCNT %( 2000 %)
 @define RANDMAX %( 1000 %)
-@define ITERS %( 1500 %)
+@define ITERS %( 15 %)

 @define feed(agg, tagg)
 %(
$ 
=======

Thoughts?

------------------------
[1]
https://sourceware.org/git/gitweb.cgi?p=systemtap.git;a=blob;f=testsuite/systemtap.base/optim_stats1.stp;h=2144b7bb210ee8f0c620487ac63fffba14e0d1bf;hb=HEAD

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
                   ` (4 preceding siblings ...)
  2016-11-24 16:06 ` mcermak at redhat dot com
@ 2016-11-24 17:30 ` fche at redhat dot com
  2016-11-28 16:55 ` dsmith at redhat dot com
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: fche at redhat dot com @ 2016-11-24 17:30 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

Frank Ch. Eigler <fche at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fche at redhat dot com

--- Comment #5 from Frank Ch. Eigler <fche at redhat dot com> ---
Would it be fair to summarize the prior comment by saying that these test cases
deliberately eat CPU in order to test performance statistics of those
optimizations.  So the "soft lockup" is only an indication of the test running
a long time, not any sort of deep problem.  One could for example reduce the
loop iterations by a factor or two on slower machines.

I don't know about the division-by-zero part.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
                   ` (5 preceding siblings ...)
  2016-11-24 17:30 ` fche at redhat dot com
@ 2016-11-28 16:55 ` dsmith at redhat dot com
  2016-11-28 17:48 ` mcermak at redhat dot com
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: dsmith at redhat dot com @ 2016-11-28 16:55 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

--- Comment #6 from David Smith <dsmith at redhat dot com> ---
(In reply to Martin Cermak from comment #4)

... stuff deleted ...

> Thoughts?

With the patch, I see no "soft lockup" BUGs on either the RHEL7 ppc64 system
I've been using or on my rawhide-32 VM that also saw the BUG.

Note that on RHEL7 ppc64, I'm getting the divide by 0 error, which the
optim_stats.exp doesn't expect so you get no pass or fail from using the
default linux runtime.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
                   ` (6 preceding siblings ...)
  2016-11-28 16:55 ` dsmith at redhat dot com
@ 2016-11-28 17:48 ` mcermak at redhat dot com
  2016-11-29 13:47 ` mcermak at redhat dot com
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: mcermak at redhat dot com @ 2016-11-28 17:48 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

--- Comment #7 from Martin Cermak <mcermak at redhat dot com> ---
Guys, I am still running the testcase in a loop with various parameters so that
I can see which parameters might improve its behaviour.  About to finish.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
                   ` (7 preceding siblings ...)
  2016-11-28 17:48 ` mcermak at redhat dot com
@ 2016-11-29 13:47 ` mcermak at redhat dot com
  2016-11-29 13:48 ` mcermak at redhat dot com
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: mcermak at redhat dot com @ 2016-11-29 13:47 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

--- Comment #8 from Martin Cermak <mcermak at redhat dot com> ---
Created attachment 9679
  --> https://sourceware.org/bugzilla/attachment.cgi?id=9679&action=edit
test data

Based on attached test data, I've updated the RANDCNT value so that the
testcase generates less CPU load still keeping the test result reasonable on
most arches.

Commit 71aa65f2e9ee4f1d5b33c76dfb4df66600b81fcf.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
                   ` (8 preceding siblings ...)
  2016-11-29 13:47 ` mcermak at redhat dot com
@ 2016-11-29 13:48 ` mcermak at redhat dot com
  2016-11-30 19:10 ` dsmith at redhat dot com
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: mcermak at redhat dot com @ 2016-11-29 13:48 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

Martin Cermak <mcermak at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #9 from Martin Cermak <mcermak at redhat dot com> ---
Closing. Let's reopen in case of need.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
                   ` (9 preceding siblings ...)
  2016-11-29 13:48 ` mcermak at redhat dot com
@ 2016-11-30 19:10 ` dsmith at redhat dot com
  2016-12-01 13:04 ` mcermak at redhat dot com
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: dsmith at redhat dot com @ 2016-11-30 19:10 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

--- Comment #10 from David Smith <dsmith at redhat dot com> ---
I played around with this one a bit more, trying to fix the "division by zero"
error. To fix this, I switched to using the monotonic clock function
cpu_clock_us() instead of using gettimeofday_us() on the linux runtime. I then
started getting MAXNESTING errors with the dyninst runtime, so I switched the
time() function to a macro.

Here's the patch:

====
diff --git a/testsuite/systemtap.base/optim_stats1.stp
b/testsuite/systemtap.base/optim_stats1.stp
index 4348ec2..884f753 100644
--- a/testsuite/systemtap.base/optim_stats1.stp
+++ b/testsuite/systemtap.base/optim_stats1.stp
@@ -4,17 +4,24 @@

 @define feed(agg, tagg)
 %(
-    t = time()
+    t = @time
     foreach(k in randvals)
        @agg <<< k
-    @tagg += time() - t
+    @tagg += @time - t
 %)

 global x, tx = 0, y, ty = 0
 global a, ta = 0, b, tb = 0
 global randvals[@RANDCNT]

-function time() { return gettimeofday_us() }
+@define time
+%(
+    %( runtime == "dyninst" %?
+       gettimeofday_us()
+    %:
+       cpu_clock_us(cpu())
+    %)
+%)

 probe begin
 {
diff --git a/testsuite/systemtap.base/optim_stats2.stp
b/testsuite/systemtap.base/optim_stats2.stp
index 4e28bdd..4ca82a9 100644
--- a/testsuite/systemtap.base/optim_stats2.stp
+++ b/testsuite/systemtap.base/optim_stats2.stp
@@ -4,17 +4,24 @@

 @define feed(agg, tagg)
 %(
-    t = time()
+    t = @time
     foreach(k in randvals)
        @agg <<< k
-    @tagg += time() - t
+    @tagg += @time - t
 %)

 global x, tx = 0, y, ty = 0
 global a, ta = 0, b, tb = 0
 global randvals[@RANDCNT]

-function time() { return gettimeofday_us() }
+@define time
+%(
+    %( runtime == "dyninst" %?
+       gettimeofday_us()
+    %:
+       cpu_clock_us(cpu())
+    %)
+%)

 probe begin
 {
====

With these changes, I get no "division by zero" errors or MAXNESTING errors,
but I do start to see a FAIL.

====
Running /root/src/testsuite/systemtap.base/optim_stats.exp ...
PASS: TEST1  (0, 0)
PASS: TEST2  (10, 45)
PASS: TEST1 dyninst (0, 2)
PASS: TEST2 dyninst (10, 38)
FAIL: TEST3  (0, -4)
PASS: TEST4  (10, 16)
PASS: TEST3 dyninst (0, 8)
PASS: TEST4 dyninst (10, 26)
====

Thoughts?

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
                   ` (10 preceding siblings ...)
  2016-11-30 19:10 ` dsmith at redhat dot com
@ 2016-12-01 13:04 ` mcermak at redhat dot com
  2016-12-01 16:11 ` dsmith at redhat dot com
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: mcermak at redhat dot com @ 2016-12-01 13:04 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

--- Comment #11 from Martin Cermak <mcermak at redhat dot com> ---
Created attachment 9681
  --> https://sourceware.org/bugzilla/attachment.cgi?id=9681&action=edit
test results

Hi David, attached test results show the effect of the above patch on my test
boxes.  Actually, I have 12 more ppc64 test boxes not reported there, but
showing similar results.  No division by zero observed at any of them yet
(although especially one of them looks very close to yours).  I wonder whether
there is some difference in how we test.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
                   ` (11 preceding siblings ...)
  2016-12-01 13:04 ` mcermak at redhat dot com
@ 2016-12-01 16:11 ` dsmith at redhat dot com
  2016-12-01 16:44 ` dsmith at redhat dot com
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: dsmith at redhat dot com @ 2016-12-01 16:11 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

--- Comment #12 from David Smith <dsmith at redhat dot com> ---
(In reply to Martin Cermak from comment #11)
> Hi David, attached test results show the effect of the above patch on my
> test boxes.  Actually, I have 12 more ppc64 test boxes not reported there,
> but showing similar results.  No division by zero observed at any of them
> yet (although especially one of them looks very close to yours).  I wonder
> whether there is some difference in how we test.

There is probably a difference in our source code. I'm running with a patch to
the time initialization code that you don't have. As part of my work debugging
bug #20735, I updated/improved the __stp_get_freq() runtime C function (in
runtime/time.c).

Currently, what it does it get the cpu frequency in an arch-specific way. If
there isn't arch-specific support for a particular architecture, it tries to
estimate the cpu frequency. On ppc64, that estimated cpu frequency is used.

While looking at that code, I decided to add some arch-specific ppc64 code that
grabs the cpu frequency. Later, I then found a arch-independent way of getting
the cpu frequency on kernels that have CONFIG_CPU_FREQ.

I've been running with that patch for about a month now, but just haven't
checked it in yet. I'd guess the more accurate cpu frequency returned by the
new code is changing the values returned by the gettimeofday_us() tapset
function.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
                   ` (12 preceding siblings ...)
  2016-12-01 16:11 ` dsmith at redhat dot com
@ 2016-12-01 16:44 ` dsmith at redhat dot com
  2016-12-01 19:29 ` jistone at redhat dot com
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: dsmith at redhat dot com @ 2016-12-01 16:44 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

--- Comment #13 from David Smith <dsmith at redhat dot com> ---
Created attachment 9682
  --> https://sourceware.org/bugzilla/attachment.cgi?id=9682&action=edit
proposed __stp_init_time() patch

In case anyone is interested, here's the __stp_init_time() patch I've been
testing.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
                   ` (13 preceding siblings ...)
  2016-12-01 16:44 ` dsmith at redhat dot com
@ 2016-12-01 19:29 ` jistone at redhat dot com
  2016-12-01 20:38 ` dsmith at redhat dot com
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: jistone at redhat dot com @ 2016-12-01 19:29 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

Josh Stone <jistone at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jistone at redhat dot com

--- Comment #14 from Josh Stone <jistone at redhat dot com> ---
Comment on attachment 9682
  --> https://sourceware.org/bugzilla/attachment.cgi?id=9682
proposed __stp_init_time() patch

> -    write_seqlock(&time->lock);
> +    write_seqlock_irqsave(&time->lock, flags2);
>      time->base_ns = ns;
>      time->base_cycles = cycles;
> -    write_sequnlock(&time->lock);
> +    write_sequnlock_irqrestore(&time->lock, flags2);
>  
>      local_irq_restore(flags);

Why do you need to save irq if it's already in a local_irq_save section?

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
                   ` (14 preceding siblings ...)
  2016-12-01 19:29 ` jistone at redhat dot com
@ 2016-12-01 20:38 ` dsmith at redhat dot com
  2016-12-01 20:52 ` jistone at redhat dot com
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: dsmith at redhat dot com @ 2016-12-01 20:38 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

--- Comment #15 from David Smith <dsmith at redhat dot com> ---
(In reply to Josh Stone from comment #14)
> Comment on attachment 9682 [details]
> proposed __stp_init_time() patch
> 
> > -    write_seqlock(&time->lock);
> > +    write_seqlock_irqsave(&time->lock, flags2);
> >      time->base_ns = ns;
> >      time->base_cycles = cycles;
> > -    write_sequnlock(&time->lock);
> > +    write_sequnlock_irqrestore(&time->lock, flags2);
> >  
> >      local_irq_restore(flags);
> 
> Why do you need to save irq if it's already in a local_irq_save section?

Good question. Everywhere I can find in the kernel that uses *_irqsave() on 2
different locking items in the same function, that code uses *_irqsave() on
both locking items. Several months back I looked at the source of
write_seqlock() vs. write_seqlock_irqsave(), and if I remember correctly the
waiting method was different between the two. So, we're not using the 2nd
write_seqlock_irqsave() call to disable interrupts again, but to wait in the
most appropriate way once interrupts are disabled.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
                   ` (15 preceding siblings ...)
  2016-12-01 20:38 ` dsmith at redhat dot com
@ 2016-12-01 20:52 ` jistone at redhat dot com
  2016-12-02 15:01 ` mcermak at redhat dot com
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: jistone at redhat dot com @ 2016-12-01 20:52 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

--- Comment #16 from Josh Stone <jistone at redhat dot com> ---
OK, thanks -- that deserves a brief comment at least.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
                   ` (16 preceding siblings ...)
  2016-12-01 20:52 ` jistone at redhat dot com
@ 2016-12-02 15:01 ` mcermak at redhat dot com
  2016-12-02 17:30 ` dsmith at redhat dot com
  2017-02-08 17:05 ` dsmith at redhat dot com
  19 siblings, 0 replies; 21+ messages in thread
From: mcermak at redhat dot com @ 2016-12-02 15:01 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

--- Comment #17 from Martin Cermak <mcermak at redhat dot com> ---
David, will you check the test case improvement from comment #10 in as a part
of your time initialization work?

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
                   ` (17 preceding siblings ...)
  2016-12-02 15:01 ` mcermak at redhat dot com
@ 2016-12-02 17:30 ` dsmith at redhat dot com
  2017-02-08 17:05 ` dsmith at redhat dot com
  19 siblings, 0 replies; 21+ messages in thread
From: dsmith at redhat dot com @ 2016-12-02 17:30 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

--- Comment #18 from David Smith <dsmith at redhat dot com> ---
(In reply to Martin Cermak from comment #17)
> David, will you check the test case improvement from comment #10 in as a
> part of your time initialization work?

I don't mind checking it in if you are comfortable with it.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug runtime/20820] another "soft lockup" BUG on RHEL7 ppc64
  2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
                   ` (18 preceding siblings ...)
  2016-12-02 17:30 ` dsmith at redhat dot com
@ 2017-02-08 17:05 ` dsmith at redhat dot com
  19 siblings, 0 replies; 21+ messages in thread
From: dsmith at redhat dot com @ 2017-02-08 17:05 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=20820

--- Comment #19 from David Smith <dsmith at redhat dot com> ---
Commit 0f7e138 adds those test case improvements.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-02-08 17:05 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-14 18:57 [Bug runtime/20820] New: another "soft lockup" BUG on RHEL7 ppc64 dsmith at redhat dot com
2016-11-16 21:26 ` [Bug runtime/20820] " dsmith at redhat dot com
2016-11-18 17:22 ` dsmith at redhat dot com
2016-11-18 19:01 ` dsmith at redhat dot com
2016-11-18 20:15 ` mcermak at redhat dot com
2016-11-24 16:06 ` mcermak at redhat dot com
2016-11-24 17:30 ` fche at redhat dot com
2016-11-28 16:55 ` dsmith at redhat dot com
2016-11-28 17:48 ` mcermak at redhat dot com
2016-11-29 13:47 ` mcermak at redhat dot com
2016-11-29 13:48 ` mcermak at redhat dot com
2016-11-30 19:10 ` dsmith at redhat dot com
2016-12-01 13:04 ` mcermak at redhat dot com
2016-12-01 16:11 ` dsmith at redhat dot com
2016-12-01 16:44 ` dsmith at redhat dot com
2016-12-01 19:29 ` jistone at redhat dot com
2016-12-01 20:38 ` dsmith at redhat dot com
2016-12-01 20:52 ` jistone at redhat dot com
2016-12-02 15:01 ` mcermak at redhat dot com
2016-12-02 17:30 ` dsmith at redhat dot com
2017-02-08 17:05 ` dsmith at redhat dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).