public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* [Bug runtime/31117] New: transport failures
@ 2023-12-06 15:39 mcermak at redhat dot com
  2023-12-06 21:17 ` [Bug runtime/31117] " wcohen at redhat dot com
                   ` (17 more replies)
  0 siblings, 18 replies; 19+ messages in thread
From: mcermak at redhat dot com @ 2023-12-06 15:39 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

            Bug ID: 31117
           Summary: transport failures
           Product: systemtap
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: runtime
          Assignee: systemtap at sourceware dot org
          Reporter: mcermak at redhat dot com
  Target Milestone: ---

Created attachment 15241
  --> https://sourceware.org/bugzilla/attachment.cgi?id=15241&action=edit
testcase patch allowing to demonstrate the problem

In the recent past, there were some problems with transport failures (PR29108).
 This was fixed.  But apparently the situation still isn't 100% perfect. 
Pretty much
all the syscall tests are now failing on RHEL-9.4 with transport failures.  It
is reliably reproducible.  By default, the problem is hidden in the testsuite's
expect gates.  To visualize it, I've used the attached patch.  With that patch
applied (on top of systemtap commit b2a1261c434788cb and
kernel-5.14.0-394.el9.x86_64), I have:


el9 x86_64 # make installcheck RUNTESTFLAGS=syscall.exp CHECK_ONLY=futex

[ ... stuff deleted ... ]

Running /root/systemtap/testsuite/systemtap.syscall/syscall.exp ...

[ ... stuff deleted ... ]

Comparing ^futex: futex \(0x0, FUTEX_CMP_REQUEUE, 1, 0, 0x[f]+, 0\) = [\-0-9]+
against futex: futex (0x0, FUTEX_CMP_REQUEUE, 1, 0, 0xffffffffffffffff, 0) =
-14 (EFAULT)
Comparing ^futex: futex \(0x0, FUTEX_CMP_REQUEUE, 1, 0, 0x0, -1\) = [\-0-9]+
against futex: futex (0x0, FUTEX_CMP_REQUEUE, 1, 0, 0x0, -1) = WARNING: There
were 1 transport failures. Try stap -s to increase the buffer size from 16.


I've tested that simpler syscall tests (e.g. those having most of their
<testcase>.c content commented out) the test result is sometimes PASS and
sometimes FAIL.  But using a "normal" syscall test involving 10+ system calls
and their checks, the failure is guaranteed..

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug runtime/31117] transport failures
  2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
@ 2023-12-06 21:17 ` wcohen at redhat dot com
  2023-12-13 21:27 ` mcermak at redhat dot com
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2023-12-06 21:17 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

William Cohen <wcohen at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wcohen at redhat dot com

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug runtime/31117] transport failures
  2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
  2023-12-06 21:17 ` [Bug runtime/31117] " wcohen at redhat dot com
@ 2023-12-13 21:27 ` mcermak at redhat dot com
  2023-12-14 20:23 ` wcohen at redhat dot com
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: mcermak at redhat dot com @ 2023-12-13 21:27 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

Martin Cermak <mcermak at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #15241|0                           |1
        is obsolete|                            |

--- Comment #1 from Martin Cermak <mcermak at redhat dot com> ---
Created attachment 15258
  --> https://sourceware.org/bugzilla/attachment.cgi?id=15258&action=edit
testcase patch allowing to demonstrate the problem

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug runtime/31117] transport failures
  2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
  2023-12-06 21:17 ` [Bug runtime/31117] " wcohen at redhat dot com
  2023-12-13 21:27 ` mcermak at redhat dot com
@ 2023-12-14 20:23 ` wcohen at redhat dot com
  2023-12-18 18:55 ` wcohen at redhat dot com
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2023-12-14 20:23 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

William Cohen <wcohen at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED

--- Comment #2 from William Cohen <wcohen at redhat dot com> ---
I am able to see issues with the transport layer in a number of the systemtap
tests.
Looking through the systemtap.log  There are a number of warnings.  Another
example warning output below is from systemtap.maps/map_hash_SSI.stp failure
(which has a significant amount of output):

    "WARNING: There were 2 transport failures. Try stap -s to increase the
buffer size from 16."
    "WARNING: There were 0 lost messages and 14 lost bytes."

Frank mentioned that the problem might be tied to transferring data more than a
page (4096 bytes).  Didn't see the issue when less than 4K, but then started
seeing it when size got to the 4K range.

Also saw cases where the lost messages counts looked questionable, such as the
one for clone:

clone: set_robust_list (0x7f54e661ba20, 24) = 0
clone: mprotect (0x7f54e67f2000, 16384, PROT_READ) = 0
clone: mprotect (0x7f54e68df000, 4096, PROT_READ) = 0
stapio:reader_thread_serialmode:299: ERROR: unexpected sequence=66: No data
available
stapio:reader_thread_serialmode:299: ERROR: unexpected sequence=67: No data
available
stapio:reader_thread_serialmode:299: ERROR: unexpected sequence=68: No data
available
stapio:reader_thread_serialmode:299: ERROR: unexpected sequence=69: No data
available
stapio:reader_thread_serialmode:299: ERROR: unexpected sequence=70: No data
available
stapio:reader_thread_serialmode:299: ERROR: unexpected sequence=71: No data
available
stapio:reader_thread_serialmode:299: ERROR: unexpected sequence=72: No data
available
stapio:reader_thread_serialmode:299: ERROR: unexpected sequence=73: No data
available
stapio:reader_thread_serialmode:299: ERROR: unexpected sequence=74: No data
available
WARNING: There were 4294967284 lost messages and 0 lost bytes.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug runtime/31117] transport failures
  2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
                   ` (2 preceding siblings ...)
  2023-12-14 20:23 ` wcohen at redhat dot com
@ 2023-12-18 18:55 ` wcohen at redhat dot com
  2023-12-21 21:43 ` wcohen at redhat dot com
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2023-12-18 18:55 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

--- Comment #3 from William Cohen <wcohen at redhat dot com> ---
I did some investigation on this with the map_hash_SSI.stp transport failures. 
Some fixes of the diagnostic print code for the transport layer were made so it
compiles on fedora 39.  The __stp_print_flush code expects there to be enough
room to write out the _stp_trace header (12 bytes) in the space returned by
_stp_data_write_reserve().  However, there are cases where
_stp_data_write_reserve() returned less space (5 and 9 bytes).  The causes the
_stp_print_flush to fail and increment _stp_transport_failures.  That certainly
needs to be fixed.

I had some additional code in the runtime to print out each time
_stp_print_flush had a failure.  I didn't see that triggering with the sys.stp.
Looks like will need to additional diagnostic prints to figure out what is
going on with the sys.stp output.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug runtime/31117] transport failures
  2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
                   ` (3 preceding siblings ...)
  2023-12-18 18:55 ` wcohen at redhat dot com
@ 2023-12-21 21:43 ` wcohen at redhat dot com
  2024-01-04 22:34 ` wcohen at redhat dot com
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2023-12-21 21:43 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

--- Comment #4 from William Cohen <wcohen at redhat dot com> ---
Thinking about this some more.  One of the problems that occurs is the
situation where variable sized writes cause the remaining space in the buffer
returned by _stp_data_write_reserve to be smaller than the 12 bytes required by
the struct _stp_trace header.  One thought is to make sure that the
_stp_data_write_reserve() provide allocations in 16 byte chunks that are are
aligned so when __stp_print_flush gets to the end of a subbuffer there is room
to write the header. However, the __stp_print_flush code assumes that
bytes_reserved returned by _stp_data_write_reserve() is <= size_request.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug runtime/31117] transport failures
  2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
                   ` (4 preceding siblings ...)
  2023-12-21 21:43 ` wcohen at redhat dot com
@ 2024-01-04 22:34 ` wcohen at redhat dot com
  2024-01-05 14:50 ` mcermak at redhat dot com
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2024-01-04 22:34 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

--- Comment #5 from William Cohen <wcohen at redhat dot com> ---
Created attachment 15285
  --> https://sourceware.org/bugzilla/attachment.cgi?id=15285&action=edit
Patch reduce transport drops

The patch retries reserving space if it gets to the end of a subbuffer and
there isn't enough room for the header.  There probably needs to be some tweaks
to the code in staprun/relay.c to not count the padding as lost bytes when it
is looking for the the STAP_TRACE_MAGIC.

The attached patch improves the test results for the systemtap.map tests that
were transporting more than PAGE_SIZE amount of data.  It also improves the
test results for the syscall tests.

For:

sudo make installcheck RUNTESTFLAGS="systemtap.maps/*.exp"

Before the patch:

                === systemtap Summary ===

# of expected passes            83
# of unexpected failures        6
# of unsupported tests          1

                === systemtap Summary ===

# of expected passes            86
# of unexpected failures        3
# of unsupported tests          1

@@ -106 +95 @@
-FAIL: systemtap.maps/map_hash_SSI.stp
+PASS: systemtap.maps/map_hash_SSI.stp
@@ -108 +97 @@
-FAIL: systemtap.maps/map_hash_stat_II.stp
+PASS: systemtap.maps/map_hash_stat_II.stp
@@ -110 +99 @@
-FAIL: systemtap.maps/map_hash_stat_SI.stp
+PASS: systemtap.maps/map_hash_stat_SI.stp


The syscall test results look much better also.  However, on RHEL9/f39 there
are a number of unrelated failures on x86_64 for the 32-bit tests because the
check for the syscall names/numbers added by git commit
7fe1b7bb8e665c9edd38eaf34a499618ee763293

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug runtime/31117] transport failures
  2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
                   ` (5 preceding siblings ...)
  2024-01-04 22:34 ` wcohen at redhat dot com
@ 2024-01-05 14:50 ` mcermak at redhat dot com
  2024-01-05 15:21 ` mcermak at redhat dot com
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: mcermak at redhat dot com @ 2024-01-05 14:50 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

--- Comment #6 from Martin Cermak <mcermak at redhat dot com> ---
Created attachment 15287
  --> https://sourceware.org/bugzilla/attachment.cgi?id=15287&action=edit
possible patch for the 32-bit tests subproblem

(In reply to William Cohen from comment #5)
> The attached patch improves the test results for the systemtap.map tests
> that were transporting more than PAGE_SIZE amount of data.  It also improves
> the test results for the syscall tests.

Thank-you for looking into this Will!

> However, on RHEL9/f39 there
> are a number of unrelated failures on x86_64 for the 32-bit tests because
> the check for the syscall names/numbers added by git commit
> 7fe1b7bb8e665c9edd38eaf34a499618ee763293

Looking closer... Apparently on RHEL9 the @__compat_task macro <=
_stp_is_compat_task()  <= _stp_is_compat_task2() in runtime/compatdefs.h
doesn't work right.  Wow, I think I have a fix for that see attached.  Here is
how my thoughts went:

I've tested the problem is in _stp_is_compat_task2(), that's in
runtime/compatdefs.h.  I've found that a related systemtap commit is
ea81249b90b92a589d304c051efbc5bbad7ce073.  This mentions kernel commit
ff170cd05953, so I've looked at that, and found that kernel function
is_64bit_mm() seems to be doing what I need.  I've then looked at the modern
implementation of this function, and used that back in systemtap's
_stp_is_compat_task2().  And that seems to work on my specific test system,
which is 5.14.0-402.el9.x86_64.

The attached patch works for me on both RHEL8 and RHEL9.  BUT ... this seems to
be a case where this upstream kernel change introduced in vanilla v6.3 was
backported to lower distribution kernel version, so the condition "#if
LINUX_VERSION_CODE < KERNEL_VERSION(5,0,0)" is a bit problematic..

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug runtime/31117] transport failures
  2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
                   ` (6 preceding siblings ...)
  2024-01-05 14:50 ` mcermak at redhat dot com
@ 2024-01-05 15:21 ` mcermak at redhat dot com
  2024-01-08 14:56 ` wcohen at redhat dot com
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: mcermak at redhat dot com @ 2024-01-05 15:21 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

--- Comment #7 from Martin Cermak <mcermak at redhat dot com> ---
I've separated this aforementioned 32-bit tests sub-problem into PR31215.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug runtime/31117] transport failures
  2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
                   ` (7 preceding siblings ...)
  2024-01-05 15:21 ` mcermak at redhat dot com
@ 2024-01-08 14:56 ` wcohen at redhat dot com
  2024-01-09 12:11 ` mcermak at redhat dot com
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2024-01-08 14:56 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

William Cohen <wcohen at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #15285|0                           |1
        is obsolete|                            |

--- Comment #8 from William Cohen <wcohen at redhat dot com> ---
Created attachment 15289
  --> https://sourceware.org/bugzilla/attachment.cgi?id=15289&action=edit
Partial fix to reduce transport drops

This WIP patch reduced the number of "transport failures" in the run of "make
installcheck" from 57 down to 6.  It also reduced the number of failures in the
syscall.exp tests  However, there still seems to be drops output in the tests. 
When running the syscall.exp testsuite multiple times see the PASS change to
FAIL and vice versa from run to run.  Maybe on shutdown not all the output is
getting flushed out.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug runtime/31117] transport failures
  2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
                   ` (8 preceding siblings ...)
  2024-01-08 14:56 ` wcohen at redhat dot com
@ 2024-01-09 12:11 ` mcermak at redhat dot com
  2024-01-10 12:37 ` mcermak at redhat dot com
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: mcermak at redhat dot com @ 2024-01-09 12:11 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

--- Comment #9 from Martin Cermak <mcermak at redhat dot com> ---
(In reply to William Cohen from comment #8)
> However, there still seems to be drops
> output in the tests.  When running the syscall.exp testsuite multiple times
> see the PASS change to FAIL and vice versa from run to run.  Maybe on
> shutdown not all the output is getting flushed out.

Agreed.  E.g. the following is based on syscall testsuite artifacts:





el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
WARNING: There were 0 lost messages and 2 lost bytes.
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
WARNING: There were 0 lost messages and 2 lost bytes.
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
4
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
0
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
WARNING: There were 0 lost messages and 2 lost bytes.
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # 




BUT ... I seem to see improvements when I put all the CPU cores offline but the
first one:




el8 x86_64 # for i in `seq 1 23`; do echo 0 >
/sys/devices/system/cpu/cpu$i/online; done
el8 x86_64 #
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 #

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug runtime/31117] transport failures
  2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
                   ` (9 preceding siblings ...)
  2024-01-09 12:11 ` mcermak at redhat dot com
@ 2024-01-10 12:37 ` mcermak at redhat dot com
  2024-01-22 15:54 ` wcohen at redhat dot com
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: mcermak at redhat dot com @ 2024-01-10 12:37 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

--- Comment #10 from Martin Cermak <mcermak at redhat dot com> ---
Created attachment 15294
  --> https://sourceware.org/bugzilla/attachment.cgi?id=15294&action=edit
Workaround dejagnu hack FTR

For my reference, I'm attaching a patch that helps me work around current
transport failures.  It namely executes staprun repeatedly if needed as long as
it (probably) gives sufficient count of output lines.  This is a very bad hack,
but it helps me work on syscall tests while the actual bug is unfixed.  It also
gives better readable/debuggable output (subjective) compared to the default.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug runtime/31117] transport failures
  2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
                   ` (10 preceding siblings ...)
  2024-01-10 12:37 ` mcermak at redhat dot com
@ 2024-01-22 15:54 ` wcohen at redhat dot com
  2024-01-23 18:24 ` wcohen at redhat dot com
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2024-01-22 15:54 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

William Cohen <wcohen at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #15289|0                           |1
        is obsolete|                            |

--- Comment #11 from William Cohen <wcohen at redhat dot com> ---
Created attachment 15317
  --> https://sourceware.org/bugzilla/attachment.cgi?id=15317&action=edit
Patch request again for buffer to get contigous region for header and include
logic to avoid counting padding bytes as "lost bytes"

This significantly reduced the number of transport failures. With this patch
still saw one transport failure warning with the 32-bit syslog nd_syscall test.
 Need to instrument the runtime and see where where that tranport failure is
coming from. The syscall tests are not writing out so much data to overrun the
buffer space reserved. Also there still seem to be some *syscall.exp tests that
have truncated/missing output that is causing the some individual tests to
fail.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug runtime/31117] transport failures
  2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
                   ` (11 preceding siblings ...)
  2024-01-22 15:54 ` wcohen at redhat dot com
@ 2024-01-23 18:24 ` wcohen at redhat dot com
  2024-02-02 15:07 ` mcermak at redhat dot com
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2024-01-23 18:24 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

--- Comment #12 from William Cohen <wcohen at redhat dot com> ---
Checked in commit b6abf4bd012a444e625296127155877fd8737931. This helps with the
transport failure issues and significantly reduces the " WARNING: There were 1
transport failures." messages. However, when running syscall.exp still observed
a "WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16." and there seems to e some FAIL: tests that look to be missing
some output.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug runtime/31117] transport failures
  2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
                   ` (12 preceding siblings ...)
  2024-01-23 18:24 ` wcohen at redhat dot com
@ 2024-02-02 15:07 ` mcermak at redhat dot com
  2024-02-02 15:15 ` mcermak at redhat dot com
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: mcermak at redhat dot com @ 2024-02-02 15:07 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

--- Comment #13 from Martin Cermak <mcermak at redhat dot com> ---
Hello Will, I think your patches improve the situation significantly. 
Currently on rhel9 x86_64 main branch, I'm getting:

# of expected passes            286
# of unexpected failures        54

When I apply the workaround from Comment #10 (which runs staprun repeatedly in
case of a clear loss of data), the test results on the same box are as follows:

# of expected passes            296
# of unexpected failures        44

Which means that the workaround doesn't make a huge difference any more.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug runtime/31117] transport failures
  2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
                   ` (13 preceding siblings ...)
  2024-02-02 15:07 ` mcermak at redhat dot com
@ 2024-02-02 15:15 ` mcermak at redhat dot com
  2024-02-05 15:32 ` wcohen at redhat dot com
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: mcermak at redhat dot com @ 2024-02-02 15:15 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

--- Comment #14 from Martin Cermak <mcermak at redhat dot com> ---
On another - rhel8 x86_64 system, the situation is as follows:  Clean main
branch stap code:

# of expected passes            158
# of unexpected failures        181

With patch from Comment #10 running staprun repeatedly in case of a clear data
loss:

# of expected passes            313
# of unexpected failures        26

This rhel8 system has 48 cores, while the rhel9 system from Comment 13 has only
20 cores.  In this case, the workaround still appears to play a significant
role.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug runtime/31117] transport failures
  2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
                   ` (14 preceding siblings ...)
  2024-02-02 15:15 ` mcermak at redhat dot com
@ 2024-02-05 15:32 ` wcohen at redhat dot com
  2024-03-04 16:52 ` wcohen at redhat dot com
  2024-03-04 21:30 ` wcohen at redhat dot com
  17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2024-02-05 15:32 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

--- Comment #15 from William Cohen <wcohen at redhat dot com> ---
The fix merged in helps, but doesn't eliminate all the transport failures. 
Looking through the RHEL8/9 runs I did this weekend see "transport failures". 
Both had warning for  systemtap.base PR14546.  However, this test looks like it
might just have a huge volume of data as it is printing backtraces each time an
irq_* function is entered.

on RHEL9 see some "transport failures" on the syscall tests:

WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16.
FAIL: 32-bit chmod nd_syscall

WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16.
FAIL: 32-bit handle_at nd_syscall

WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16.
FAIL: 32-bit openclose nd_syscall

WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16.
FAIL: 32-bit rt_signal nd_syscall

WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16.
FAIL: 32-bit statfs nd_syscall

WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16.
FAIL: 32-bit msg_queue syscall

For RHEL8 syscall tests it looks like the return probes failed to print for
mmap2 and read for these particular failures:

WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16.
FAIL: 64-bit remap_file_pages syscall

WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16.
FAIL: 64-bit set_tid_address syscall

WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16.
Testing 64-bit setgroups syscall

Looking through the results syscall tests there certainly seem to be cases
where some output seems to be missing. On some seeing "WARNING: Child process
exited with status 255".  Might be some other issue going on with some of the
syscall tests.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug runtime/31117] transport failures
  2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
                   ` (15 preceding siblings ...)
  2024-02-05 15:32 ` wcohen at redhat dot com
@ 2024-03-04 16:52 ` wcohen at redhat dot com
  2024-03-04 21:30 ` wcohen at redhat dot com
  17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2024-03-04 16:52 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

--- Comment #16 from William Cohen <wcohen at redhat dot com> ---
The test results are much improved after the fix for padding the end of the
small sub-buffer and writing the new packet in a new sub-buffer.  There are 
some RHEL8 and RHEL9 tests that appear to create a great amount of data and
warnings about transport failures are triggered (pr14546 and
systemtap.examples/general/callgraph in RHEL8,
systemtap.examples/profiling/sched_switch run in RHEL9).  The main place that
transport failures are still observed are the syscall tests.

f39
FAIL: 64-bit setitimer nd_syscall
FAIL: 32-bit sysctl nd_syscall
FAIL: 64-bit prctl syscall
FAIL: 32-bit aio syscall
FAIL: 32-bit prctl syscall

rawhide
FAIL: 32-bit arch_prctl nd_syscall
FAIL: 32-bit membarrier nd_syscall
FAIL: 32-bit arch_prctl syscall
FAIL: 32-bit exit_group syscall
FAIL: 32-bit modify_ldt syscall
FAIL: 32-bit openclose syscall
FAIL: 32-bit socketpair syscall

rhel8
FAIL: 32-bit memfd_create tp_syscall

rhel9
FAIL: 64-bit connect nd_syscall
FAIL: 64-bit readv nd_syscall
FAIL: 64-bit thread_area nd_syscall
FAIL: 32-bit syslog nd_syscall
FAIL: 64-bit fadvise64 syscall
FAIL: 64-bit getpeername syscall
FAIL: 64-bit getsockname syscall
FAIL: 64-bit readahead syscall
FAIL: 64-bit sigaltstack syscall
FAIL: 64-bit thread_area syscall
FAIL: 32-bit syslog syscall

There is a great deal of rhyme or reasons for which syscall tests are failing.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug runtime/31117] transport failures
  2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
                   ` (16 preceding siblings ...)
  2024-03-04 16:52 ` wcohen at redhat dot com
@ 2024-03-04 21:30 ` wcohen at redhat dot com
  17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2024-03-04 21:30 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=31117

William Cohen <wcohen at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|ASSIGNED                    |RESOLVED

--- Comment #17 from William Cohen <wcohen at redhat dot com> ---
This PR should be addressed by:

commit d1b74b12acc07f52c7d8f5b080176c485c34cd63 (HEAD -> master, origin/master,
origin/HEAD)
Author: William Cohen <wcohen@redhat.com>
Date:   Mon Mar 4 16:27:18 2024 -0500

    PR31117: Correct handling of transport layer allocated memory

    The _stp_print_flush() code was not correct.  There are four possible
    ranges of values compared to the header size (hlen)
    _stp_data_write_reserve() could return when beginning to write out
    log:

    <0      unable to allocate any space
    <hlen   pad out the allocated space and try another allocation
    ==hlen  just enough space for the initial header
    >hlen   write out the header and some portion of log

    The case where the space allocated was equal size of the header
    (==hlen) was not handled correctly.  In the cases where there was only
    enough room to write the header the _stp_transport_failures variable
    was incremented and none of the log data was written out.  The correct
    course of action in these cases would be to write the header out in
    the allocated space and start looping to write the rest of the log
    data.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-03-04 21:31 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
2023-12-06 21:17 ` [Bug runtime/31117] " wcohen at redhat dot com
2023-12-13 21:27 ` mcermak at redhat dot com
2023-12-14 20:23 ` wcohen at redhat dot com
2023-12-18 18:55 ` wcohen at redhat dot com
2023-12-21 21:43 ` wcohen at redhat dot com
2024-01-04 22:34 ` wcohen at redhat dot com
2024-01-05 14:50 ` mcermak at redhat dot com
2024-01-05 15:21 ` mcermak at redhat dot com
2024-01-08 14:56 ` wcohen at redhat dot com
2024-01-09 12:11 ` mcermak at redhat dot com
2024-01-10 12:37 ` mcermak at redhat dot com
2024-01-22 15:54 ` wcohen at redhat dot com
2024-01-23 18:24 ` wcohen at redhat dot com
2024-02-02 15:07 ` mcermak at redhat dot com
2024-02-02 15:15 ` mcermak at redhat dot com
2024-02-05 15:32 ` wcohen at redhat dot com
2024-03-04 16:52 ` wcohen at redhat dot com
2024-03-04 21:30 ` wcohen at redhat dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).