* [Bug runtime/31117] New: transport failures
@ 2023-12-06 15:39 mcermak at redhat dot com
2023-12-06 21:17 ` [Bug runtime/31117] " wcohen at redhat dot com
` (17 more replies)
0 siblings, 18 replies; 19+ messages in thread
From: mcermak at redhat dot com @ 2023-12-06 15:39 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
Bug ID: 31117
Summary: transport failures
Product: systemtap
Version: unspecified
Status: NEW
Severity: normal
Priority: P2
Component: runtime
Assignee: systemtap at sourceware dot org
Reporter: mcermak at redhat dot com
Target Milestone: ---
Created attachment 15241
--> https://sourceware.org/bugzilla/attachment.cgi?id=15241&action=edit
testcase patch allowing to demonstrate the problem
In the recent past, there were some problems with transport failures (PR29108).
This was fixed. But apparently the situation still isn't 100% perfect.
Pretty much
all the syscall tests are now failing on RHEL-9.4 with transport failures. It
is reliably reproducible. By default, the problem is hidden in the testsuite's
expect gates. To visualize it, I've used the attached patch. With that patch
applied (on top of systemtap commit b2a1261c434788cb and
kernel-5.14.0-394.el9.x86_64), I have:
el9 x86_64 # make installcheck RUNTESTFLAGS=syscall.exp CHECK_ONLY=futex
[ ... stuff deleted ... ]
Running /root/systemtap/testsuite/systemtap.syscall/syscall.exp ...
[ ... stuff deleted ... ]
Comparing ^futex: futex \(0x0, FUTEX_CMP_REQUEUE, 1, 0, 0x[f]+, 0\) = [\-0-9]+
against futex: futex (0x0, FUTEX_CMP_REQUEUE, 1, 0, 0xffffffffffffffff, 0) =
-14 (EFAULT)
Comparing ^futex: futex \(0x0, FUTEX_CMP_REQUEUE, 1, 0, 0x0, -1\) = [\-0-9]+
against futex: futex (0x0, FUTEX_CMP_REQUEUE, 1, 0, 0x0, -1) = WARNING: There
were 1 transport failures. Try stap -s to increase the buffer size from 16.
I've tested that simpler syscall tests (e.g. those having most of their
<testcase>.c content commented out) the test result is sometimes PASS and
sometimes FAIL. But using a "normal" syscall test involving 10+ system calls
and their checks, the failure is guaranteed..
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug runtime/31117] transport failures
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
@ 2023-12-06 21:17 ` wcohen at redhat dot com
2023-12-13 21:27 ` mcermak at redhat dot com
` (16 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2023-12-06 21:17 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
William Cohen <wcohen at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |wcohen at redhat dot com
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug runtime/31117] transport failures
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
2023-12-06 21:17 ` [Bug runtime/31117] " wcohen at redhat dot com
@ 2023-12-13 21:27 ` mcermak at redhat dot com
2023-12-14 20:23 ` wcohen at redhat dot com
` (15 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: mcermak at redhat dot com @ 2023-12-13 21:27 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
Martin Cermak <mcermak at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #15241|0 |1
is obsolete| |
--- Comment #1 from Martin Cermak <mcermak at redhat dot com> ---
Created attachment 15258
--> https://sourceware.org/bugzilla/attachment.cgi?id=15258&action=edit
testcase patch allowing to demonstrate the problem
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug runtime/31117] transport failures
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
2023-12-06 21:17 ` [Bug runtime/31117] " wcohen at redhat dot com
2023-12-13 21:27 ` mcermak at redhat dot com
@ 2023-12-14 20:23 ` wcohen at redhat dot com
2023-12-18 18:55 ` wcohen at redhat dot com
` (14 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2023-12-14 20:23 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
William Cohen <wcohen at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |ASSIGNED
--- Comment #2 from William Cohen <wcohen at redhat dot com> ---
I am able to see issues with the transport layer in a number of the systemtap
tests.
Looking through the systemtap.log There are a number of warnings. Another
example warning output below is from systemtap.maps/map_hash_SSI.stp failure
(which has a significant amount of output):
"WARNING: There were 2 transport failures. Try stap -s to increase the
buffer size from 16."
"WARNING: There were 0 lost messages and 14 lost bytes."
Frank mentioned that the problem might be tied to transferring data more than a
page (4096 bytes). Didn't see the issue when less than 4K, but then started
seeing it when size got to the 4K range.
Also saw cases where the lost messages counts looked questionable, such as the
one for clone:
clone: set_robust_list (0x7f54e661ba20, 24) = 0
clone: mprotect (0x7f54e67f2000, 16384, PROT_READ) = 0
clone: mprotect (0x7f54e68df000, 4096, PROT_READ) = 0
stapio:reader_thread_serialmode:299: ERROR: unexpected sequence=66: No data
available
stapio:reader_thread_serialmode:299: ERROR: unexpected sequence=67: No data
available
stapio:reader_thread_serialmode:299: ERROR: unexpected sequence=68: No data
available
stapio:reader_thread_serialmode:299: ERROR: unexpected sequence=69: No data
available
stapio:reader_thread_serialmode:299: ERROR: unexpected sequence=70: No data
available
stapio:reader_thread_serialmode:299: ERROR: unexpected sequence=71: No data
available
stapio:reader_thread_serialmode:299: ERROR: unexpected sequence=72: No data
available
stapio:reader_thread_serialmode:299: ERROR: unexpected sequence=73: No data
available
stapio:reader_thread_serialmode:299: ERROR: unexpected sequence=74: No data
available
WARNING: There were 4294967284 lost messages and 0 lost bytes.
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug runtime/31117] transport failures
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
` (2 preceding siblings ...)
2023-12-14 20:23 ` wcohen at redhat dot com
@ 2023-12-18 18:55 ` wcohen at redhat dot com
2023-12-21 21:43 ` wcohen at redhat dot com
` (13 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2023-12-18 18:55 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
--- Comment #3 from William Cohen <wcohen at redhat dot com> ---
I did some investigation on this with the map_hash_SSI.stp transport failures.
Some fixes of the diagnostic print code for the transport layer were made so it
compiles on fedora 39. The __stp_print_flush code expects there to be enough
room to write out the _stp_trace header (12 bytes) in the space returned by
_stp_data_write_reserve(). However, there are cases where
_stp_data_write_reserve() returned less space (5 and 9 bytes). The causes the
_stp_print_flush to fail and increment _stp_transport_failures. That certainly
needs to be fixed.
I had some additional code in the runtime to print out each time
_stp_print_flush had a failure. I didn't see that triggering with the sys.stp.
Looks like will need to additional diagnostic prints to figure out what is
going on with the sys.stp output.
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug runtime/31117] transport failures
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
` (3 preceding siblings ...)
2023-12-18 18:55 ` wcohen at redhat dot com
@ 2023-12-21 21:43 ` wcohen at redhat dot com
2024-01-04 22:34 ` wcohen at redhat dot com
` (12 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2023-12-21 21:43 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
--- Comment #4 from William Cohen <wcohen at redhat dot com> ---
Thinking about this some more. One of the problems that occurs is the
situation where variable sized writes cause the remaining space in the buffer
returned by _stp_data_write_reserve to be smaller than the 12 bytes required by
the struct _stp_trace header. One thought is to make sure that the
_stp_data_write_reserve() provide allocations in 16 byte chunks that are are
aligned so when __stp_print_flush gets to the end of a subbuffer there is room
to write the header. However, the __stp_print_flush code assumes that
bytes_reserved returned by _stp_data_write_reserve() is <= size_request.
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug runtime/31117] transport failures
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
` (4 preceding siblings ...)
2023-12-21 21:43 ` wcohen at redhat dot com
@ 2024-01-04 22:34 ` wcohen at redhat dot com
2024-01-05 14:50 ` mcermak at redhat dot com
` (11 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2024-01-04 22:34 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
--- Comment #5 from William Cohen <wcohen at redhat dot com> ---
Created attachment 15285
--> https://sourceware.org/bugzilla/attachment.cgi?id=15285&action=edit
Patch reduce transport drops
The patch retries reserving space if it gets to the end of a subbuffer and
there isn't enough room for the header. There probably needs to be some tweaks
to the code in staprun/relay.c to not count the padding as lost bytes when it
is looking for the the STAP_TRACE_MAGIC.
The attached patch improves the test results for the systemtap.map tests that
were transporting more than PAGE_SIZE amount of data. It also improves the
test results for the syscall tests.
For:
sudo make installcheck RUNTESTFLAGS="systemtap.maps/*.exp"
Before the patch:
=== systemtap Summary ===
# of expected passes 83
# of unexpected failures 6
# of unsupported tests 1
=== systemtap Summary ===
# of expected passes 86
# of unexpected failures 3
# of unsupported tests 1
@@ -106 +95 @@
-FAIL: systemtap.maps/map_hash_SSI.stp
+PASS: systemtap.maps/map_hash_SSI.stp
@@ -108 +97 @@
-FAIL: systemtap.maps/map_hash_stat_II.stp
+PASS: systemtap.maps/map_hash_stat_II.stp
@@ -110 +99 @@
-FAIL: systemtap.maps/map_hash_stat_SI.stp
+PASS: systemtap.maps/map_hash_stat_SI.stp
The syscall test results look much better also. However, on RHEL9/f39 there
are a number of unrelated failures on x86_64 for the 32-bit tests because the
check for the syscall names/numbers added by git commit
7fe1b7bb8e665c9edd38eaf34a499618ee763293
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug runtime/31117] transport failures
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
` (5 preceding siblings ...)
2024-01-04 22:34 ` wcohen at redhat dot com
@ 2024-01-05 14:50 ` mcermak at redhat dot com
2024-01-05 15:21 ` mcermak at redhat dot com
` (10 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: mcermak at redhat dot com @ 2024-01-05 14:50 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
--- Comment #6 from Martin Cermak <mcermak at redhat dot com> ---
Created attachment 15287
--> https://sourceware.org/bugzilla/attachment.cgi?id=15287&action=edit
possible patch for the 32-bit tests subproblem
(In reply to William Cohen from comment #5)
> The attached patch improves the test results for the systemtap.map tests
> that were transporting more than PAGE_SIZE amount of data. It also improves
> the test results for the syscall tests.
Thank-you for looking into this Will!
> However, on RHEL9/f39 there
> are a number of unrelated failures on x86_64 for the 32-bit tests because
> the check for the syscall names/numbers added by git commit
> 7fe1b7bb8e665c9edd38eaf34a499618ee763293
Looking closer... Apparently on RHEL9 the @__compat_task macro <=
_stp_is_compat_task() <= _stp_is_compat_task2() in runtime/compatdefs.h
doesn't work right. Wow, I think I have a fix for that see attached. Here is
how my thoughts went:
I've tested the problem is in _stp_is_compat_task2(), that's in
runtime/compatdefs.h. I've found that a related systemtap commit is
ea81249b90b92a589d304c051efbc5bbad7ce073. This mentions kernel commit
ff170cd05953, so I've looked at that, and found that kernel function
is_64bit_mm() seems to be doing what I need. I've then looked at the modern
implementation of this function, and used that back in systemtap's
_stp_is_compat_task2(). And that seems to work on my specific test system,
which is 5.14.0-402.el9.x86_64.
The attached patch works for me on both RHEL8 and RHEL9. BUT ... this seems to
be a case where this upstream kernel change introduced in vanilla v6.3 was
backported to lower distribution kernel version, so the condition "#if
LINUX_VERSION_CODE < KERNEL_VERSION(5,0,0)" is a bit problematic..
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug runtime/31117] transport failures
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
` (6 preceding siblings ...)
2024-01-05 14:50 ` mcermak at redhat dot com
@ 2024-01-05 15:21 ` mcermak at redhat dot com
2024-01-08 14:56 ` wcohen at redhat dot com
` (9 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: mcermak at redhat dot com @ 2024-01-05 15:21 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
--- Comment #7 from Martin Cermak <mcermak at redhat dot com> ---
I've separated this aforementioned 32-bit tests sub-problem into PR31215.
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug runtime/31117] transport failures
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
` (7 preceding siblings ...)
2024-01-05 15:21 ` mcermak at redhat dot com
@ 2024-01-08 14:56 ` wcohen at redhat dot com
2024-01-09 12:11 ` mcermak at redhat dot com
` (8 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2024-01-08 14:56 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
William Cohen <wcohen at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #15285|0 |1
is obsolete| |
--- Comment #8 from William Cohen <wcohen at redhat dot com> ---
Created attachment 15289
--> https://sourceware.org/bugzilla/attachment.cgi?id=15289&action=edit
Partial fix to reduce transport drops
This WIP patch reduced the number of "transport failures" in the run of "make
installcheck" from 57 down to 6. It also reduced the number of failures in the
syscall.exp tests However, there still seems to be drops output in the tests.
When running the syscall.exp testsuite multiple times see the PASS change to
FAIL and vice versa from run to run. Maybe on shutdown not all the output is
getting flushed out.
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug runtime/31117] transport failures
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
` (8 preceding siblings ...)
2024-01-08 14:56 ` wcohen at redhat dot com
@ 2024-01-09 12:11 ` mcermak at redhat dot com
2024-01-10 12:37 ` mcermak at redhat dot com
` (7 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: mcermak at redhat dot com @ 2024-01-09 12:11 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
--- Comment #9 from Martin Cermak <mcermak at redhat dot com> ---
(In reply to William Cohen from comment #8)
> However, there still seems to be drops
> output in the tests. When running the syscall.exp testsuite multiple times
> see the PASS change to FAIL and vice versa from run to run. Maybe on
> shutdown not all the output is getting flushed out.
Agreed. E.g. the following is based on syscall testsuite artifacts:
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
WARNING: There were 0 lost messages and 2 lost bytes.
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
WARNING: There were 0 lost messages and 2 lost bytes.
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
4
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
0
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
WARNING: There were 0 lost messages and 2 lost bytes.
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 #
BUT ... I seem to see improvements when I put all the CPU cores offline but the
first one:
el8 x86_64 # for i in `seq 1 23`; do echo 0 >
/sys/devices/system/cpu/cpu$i/online; done
el8 x86_64 #
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 # staprun -R /tmp/syscall_longlonglong.ko -c /tmp/adjtimex | wc -l
62
el8 x86_64 #
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug runtime/31117] transport failures
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
` (9 preceding siblings ...)
2024-01-09 12:11 ` mcermak at redhat dot com
@ 2024-01-10 12:37 ` mcermak at redhat dot com
2024-01-22 15:54 ` wcohen at redhat dot com
` (6 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: mcermak at redhat dot com @ 2024-01-10 12:37 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
--- Comment #10 from Martin Cermak <mcermak at redhat dot com> ---
Created attachment 15294
--> https://sourceware.org/bugzilla/attachment.cgi?id=15294&action=edit
Workaround dejagnu hack FTR
For my reference, I'm attaching a patch that helps me work around current
transport failures. It namely executes staprun repeatedly if needed as long as
it (probably) gives sufficient count of output lines. This is a very bad hack,
but it helps me work on syscall tests while the actual bug is unfixed. It also
gives better readable/debuggable output (subjective) compared to the default.
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug runtime/31117] transport failures
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
` (10 preceding siblings ...)
2024-01-10 12:37 ` mcermak at redhat dot com
@ 2024-01-22 15:54 ` wcohen at redhat dot com
2024-01-23 18:24 ` wcohen at redhat dot com
` (5 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2024-01-22 15:54 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
William Cohen <wcohen at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #15289|0 |1
is obsolete| |
--- Comment #11 from William Cohen <wcohen at redhat dot com> ---
Created attachment 15317
--> https://sourceware.org/bugzilla/attachment.cgi?id=15317&action=edit
Patch request again for buffer to get contigous region for header and include
logic to avoid counting padding bytes as "lost bytes"
This significantly reduced the number of transport failures. With this patch
still saw one transport failure warning with the 32-bit syslog nd_syscall test.
Need to instrument the runtime and see where where that tranport failure is
coming from. The syscall tests are not writing out so much data to overrun the
buffer space reserved. Also there still seem to be some *syscall.exp tests that
have truncated/missing output that is causing the some individual tests to
fail.
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug runtime/31117] transport failures
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
` (11 preceding siblings ...)
2024-01-22 15:54 ` wcohen at redhat dot com
@ 2024-01-23 18:24 ` wcohen at redhat dot com
2024-02-02 15:07 ` mcermak at redhat dot com
` (4 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2024-01-23 18:24 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
--- Comment #12 from William Cohen <wcohen at redhat dot com> ---
Checked in commit b6abf4bd012a444e625296127155877fd8737931. This helps with the
transport failure issues and significantly reduces the " WARNING: There were 1
transport failures." messages. However, when running syscall.exp still observed
a "WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16." and there seems to e some FAIL: tests that look to be missing
some output.
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug runtime/31117] transport failures
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
` (12 preceding siblings ...)
2024-01-23 18:24 ` wcohen at redhat dot com
@ 2024-02-02 15:07 ` mcermak at redhat dot com
2024-02-02 15:15 ` mcermak at redhat dot com
` (3 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: mcermak at redhat dot com @ 2024-02-02 15:07 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
--- Comment #13 from Martin Cermak <mcermak at redhat dot com> ---
Hello Will, I think your patches improve the situation significantly.
Currently on rhel9 x86_64 main branch, I'm getting:
# of expected passes 286
# of unexpected failures 54
When I apply the workaround from Comment #10 (which runs staprun repeatedly in
case of a clear loss of data), the test results on the same box are as follows:
# of expected passes 296
# of unexpected failures 44
Which means that the workaround doesn't make a huge difference any more.
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug runtime/31117] transport failures
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
` (13 preceding siblings ...)
2024-02-02 15:07 ` mcermak at redhat dot com
@ 2024-02-02 15:15 ` mcermak at redhat dot com
2024-02-05 15:32 ` wcohen at redhat dot com
` (2 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: mcermak at redhat dot com @ 2024-02-02 15:15 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
--- Comment #14 from Martin Cermak <mcermak at redhat dot com> ---
On another - rhel8 x86_64 system, the situation is as follows: Clean main
branch stap code:
# of expected passes 158
# of unexpected failures 181
With patch from Comment #10 running staprun repeatedly in case of a clear data
loss:
# of expected passes 313
# of unexpected failures 26
This rhel8 system has 48 cores, while the rhel9 system from Comment 13 has only
20 cores. In this case, the workaround still appears to play a significant
role.
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug runtime/31117] transport failures
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
` (14 preceding siblings ...)
2024-02-02 15:15 ` mcermak at redhat dot com
@ 2024-02-05 15:32 ` wcohen at redhat dot com
2024-03-04 16:52 ` wcohen at redhat dot com
2024-03-04 21:30 ` wcohen at redhat dot com
17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2024-02-05 15:32 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
--- Comment #15 from William Cohen <wcohen at redhat dot com> ---
The fix merged in helps, but doesn't eliminate all the transport failures.
Looking through the RHEL8/9 runs I did this weekend see "transport failures".
Both had warning for systemtap.base PR14546. However, this test looks like it
might just have a huge volume of data as it is printing backtraces each time an
irq_* function is entered.
on RHEL9 see some "transport failures" on the syscall tests:
WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16.
FAIL: 32-bit chmod nd_syscall
WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16.
FAIL: 32-bit handle_at nd_syscall
WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16.
FAIL: 32-bit openclose nd_syscall
WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16.
FAIL: 32-bit rt_signal nd_syscall
WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16.
FAIL: 32-bit statfs nd_syscall
WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16.
FAIL: 32-bit msg_queue syscall
For RHEL8 syscall tests it looks like the return probes failed to print for
mmap2 and read for these particular failures:
WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16.
FAIL: 64-bit remap_file_pages syscall
WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16.
FAIL: 64-bit set_tid_address syscall
WARNING: There were 1 transport failures. Try stap -s to increase the buffer
size from 16.
Testing 64-bit setgroups syscall
Looking through the results syscall tests there certainly seem to be cases
where some output seems to be missing. On some seeing "WARNING: Child process
exited with status 255". Might be some other issue going on with some of the
syscall tests.
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug runtime/31117] transport failures
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
` (15 preceding siblings ...)
2024-02-05 15:32 ` wcohen at redhat dot com
@ 2024-03-04 16:52 ` wcohen at redhat dot com
2024-03-04 21:30 ` wcohen at redhat dot com
17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2024-03-04 16:52 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
--- Comment #16 from William Cohen <wcohen at redhat dot com> ---
The test results are much improved after the fix for padding the end of the
small sub-buffer and writing the new packet in a new sub-buffer. There are
some RHEL8 and RHEL9 tests that appear to create a great amount of data and
warnings about transport failures are triggered (pr14546 and
systemtap.examples/general/callgraph in RHEL8,
systemtap.examples/profiling/sched_switch run in RHEL9). The main place that
transport failures are still observed are the syscall tests.
f39
FAIL: 64-bit setitimer nd_syscall
FAIL: 32-bit sysctl nd_syscall
FAIL: 64-bit prctl syscall
FAIL: 32-bit aio syscall
FAIL: 32-bit prctl syscall
rawhide
FAIL: 32-bit arch_prctl nd_syscall
FAIL: 32-bit membarrier nd_syscall
FAIL: 32-bit arch_prctl syscall
FAIL: 32-bit exit_group syscall
FAIL: 32-bit modify_ldt syscall
FAIL: 32-bit openclose syscall
FAIL: 32-bit socketpair syscall
rhel8
FAIL: 32-bit memfd_create tp_syscall
rhel9
FAIL: 64-bit connect nd_syscall
FAIL: 64-bit readv nd_syscall
FAIL: 64-bit thread_area nd_syscall
FAIL: 32-bit syslog nd_syscall
FAIL: 64-bit fadvise64 syscall
FAIL: 64-bit getpeername syscall
FAIL: 64-bit getsockname syscall
FAIL: 64-bit readahead syscall
FAIL: 64-bit sigaltstack syscall
FAIL: 64-bit thread_area syscall
FAIL: 32-bit syslog syscall
There is a great deal of rhyme or reasons for which syscall tests are failing.
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug runtime/31117] transport failures
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
` (16 preceding siblings ...)
2024-03-04 16:52 ` wcohen at redhat dot com
@ 2024-03-04 21:30 ` wcohen at redhat dot com
17 siblings, 0 replies; 19+ messages in thread
From: wcohen at redhat dot com @ 2024-03-04 21:30 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=31117
William Cohen <wcohen at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |FIXED
Status|ASSIGNED |RESOLVED
--- Comment #17 from William Cohen <wcohen at redhat dot com> ---
This PR should be addressed by:
commit d1b74b12acc07f52c7d8f5b080176c485c34cd63 (HEAD -> master, origin/master,
origin/HEAD)
Author: William Cohen <wcohen@redhat.com>
Date: Mon Mar 4 16:27:18 2024 -0500
PR31117: Correct handling of transport layer allocated memory
The _stp_print_flush() code was not correct. There are four possible
ranges of values compared to the header size (hlen)
_stp_data_write_reserve() could return when beginning to write out
log:
<0 unable to allocate any space
<hlen pad out the allocated space and try another allocation
==hlen just enough space for the initial header
>hlen write out the header and some portion of log
The case where the space allocated was equal size of the header
(==hlen) was not handled correctly. In the cases where there was only
enough room to write the header the _stp_transport_failures variable
was incremented and none of the log data was written out. The correct
course of action in these cases would be to write the header out in
the allocated space and start looping to write the rest of the log
data.
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2024-03-04 21:31 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-06 15:39 [Bug runtime/31117] New: transport failures mcermak at redhat dot com
2023-12-06 21:17 ` [Bug runtime/31117] " wcohen at redhat dot com
2023-12-13 21:27 ` mcermak at redhat dot com
2023-12-14 20:23 ` wcohen at redhat dot com
2023-12-18 18:55 ` wcohen at redhat dot com
2023-12-21 21:43 ` wcohen at redhat dot com
2024-01-04 22:34 ` wcohen at redhat dot com
2024-01-05 14:50 ` mcermak at redhat dot com
2024-01-05 15:21 ` mcermak at redhat dot com
2024-01-08 14:56 ` wcohen at redhat dot com
2024-01-09 12:11 ` mcermak at redhat dot com
2024-01-10 12:37 ` mcermak at redhat dot com
2024-01-22 15:54 ` wcohen at redhat dot com
2024-01-23 18:24 ` wcohen at redhat dot com
2024-02-02 15:07 ` mcermak at redhat dot com
2024-02-02 15:15 ` mcermak at redhat dot com
2024-02-05 15:32 ` wcohen at redhat dot com
2024-03-04 16:52 ` wcohen at redhat dot com
2024-03-04 21:30 ` wcohen at redhat dot com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).