From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 418A83939C1F; Thu, 18 Jun 2020 03:27:13 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 418A83939C1F From: "agentzh at gmail dot com" To: systemtap@sourceware.org Subject: [Bug runtime/26131] New: Relay flushing results in garbled output data being read by staprun Date: Thu, 18 Jun 2020 03:27:12 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: systemtap X-Bugzilla-Component: runtime X-Bugzilla-Version: unspecified X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: agentzh at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: systemtap at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: systemtap@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Systemtap mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Jun 2020 03:27:13 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D26131 Bug ID: 26131 Summary: Relay flushing results in garbled output data being read by staprun Product: systemtap Version: unspecified Status: UNCONFIRMED Severity: normal Priority: P2 Component: runtime Assignee: systemtap at sourceware dot org Reporter: agentzh at gmail dot com Target Milestone: --- We observed some garbled data ('\0') would randomly appear in staprun's std= out output or `-o FILE` option's file output when the machine is under load. It= can be demonstrated by the following simple stap script: https://gist.github.com/agentzh/b44384ad9f3804d85ddfbf771094a718 When running this script's ko module with staprun in 20 ~ 30 threads, we can get a sample output with `\0` in it every 30 sec to 3 min. One bad sample output file is like this (generated by `staprun -o FILE`): https://gist.github.com/agentzh/5967e0a9ce90e494ee0c88cb4d9ecef6 It is reproducible with a very recent systemtap git master (commit fb59e8c7= 5). The `staprun -vvv` output for the bad runs is all similar to the following: https://gist.github.com/agentzh/283a15f617f751cf9faa786632b1e516 We can see that the staprun receives `STP_REQUEST_EXIT` and sends back a `STP_STOP` command back to ko before its reader thread has read anything fr= om the relay data debugfs file. The `reader_thread:237` log lines are added by myself. Digging deeper I found that there is a race condition in kernel's `relay_switch_subbuf` function which is called by `relay_flush()` in system= tap runtime's `_stp_transport_data_fs_stop` function. To be more specific, it happens on the following line of the kernel's `relay_switch_subbuf`: ``` irq_work_queue(&buf->wakeup_work); ``` When this line is being executed, the `relay_file_read()` function reads the relay buf data structures which is partially set by the `relay_switch_subbu= f` function. The latter needs to set the following fields in the `struct rchan_buf` data structure: * buf->offset * buf->padding[new_subbuf] * buf->subbufs_produced * buf->data But at the point of the `irq_work_queue()` call, only `buf->subbufs_produce= d` is assigned, leaving `buf` in an inconsistent state, confusing `relay_file_read()` to read the full newly switched empty buffer (which is usually full of \0 bytes), as evidenced by the following dmesg messages produced by my own `printk` calls in the systemtap runtime: ``` [203515.224386] return_30253: _stp_transport_data_fs_stop:334 [203515.224393] relay before flush: subbuf_size: 262144, n_subbufs: 8, chan global? 1, prev padding: 0, subbufs produced: 0, subbufs consumed: 0, bytes consumed: 21, offset: 21, paddings: 0 0, early bytes: 0 [203515.224531] return_30253: __stp_relay_switch_subbuf:106 [203515.224534] XXX line 106: subbufs_consumed changed! offset: 21, subbufs consumed; 1 [203515.224537] return_30253: __stp_relay_subbuf_start_callback:225 [203515.224540] subbuf start returned 1: overwrite: 0, relay buf full: 0 [203515.224542] return_30253: _stp_transport_data_fs_stop:350 [203515.224547] relay after flush: subbufs produced: 1, subbufs consumed: 1, bytes consumed: 0, offset: 0, paddings: 262123 0, early bytes: 0, buf =3D=3D per_cpu? 1 ``` Using the numbers in this dmesg output, we can successfully determine the following execution sequence on the relay reader side (kernel space only): ``` relay_file_read relay_file_read_avail relay_file_read_consume relay_subbufs_consumed(buf->chan, buf->cpu, 1) buf->subbufs_consumed +=3D subbufs_consumed; // subbufs_consumed = =3D=3D 1 buf->bytes_consumed =3D 0; return 1; read_start =3D relay_file_read_start_pos(*ppos, buf) // *ppos =3D=3D 0 read_pos =3D 262144; // subbuf_size return read_pos; // read_start =3D 262144; avail =3D relay_file_read_subbuf_avail(read_start, buf); // avail =3D=3D= 262144 ``` where 262144 is the default subbuf size set by the stap runtime (65536*4). After switching the `relay_flush()` call from the kernel version to systemt= ap's own version based on systemtap's own `__stp_relay_switch_subbuf()` can miti= gate the problem to the extend that it's very hard to reproduce the race with the simple .stp script given above. This is because systemtap's `__stp_relay_switch_subbuf()` intentionally avoids the `irq_work_queue()` c= all. But still, we can get garbled output using .stp scripts producing more outp= ut data, like this one: ``` global N =3D 20000 probe oneshot { exit() } probe end { for (i =3D 0; i < N; i++) printf("hello world %d !!! %d\n", i, i + 1) } ``` This script generates 130895 bytes of data (about 128KB). This is due to the fact that both implementations of `relay_switch_subbuf` has inherent race conditions with the concurrent reader. There is no locking protection around the statements setting those `buf->xxx` fields. So there's always a chance = that the reader would interrupt in the middle of that instruction sequence. The right way would be introduce a spinlock around the critical section of `relay_switch_subbuf` and also make the reader respect that. The simplest patch would be remove the `relay_flush()` call from the `_stp_transport_data_fs_stop` function. If no subbuf switching, then no rac= es at all. That flushing call is not necessary anyway because `relay_close()` would defer the deallocation of the relay bufs until their reference counti= ng reaches zero. And load testing using the script above reveals no data loss = in the end of the output stream. This is not a complete fix even for systemtap alone, however, because there are other places in the systemtap runtime whi= ch would call `__stp_relay_switch_subbuf()`, like in function `_stp_data_write_reserve`. A second patch I would suggest is to reorder the statements of the `buf` field assignments to minimize the race window even though completely eliminating it would not be possible if not adding a lock around the critical sections. Anyway, these two patches do not require kern= el patching and I'll prepare them shortly for review unless there are better ideas. As far as I can see, this racing bug exists in both older kernels like Cent= OS 7's 3.10 kernels, and also newer kernels found in Fedora (or the kernel's g= it master). Feedback welcome! Thanks! --=20 You are receiving this mail because: You are the assignee for the bug.=