From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sourceware-bugzilla@sourceware.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 418A83939C1F; Thu, 18 Jun 2020 03:27:13 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 418A83939C1F
From: "agentzh at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: systemtap@sourceware.org
Subject: [Bug runtime/26131] New: Relay flushing results in garbled output
 data being read by staprun
Date: Thu, 18 Jun 2020 03:27:12 +0000
X-Bugzilla-Reason: AssignedTo
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: systemtap
X-Bugzilla-Component: runtime
X-Bugzilla-Version: unspecified
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: agentzh at gmail dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: systemtap at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter target_milestone
Message-ID: <bug-26131-6586@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: systemtap@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Systemtap mailing list <systemtap.sourceware.org>
List-Unsubscribe: <http://sourceware.org/mailman/options/systemtap>,
 <mailto:systemtap-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/systemtap/>
List-Help: <mailto:systemtap-request@sourceware.org?subject=help>
List-Subscribe: <http://sourceware.org/mailman/listinfo/systemtap>,
 <mailto:systemtap-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Thu, 18 Jun 2020 03:27:13 -0000

https://sourceware.org/bugzilla/show_bug.cgi?id=3D26131

            Bug ID: 26131
           Summary: Relay flushing results in garbled output data being
                    read by staprun
           Product: systemtap
           Version: unspecified
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: runtime
          Assignee: systemtap at sourceware dot org
          Reporter: agentzh at gmail dot com
  Target Milestone: ---

We observed some garbled data ('\0') would randomly appear in staprun's std=
out
output or `-o FILE` option's file output when the machine is under load. It=
 can
be demonstrated by the following simple stap script:

https://gist.github.com/agentzh/b44384ad9f3804d85ddfbf771094a718

When running this script's ko module with staprun in 20 ~ 30 threads, we can
get a sample output with `\0` in it every 30 sec to 3 min. One bad sample
output file is like this (generated by `staprun -o FILE`):

https://gist.github.com/agentzh/5967e0a9ce90e494ee0c88cb4d9ecef6

It is reproducible with a very recent systemtap git master (commit fb59e8c7=
5).

The `staprun -vvv` output for the bad runs is all similar to the following:

https://gist.github.com/agentzh/283a15f617f751cf9faa786632b1e516

We can see that the staprun receives `STP_REQUEST_EXIT` and sends back a
`STP_STOP` command back to ko before its reader thread has read anything fr=
om
the relay data debugfs file. The `reader_thread:237` log lines are added by
myself.

Digging deeper I found that there is a race condition in kernel's
`relay_switch_subbuf` function which is called by `relay_flush()` in system=
tap
runtime's `_stp_transport_data_fs_stop` function. To be more specific, it
happens on the following line of the kernel's `relay_switch_subbuf`:

```
irq_work_queue(&buf->wakeup_work);
```

When this line is being executed, the `relay_file_read()` function reads the
relay buf data structures which is partially set by the `relay_switch_subbu=
f`
function. The latter needs to set the following fields in the `struct
rchan_buf` data structure:

* buf->offset
* buf->padding[new_subbuf]
* buf->subbufs_produced
* buf->data

But at the point of the `irq_work_queue()` call, only `buf->subbufs_produce=
d`
is assigned, leaving `buf` in an inconsistent state, confusing
`relay_file_read()` to read the full newly switched empty buffer (which is
usually full of \0 bytes), as evidenced by the following dmesg messages
produced by my own `printk` calls in the systemtap runtime:

```
[203515.224386] return_30253: _stp_transport_data_fs_stop:334
[203515.224393] relay before flush: subbuf_size: 262144, n_subbufs: 8, chan
global? 1, prev padding: 0, subbufs produced: 0, subbufs consumed: 0, bytes
consumed: 21, offset: 21, paddings: 0 0, early bytes: 0
[203515.224531] return_30253: __stp_relay_switch_subbuf:106
[203515.224534] XXX line 106: subbufs_consumed changed! offset: 21, subbufs
consumed; 1
[203515.224537] return_30253: __stp_relay_subbuf_start_callback:225
[203515.224540] subbuf start returned 1: overwrite: 0, relay buf full: 0
[203515.224542] return_30253: _stp_transport_data_fs_stop:350
[203515.224547] relay after flush: subbufs produced: 1, subbufs consumed: 1,
bytes consumed: 0, offset: 0, paddings: 262123 0, early bytes: 0, buf =3D=3D
per_cpu? 1
```

Using the numbers in this dmesg output, we can successfully determine the
following execution sequence on the relay reader side (kernel space only):

```
relay_file_read
  relay_file_read_avail
    relay_file_read_consume
      relay_subbufs_consumed(buf->chan, buf->cpu, 1)
        buf->subbufs_consumed +=3D subbufs_consumed;  // subbufs_consumed =
=3D=3D 1
      buf->bytes_consumed =3D 0;
    return 1;
  read_start =3D relay_file_read_start_pos(*ppos, buf)  // *ppos =3D=3D 0
    read_pos =3D 262144; // subbuf_size
    return read_pos;
  // read_start =3D 262144;
  avail =3D relay_file_read_subbuf_avail(read_start, buf);  // avail =3D=3D=
 262144
```

where 262144 is the default subbuf size set by the stap runtime (65536*4).

After switching the `relay_flush()` call from the kernel version to systemt=
ap's
own version based on systemtap's own `__stp_relay_switch_subbuf()` can miti=
gate
the problem to the extend that it's very hard to reproduce the race with the
simple .stp script given above. This is because systemtap's
`__stp_relay_switch_subbuf()` intentionally avoids the `irq_work_queue()` c=
all.
But still, we can get garbled output using .stp scripts producing more outp=
ut
data, like this one:

```
global N =3D 20000

probe oneshot {
  exit()
}

probe end {
  for (i =3D 0; i < N; i++)
      printf("hello world %d !!! %d\n", i, i + 1)
}
```

This script generates 130895 bytes of data (about 128KB). This is due to the
fact that both implementations of `relay_switch_subbuf` has inherent race
conditions with the concurrent reader. There is no locking protection around
the statements setting those `buf->xxx` fields. So there's always a chance =
that
the reader would interrupt in the middle of that instruction sequence. The
right way would be introduce a spinlock around the critical section of
`relay_switch_subbuf` and also make the reader respect that.

The simplest patch would be remove the `relay_flush()` call from the
`_stp_transport_data_fs_stop` function. If no subbuf switching, then no rac=
es
at all. That flushing call is not necessary anyway because `relay_close()`
would defer the deallocation of the relay bufs until their reference counti=
ng
reaches zero. And load testing using the script above reveals no data loss =
in
the end of the output stream. This is not a complete fix even for systemtap
alone, however, because there are other places in the systemtap runtime whi=
ch
would call `__stp_relay_switch_subbuf()`, like in function
`_stp_data_write_reserve`. A second patch I would suggest is to reorder the
statements of the `buf` field assignments to minimize the race window even
though completely eliminating it would not be possible if not adding a lock
around the critical sections. Anyway, these two patches do not require kern=
el
patching and I'll prepare them shortly for review unless there are better
ideas.

As far as I can see, this racing bug exists in both older kernels like Cent=
OS
7's 3.10 kernels, and also newer kernels found in Fedora (or the kernel's g=
it
master).

Feedback welcome! Thanks!

--=20
You are receiving this mail because:
You are the assignee for the bug.=