public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* stap/eBPF language features brainstorm
@ 2018-07-11 20:13 Serhei Makarov
  2018-07-12 15:47 ` Serhei Makarov
  2018-07-19 15:27 ` William Cohen
  0 siblings, 2 replies; 5+ messages in thread
From: Serhei Makarov @ 2018-07-11 20:13 UTC (permalink / raw)
  To: systemtap

Some very rough ideas on how to jam further desirable SystemTap language features into the constraints of eBPF. The goal isn't for total feature-parity with the kernel backend (that's not feasible) but for the restrictions to be clear and understandable, e.g.:

- Strings are at most BPF_MAXSTRINGLEN=64 bytes.

- Certain operations (e.g. foreach) can only be done in userspace. It's clearly documented which probes run in userspace (e.g. begin, end, selected ?timer probes).

Unless otherwise specified, the below scenarios assume code running in the in-kernel eBPF interpreter which forbids looping and can only store data in local stack memory or eBPF maps. The userspace interpreter has a lot more flexibility, being able to perform arbitrary control-flow and define helpers that allocate userspace heap memory.

There are upstream eBPF and bcc planning docs which suggest intent to add bounded loops (https://github.com/iovisor/bcc/issues/574) and better string handling (https://github.com/iovisor/bcc/issues/691, https://github.com/iovisor/bcc/issues/900) and synchronization (https://github.com/iovisor/bcc/issues/1521) but the freshness / status of some of those plans was unclear to me.

# Current string handling

/* A string constant is pushed onto the stack (*r10) in 8 byte chunks (4 byte chunks also work): */
r1 = "Hello, "
*(u64 *)(r10-80)=r1
r1 = "World!\0\0"
*(u64 *)(r10-72)=r1
trace_printk(r10-80) /* pass as argument to a helper */

# Storing a string in an eBPF map

The string is loaded onto the stack as above, then its address is passed to bpf_map_{lookup,update}_elem.

# Loading kernel_string(addr) or user_string(addr) onto the stack

offset = 80
unrolled loop {
   *(u64 *)(r10-offset) = *(u64 *)addr
   addr += 8; offset -= 8
} until '\0' encountered or BPF_MAXSTRINGLEN bytes copied

/* loading kernel_string(addr) or user_string(addr) into an eBPF map is similar */

# String-indexed associative array

This seems to work straightforwardly with BPF_MAP_TYPE_HASH. Both the key and value are loaded onto the stack, then their addresses are passed to bpf_map_{lookup,update}_elem.

# Backtrace tapset

Backtraces can be recorded to eBPF's BPF_MAP_TYPE_STACK_TRACE. Backtrace analysis+printing can be handled in userspace. This requires replacing the trace_printk() functionality with a messaging protocol that offloads printing to userspace (PR22330), then adding a new message type that requests a backtrace to be printed.

# Timer probes

In addition to the perf-event based timers, there should be a timer probe that runs in user space. This is needed to support the common pattern of scripts that iterate through their associative arrays and print a report every N seconds.

# Statistical aggregates

These can be implemented with BPF_MAP_TYPE_PERCPU storing elements of type struct stat_data. Multiple aggregates can be stored in the same array, one per aggregate.

Then __stp_stat_add can be implemented in kernel-space as non-looping, non-locking eBPF code, while all other functions (reading stat value, histogram printing) can be implemented as userspace helpers that aggregate data from all CPUs.

As far as I can tell, it is not strictly necessary to lock a statistical aggregate when reading its value -- the kernel-module backend does this to guarantee a time-consistent snapshot of the different CPU's values, whereas without locking the result might be approximate.

# {TODO} More complex structures: arrays of statistical aggregates

Still investigating whether we can do this.

(0) There is no per-CPU version of BPF_HASH.

(1) A BPF_MAP_TYPE_PERCPU would be a contiguously indexed, preallocated array of aggregates, so a BPF_MAP_TYPE_HASH would be needed to map from sparse keys to indices into the BPF_MAP_TYPE_PERCPU. However, without synchronization, there is no way to allocate slots in the BPF_MAP_TYPE_PERCPU.

(2) /usr/include/linux/bpf.h mentions BPF_MAP_TYPE_HASH_OF_MAPS, but it's currently undocumented. Still need to read the code and investigate if it works for our purposes.

# More complex structures: multi-key associative arrays

It's theoretically possible to have a large struct as the key to a hash map. Thus, a key could literally be a structure with 2 (or more) embedded strings. It seemed to work fine when I used a 128-byte map key size, so two 64-byte string keys should be doable at the very least.

# Global variable locking semantics

Apparently not possible due to upstream eBPF limitations. I could not find any compare-and-swap-type operation for map elements. (bcc's lookup_or_init() compiles to code with potential to data-race. The only map modification helpers are lookup_elem, update_elem, and delete_elem, and any compare-and-swap code constructed from them will be subject to data races.)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: stap/eBPF language features brainstorm
  2018-07-11 20:13 stap/eBPF language features brainstorm Serhei Makarov
@ 2018-07-12 15:47 ` Serhei Makarov
  2018-07-19 15:27 ` William Cohen
  1 sibling, 0 replies; 5+ messages in thread
From: Serhei Makarov @ 2018-07-12 15:47 UTC (permalink / raw)
  To: systemtap

On Wed, Jul 11, 2018, at 4:13 PM, Serhei Makarov wrote:
> (1) A BPF_MAP_TYPE_PERCPU would be a contiguously indexed, preallocated 
> array of aggregates, so a BPF_MAP_TYPE_HASH would be needed to map from 
> sparse keys to indices into the BPF_MAP_TYPE_PERCPU. However, without 
> synchronization, there is no way to allocate slots in the 
> BPF_MAP_TYPE_PERCPU.

Note that eBPF does have an atomic increment operation in the form of BPF_XADD.
If it returned the value at the memory location (either before or after the increment)
like a compare-and-swap operation, then it could be used to allocate array slots
in a thread-safe fashion.

Alas, I can't find any indication in the docs that a value is returned. Rather, the in-kernel
testsuite (https://github.com/torvalds/linux/blob/4e33d7d47943aaa84a5904472cf2f9c6d6b0a6ca/lib/test_bpf.c#L4306)
specifies that there should be no side-effects (as far as I can decipher the testcases).
The purpose of BPF_XADD seems to be to make sure that two increment operations
don't 'cancel each other out' by racing to read the same memory location.

For example, see the sample eBPF program in
 http://www.man7.org/linux/man-pages/man2/bpf.2.html
-- which contains the following atomic increment of a counter:

BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* lock *(u64 *) r0 += r1 */
 
> (2) /usr/include/linux/bpf.h mentions BPF_MAP_TYPE_HASH_OF_MAPS, but 
> it's currently undocumented. Still need to read the code and investigate 
> if it works for our purposes.

This might still be an option.

> # Global variable locking semantics
> 
> Apparently not possible due to upstream eBPF limitations. I could not 
> find any compare-and-swap-type operation for map elements. (bcc's 
> lookup_or_init() compiles to code with potential to data-race. The only 
> map modification helpers are lookup_elem, update_elem, and delete_elem, 
> and any compare-and-swap code constructed from them will be subject to 
> data races.)

As suggested by Frank, we could implement probe exclusion with a sequence
such as the following:

BPF_XADD(&lock_counter, 1);
value = read(lock_counter);
if (value <= 1) { ... execute probe ... } else skip probe
BPF_XADD(&lock_counter, -1);

In this case, it is possible for two probes to *both* cancel each other's execution,
but it should not be possible for two probes to execute simultaneously.
Will need to test how well this works in practice (i.e. how likely are two probes
to mutually cancel?)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: stap/eBPF language features brainstorm
  2018-07-11 20:13 stap/eBPF language features brainstorm Serhei Makarov
  2018-07-12 15:47 ` Serhei Makarov
@ 2018-07-19 15:27 ` William Cohen
  2018-07-19 17:32   ` Serhei Makarov
  1 sibling, 1 reply; 5+ messages in thread
From: William Cohen @ 2018-07-19 15:27 UTC (permalink / raw)
  To: Serhei Makarov, systemtap

On 07/11/2018 04:13 PM, Serhei Makarov wrote:
> Some very rough ideas on how to jam further desirable SystemTap language features into the constraints of eBPF. The goal isn't for total feature-parity with the kernel backend (that's not feasible) but for the restrictions to be clear and understandable, e.g.:

... 
> - Certain operations (e.g. foreach) can only be done in userspace. It's clearly documented which probes run in userspace (e.g. begin, end, selected ?timer probes).

A number of the systemtap script have idioms like the following from iotop.stp:

global io_stat,device
global read_bytes,write_bytes

...

probe timer.ms(5000) {
  /* write out data */
  /* clear data */
  delete io_stat
  delete device
  read_bytes = 0
  write_bytes = 0  
}

Is there some way to clear out all the previous entries in the bpf map from the timer probe?

-Will

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: stap/eBPF language features brainstorm
  2018-07-19 15:27 ` William Cohen
@ 2018-07-19 17:32   ` Serhei Makarov
  2018-07-19 17:56     ` William Cohen
  0 siblings, 1 reply; 5+ messages in thread
From: Serhei Makarov @ 2018-07-19 17:32 UTC (permalink / raw)
  To: systemtap

On Thu, Jul 19, 2018, at 11:27 AM, William Cohen wrote:
> probe timer.ms(5000) {
>   /* write out data */
>   /* clear data */
>   delete io_stat
>   delete device
>   read_bytes = 0
>   write_bytes = 0  
> }
> 
> Is there some way to clear out all the previous entries in the bpf map 
> from the timer probe?
> 
> -Will

The only ways I can think of to 'clear out' a map from eBPF kernel space are fairly contrived.

However, in the case of scripts like iotop.stp, there is no reason to run that 5-second timer in kernel-space. I don't see why it would require perf-like precision, so it could be handled by the userspace eBPF interpreter, where such operations are easier to support by iterating through the map elements.

Last time we discussed this, Aaron mentioned some plans to implement support for userspace stapBPF timer probes. I'm not sure what the best option would be for specifying which timer probes should run in userspace and which ones should be handled in the kernel. Perhaps it could be decided automatically depending on whether userspace-only operations such as foreach or map_clear are detected in the probe body.

All the best,
      Serhei

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: stap/eBPF language features brainstorm
  2018-07-19 17:32   ` Serhei Makarov
@ 2018-07-19 17:56     ` William Cohen
  0 siblings, 0 replies; 5+ messages in thread
From: William Cohen @ 2018-07-19 17:56 UTC (permalink / raw)
  To: Serhei Makarov, systemtap

On 07/19/2018 01:32 PM, Serhei Makarov wrote:
> On Thu, Jul 19, 2018, at 11:27 AM, William Cohen wrote:
>> probe timer.ms(5000) {
>>   /* write out data */
>>   /* clear data */
>>   delete io_stat
>>   delete device
>>   read_bytes = 0
>>   write_bytes = 0  
>> }
>>
>> Is there some way to clear out all the previous entries in the bpf map 
>> from the timer probe?
>>
>> -Will
> 
> The only ways I can think of to 'clear out' a map from eBPF kernel space are fairly contrived.
> 
> However, in the case of scripts like iotop.stp, there is no reason to run that 5-second timer in kernel-space. I don't see why it would require perf-like precision, so it could be handled by the userspace eBPF interpreter, where such operations are easier to support by iterating through the map elements
> 
> Last time we discussed this, Aaron mentioned some plans to implement support for userspace stapBPF timer probes. I'm not sure what the best option would be for specifying which timer probes should run in userspace and which ones should be handled in the kernel. Perhaps it could be decided automatically depending on whether userspace-only operations such as foreach or map_clear are detected in the probe body.
> 
> All the best,
>       Serhei
> 


The delete in userspace timer could keep a shadow array of the array and use the delta between the live bpf map array and the shadow array to allow behavior like the current delete operation.  However, this doesn't address the problem where the bpf array get full.

-Will

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-07-19 17:56 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-11 20:13 stap/eBPF language features brainstorm Serhei Makarov
2018-07-12 15:47 ` Serhei Makarov
2018-07-19 15:27 ` William Cohen
2018-07-19 17:32   ` Serhei Makarov
2018-07-19 17:56     ` William Cohen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).