public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* stap/eBPF language features brainstorm
@ 2018-07-11 20:13 Serhei Makarov
  2018-07-12 15:47 ` Serhei Makarov
  2018-07-19 15:27 ` William Cohen
  0 siblings, 2 replies; 5+ messages in thread
From: Serhei Makarov @ 2018-07-11 20:13 UTC (permalink / raw)
  To: systemtap

Some very rough ideas on how to jam further desirable SystemTap language features into the constraints of eBPF. The goal isn't for total feature-parity with the kernel backend (that's not feasible) but for the restrictions to be clear and understandable, e.g.:

- Strings are at most BPF_MAXSTRINGLEN=64 bytes.

- Certain operations (e.g. foreach) can only be done in userspace. It's clearly documented which probes run in userspace (e.g. begin, end, selected ?timer probes).

Unless otherwise specified, the below scenarios assume code running in the in-kernel eBPF interpreter which forbids looping and can only store data in local stack memory or eBPF maps. The userspace interpreter has a lot more flexibility, being able to perform arbitrary control-flow and define helpers that allocate userspace heap memory.

There are upstream eBPF and bcc planning docs which suggest intent to add bounded loops (https://github.com/iovisor/bcc/issues/574) and better string handling (https://github.com/iovisor/bcc/issues/691, https://github.com/iovisor/bcc/issues/900) and synchronization (https://github.com/iovisor/bcc/issues/1521) but the freshness / status of some of those plans was unclear to me.

# Current string handling

/* A string constant is pushed onto the stack (*r10) in 8 byte chunks (4 byte chunks also work): */
r1 = "Hello, "
*(u64 *)(r10-80)=r1
r1 = "World!\0\0"
*(u64 *)(r10-72)=r1
trace_printk(r10-80) /* pass as argument to a helper */

# Storing a string in an eBPF map

The string is loaded onto the stack as above, then its address is passed to bpf_map_{lookup,update}_elem.

# Loading kernel_string(addr) or user_string(addr) onto the stack

offset = 80
unrolled loop {
   *(u64 *)(r10-offset) = *(u64 *)addr
   addr += 8; offset -= 8
} until '\0' encountered or BPF_MAXSTRINGLEN bytes copied

/* loading kernel_string(addr) or user_string(addr) into an eBPF map is similar */

# String-indexed associative array

This seems to work straightforwardly with BPF_MAP_TYPE_HASH. Both the key and value are loaded onto the stack, then their addresses are passed to bpf_map_{lookup,update}_elem.

# Backtrace tapset

Backtraces can be recorded to eBPF's BPF_MAP_TYPE_STACK_TRACE. Backtrace analysis+printing can be handled in userspace. This requires replacing the trace_printk() functionality with a messaging protocol that offloads printing to userspace (PR22330), then adding a new message type that requests a backtrace to be printed.

# Timer probes

In addition to the perf-event based timers, there should be a timer probe that runs in user space. This is needed to support the common pattern of scripts that iterate through their associative arrays and print a report every N seconds.

# Statistical aggregates

These can be implemented with BPF_MAP_TYPE_PERCPU storing elements of type struct stat_data. Multiple aggregates can be stored in the same array, one per aggregate.

Then __stp_stat_add can be implemented in kernel-space as non-looping, non-locking eBPF code, while all other functions (reading stat value, histogram printing) can be implemented as userspace helpers that aggregate data from all CPUs.

As far as I can tell, it is not strictly necessary to lock a statistical aggregate when reading its value -- the kernel-module backend does this to guarantee a time-consistent snapshot of the different CPU's values, whereas without locking the result might be approximate.

# {TODO} More complex structures: arrays of statistical aggregates

Still investigating whether we can do this.

(0) There is no per-CPU version of BPF_HASH.

(1) A BPF_MAP_TYPE_PERCPU would be a contiguously indexed, preallocated array of aggregates, so a BPF_MAP_TYPE_HASH would be needed to map from sparse keys to indices into the BPF_MAP_TYPE_PERCPU. However, without synchronization, there is no way to allocate slots in the BPF_MAP_TYPE_PERCPU.

(2) /usr/include/linux/bpf.h mentions BPF_MAP_TYPE_HASH_OF_MAPS, but it's currently undocumented. Still need to read the code and investigate if it works for our purposes.

# More complex structures: multi-key associative arrays

It's theoretically possible to have a large struct as the key to a hash map. Thus, a key could literally be a structure with 2 (or more) embedded strings. It seemed to work fine when I used a 128-byte map key size, so two 64-byte string keys should be doable at the very least.

# Global variable locking semantics

Apparently not possible due to upstream eBPF limitations. I could not find any compare-and-swap-type operation for map elements. (bcc's lookup_or_init() compiles to code with potential to data-race. The only map modification helpers are lookup_elem, update_elem, and delete_elem, and any compare-and-swap code constructed from them will be subject to data races.)

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-07-19 17:56 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-11 20:13 stap/eBPF language features brainstorm Serhei Makarov
2018-07-12 15:47 ` Serhei Makarov
2018-07-19 15:27 ` William Cohen
2018-07-19 17:32   ` Serhei Makarov
2018-07-19 17:56     ` William Cohen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).