public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* RFC: major runtime map changes
@ 2005-10-19 22:36 Martin Hunt
  2005-10-20 15:49 ` Frank Ch. Eigler
  0 siblings, 1 reply; 9+ messages in thread
From: Martin Hunt @ 2005-10-19 22:36 UTC (permalink / raw)
  To: systemtap

I am rewriting parts of the maps (associative arrays).

The big change is support for per-cpu or aggregated maps (pmaps).  pmaps
are really independent maps, one for each cpu. They are ideal for
storing statistics. They can be written quickly because only the map for
the current cpu is updated. But reading requires locking and reading
each per-cpu map and adding the result together.

Typically, reading statistics would be done only at probe end time, or
at periodic intervals, as in "top".

There are two possible ways to implement the locking. The obvious method
would be to have writers get a spinlock on their cpu-local map.  Then
readers get spinlocks on each cpu-local maps. Because spinlocks are
fast, this should have minimal overhead.

A second option would be to have no locks for reading or writing.
Instead we require all reads and writes to be done by the current cpu
only to its cpu-local map. This would require a read of a pmap to have a
workqueue read the local copy of the map for each cpu and update the
global statistics. A further optimization might be to have this global
statistics value for the pmap be updated automatically (say every 200ms)
so that any reads would just grab the cached value.

I'm planning to implement the first option, then when I get time,
implement the second and do some scalability comparisons.

The second big change is that I am planning to deprecate the current
two-part api to read and write to maps. Currently you call a function to
set a key, then either read or write a value to/from that key. This
seemed like a good idea before I implemented the macros to dynamically
create needed map APIs and when we didn't know how many datatypes we
would support.

So instead of something like:
  _stp_map_key_int64_int64_str (map, 1,2,"Ohio");
  _stp_map_set_str (map, "Columbus" );
I am planning on implementing
  _stp_map_set_iiss (map, 1,2,"Ohio", "Columbus");

Reads prevent a slight complication for locking because for strings and
stats we need to either return a pointer to the data in the map, or copy
that data to a supplied buffer.  For the former, locks need to be held
until the data is no longer accessed.  What would the translator guys
like?

I am planning to make the pmap API identical, except "pmap" will be used
in place of "map".  So, _stp_pmap_new(), _stp_pmap_get(), etc.

comments?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: major runtime map changes
  2005-10-19 22:36 RFC: major runtime map changes Martin Hunt
@ 2005-10-20 15:49 ` Frank Ch. Eigler
  2005-10-20 17:15   ` Martin Hunt
  2005-10-20 17:53   ` Martin Hunt
  0 siblings, 2 replies; 9+ messages in thread
From: Frank Ch. Eigler @ 2005-10-20 15:49 UTC (permalink / raw)
  To: Martin Hunt; +Cc: systemtap


hunt wrote:

> [...]  The big change is support for per-cpu or aggregated maps
> (pmaps).  pmaps are really independent maps, one for each cpu. [...]

How would iteration/sorting work on these pmaps, considering that
different per-cpu maps will in general have different set of index
tuples?

> There are two possible ways to implement the locking. The obvious
> method would be to have writers get a spinlock on their cpu-local
> map.  [...]

Yes, and if the theory that reading (printing/querying) statistics
objects is rare, then these locks will suffer no contention.  Is it
necessary for the runtime to implement any of this locking, or can it
continue to leave that entire chore to the translator?

> A second option would be to have no locks for reading or writing.
> [...] This would require a read of a pmap to have a workqueue read
> the local copy of the map for each cpu and update the global
> statistics. [...]

I suggest not even prototyping this until someone can produce evidence
that the first style imposes unacceptable performance constraints.

> The second big change is that I am planning to deprecate the current
> two-part api to read and write to maps. [...]

Hurrah.  Will this accomplish the needs of bug #1275, so that we can
read-lock foreach again?

> Reads prevent a slight complication for locking because for strings
> and stats we need to either return a pointer to the data in the map,
> or copy that data to a supplied buffer.  [...]

Copying to a translator-supplied buffer would be fine.  But again,
since the translator is happy to hold read/write locks as needed
around runtime calls, the runtime can also just return pointers and
let the generated code do the copy.


- FChE

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: major runtime map changes
  2005-10-20 15:49 ` Frank Ch. Eigler
@ 2005-10-20 17:15   ` Martin Hunt
  2005-10-20 17:55     ` Frank Ch. Eigler
  2005-10-20 17:53   ` Martin Hunt
  1 sibling, 1 reply; 9+ messages in thread
From: Martin Hunt @ 2005-10-20 17:15 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap

On Thu, 2005-10-20 at 11:48 -0400, Frank Ch. Eigler wrote:
> hunt wrote:
> 
> > [...]  The big change is support for per-cpu or aggregated maps
> > (pmaps).  pmaps are really independent maps, one for each cpu. [...]
> 
> How would iteration/sorting work on these pmaps, considering that
> different per-cpu maps will in general have different set of index
> tuples?

The per-cpu maps are first aggregated into one map. 

Creating a pmap will actually create N+1 maps where N = number of cpus.
The extra map will be the aggregation map.  When _stp_pmap_start(),
stp_pmap_print(), or _stp_pmap_sort() are called, the per-cpu maps are
summed into the aggregation map.

> > There are two possible ways to implement the locking. The obvious
> > method would be to have writers get a spinlock on their cpu-local
> > map.  [...]
> 
> Yes, and if the theory that reading (printing/querying) statistics
> objects is rare, then these locks will suffer no contention.  Is it
> necessary for the runtime to implement any of this locking, or can it
> continue to leave that entire chore to the translator?

I'll probably leave it to the translator. 

> > A second option would be to have no locks for reading or writing.
> > [...] This would require a read of a pmap to have a workqueue read
> > the local copy of the map for each cpu and update the global
> > statistics. [...]
> 
> I suggest not even prototyping this until someone can produce evidence
> that the first style imposes unacceptable performance constraints.
> 
> > The second big change is that I am planning to deprecate the current
> > two-part api to read and write to maps. [...]
> 
> Hurrah.  Will this accomplish the needs of bug #1275, so that we can
> read-lock foreach again?

Yes.

> > Reads prevent a slight complication for locking because for strings
> > and stats we need to either return a pointer to the data in the map,
> > or copy that data to a supplied buffer.  [...]
> 
> Copying to a translator-supplied buffer would be fine.  But again,
> since the translator is happy to hold read/write locks as needed
> around runtime calls, the runtime can also just return pointers and
> let the generated code do the copy.

Sounds good.

Martin


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: major runtime map changes
  2005-10-20 15:49 ` Frank Ch. Eigler
  2005-10-20 17:15   ` Martin Hunt
@ 2005-10-20 17:53   ` Martin Hunt
  2005-10-20 17:57     ` Frank Ch. Eigler
  1 sibling, 1 reply; 9+ messages in thread
From: Martin Hunt @ 2005-10-20 17:53 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap

While we're talking about changes to maps, what should happen when an
_stp_map_set_xxx() fails due to lack of space? Currently this does not
happen because maps are set to wrap by default. So the oldest entry in
the map is cleared and reused. There is a BZ asking for that to be
changed.

So which of the following should happen?

1. wraps and returns an error code
2. fails (does not store value) and returns error code
3. increments a "drops" counter for the map
4. does stp_warn()


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: major runtime map changes
  2005-10-20 17:15   ` Martin Hunt
@ 2005-10-20 17:55     ` Frank Ch. Eigler
  2005-10-20 18:20       ` Martin Hunt
  0 siblings, 1 reply; 9+ messages in thread
From: Frank Ch. Eigler @ 2005-10-20 17:55 UTC (permalink / raw)
  To: Martin Hunt; +Cc: systemtap

[-- Attachment #1: Type: text/plain, Size: 1031 bytes --]

Hi -

hunt wrote:

> > different per-cpu maps will in general have different set of index
> > tuples?
> 
> The per-cpu maps are first aggregated into one map. 
> 
> Creating a pmap will actually create N+1 maps where N = number of cpus.
> The extra map will be the aggregation map.  When _stp_pmap_start(),
> stp_pmap_print(), or _stp_pmap_sort() are called, the per-cpu maps are
> summed into the aggregation map. [...]

OK.  Maybe it will be useful to avoid such copying, and do it
virtually, something like this:

- for pmap->lookup(key) iterate over map[cpu]->lookup(key) and aggregate
  on the fly

- for iteration, iterate over map[0], aggregating same keys over
  map[1..N], then follow that with an iteration over map[1] that skips
  those key tuples already handled for map[0], and so on.  The
  complexity of this may not be smaller than plain merge-then-iterate,
  but would eliminate data copying.

- for sorted iteration, sort each map[cpu], then do per-step
  "external merge"


- FChE

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: major runtime map changes
  2005-10-20 17:53   ` Martin Hunt
@ 2005-10-20 17:57     ` Frank Ch. Eigler
  0 siblings, 0 replies; 9+ messages in thread
From: Frank Ch. Eigler @ 2005-10-20 17:57 UTC (permalink / raw)
  To: Martin Hunt; +Cc: systemtap

[-- Attachment #1: Type: text/plain, Size: 409 bytes --]

Hi -

hunt wrote:

> While we're talking about changes to maps, what should happen when an
> _stp_map_set_xxx() fails due to lack of space? [...]
> So which of the following should happen?

I suggest the simplest:

> 2. fails (does not store value) and returns error code
> [...]

and let the translator/user decide whether these are big errors or
little ones.  I expect this will tie into bug #1379.

- FChE

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: major runtime map changes
  2005-10-20 17:55     ` Frank Ch. Eigler
@ 2005-10-20 18:20       ` Martin Hunt
  2005-10-26  1:08         ` Frank Ch. Eigler
  0 siblings, 1 reply; 9+ messages in thread
From: Martin Hunt @ 2005-10-20 18:20 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap

On Thu, 2005-10-20 at 13:55 -0400, Frank Ch. Eigler wrote:

> OK.  Maybe it will be useful to avoid such copying, and do it
> virtually, something like this:
> 
> - for pmap->lookup(key) iterate over map[cpu]->lookup(key) and aggregate
>   on the fly
> 
> - for iteration, iterate over map[0], aggregating same keys over
>   map[1..N], then follow that with an iteration over map[1] that skips
>   those key tuples already handled for map[0], and so on.  The
>   complexity of this may not be smaller than plain merge-then-iterate,
>   but would eliminate data copying.
> 
> - for sorted iteration, sort each map[cpu], then do per-step
>   "external merge"

That is much more complex than what I proposed and all it does is save
some storage space. Sorts become a huge mess; it is not as easy as you
make it sound. And we also lose the ability to just call a simple
aggregation function and then treat the output exactly as a normal map,
reusing all the current map functions.

While I'm thinking about it, do we want to preserve the per-cpu data?
Would it ever make sense to print the statistics not aggregated over all
cpus, but per-cpu?

Having a separate aggregation map allows us to do that. It also allows
us to switch to option 2 and cache that map so it doesn't have to be
recomputed all the time.

If we do not want per-cpu data kept, then we have an option of having
the local maps cleared after each aggregation operation, increasing
efficiency and allowing us to save space by having smaller local maps.

Martin


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: major runtime map changes
  2005-10-20 18:20       ` Martin Hunt
@ 2005-10-26  1:08         ` Frank Ch. Eigler
  2005-10-26  8:40           ` Martin Hunt
  0 siblings, 1 reply; 9+ messages in thread
From: Frank Ch. Eigler @ 2005-10-26  1:08 UTC (permalink / raw)
  To: Martin Hunt; +Cc: systemtap

Hi -

hunt wrote:

> [...]
> > OK.  Maybe it will be useful to avoid such copying, and do it
> > virtually, something like this:
> > - [3 algorithms]
> 
> That is much more complex than what I proposed and all it does is save
> some storage space. 

In addition, it saves copying and metadata churn (rehashing, reallocation).

> Sorts become a huge mess; it is not as easy as you make it sound.

(I'm somewhat curious what extra complications you forsee.  Remember,
sorting is interesting only for purposes of consequent iteration.)

> And we also lose the ability to just call a simple aggregation
> function and then treat the output exactly as a normal map, reusing
> all the current map functions.

If it turns out that the pmap api is not too far from the map api, the
translator will have no trouble with either.

> While I'm thinking about it, do we want to preserve the per-cpu
> data?  [...]  If we do not want per-cpu data kept, then we have an
> option of having the local maps cleared after each aggregation
> operation [...]

That could work well, if it turned out that aggregates were not
aggregated frequently.  :-)  What I mean is that if the access 
is like

   var[idx] <<< exp

and [idx] reoccurs a lot (which it should - that's the whole point),
then there is a cost to clearing the per-cpu table: the reallocation
of the same index tuple at the next "<<<".


Anyway, Graydon and I are talking over the whole statistics issue.  It
may be that all we will end up needing from the runtime are ordinary
maps that store binary blob values of some initialization-time-fixed
width.  It would leave all the mathematics etc. to be open-coded by
the translator, or else added to new runtime sources.


- FChE

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: major runtime map changes
  2005-10-26  1:08         ` Frank Ch. Eigler
@ 2005-10-26  8:40           ` Martin Hunt
  0 siblings, 0 replies; 9+ messages in thread
From: Martin Hunt @ 2005-10-26  8:40 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap

On Tue, 2005-10-25 at 21:08 -0400, Frank Ch. Eigler wrote:
> [...]
> > Sorts become a huge mess; it is not as easy as you make it sound.
> 
> (I'm somewhat curious what extra complications you forsee.  Remember,
> sorting is interesting only for purposes of consequent iteration.)

If you are sorting based on keys, then your proposal works fine.  But
the more useful case will be sorting based on aggregated values, in
which case your proposal of sorting each per-cpu map is just a waste of
time.

> > And we also lose the ability to just call a simple aggregation
> > function and then treat the output exactly as a normal map, reusing
> > all the current map functions.
> 
> If it turns out that the pmap api is not too far from the map api, the
> translator will have no trouble with either.

It will be almost identical.

> > While I'm thinking about it, do we want to preserve the per-cpu
> > data?  [...]  If we do not want per-cpu data kept, then we have an
> > option of having the local maps cleared after each aggregation
> > operation [...]
> 
> That could work well, if it turned out that aggregates were not
> aggregated frequently.  :-)  What I mean is that if the access 
> is like
> 
>    var[idx] <<< exp
> 
> and [idx] reoccurs a lot (which it should - that's the whole point),
> then there is a cost to clearing the per-cpu table: the reallocation
> of the same index tuple at the next "<<<".

Reallocation consists of removing a node from the head of one list and
inserting it at the end of two others, which is just some pointer
manipulation.  The bigger cost is copying the keys into the node, which
might be strings. 

> Anyway, Graydon and I are talking over the whole statistics issue.  It
> may be that all we will end up needing from the runtime are ordinary
> maps that store binary blob values of some initialization-time-fixed
> width.  It would leave all the mathematics etc. to be open-coded by
> the translator, or else added to new runtime sources.

I don't understand why you wouldn't want common code like that in the
runtime. 

Martin


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2005-10-26  8:40 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-10-19 22:36 RFC: major runtime map changes Martin Hunt
2005-10-20 15:49 ` Frank Ch. Eigler
2005-10-20 17:15   ` Martin Hunt
2005-10-20 17:55     ` Frank Ch. Eigler
2005-10-20 18:20       ` Martin Hunt
2005-10-26  1:08         ` Frank Ch. Eigler
2005-10-26  8:40           ` Martin Hunt
2005-10-20 17:53   ` Martin Hunt
2005-10-20 17:57     ` Frank Ch. Eigler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).