Re: [pcp] suitability of PCP for event tracing

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* Re: [pcp] suitability of PCP for event tracing
       [not found] <1780385660.592861283401180077.JavaMail.root@mail-au.aconex.com>
@ 2010-09-02  4:22 ` nathans
  2010-09-02  4:30   ` Greg Banks
  0 siblings, 1 reply; 48+ messages in thread
From: nathans @ 2010-09-02  4:22 UTC (permalink / raw)
  To: Greg Banks; +Cc: Frank Ch. Eigler, pcp, systemtap, Mark Goodwin

----- "Greg Banks" <gnb@evostor.com> wrote:

> Sure we could do it in pmproxy, but I don't see what it buys us other
> than not having to start one more daemon in the init script?

From someone who is administering a number of sites (ie. me) that would
want to use both, it's a big win.  One less open port to register & worry
about, get to share all the code for dealing with multiplexing requests
already ... *shrug* ... why not?  Seems like a no-brainer choice - just
inject new code at pmproxy.c line 320 and 350 for web clients.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-02  4:22 ` [pcp] suitability of PCP for event tracing nathans
@ 2010-09-02  4:30   ` Greg Banks
  0 siblings, 0 replies; 48+ messages in thread
From: Greg Banks @ 2010-09-02  4:30 UTC (permalink / raw)
  To: nathans; +Cc: Frank Ch. Eigler, pcp, systemtap, Mark Goodwin

nathans@aconex.com wrote:
> ----- "Greg Banks" <gnb@evostor.com> wrote:
>
>   
>> Sure we could do it in pmproxy, but I don't see what it buys us other
>> than not having to start one more daemon in the init script?
>>     
>
> From someone who is administering a number of sites (ie. me) that would
> want to use both, it's a big win.  One less open port to register & worry
> about,
Oh, you want to run both protocols on the same port?  Wow, I was 
thinking a separate port, e.g. port 80 or 8000, for the new one.  So we 
don't have to futz around detecting which protocol is being used, and so 
that they can be firewalled separately.

>  get to share all the code for dealing with multiplexing requests
> already ... *shrug* ... why not?  Seems like a no-brainer choice - just
> inject new code at pmproxy.c line 320 and 350 for web clients.
>
>   
That main loop is the easiest 5% of the code involved :)

-- 
Greg.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-12-03  9:40 ` nathans
@ 2010-12-03 11:13   ` Ken McDonell
  0 siblings, 0 replies; 48+ messages in thread
From: Ken McDonell @ 2010-12-03 11:13 UTC (permalink / raw)
  To: nathans; +Cc: systemtap, pcp

On Fri, 2010-12-03 at 20:40 +1100, nathans@aconex.com wrote:
> ...
> > which reduces it to
> > 
> > int
> > pmRegisterAnon(char *name, int type)
> > 
> > which is not that different from what I've done.
> > 
> > How would that work?
> 
> Sounds fine.  Perhaps this should be an __pm routine though (this
> all sounds libpcp internal, right?) ... __pmRegisterStaticMetric?
> Or something along those lines.

Nod ... I like the "anonymous" adjective, so __pmRegisterAnon is likely
to be it.

> > Making the pmRegisterAnon() prototype more complicated than this
> > "minimalist" version would be real difficult given the way this is
> > spliced into the derived metrics definition parser, so I'd rather not
> > do
> > that unless there is a strong case (which the events record scenario
> > certainly does not suggest).
> 
> Shouldn't be a problem ... it seems an internal thing rather than a
> general-purpose user-exposed thing, so not really a problem that it
> allows the limited configurability.  The two uses we have would both
> have u32 metrics, I think ... so you could even drop the type at the
> current time and add it later if needed.

This is likely to be useful to a pmResult rewriter, so I'd rather keep
the type argument just in case ... it adds almost no complexity to the
implementation.

> Stepping back a bit, do we need to worry about the situation where we
> have multiple PMDAs (different domains) exporting trace data?  Is the
> one PMID for thee metrics sufficient?  (there is only ever data in an
> event record array from one PMDA, correct?)pmUnpackEventRecords,

I don't think PMDAs can ever emit these things ... they should use the
metrics they really know about, else the additional control information
in a pmEventRecord ... and on this topic, I'm moving to yet another
encoding of "missing records" in a pmEventRecord (if er_flags is
PM_ER_FLAG_MISSED then er_nparams is the number of missed records, not
the number of parameters for an event record).

The only place we're going to see the anonymous metrics is in the output
pmResult from pmUnpackEventRecords, which means pminfo -x, pmdumplog,
pmlogger, pmlogextract will all have to know about them, but no PMDA
should be using this stuff.

I think this will work OK in the multiple PMDAs emitting event trace
data as event.flags and event.missing will be defined once and used by
pmUnpackEventRecords for all the PM_TYPE_EVENT data, no matter where it
comes from.

I sense I'm converging on something that is close to done here in terms
of definition ... thanks for your help and feedback.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
       [not found] <534400126.208681291368854553.JavaMail.root@acxmail-au2.aconex.com>
@ 2010-12-03  9:40 ` nathans
  2010-12-03 11:13   ` Ken McDonell
  0 siblings, 1 reply; 48+ messages in thread
From: nathans @ 2010-12-03  9:40 UTC (permalink / raw)
  To: kenj; +Cc: systemtap, pcp


----- "Ken McDonell" <kenj@internode.on.net> wrote:

> On Fri, 2010-12-03 at 16:32 +1100, nathans@aconex.com wrote:
> > 
> Instead if always polluting the PMNS with anon.* metrics, I've made a
> less intrusive change that introduces pmRegisterAnon(void) to
> explicitly
> add the derived metrics for 3 anon metrics (32, 64 and DOUBLE).  So I
> still have anon.32 et al in the PMNS.
> 
> Now if we wanted to allow arbitrary names for the anon metrics, then
> I
> think pmRegisterAnon would have to be expanded ... below is overkill
> 
> int
> pmRegisterAnon(char *name, pmID pmid, pmInDom indom, int type, int
> sem,
> pmUnits units)
> 
> but would allow one to specify the additional metadata that cannot
> necessarily be deduced from a name like "event.records.flags" ... now
> some of these can be dropped by making unilateral decisions
> 
> pmid	- does it really matter? pmRegisterDerived (which sits behind
> all
> of this) makes the allocation in a way that avoids duplicate PMIDs

Don't think pmid matters, since name would be used by tools.

> indom	- there is no place of anonymous indoms in all of this,
> PM_INDOM_NULL is the only choice here
> 
> sem - since there are no PMDA-provided values, we could drop this
> also
> and stick with PM_SEM_DISCRETE as per what I've implemented
> 
> units - same argument as for sem, make this all zeros.

Yep.

> which reduces it to
> 
> int
> pmRegisterAnon(char *name, int type)
> 
> which is not that different from what I've done.
> 
> How would that work?

Sounds fine.  Perhaps this should be an __pm routine though (this
all sounds libpcp internal, right?) ... __pmRegisterStaticMetric?
Or something along those lines.

> Making the pmRegisterAnon() prototype more complicated than this
> "minimalist" version would be real difficult given the way this is
> spliced into the derived metrics definition parser, so I'd rather not
> do
> that unless there is a strong case (which the events record scenario
> certainly does not suggest).

Shouldn't be a problem ... it seems an internal thing rather than a
general-purpose user-exposed thing, so not really a problem that it
allows the limited configurability.  The two uses we have would both
have u32 metrics, I think ... so you could even drop the type at the
current time and add it later if needed.

Stepping back a bit, do we need to worry about the situation where we
have multiple PMDAs (different domains) exporting trace data?  Is the
one PMID for thee metrics sufficient?  (there is only ever data in an
event record array from one PMDA, correct?)

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-12-03  5:32 ` nathans
@ 2010-12-03  6:08   ` Ken McDonell
  0 siblings, 0 replies; 48+ messages in thread
From: Ken McDonell @ 2010-12-03  6:08 UTC (permalink / raw)
  To: nathans; +Cc: systemtap, pcp

On Fri, 2010-12-03 at 16:32 +1100, nathans@aconex.com wrote:
> 
> Yeah, sounds like this could indeed work & would be alot more flexible.
> Not sure about the name "anon", nor the one metric per type, wouldn't
> it be better to have specific metrics for the specific items we know we
> need (so far, nmissed and flags, both of which would be u32 as well).
> Then we can make the units/semantics/type make sense too.
> 
> And meaningful names like event.record.missed and event.record.flags
> (or something like that) might give a bit more of a clue as to what
> they are than anon.
> 
> > I'd appreciate thoughts on this.

Since I wrote the original "anonymous" metrics mail, I've done an
implementation and got everything working and passing QA (the latter was
harder than the former).

Instead if always polluting the PMNS with anon.* metrics, I've made a
less intrusive change that introduces pmRegisterAnon(void) to explicitly
add the derived metrics for 3 anon metrics (32, 64 and DOUBLE).  So I
still have anon.32 et al in the PMNS.

Now if we wanted to allow arbitrary names for the anon metrics, then I
think pmRegisterAnon would have to be expanded ... below is overkill

int
pmRegisterAnon(char *name, pmID pmid, pmInDom indom, int type, int sem,
pmUnits units)

but would allow one to specify the additional metadata that cannot
necessarily be deduced from a name like "event.records.flags" ... now
some of these can be dropped by making unilateral decisions

pmid	- does it really matter? pmRegisterDerived (which sits behind all
of this) makes the allocation in a way that avoids duplicate PMIDs

indom	- there is no place of anonymous indoms in all of this,
PM_INDOM_NULL is the only choice here

sem - since there are no PMDA-provided values, we could drop this also
and stick with PM_SEM_DISCRETE as per what I've implemented

units - same argument as for sem, make this all zeros.

which reduces it to

int
pmRegisterAnon(char *name, int type)

which is not that different from what I've done.

How would that work?

Making the pmRegisterAnon() prototype more complicated than this
"minimalist" version would be real difficult given the way this is
spliced into the derived metrics definition parser, so I'd rather not do
that unless there is a strong case (which the events record scenario
certainly does not suggest).

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
       [not found] <1949991220.207641291354253203.JavaMail.root@acxmail-au2.aconex.com>
@ 2010-12-03  5:32 ` nathans
  2010-12-03  6:08   ` Ken McDonell
  0 siblings, 1 reply; 48+ messages in thread
From: nathans @ 2010-12-03  5:32 UTC (permalink / raw)
  To: kenj; +Cc: systemtap, pcp

Hi Ken,

----- "Ken McDonell" <kenj@internode.on.net> wrote:
> 
> I've been mulling over this issue (and the related matter of "missed"
> event records, where I don't think the implementation is sufficiently
> flexible).
> 
> I think _anything_ related to event records has to map onto a
> pmResult ... this is the only way avoid auxiliary data structures for
> pmUnpackEventRecords(), but _more_ importantly when pmlogextract
> knows
> how to unroll an archive and get all of the original pmResults and
> the
> event records pmResults temporally sorted in a single output archive
> the _only_ place to hide information will be in a pmResult.

OK, yep, makes sense.

> Try this one for size ...
> 
> Add a new sort of "anonymous" PCP metric, such that the PMID is
> hard-coded for a small number of anon metrics as is the pmDesc (I'm
> thinking one per simple data type, semantics discrete, no units, no
> instance domain) and the "name", e.g. anon.u32 for a PM_TYPE_U32
> metric with pmid 0:0:1.
> 
> Domain #0 is unused, so we have a convenient place in the range of
> PMIDs.

Yeah, sounds like this could indeed work & would be alot more flexible.
Not sure about the name "anon", nor the one metric per type, wouldn't
it be better to have specific metrics for the specific items we know we
need (so far, nmissed and flags, both of which would be u32 as well).
Then we can make the units/semantics/type make sense too.

And meaningful names like event.record.missed and event.record.flags
(or something like that) might give a bit more of a clue as to what
they are than anon.

> I'd appreciate thoughts on this.

Yep, I like the general idea, maybe with some cosmetic tweaks as above?

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-11-27  5:29   ` Ken McDonell
@ 2010-11-28 19:08     ` Ken McDonell
  0 siblings, 0 replies; 48+ messages in thread
From: Ken McDonell @ 2010-11-28 19:08 UTC (permalink / raw)
  To: nathans; +Cc: pcp, systemtap

On Sat, 2010-11-27 at 16:28 +1100, Ken McDonell wrote:

> Try this one for size ...
> 
> Add a new sort of "anonymous" PCP metric, such that the PMID is
> hard-coded for a small number of anon metrics as is the pmDesc (I'm
> thinking one per simple data type, semantics discrete, no units, no
> instance domain) and the "name", e.g. anon.u32 for a PM_TYPE_U32 metric
> with pmid 0:0:1.

Turns out this is really simple to implement on the back of derived
metrics ...

$ diffstat -p1 </tmp/patch.anon
 src/libpcp/src/derive.c       |   57
+++++++++++++++++++++++++++++++++++++++---
 src/libpcp/src/derive.h       |    1 
 src/libpcp/src/derive_fetch.c |    4 ++
 3 files changed, 58 insertions(+), 4 deletions(-)

$ pminfo -df anon

anon.32
    Data Type: 32-bit int  InDom: PM_INDOM_NULL 0xffffffff
    Semantics: discrete  Units: none
No value(s) available!

anon.u32
    Data Type: 32-bit unsigned int  InDom: PM_INDOM_NULL 0xffffffff
    Semantics: discrete  Units: none
No value(s) available!

anon.64
    Data Type: 64-bit int  InDom: PM_INDOM_NULL 0xffffffff
    Semantics: discrete  Units: none
No value(s) available!

anon.u64
    Data Type: 64-bit unsigned int  InDom: PM_INDOM_NULL 0xffffffff
    Semantics: discrete  Units: none
No value(s) available!

anon.float
    Data Type: float  InDom: PM_INDOM_NULL 0xffffffff
    Semantics: discrete  Units: none
No value(s) available!

anon.double
    Data Type: double  InDom: PM_INDOM_NULL 0xffffffff
    Semantics: discrete  Units: none
No value(s) available!


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-11-24 22:58 ` nathans
@ 2010-11-27  5:29   ` Ken McDonell
  2010-11-28 19:08     ` Ken McDonell
  0 siblings, 1 reply; 48+ messages in thread
From: Ken McDonell @ 2010-11-27  5:29 UTC (permalink / raw)
  To: nathans; +Cc: systemtap, pcp

On Thu, 2010-11-25 at 09:58 +1100, nathans@aconex.com wrote:
> ... 
> 
> > > | typedef struct {
> > > |     struct timeval	er_timestamp;
> > > |     int		er_nparams;	/* number of er_param[] entries */
> > > |     pmEventParameter	er_param[1];
> > > | } pmEventRecord;
> > > 
> > 
> > I considered er_flags but this seems to be dependent on prior
> > agreement of the producer and consumer, 
> 
> ...?  To my mind, its the opposite.  The idea is that these flags
> encapsulate the generic characteristics of traces, which allows a
> generic client to make more sense of them.
> 
> For example, pmchart could do a decent job of displaying traces with
> these flags, but it has no chance without them (cannot tell what is
> a start event, whats an end event, nor how to build a hierarchy ...
> for an arbitrary trace).  Thinking specifically of the gantt chart
> (e.g. http://www.bootchart.org/images/bootchart.png) class of trace
> visualisation here, if that helps clarify wtf I'm on about.  That is
> feasible (and useful!) for arbitrary traces via er_flags, I think.
> 
> >  Also, there is no obvious place to store the
> > value when the packed array of event records is expanded (it cannot
> > go
> > in the pmResult, so needs another allocation and return parameter for
> > pmUnpackEventRecords (as you indicated later).
> 
> So, this seems to be more of an issue, yes it complicates the API a
> bit ... but its the difference between something bland and something
> alot more useful IMO.
> 
> > I'd prefer to see event "flags" or event "type" encoded in a specific
> > parameter of the event record.  This also needs prior agreement by
> > the
> > consumer and the producer, enables all of the same functionality that
> > er_flags would, and requires no additional API or data structure
> > changes.
> 
> The problem there, as you say, is that it means client tools must all
> be custom written (prior agreement)... theres really not much useful/
> generic that pmchart could do with this.

I've been mulling over this issue (and the related matter of "missed"
event records, where I don't think the implementation is sufficiently
flexible).

I think _anything_ related to event records has to map onto a
pmResult ... this is the only way avoid auxiliary data structures for
pmUnpackEventRecords(), but _more_ importantly when pmlogextract knows
how to unroll an archive and get all of the original pmResults and the
event records pmResults temporally sorted in a single output archive the
_only_ place to hide information will be in a pmResult.

Try this one for size ...

Add a new sort of "anonymous" PCP metric, such that the PMID is
hard-coded for a small number of anon metrics as is the pmDesc (I'm
thinking one per simple data type, semantics discrete, no units, no
instance domain) and the "name", e.g. anon.u32 for a PM_TYPE_U32 metric
with pmid 0:0:1.

Domain #0 is unused, so we have a convenient place in the range of
PMIDs.

I think the only places where pmids and names are used is these routines
- pmLookupName, pmNameID, pmNameAll, pmLookupDesc, pmFetch and
pmLookupText - and I can see how to support anon metrics in all of these
places locally in libpcp with no reference to pmcd or an underlying
archive.  So we can stuff anon metrics into a pmResult without needing
any particular PMDA to provide the metadata for the anon metrics.

Now, pmUnpackEventRecords could _add_ anon.u32 to hold the er_flags
value of er_flags is not zero.

So pmchart calls pmUnpackEventRecords and checks each pmResult, if
anon.u32 is the first metric, then the value is er_flags, and Nathan is
off to the races.

And I could use something similar to encode ea_nmissed insitu so the
correct place in the event array could be marked ... I can see
implementation use cases where there are "missed" event records at the
start of the array (tail buffering), end of the array (quota buffering),
or middle of the array (resource pressure or other throttling of the
rate of event capture). 

I'd appreciate thoughts on this.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
       [not found] <290445718.141091290639324786.JavaMail.root@acxmail-au2.aconex.com>
@ 2010-11-24 22:58 ` nathans
  2010-11-27  5:29   ` Ken McDonell
  0 siblings, 1 reply; 48+ messages in thread
From: nathans @ 2010-11-24 22:58 UTC (permalink / raw)
  To: kenj; +Cc: systemtap, pcp

----- "Ken McDonell" <kenj@internode.on.net> wrote:

> Typed strings is a concept that probably makes sense independent of
> the
> event records stuff, e.g. a simple metric that returns a
> PM_TYPE_STRING
> value where the producer and consumer agree the value is encoded in a
> particular way, e.g. JSON.

*nod*, very good point.

> One option would be to add more values to the sem (semantics) field
> of the pmDesc, e.g. PM_SEM_XML, PM_SEM_JSON, ... these additonal
> semantic
> settings would need to imply PM_SEM_DISCRETE or PM_SEM_INSTANT.  This
> would
>       * make them available for all metrics (not just the parameter(s)
>         for an event record), and
>       * involve NO PDU, API or archive changes
>       * seem more natural as it is part of the semantic metadata for
>         the metric, rather than something to be encoded in the vtype
>         header
>         of a pmValueBlock (which is where the PM_TYPE_* data for the
>         composite data types are carried around in each fetch result)
> 
> Thoughts?

Yep, that does sound like a much better approach.

> > | typedef struct {
> > |     struct timeval	er_timestamp;
> > |     int		er_nparams;	/* number of er_param[] entries */
> > |     pmEventParameter	er_param[1];
> > | } pmEventRecord;
> > 
> 
> I considered er_flags but this seems to be dependent on prior
> agreement of the producer and consumer, 

...?  To my mind, its the opposite.  The idea is that these flags
encapsulate the generic characteristics of traces, which allows a
generic client to make more sense of them.

For example, pmchart could do a decent job of displaying traces with
these flags, but it has no chance without them (cannot tell what is
a start event, whats an end event, nor how to build a hierarchy ...
for an arbitrary trace).  Thinking specifically of the gantt chart
(e.g. http://www.bootchart.org/images/bootchart.png) class of trace
visualisation here, if that helps clarify wtf I'm on about.  That is
feasible (and useful!) for arbitrary traces via er_flags, I think.

>  Also, there is no obvious place to store the
> value when the packed array of event records is expanded (it cannot
> go
> in the pmResult, so needs another allocation and return parameter for
> pmUnpackEventRecords (as you indicated later).

So, this seems to be more of an issue, yes it complicates the API a
bit ... but its the difference between something bland and something
alot more useful IMO.

> I'd prefer to see event "flags" or event "type" encoded in a specific
> parameter of the event record.  This also needs prior agreement by
> the
> consumer and the producer, enables all of the same functionality that
> er_flags would, and requires no additional API or data structure
> changes.

The problem there, as you say, is that it means client tools must all
be custom written (prior agreement)... theres really not much useful/
generic that pmchart could do with this.

> Not so ... the number of metrics per PMDA is 2^10 (items) x (2^12-1)
> (clusters) because I've only pinched one value for the cluster field.

Ah, my mistake - thats great!

> I suspect we need to wait for someone to use all this new functionality
> in anger before we will be in a position to consider tweaks to what
> has been done so far.

*nod*.  Will get a code review in too when time permits (not for a week
or two though).

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-11-10 23:49 ` nathans
  2010-11-11  1:46   ` Max Matveev
@ 2010-11-23 20:48   ` Ken McDonell
  1 sibling, 0 replies; 48+ messages in thread
From: Ken McDonell @ 2010-11-23 20:48 UTC (permalink / raw)
  To: nathans; +Cc: systemtap, pcp

Now I'm done with the core implementation, I'm returning to some of the
peripheral issues raised in the (long) discussion on this topic ...

On Thu, 2010-11-11 at 10:49 +1100, nathans@aconex.com wrote:
> ...
> | There may be just one PMID and an associated PM_TYPE_STRING value
> | in cases where the event subsystem produces event records as 
> | structured strings, e.g. XML or JSON encodings. 
> 
> I think we should introduce PM_TYPE_JSON and PM_TYPE_XML so that the
> clients can differentiate these from unstructured strings... (and do
> the right thing, e.g. pcp-gui code can use Qt XML classes, QtScript
> for JSON, etc) - now would seem like a good time, since we're adding
> in PM_TYPE_EVENT to the set of types here.

I decided not to do this.  I'd like to keep PM_TYPE_EVENT as the sole
container for event records as this simplifies a lot of the new code.
Typed strings is a concept that probably makes sense independent of the
event records stuff, e.g. a simple metric that returns a PM_TYPE_STRING
value where the producer and consumer agree the value is encoded in a
particular way, e.g. JSON.

One option would be to add more values to the sem (semantics) field of
the pmDesc, e.g. PM_SEM_XML, PM_SEM_JSON, ... these additonal semantic
settings would need to imply PM_SEM_DISCRETE or PM_SEM_INSTANT.  This
would
      * make them available for all metrics (not just the parameter(s)
        for an event record), and
      * involve NO PDU, API or archive changes
      * seem more natural as it is part of the semantic metadata for the
        metric, rather than something to be encoded in the vtype header
        of a pmValueBlock (which is where the PM_TYPE_* data for the
        composite data types are carried around in each fetch result)

Thoughts?

> | typedef struct {
> |     struct timeval	er_timestamp;
> |     int		er_nparams;	/* number of er_param[] entries */
> |     pmEventParameter	er_param[1];
> | } pmEventRecord;
> 
> I think we need an "er_type" or perhaps an "er_flags" field above too
> (there's room there for it, above nparams, so I'm getting in early!)
> 
> This might be "enum { PM_EVENT_POINT, PM_EVENT_START, PM_EVENT_END }",
> which will let clients know what class of trace data this is, which is
> not captured anywhere in the current proposal.
> 
> I'm thinking "er_flags" cos I think it will be useful for a client to
> be able to tell in a generic way whether there are identifiers, so we
> could extend that set to include PM_EVENT_TRACEID and PM_EVENT_PARENTID,
> optionally signifying "1st parameter is a unique trace ID", and (also
> optionally) "2nd parameter is the parent ID" ... ?  (or, something like
> that, anyway ...).

I considered er_flags but this seems to be dependent on prior agreement
of the producer and consumer, and other than transporting this in PDUs
and archives and across the APIs there is nothing PCP needs to do based
on the er_flags value.  Also, there is no obvious place to store the
value when the packed array of event records is expanded (it cannot go
in the pmResult, so needs another allocation and return parameter for
pmUnpackEventRecords (as you indicated later).

I'd prefer to see event "flags" or event "type" encoded in a specific
parameter of the event record.  This also needs prior agreement by the
consumer and the producer, enables all of the same functionality that
er_flags would, and requires no additional API or data structure
changes.

> | ... PM_CLUSTER_ERA ...
> 
> That name is confusing (ERA being both a word and TLA).  Maybe just make
> another pmid_* interface (like pmid_item, pmid_cluster, pmid_domain)...
> and/or a #define alongside DYNAMIC_PMID like EVENTRECORD_PMID perhaps?

Agreed, the macro became PM_CLUSTER_EVENT.

> static inline int pmid_event_record_array(pmID id);
> {
>     return __pmid_int(&id)->cluster == EVENT_RECORD_PMID;
> }
> 
> Probably we should replace the open-coded uses of DYNAMIC_PMID with a new
> pmid_dynamic() routine too to keep those details all together in impl.h.

Not sure that this is worth it, the value is used only once in the
sample PMDA so far.  None of the libraries or tools need to know or care
about the specialness of this PMID encoding.

> Note that using -1 as the value for "ERA" limits PMDAs to 2^10 metrics...
> probably OK I guess (not great), not sure there's any other better way to
> attack this problem though... hmmm.

Not so ... the number of metrics per PMDA is 2^10 (items) x (2^12-1)
(clusters) because I've only pinched one value for the cluster field.
> 
> | Everywhere PM_TYPE_AGGREGATE is used in the existing code ...
> 
> This table doesn't cover libqmc in pcp-gui, where theres one extra use.

I haven't done the pcp-gui changes yet.

> | pmUnpackEventRecords(pmValueBlock *vbp, pmResult **rec, int *nmissed)
> 
> So, if the above er_flags is introduced, I think this needs to become
> more like those namespace routines (with optional status array) ...
> int pmUnpackEventRecords(pmValueBlock *vbp, pmResult **rec, int **flags, int *nmissed)

Don't like this ... see above.

> Either way, I think there also needs to be a
> void pmFreeEventRecords(int count, pmResult *rec, int *flags)

I added the convenience function pmFreeEventRecords() with one pmResult
** parameter (pmUnpackEventRecords returns an array of pmResult pointers
that is NULL terminated, so count is not needed).

> 
> And finally, pedantically, the strace example - if/when this PMDA is
> written, PMID 62.0.0 should definately have string type - the numbers
> change depending on Linux architecture, so this decoding should really
> be done in the PMDA.  OK, I did say pedantic!

Nod and wink.

Thanks for the feedback Nathan.

I suspect we need to wait for someone to use all this new functionality
in anger before we will be in a position to consider tweaks to what has
been done so far.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-11-10 23:49 ` nathans
@ 2010-11-11  1:46   ` Max Matveev
  2010-11-23 20:48   ` Ken McDonell
  1 sibling, 0 replies; 48+ messages in thread
From: Max Matveev @ 2010-11-11  1:46 UTC (permalink / raw)
  To: nathans; +Cc: kenj, pcp, systemtap

On Thu, 11 Nov 2010 10:49:07 +1100 (EST), nathans  wrote:

 >> There may be just one PMID and an associated PM_TYPE_STRING value
 >> in cases where the event subsystem produces event records as 
 >> structured strings, e.g. XML or JSON encodings. 

 nathans> I think we should introduce PM_TYPE_JSON and PM_TYPE_XML so that the
 nathans> clients can differentiate these from unstructured strings...

It we're going to do that can we also agree who owns memory allocated
to hold the string and who is responsible for freeing it? Current idea
what pmda callback owns the memory makes life harder for dynamically
allocated strings. In fact, I think adding something
PM_TYPE_YOUFREE_STRING would make like easier for PMDA writers.

max

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
       [not found] <1565492777.26861289432902163.JavaMail.root@acxmail-au2.aconex.com>
@ 2010-11-10 23:49 ` nathans
  2010-11-11  1:46   ` Max Matveev
  2010-11-23 20:48   ` Ken McDonell
  0 siblings, 2 replies; 48+ messages in thread
From: nathans @ 2010-11-10 23:49 UTC (permalink / raw)
  To: kenj; +Cc: systemtap, pcp

----- "Ken McDonell" <kenj@internode.on.net> wrote:

> On Mon, 2010-10-11 at 19:02 +1100, Ken McDonell wrote:
> > It has taken me a while to work through all of the issues here, not
> to
> > mention more distractions than you could hit with a big stick.
> > 
> > Anyway putting an array of event records inside a pmResult is not
> going
> > to be too hard - see http://oss.sgi.com/~kenj/event-params.html
> 
> Now would be a good time for feedback if you have any.

Read through it again, looks good - here's some feedback...

| There may be just one PMID and an associated PM_TYPE_STRING value
| in cases where the event subsystem produces event records as 
| structured strings, e.g. XML or JSON encodings. 

I think we should introduce PM_TYPE_JSON and PM_TYPE_XML so that the
clients can differentiate these from unstructured strings... (and do
the right thing, e.g. pcp-gui code can use Qt XML classes, QtScript
for JSON, etc) - now would seem like a good time, since we're adding
in PM_TYPE_EVENT to the set of types here.

| ... Similarly that we do not need the inst field ...

(typo, no need for "that" there)

| typedef struct {
|     struct timeval	er_timestamp;
|     int		er_nparams;	/* number of er_param[] entries */
|     pmEventParameter	er_param[1];
| } pmEventRecord;

I think we need an "er_type" or perhaps an "er_flags" field above too
(there's room there for it, above nparams, so I'm getting in early!)

This might be "enum { PM_EVENT_POINT, PM_EVENT_START, PM_EVENT_END }",
which will let clients know what class of trace data this is, which is
not captured anywhere in the current proposal.

I'm thinking "er_flags" cos I think it will be useful for a client to
be able to tell in a generic way whether there are identifiers, so we
could extend that set to include PM_EVENT_TRACEID and PM_EVENT_PARENTID,
optionally signifying "1st parameter is a unique trace ID", and (also
optionally) "2nd parameter is the parent ID" ... ?  (or, something like
that, anyway ...).

| typedef struct {
|     int		ea_nrecords;	/* number of ea_record[] entries */
|     int		ea_nmissed;	/* number of missed event records */
|     pmEventRecord	*ea_record[1];
| } pmEventArray;

Is ea_record being a pointer array a typo or intentional...?

| ... PM_CLUSTER_ERA ...

That name is confusing (ERA being both a word and TLA).  Maybe just make
another pmid_* interface (like pmid_item, pmid_cluster, pmid_domain)...
and/or a #define alongside DYNAMIC_PMID like EVENTRECORD_PMID perhaps?

static inline int pmid_event_record_array(pmID id);
{
    return __pmid_int(&id)->cluster == EVENT_RECORD_PMID;
}

Probably we should replace the open-coded uses of DYNAMIC_PMID with a new
pmid_dynamic() routine too to keep those details all together in impl.h.

Note that using -1 as the value for "ERA" limits PMDAs to 2^10 metrics...
probably OK I guess (not great), not sure there's any other better way to
attack this problem though... hmmm.

| Everywhere PM_TYPE_AGGREGATE is used in the existing code ...

This table doesn't cover libqmc in pcp-gui, where theres one extra use.

| pmUnpackEventRecords(pmValueBlock *vbp, pmResult **rec, int *nmissed)

So, if the above er_flags is introduced, I think this needs to become
more like those namespace routines (with optional status array) ...
int pmUnpackEventRecords(pmValueBlock *vbp, pmResult **rec, int **flags, int *nmissed)

Either way, I think there also needs to be a
void pmFreeEventRecords(int count, pmResult *rec, int *flags)

And finally, pedantically, the strace example - if/when this PMDA is
written, PMID 62.0.0 should definately have string type - the numbers
change depending on Linux architecture, so this decoding should really
be done in the PMDA.  OK, I did say pedantic!

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-10-11  8:02                   ` Ken McDonell
  2010-10-11 12:34                     ` Nathan Scott
@ 2010-11-10  0:43                     ` Ken McDonell
  1 sibling, 0 replies; 48+ messages in thread
From: Ken McDonell @ 2010-11-10  0:43 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Greg Banks, nathans, systemtap, pcp

On Mon, 2010-10-11 at 19:02 +1100, Ken McDonell wrote:
> It has taken me a while to work through all of the issues here, not to
> mention more distractions than you could hit with a big stick.
> 
> Anyway putting an array of event records inside a pmResult is not going
> to be too hard - see http://oss.sgi.com/~kenj/event-params.html

I've been investigating the event record data issues again.  The design
document at the URL above is now much more detailed and I think captures
the scope of all the work needed (as far as I understand it, ahead of
writing any real code).

Now would be a good time for feedback if you have any.

I think the execution of these changes is a bit messy rather than hard.

Cheers, Ken.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
       [not found] <1362202390.1923851286924784463.JavaMail.root@mail-au.aconex.com>
@ 2010-10-12 23:07 ` nathans
  0 siblings, 0 replies; 48+ messages in thread
From: nathans @ 2010-10-12 23:07 UTC (permalink / raw)
  To: kenj; +Cc: Greg Banks, systemtap, pcp, Frank Ch. Eigler


----- "Ken McDonell" <kenj@internode.on.net> wrote:

> Nathan,
> 
> Thanks for the feedback.

No worries.  Thanks for the examples, big help.

> On Mon, 2010-10-11 at 23:33 +1100, Nathan Scott wrote:
> > ----- "Ken McDonell" <kenj@internode.on.net> wrote:
> > 
> > > It has taken me a while to work through all of the issues here,
> not
> > > to
> > > mention more distractions than you could hit with a big stick.
> > > 
> > > Anyway putting an array of event records inside a pmResult is not
> > > going to be too hard - see
> http://oss.sgi.com/~kenj/event-params.html
> > 
> > | Event trace data includes the following for each event:
> > | timestamp
> > | event type
> > | event parameters (optional)
> > | The event type is really just another event parameter so we will
> not
> > | consider this as a separate data type, and event data consists of
> a
> > | timestamp and one or more parameters.
> > 
> > IMO, we also need Unique ID (refer X-trace and Dapper papers) and
> parent ID.
> > Preferably not optionally (could be inserted in PCP layers if no
> underlying
> > tracing support for end-to-end style tracing.
> 
> I don't think tree-structured traces as in X-trace or Dapper are
> sufficiently generic that they warrant explicit support in the PCP
> infrastructure.  I've looked at the X-Trace data model, and it maps

Hmmm, not yet completely convinced because...

> simply to the PCP data model (TaskID is simply an event parameter
> that is repeated and has the same unique value in all events that are part
> of the same tree).
> 
> I've updated the proposal document to include some examples,
> including X-Trace.

Ah, thats great, esp. the strace example.  The problem I think we have
there is that there is no distinction between start and end of an event
and no way to say "this event completes that earlier event".

If we dive a bit deeper into the strace example, that's implemented on
Linux using ptrace(PTRACE_SYSCALL,...) - and the start (syscall addr &
args) and end (return code) are separately generated trace events ... I
think we really need to preserve that notion (2 separate events, in PCP,
not merging 'em into just one at the end) and so some mechanism will be
needed to be able to associate the two events later (a unique ID, ala
X-Trace/Dapper might be the go).  Perhaps this can be a mandatory metric
- 62.0.0 in strace example, before syscall ... not sure, but still seems
to me like its needed.

Following on from that, its not clear if PM_TYPE_EVENT is enough, we may
need START/IN-PROGRESS/FINISH granularity too, so client tools can make
sense of the flow for presenting the data sensibly.

> > I think in practice this will prove a bit optimistic (so much data,
> classifying
> > it all - all parameters, flags, etc - to the tight PCP definitions
> ... not sure
> > this is really feasible, or needed). ...
> 
> The number of classifications may be quite small, e.g. the generic
> data
> types in the strace example.  We don't need to capture deep semantics
> here, but there is some advantage in capturing shallow semantics.  I
> suspect that the tools that process the data are the places where the
> deep semantics is needed, and this is likely to be embedded within
> those applications rather than the PCP data stream.
> 

On this one, the strace example has convinced me its doable & indeed,
as you say, the tracing PMDA author can choose the level of granularity
and/or build it up over time (e.g. on strace again, start with a core
set of syscalls of interest which are explicitly decoded, and for the
rest pass out an AGGREGATE and/or just the syscall identifier.

> > ...
> > I think a composite metric type like JSON
> ....
> This is not precluded by the proposal ... we can encode arbitrary
> strings as the "parameters" of an event if that makes most sense.

*nod*

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-10-11 12:34                     ` Nathan Scott
@ 2010-10-12 20:37                       ` Ken McDonell
  0 siblings, 0 replies; 48+ messages in thread
From: Ken McDonell @ 2010-10-12 20:37 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Greg Banks, systemtap, pcp, Frank Ch. Eigler

Nathan,

Thanks for the feedback.

On Mon, 2010-10-11 at 23:33 +1100, Nathan Scott wrote:
> ----- "Ken McDonell" <kenj@internode.on.net> wrote:
> 
> > It has taken me a while to work through all of the issues here, not
> > to
> > mention more distractions than you could hit with a big stick.
> > 
> > Anyway putting an array of event records inside a pmResult is not
> > going to be too hard - see http://oss.sgi.com/~kenj/event-params.html
> 
> | Event trace data includes the following for each event:
> | timestamp
> | event type
> | event parameters (optional)
> | The event type is really just another event parameter so we will not
> | consider this as a separate data type, and event data consists of a
> | timestamp and one or more parameters.
> 
> IMO, we also need Unique ID (refer X-trace and Dapper papers) and parent ID.
> Preferably not optionally (could be inserted in PCP layers if no underlying
> tracing support for end-to-end style tracing.

I don't think tree-structured traces as in X-trace or Dapper are
sufficiently generic that they warrant explicit support in the PCP
infrastructure.  I've looked at the X-Trace data model, and it maps
simply to the PCP data model (TaskID is simply an event parameter that
is repeated and has the same unique value in all events that are part of
the same tree).

I've updated the proposal document to include some examples, including
X-Trace.

> Not sure what the "event types" are you're refering to?  An example trace
> (e.g. something simple - blktrace trace, or strace/ptrace info) encoded as
> PCP results/PDUs would really help me here I think.

I've added an strace example.

> | We each of the data components to be a PCP metric, so we can leverage all
> | of the metadata services to describe the type and semantics of the event
> | data. So each data component would have a PMID (and a corresponding name
> | in the external PMNS).
> 
> I think in practice this will prove a bit optimistic (so much data, classifying
> it all - all parameters, flags, etc - to the tight PCP definitions ... not sure
> this is really feasible, or needed). ...

The number of classifications may be quite small, e.g. the generic data
types in the strace example.  We don't need to capture deep semantics
here, but there is some advantage in capturing shallow semantics.  I
suspect that the tools that process the data are the places where the
deep semantics is needed, and this is likely to be embedded within those
applications rather than the PCP data stream.

> ...
> I think a composite metric type like JSON
> (readable) or perhaps XML might be a good option to provide for tracers.
> Microsoft went with XML in ETW presentation tools FWIW, I think, but I'd lean
> more toward JSON having coded both forms recently - they would have chosen XML
> about 10 years ago and might not choose it a second time ;) given the chance.

This is not precluded by the proposal ... we can encode arbitrary
strings as the "parameters" of an event if that makes most sense.


> | We need to accommodate an array of event records in pmResult to support polled
> | retrieval where each pmFetch may return multiple event records. And the number
> | of parameters per event record may not be fixed, e.g. when the number of
> | parameters depends on the event type.
> ...
> | An event parameter can then be encoded using a pmID (this metric needs to be
> 
> Hmmm ... seems a bit heavyweight to require a PMID per parameter.  Allocating
> PMIDs dynamically is also quite tricky (cgroup stats in Linux PMDA does this,
> so I recently coded something like that ... not alot of fun) and the numbering
> space is limited.

I don't think this would be done dynamically.  The PMDA for a particular
source of event records would have to decide which PMIDs and PMNS names
are required to describe the data it knows about.  I could imagine this
being spartan in some cases, and very rich in others.

> Think I lost the plot shortly after this bit, that diagram sure would help.
> Will talk soon.
> 
> cheers.
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-10-11  8:02                   ` Ken McDonell
@ 2010-10-11 12:34                     ` Nathan Scott
  2010-10-12 20:37                       ` Ken McDonell
  2010-11-10  0:43                     ` Ken McDonell
  1 sibling, 1 reply; 48+ messages in thread
From: Nathan Scott @ 2010-10-11 12:34 UTC (permalink / raw)
  To: kenj; +Cc: Greg Banks, systemtap, pcp, Frank Ch. Eigler

----- "Ken McDonell" <kenj@internode.on.net> wrote:

> It has taken me a while to work through all of the issues here, not
> to
> mention more distractions than you could hit with a big stick.
> 
> Anyway putting an array of event records inside a pmResult is not
> going to be too hard - see http://oss.sgi.com/~kenj/event-params.html

| Event trace data includes the following for each event:
| timestamp
| event type
| event parameters (optional)
| The event type is really just another event parameter so we will not
| consider this as a separate data type, and event data consists of a
| timestamp and one or more parameters.

IMO, we also need Unique ID (refer X-trace and Dapper papers) and parent ID.
Preferably not optionally (could be inserted in PCP layers if no underlying
tracing support for end-to-end style tracing.

Not sure what the "event types" are you're refering to?  An example trace
(e.g. something simple - blktrace trace, or strace/ptrace info) encoded as
PCP results/PDUs would really help me here I think.

| We each of the data components to be a PCP metric, so we can leverage all
| of the metadata services to describe the type and semantics of the event
| data. So each data component would have a PMID (and a corresponding name
| in the external PMNS).

I think in practice this will prove a bit optimistic (so much data, classifying
it all - all parameters, flags, etc - to the tight PCP definitions ... not sure
this is really feasible, or needed).  I think a composite metric type like JSON
(readable) or perhaps XML might be a good option to provide for tracers.
Microsoft went with XML in ETW presentation tools FWIW, I think, but I'd lean
more toward JSON having coded both forms recently - they would have chosen XML
about 10 years ago and might not choose it a second time ;) given the chance.

| We need to accommodate an array of event records in pmResult to support polled
| retrieval where each pmFetch may return multiple event records. And the number
| of parameters per event record may not be fixed, e.g. when the number of
| parameters depends on the event type.
...
| An event parameter can then be encoded using a pmID (this metric needs to be

Hmmm ... seems a bit heavyweight to require a PMID per parameter.  Allocating
PMIDs dynamically is also quite tricky (cgroup stats in Linux PMDA does this,
so I recently coded something like that ... not alot of fun) and the numbering
space is limited.

Think I lost the plot shortly after this bit, that diagram sure would help.
Will talk soon.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-23 22:15                 ` Frank Ch. Eigler
@ 2010-10-11  8:02                   ` Ken McDonell
  2010-10-11 12:34                     ` Nathan Scott
  2010-11-10  0:43                     ` Ken McDonell
  0 siblings, 2 replies; 48+ messages in thread
From: Ken McDonell @ 2010-10-11  8:02 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Greg Banks, nathans, systemtap, pcp

It has taken me a while to work through all of the issues here, not to
mention more distractions than you could hit with a big stick.

Anyway putting an array of event records inside a pmResult is not going
to be too hard - see http://oss.sgi.com/~kenj/event-params.html

Pushing client context through PMCD to PMDAs is a much more challenging
task - see http://oss.sgi.com/~kenj/per-client-ctx.html for a proposal
and discussion.

Lemme know if this sounds good, bad or too ugly to contemplate ... I'll
start the implementation as soon as we seem to be approaching some sort
of consensus.

Cheers.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-16 15:53               ` Ken McDonell
@ 2010-09-23 22:15                 ` Frank Ch. Eigler
  2010-10-11  8:02                   ` Ken McDonell
  0 siblings, 1 reply; 48+ messages in thread
From: Frank Ch. Eigler @ 2010-09-23 22:15 UTC (permalink / raw)
  To: Ken McDonell; +Cc: Greg Banks, nathans, systemtap, pcp

Hi -

kenj wrote:

> [...]  I am assuming there is at most one source of event records
> per PMDA, which I think is OK (event records are of variant type,
> with the number of parameters being different, possibly based on
> event type) [...]

This sounds fine.  In stap land, we could adapt to either variant
types or multiple PMDAs with non-variant types.

> >For the former, it seems we'd need at least three functions in pmapi.h
> >to be forwarded to the pmda: one to start watching metrics of
> >interest, one to poll any values collected since last time, and one to
> >shut down the watch.
> 
> I have not considered the register (and filter if they are separate) and 
> unregister functions.  pmStore is the existing way to push this sort of 
> thing to a PMDA.  [...]
> ....  pmStore leverages existing protocols and can be used to both start 
> watching (register in my speak) and shutdown (unregister in my speak). 
> Is there something else that is needed that could not be implemented 
> using pmStore and "control" variables to set (and possibly return) the 
> status/profile/registration of current events of interest?

first pmda->pmStore (foo.bar.register, VALUE)
later pmda->pmFetch (foo.bar) -> { bunch of queued foo.bar events }
later pmda->pmFetch (foo.bar) -> { bunch of queued foo.bar events }
 last pmda->pmStore (foo.bar.unregister, 0)  # or .register = 0

That is probably fine, as long as the context-passing /
cleanup-callback data is available.

The initial .register VALUE offers some interesting possibilities.
Since they are typed, could we interpret the VALUE as a promised
polling interval?  Or buffer size?  Or event count?  (Or, if this is
itself a structured type, a filtering predicate!)

> There is no need to poll ... pmFetch will return a pmResult with numval 
> indicating if there are any event records in the pmResult payload.

(Right, the pmFetch would be the poll function.)

> [...]  It would help me provide some concrete understanding of your
> planned use cases if I could see some sample code that enables,
> collects and disables the sort of event records (or traces) you're
> considering. [...]

We have not worked out the most natural systemtap lingo to encode the
PCP metric / domain / forthcoming event mappings, since we're
infinitely programmable, so don't have to hard-code it all yet.
But to help think about it, consider a system call trace metric
"process.all.syscall" that returns events of the same general form
that strace gives: a timestamp, a pid, an entry/exit indication,
parameters/result values.  Obviously this would be a high-volume
event.

A similar "process.perpid.syscall", would carry events with a pid# as
a pcp instance variable.  For efficiency, we need to figure out how
the registration-time pmStore should identify the requested instance
numbers, so that other processes need not be affected.

A similar "process.percall.syscall" could be indexed by syscall#.
Whether composing .perpid.percall etc. makes sense within a PCP
client, dunno.

(We'd also have process.*.syscall.stats type metrics to return
rates/counts in a normal polled manner.)

A simple alternative that may serve many purposes may be nothing but a
timestamped lump of text constituting the event, like lines from a log
file or pipe, or unstructured printf()s from a systemtap script.

- FChE

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
       [not found] <2010549822.1115071284891293369.JavaMail.root@mail-au.aconex.com>
@ 2010-09-19 10:19 ` nathans
  0 siblings, 0 replies; 48+ messages in thread
From: nathans @ 2010-09-19 10:19 UTC (permalink / raw)
  To: Ken McDonell; +Cc: Frank Ch. Eigler, Greg Banks, systemtap, pcp


----- "Ken McDonell" <kenj@internode.on.net> wrote:

> On 17/09/2010 9:18 AM, nathans@aconex.com wrote:
> > ...
> I _do_ think this is simple
> - doubly linked list (or similar) of events
> - reference count when event arrives based on number of matching
> client 
> registrations
> - scan list for each client gathering matching events, decrementing 
> reference counts
> - free event record when reference count is zero
> - tune buffer depth per client with pmStore
> - cull list if client is not keeping up and return PM_ERR_TOOSLOW

OK.  Obviously, will need to encapsulate that in a generic library
(likely libpcp_pmda) for all types of trace PMDA... but, thats kinda
obvious.  Sounds good, worth a shot.  I think we may also need to
provide a PMDA-side mechanism to ensure memory use doesn't grow out
of control on PMDA side even if clients are keeping up, but that we
can trial in practice.

> Plus several variants around lists per client or bit maps per client
> to 
> reduce matching overhead on each pmFetch.
> 
> If my battery would last long enough, I think this could be done on a
> 
> plane between Copenhagen and Melbourne!
> 

Sounds optimistic to me... consider it a challenge.  :)

> 
> I think you'd need my "expand PM_TYPE_EVENT into a set of pmResults" 
> change to pmlogger to get close here.  But even with that, I'm not
> sure 
> what pmchart is going to do with event data records having timestamps
> of 
> 9:45:03.456, 9:45:03.501, 9:45:04.001, etc.  The event parameters are
> likely to be discrete, so the semantics is going to be hard for pmchart.

I think it will be able to handle it OK (with changes, but it all
sounds doable so far).  I suspect we can even make a decent stab
at presenting the formatted trace data there too (though not inline
in the actual charts).

> > ... Well, actually, not sure how this will look? - does a
> > trace have to end before a PMDA would see it?  that'd be a bit
> lame;
> > or would we export start and end events separately? ...
> 
> This depends on the underlying event tracing subsystem.  Some emit

Good.  I just wanted to be sure its not dependent on the PMDA / libpmda.

> 
> Not sure I follow.  I'm expecting the events to be emitted once
> tracing 
> is activated (or an interest is registered), so I'm not sure the
> concept 
> of "trace X is in-progress" will be visible outside the PMDA.

Good stuff.

> 
> > Could extend the existing temporal index to index start/end time
> for
> > traces so we can quickly find whether a client sample covers a
> trace?
> > Either way, I suspect "trace start" and "trace end" may need to
> each
> > be a new metric type (in addition to PM_TYPE_COUNTER,
> PM_TYPE_INSTANT
> > and PM_TYPE_DISCRETE that we have now, iow).
> 
> I think we need some input from those on the list likely to be the 
> generators of the events, as it seems Nathan and I don't have a common
> view on what data is going to be emitted.  In my mind, there are event
> 

I'm mostly just wondering out loud, don't take my earlier mail as any
disagreement.  :)

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-19  9:28     ` Max Matveev
@ 2010-09-19  9:49       ` Nathan Scott
  0 siblings, 0 replies; 48+ messages in thread
From: Nathan Scott @ 2010-09-19  9:49 UTC (permalink / raw)
  To: Max Matveev; +Cc: pcp, Frank Ch. Eigler, systemtap, Ken McDonell


----- "Max Matveev" <makc@gmx.co.uk> wrote:

> On Sun, 19 Sep 2010 00:21:34 +1000, Ken McDonell wrote:
> 
>  kenj> On 17/09/2010 9:18 AM, nathans@aconex.com wrote:
> 
>  kenj> For the existing tools, I think we'll probably end up adding a
>  kenj> routine to libpcp to turn a PM_TYPE_EVENT "blob" into a
>  kenj> pmResult ... this will work for pminfo and pmprobe where the
>  kenj> timestamps are ignored.  For pmie, pmval, pmdumptext, pmchart,
>  kenj> ... I'm not sure how they can make sense of the event trace
>  kenj> data in real time, expecting data from time t, and getting a
>  kenj> set of values with different timestamps smaller than t is
> going
>  kenj> to be a bit odd for these ones.
> 
> I've been trying to come up with the use-case where event data and
> statistical data could be useful in the live mode (current tools
> issue
> aside) and I cannot really see one - Nathan or Frank, what did you
> have in mind?

One example was mentioned earlier - http://www.bootchart.org/
I'd want to extend it well beyond that use case though.

>  kenj> Plus several variants around lists per client or bit maps per
> client to 
>  kenj> reduce matching overhead on each pmFetch.
> 
> How would per-client list entries be trimmed?

Yeah, thats pretty much what I was wondering.  Have to at least ensure
memory use is bounded and small, and really can't have memory utilisation
dependent on number of clients, IMO.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-18 14:21   ` Ken McDonell
@ 2010-09-19  9:28     ` Max Matveev
  2010-09-19  9:49       ` Nathan Scott
  0 siblings, 1 reply; 48+ messages in thread
From: Max Matveev @ 2010-09-19  9:28 UTC (permalink / raw)
  To: Ken McDonell; +Cc: nathans, pcp, Frank Ch. Eigler, systemtap

On Sun, 19 Sep 2010 00:21:34 +1000, Ken McDonell wrote:

 kenj> On 17/09/2010 9:18 AM, nathans@aconex.com wrote:

 kenj> For the existing tools, I think we'll probably end up adding a
 kenj> routine to libpcp to turn a PM_TYPE_EVENT "blob" into a
 kenj> pmResult ... this will work for pminfo and pmprobe where the
 kenj> timestamps are ignored.  For pmie, pmval, pmdumptext, pmchart,
 kenj> ... I'm not sure how they can make sense of the event trace
 kenj> data in real time, expecting data from time t, and getting a
 kenj> set of values with different timestamps smaller than t is going
 kenj> to be a bit odd for these ones.

I've been trying to come up with the use-case where event data and
statistical data could be useful in the live mode (current tools issue
aside) and I cannot really see one - Nathan or Frank, what did you
have in mind?

I can see how ability to combine events and statistical data into an
archive can be useful for the post-mortem analysis (assuming there is
a suitable tool for such task) but even there the usefulness is
limited to the fact that all the data is in one place.

 >> Main concerns center around the PMDA buffering scheme ... things like,
 >> how does a PMDA decide what a sensible timeframe for buffering data is
 >> (probably will need some kind of per-PMDA memory limit on buffer size,
 >> rather than time frame).  Also, will the PMDA have to keep track of
 >> which clients have been sent which (portions of?) buffered data?  (in
 >> case of multiple clients with different request frequencies ... might
 >> get a bit hairy?).

 kenj> I'm bald, so hairy is no threat ... 8^)>

Wouldn't it disqualify you from working on hairy problems?

 kenj> I _do_ think this is simple
 kenj> - doubly linked list (or similar) of events
 kenj> - reference count when event arrives based on number of matching client 
 kenj> registrations
 kenj> - scan list for each client gathering matching events, decrementing 
 kenj> reference counts
 kenj> - free event record when reference count is zero
 kenj> - tune buffer depth per client with pmStore
 kenj> - cull list if client is not keeping up and return PM_ERR_TOOSLOW

I think it should be even simpler - one list per pmda, fixed number of
entries (changable via pmStore), "cookie" per event, client gets all
the events and is responsible for figuring out which ones it has seen
or if it was too slow and there is a gap in the event list. 

 kenj> Plus several variants around lists per client or bit maps per client to 
 kenj> reduce matching overhead on each pmFetch.

How would per-client list entries be trimmed? Are you going to assume
a well-behaved client? And how does PM_CONTEXT_LOCAL and multiple clients
work?

max

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-16 23:18 ` nathans
@ 2010-09-18 14:21   ` Ken McDonell
  2010-09-19  9:28     ` Max Matveev
  0 siblings, 1 reply; 48+ messages in thread
From: Ken McDonell @ 2010-09-18 14:21 UTC (permalink / raw)
  To: nathans; +Cc: Frank Ch. Eigler, Greg Banks, systemtap, pcp

On 17/09/2010 9:18 AM, nathans@aconex.com wrote:
> ...
> [sp. TCP]  :) ... local context mode could be used in that situation
> (PM_CONTEXT_LOCAL), which would map more closely to the current trace
> tools and doesn't use TCP.  I haven't seen any reason why this scheme
> wont work for our little-used local context friend, good thing we did
> not remove that code, eh Ken?  ;)

Yep.  And  the good thing is that there are no extra PMDA or PMCD 
changes needed here ... build for the distributed case, then when the 
client is running on the same host as the data source, you can choose 
the low latency PM_CONTEXT_LOCAL path ... promoting PM_CONTEXT_LOCAL to 
be a first class citizen has meant that we can accommodate new PMDAs in 
this scheme, e.g. the event tracing PMDA.

> ...
> I guess it remains to be seen what (existing) tools will do with the
> trace data ... I'm guessing for the most part they will ignore it (as
> many of them do for STRING/AGGREGATE type already (pmie, pmval, etc).
> So, there's still plenty of work to be done to do a good job of adding
> support to the client tools - almost certainly a new tracing-specific
> tool will be needed.

For the existing tools, I think we'll probably end up adding a routine 
to libpcp to turn a PM_TYPE_EVENT "blob" into a pmResult ... this will 
work for pminfo and pmprobe where the timestamps are ignored.  For pmie, 
pmval, pmdumptext, pmchart, ... I'm not sure how they can make sense of 
the event trace data in real time, expecting data from time t, and 
getting a set of values with different timestamps smaller than t is 
going to be a bit odd for these ones.

The tracing-specific tool will be expecting it, so that should be OK.

I've also thought we could even teach pmlogger about PM_TYPE_EVENT data 
and it could emit one pmResult per event record to capture the correct 
time sequences from the original event traces ... this only makes sense 
if there are tools that can use this sort of data from an archive.

> ...
> Main concerns center around the PMDA buffering scheme ... things like,
> how does a PMDA decide what a sensible timeframe for buffering data is
> (probably will need some kind of per-PMDA memory limit on buffer size,
> rather than time frame).  Also, will the PMDA have to keep track of
> which clients have been sent which (portions of?) buffered data?  (in
> case of multiple clients with different request frequencies ... might
> get a bit hairy?).

I'm bald, so hairy is no threat ... 8^)>

I _do_ think this is simple
- doubly linked list (or similar) of events
- reference count when event arrives based on number of matching client 
registrations
- scan list for each client gathering matching events, decrementing 
reference counts
- free event record when reference count is zero
- tune buffer depth per client with pmStore
- cull list if client is not keeping up and return PM_ERR_TOOSLOW

Plus several variants around lists per client or bit maps per client to 
reduce matching overhead on each pmFetch.

If my battery would last long enough, I think this could be done on a 
plane between Copenhagen and Melbourne!

> Also, we've not really considered the additional requirements that we
> have in archive mode.  Unlike the sampled data, traces have explicit
> start and end points, which we will need to know about.  For example,
> if I construct a chart with starting offset (-S) at 10am and ending
> (-T) at 10:15, and a trace started at 9:45 which completes at 10:10,
> I'd expect to see that trace displayed, even though the trace data
> would (AIUI, in this proposal) all be stored at the time the trace
> was sampled? ...

I think you'd need my "expand PM_TYPE_EVENT into a set of pmResults" 
change to pmlogger to get close here.  But even with that, I'm not sure 
what pmchart is going to do with event data records having timestamps of 
9:45:03.456, 9:45:03.501, 9:45:04.001, etc.  The event parameters are 
likely to be discrete, so the semantics is going to be hard for pmchart.

> ... Well, actually, not sure how this will look? - does a
> trace have to end before a PMDA would see it?  that'd be a bit lame;
> or would we export start and end events separately? ...

This depends on the underlying event tracing subsystem.  Some emit start 
and end events, and then the consumer has to know how to match these up. 
  Others emit completion events (which usually include time taken and 
other resources consumed to process the event, return status, etc).

As a generic tool, I'm not sure pmchart will be able to make a lot of 
sense of the raw event data.

> ... then we need a
> way to tie them back together in the client tools.  Or in this example
> of a long-running trace (relative to client sample time), does the
> PMDA report "trace X is in-progress" on each sample?  That'd be a bit
> wasteful on disk space ... hmm, not clear what the best approach here
> will be.

Not sure I follow.  I'm expecting the events to be emitted once tracing 
is activated (or an interest is registered), so I'm not sure the concept 
of "trace X is in-progress" will be visible outside the PMDA.

> Could extend the existing temporal index to index start/end time for
> traces so we can quickly find whether a client sample covers a trace?
> Either way, I suspect "trace start" and "trace end" may need to each
> be a new metric type (in addition to PM_TYPE_COUNTER, PM_TYPE_INSTANT
> and PM_TYPE_DISCRETE that we have now, iow).

I think we need some input from those on the list likely to be the 
generators of the events, as it seems Nathan and I don't have a common 
view on what data is going to be emitted.  In my mind, there are event 
records when a trace is active, and there are no event records when a 
trace is not active, so the notion of a "start or end of trace" event is 
not explicitly present.

> ...
> Alot of work here, but its all fascinating stuff&  gonna be great fun
> to code!

Agreed ... this sounds like fun.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
       [not found] <1341556404.1064361284677032819.JavaMail.root@mail-au.aconex.com>
@ 2010-09-16 23:18 ` nathans
  2010-09-18 14:21   ` Ken McDonell
  0 siblings, 1 reply; 48+ messages in thread
From: nathans @ 2010-09-16 23:18 UTC (permalink / raw)
  To: Frank Ch. Eigler, Ken McDonell, Greg Banks; +Cc: systemtap, pcp

Hi guys,

----- "Ken McDonell" <kenj@internode.on.net> wrote:

> On 16/09/2010 12:07 PM, Greg Banks wrote:
> > Frank Ch. Eigler wrote:
> 
> I think we should pursue the discussion this approach a little
> further. 
>   There is only one layer of buffering needed, at the PMDA.
> 
> So far, the obstacles to this approach would appear to be ...
> 
> 1. Need buffer management and per-client state in the PMDA (actually
> it is per PMAPI context which is a little more complicated, but doable)
> ... 
> I don't see either issue as a big deal, and together they are order of
> magnitude simpler than supporting the sort of asynchronous callbacks 
> from PMCD that have been suggested.
> 
> 2. Latency for event notification ... the client can control the
> polling 
> interval (down to a few milliseconds demonstrably works), so I expect
> 
> you'd be able to tune the latency to match the semantic demands.  If 
> really low latency is needed then any TPC-based mechanism is probably
> 

[sp. TCP]  :) ... local context mode could be used in that situation
(PM_CONTEXT_LOCAL), which would map more closely to the current trace
tools and doesn't use TCP.  I haven't seen any reason why this scheme
wont work for our little-used local context friend, good thing we did
not remove that code, eh Ken?  ;)

> not going to work well, and PCP may be the wrong tool for that space.

Local context should be fine, and perhaps that should be the default
mode for any generic PCP tracing client tool (which, I imagine, we'll
soon be needing).

> 
> c. does not break any existing PMDAs or PMAPI clients
> 

I guess it remains to be seen what (existing) tools will do with the
trace data ... I'm guessing for the most part they will ignore it (as
many of them do for STRING/AGGREGATE type already (pmie, pmval, etc).
So, there's still plenty of work to be done to do a good job of adding
support to the client tools - almost certainly a new tracing-specific
tool will be needed.

> d. be doable in a very short time ... for instance wrapping an array
> of 
> events inside a "special" data aggregate is simple and isolated, and 
> there is already the basis for the required PMCD-PMDA interaction to 
> ensure the context id is known to the PMDA, and the existing context 
> cleanup code in PMCD provides the place to notify PMDAs that a context
> will no longer be requesting events.
> 
> So, can anyone mount a convincing argument that the requirements would
> 
> demand changes to allow asynchronous behaviour between PMAPI clients 
> <---> PMCD <---> PMDAs?

Main concerns center around the PMDA buffering scheme ... things like,
how does a PMDA decide what a sensible timeframe for buffering data is
(probably will need some kind of per-PMDA memory limit on buffer size,
rather than time frame).  Also, will the PMDA have to keep track of
which clients have been sent which (portions of?) buffered data?  (in
case of multiple clients with different request frequencies ... might
get a bit hairy?).

Also, we've not really considered the additional requirements that we
have in archive mode.  Unlike the sampled data, traces have explicit
start and end points, which we will need to know about.  For example,
if I construct a chart with starting offset (-S) at 10am and ending
(-T) at 10:15, and a trace started at 9:45 which completes at 10:10,
I'd expect to see that trace displayed, even though the trace data
would (AIUI, in this proposal) all be stored at the time the trace
was sampled?  Well, actually, not sure how this will look? - does a
trace have to end before a PMDA would see it?  that'd be a bit lame;
or would we export start and end events separately?  then we need a
way to tie them back together in the client tools.  Or in this example
of a long-running trace (relative to client sample time), does the
PMDA report "trace X is in-progress" on each sample?  That'd be a bit
wasteful on disk space ... hmm, not clear what the best approach here
will be.

Could extend the existing temporal index to index start/end time for
traces so we can quickly find whether a client sample covers a trace?
Either way, I suspect "trace start" and "trace end" may need to each
be a new metric type (in addition to PM_TYPE_COUNTER, PM_TYPE_INSTANT
and PM_TYPE_DISCRETE that we have now, iow).

> If not, I strongly suggest we work to flesh out the changes needed to
> make a variable length array of structured event records avaliable 
> through the existing poll-based APIs.

I'm not far away (got distracted after starting then tossing XML, and
switching to JSON) from sending out some prototype JSON PDU support,
which adds in a libpcp_json library that would be handy here I think.

FWIW, the structured data approach should be just fine for capturing
the parent/child trace relationship which I want us to tackle as well
(from those papers I fwd'd); for traces that support this concept we
can add those as additional JSON maps (or XML elements, or...), so I
am content there.

Alot of work here, but its all fascinating stuff & gonna be great fun
to code!

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-16 14:24             ` Frank Ch. Eigler
@ 2010-09-16 15:53               ` Ken McDonell
  2010-09-23 22:15                 ` Frank Ch. Eigler
  0 siblings, 1 reply; 48+ messages in thread
From: Ken McDonell @ 2010-09-16 15:53 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Greg Banks, nathans, systemtap, pcp

On 17/09/2010 12:24 AM, Frank Ch. Eigler wrote:
> ...
>
> OK.  This suggests two separate steps: allowing more than one
> separately-timestamped metric tuple to come back from the new
> array-fetch functions; and to allow structured data through some new
> metric data type.

Yep.  I have a design for this ... it is in HTML, so won't go to the 
systemtap list ... I'll be home Mon next week and I'll put the document 
someplace where others can find it.  It is not very complicated and 
should be easy to implement.

I am assuming there is at most one source of event records per PMDA, 
which I think is OK (event records are of variant type, with the number 
of parameters being different, possibly based on event type) ... if 
there were two independent sources of event records, I think they'd have 
two different (kernel) APIs for registration and extraction and would 
sensibly belong in two different PMDAs.  Please let me know if this 
assumption is not likely to be correct/acceptable.

> For the former, it seems we'd need at least three functions in pmapi.h
> to be forwarded to the pmda: one to start watching metrics of
> interest, one to poll any values collected since last time, and one to
> shut down the watch.

I have not considered the register (and filter if they are separate) and 
unregister functions.  pmStore is the existing way to push this sort of 
thing to a PMDA.  Since I don't think there is a generic syntax for 
event registration and filtering, this probably needs to be opaque to 
PCP and known by the client and the PMDA, e.g. magic integers, strings, 
....  pmStore leverages existing protocols and can be used to both start 
watching (register in my speak) and shutdown (unregister in my speak). 
Is there something else that is needed that could not be implemented 
using pmStore and "control" variables to set (and possibly return) the 
status/profile/registration of current events of interest?

We do need additional PMCD <--> PMDA protocols for the context id 
information and this will provide the hook for the cleanup in the case 
where the client chooses to die, rather than unregister their event 
collection(s).   8^)>

There is no need to poll ... pmFetch will return a pmResult with numval 
indicating if there are any event records in the pmResult payload.

> For the pmda side, since main thread control is to stay elsewhere,
> asynchronous data collection would seem to require multithreading in
> the typical cases, but perhaps not initially.

Agreed.  But provided the thread catching and cacheing the events does 
not call any PCP libraries, we're OK with the libraries as they stand 
... there are several existing PMDAs built this way.  We just need 
conventional mutex guards to protect the shared data structures that 
hold the cached events ... simple single producer and single consumer stuff.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-16 12:40           ` Ken McDonell
@ 2010-09-16 14:24             ` Frank Ch. Eigler
  2010-09-16 15:53               ` Ken McDonell
  0 siblings, 1 reply; 48+ messages in thread
From: Frank Ch. Eigler @ 2010-09-16 14:24 UTC (permalink / raw)
  To: Ken McDonell; +Cc: Greg Banks, nathans, systemtap, pcp

Hi -

> [...]

Makes sense.

> So, can anyone mount a convincing argument that the requirements would 
> demand changes to allow asynchronous behaviour between PMAPI clients 
> <---> PMCD <---> PMDAs?

Not without more data, so your plan sounds good.

> If not, I strongly suggest we work to flesh out the changes needed
> to make a variable length array of structured event records
> available through the existing poll-based APIs.

OK.  This suggests two separate steps: allowing more than one
separately-timestamped metric tuple to come back from the new
array-fetch functions; and to allow structured data through some new
metric data type.

For the former, it seems we'd need at least three functions in pmapi.h
to be forwarded to the pmda: one to start watching metrics of
interest, one to poll any values collected since last time, and one to
shut down the watch.

For the pmda side, since main thread control is to stay elsewhere,
asynchronous data collection would seem to require multithreading in
the typical cases, but perhaps not initially.

- FChE

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-16  2:07         ` Greg Banks
@ 2010-09-16 12:40           ` Ken McDonell
  2010-09-16 14:24             ` Frank Ch. Eigler
  0 siblings, 1 reply; 48+ messages in thread
From: Ken McDonell @ 2010-09-16 12:40 UTC (permalink / raw)
  To: Greg Banks; +Cc: Frank Ch. Eigler, nathans, systemtap, pcp

On 16/09/2010 12:07 PM, Greg Banks wrote:
> Frank Ch. Eigler wrote:
> ...
>> Or do you imagine a pure polling-based API all around, with buffering
>> latencies at the intermediate layers?
>>
>>
> This could be made to work too.

I think we should pursue the discussion this approach a little further. 
  There is only one layer of buffering needed, at the PMDA.

So far, the obstacles to this approach would appear to be ...

1. Need buffer management and per-client state in the PMDA (actually it 
is per PMAPI context which is a little more complicated, but doable) ... 
I don't see either issue as a big deal, and together they are order of 
magnitude simpler than supporting the sort of asynchronous callbacks 
from PMCD that have been suggested.

2. Latency for event notification ... the client can control the polling 
interval (down to a few milliseconds demonstrably works), so I expect 
you'd be able to tune the latency to match the semantic demands.  If 
really low latency is needed then any TPC-based mechanism is probably 
not going to work well, and PCP may be the wrong tool for that space.

3. Each pmFetch may return none, one or many event records.  None is not 
a problem, but to support many we need to devise an encapsulated data 
aggregate within a pmResult that stores event parameters (including 
timestamps I suspect) as first order citizens in terms of pcp metadata 
... I don't think this is hard either.

I believe this approach would ...

a. allow event trace data to be intermixed with classical PCP data 
coming from PMCD to a PMAPI client

b. move the threadsafe PCP libraries work off into something that can be 
tackled independently (and probably should be done as the benefits are 
more general)

c. does not break any existing PMDAs or PMAPI clients

d. be doable in a very short time ... for instance wrapping an array of 
events inside a "special" data aggregate is simple and isolated, and 
there is already the basis for the required PMCD-PMDA interaction to 
ensure the context id is known to the PMDA, and the existing context 
cleanup code in PMCD provides the place to notify PMDAs that a context 
will no longer be requesting events.

So, can anyone mount a convincing argument that the requirements would 
demand changes to allow asynchronous behaviour between PMAPI clients 
<---> PMCD <---> PMDAs?

If not, I strongly suggest we work to flesh out the changes needed to 
make a variable length array of structured event records avaliable 
through the existing poll-based APIs.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-16  1:04       ` Frank Ch. Eigler
@ 2010-09-16  2:07         ` Greg Banks
  2010-09-16 12:40           ` Ken McDonell
  0 siblings, 1 reply; 48+ messages in thread
From: Greg Banks @ 2010-09-16  2:07 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: nathans, Ken McDonell, systemtap, pcp

Frank Ch. Eigler wrote:
> Hi -
>
> On Thu, Sep 16, 2010 at 10:21:16AM +1000, Greg Banks wrote:
>   
>
>   
>> How is such an application supposed to handle gracefully receiving a
>> signal?
>>     
>
> (The app would have to defer handling it till the next timeout
> callback, stop the watch loop.)
>   

Which is hardly helpful if (e.g.) the server is not responding.
>
>   
>> As main loop designs go, I'm not very impressed.
>>     
>
> Understood, but it wasn't meant as a *main* loop, only an extended
> form of pmFetch that can return many results over time.  pmFetch
> itself can take a relatively long amount of time (a network RPC), but
> this hypothetical pmWatch would extend that time even more, so I see
> your point.  I'm not sure how to proceed though.
>
> Do you imagine an asynchronous callback for pmWatch-type functionality
> being modeled as another pmLoopRegister* sibling?  
I think that's an option, if you want to do push with a purely 
single-threaded client which also responds to user input (like ^C) using 
today's libpcp.  Another option would be to add a function which does a 
non-blocking poll on a context to check for any pending fetch results 
and/or pushed data, and call that from the single app thread when poll() 
reports activity on the context's fd.  A third option would be to rely 
on libpcp becoming threaded at some time in the future and add a 
pmWatch-like API which does a cancellable blocking poll in a dedicated 
thread.

> (OTOH, are there
> any open-source pcp clients that use the pmLoop* facility?  git grep &
> codesearch.google.com have no hits at all.)
>   

Not AFAIK.  The code it was created for lived only inside SGI.  It might 
still be alive, I don't know.
> Or do you imagine a pure polling-based API all around, with buffering
> latencies at the intermediate layers?
>
>
>   
This could be made to work too.

-- 
Greg.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-16  0:21     ` Greg Banks
@ 2010-09-16  1:04       ` Frank Ch. Eigler
  2010-09-16  2:07         ` Greg Banks
  0 siblings, 1 reply; 48+ messages in thread
From: Frank Ch. Eigler @ 2010-09-16  1:04 UTC (permalink / raw)
  To: Greg Banks; +Cc: nathans, Ken McDonell, systemtap, pcp

Hi -

On Thu, Sep 16, 2010 at 10:21:16AM +1000, Greg Banks wrote:
> [...]
> Ok, so let me see if I understand the idea.  In this model the calling app
> [...]

Yeah.

> Also, what is the relationship between the pollInterval and the
> timeoutInterval values?  Which is larger than the other, and what
> are the semantics of specifying either or both as NULL or a zero
> struct timeval?  If pmWatch() is implemented by polling, does the
> polling happen immediately when called or one pollInterval later or
> at some other time?  Is any state kept between pmWatch() calls to
> make polls occur at regular times?

There are some probably reasonable answers to these questions, if it
matters, but:

> How is such an application supposed to handle gracefully receiving a
> signal?

(The app would have to defer handling it till the next timeout
callback, stop the watch loop.)

> As main loop designs go, I'm not very impressed.

Understood, but it wasn't meant as a *main* loop, only an extended
form of pmFetch that can return many results over time.  pmFetch
itself can take a relatively long amount of time (a network RPC), but
this hypothetical pmWatch would extend that time even more, so I see
your point.  I'm not sure how to proceed though.

Do you imagine an asynchronous callback for pmWatch-type functionality
being modeled as another pmLoopRegister* sibling?  (OTOH, are there
any open-source pcp clients that use the pmLoop* facility?  git grep &
codesearch.google.com have no hits at all.)

Or do you imagine a pure polling-based API all around, with buffering
latencies at the intermediate layers?

> [...]
> I don't see any way of doing push without having some new behaviour in 
> the PMDAs that support it, however I think it would be wise to strongly 
> minimise and simplify the requirements on PMDAs.  [...]

Definitely.

- FChE

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-15 12:11   ` Frank Ch. Eigler
@ 2010-09-16  0:21     ` Greg Banks
  2010-09-16  1:04       ` Frank Ch. Eigler
  0 siblings, 1 reply; 48+ messages in thread
From: Greg Banks @ 2010-09-16  0:21 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: nathans, Ken McDonell, systemtap, pcp

Frank Ch. Eigler wrote:
> Hi -
>
> On Wed, Sep 15, 2010 at 01:11:55PM +1000, Greg Banks wrote:
>   
>
>
>> Re Frank's API proposal: how does the client cancel a watch?
>>     
>
> In the pmWatch idea, control is handed to pmapi during the watch
> interval.  The client receives callbacks periodically, and at those
> times, it has the chance to cancel the watch.  This is what the
> poll/timeout intervals were for: to guarantee that the client will get
> some sort of callback no less frequently than the requested
> interval(s).
>   
Ok, so let me see if I understand the idea.  In this model the calling app

a) does not need to do anything else in the meanwhile, like run timers 
or respond to other network sockets or talk to the window system,

b) knows beforehand the complete set of pmids it needs to watch and can 
organise them into an array, to do a single global watch call,

c) knows how long it's main loop will take,

d) never needs to cancel a watch asynchronously, i.e. at any time except 
at the pre-nominated expiry or as the return value of a callback.

Also, what is the relationship between the pollInterval and the 
timeoutInterval values?  Which is larger than the other, and what are 
the semantics of specifying either or both as NULL or a zero struct 
timeval?  If pmWatch() is implemented by polling, does the polling 
happen immediately when called or one pollInterval later or at some 
other time?  Is any state kept between pmWatch() calls to make polls 
occur at regular times?  How is such an application supposed to handle 
gracefully receiving a signal?

As main loop designs go, I'm not very impressed.

>
>   
>> What thread is doing the servicing of the socket to PMCD, and if the
>> main app thread, when?
>>     
>
> It would be the single pmapi/app thread.
>
>
> If one puts the burden of buffering and client-state-keeping onto a
> PMDA, probably the same flavour of scheme can work there too, with
> single-threaded polling.
>
>
>   

I don't see any way of doing push without having some new behaviour in 
the PMDAs that support it, however I think it would be wise to strongly 
minimise and simplify the requirements on PMDAs.  We have a lot of 
PMDAs, they're written by all sorts of people, and testing them is not 
always easy.

-- 
Greg.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-15  3:12 ` Greg Banks
@ 2010-09-15 12:11   ` Frank Ch. Eigler
  2010-09-16  0:21     ` Greg Banks
  0 siblings, 1 reply; 48+ messages in thread
From: Frank Ch. Eigler @ 2010-09-15 12:11 UTC (permalink / raw)
  To: Greg Banks; +Cc: nathans, Ken McDonell, systemtap, pcp

Hi -

On Wed, Sep 15, 2010 at 01:11:55PM +1000, Greg Banks wrote:
> [...]
> Personally I'm a big fan of void *closure pointers, see the pmLoop*() 
> functions for examples.
> 
> Guys, let's please not take the existing async API calls as an example 
> of good design or as a precedent.  I think they should be considered a 
> short term expedient, and replaced with better design.

On the other hand, if the pmapi.h client-side model is likely to stay
single-threaded, then there is no need for passing back those
pointers: just use global variables.

> Re Frank's API proposal: how does the client cancel a watch?

In the pmWatch idea, control is handed to pmapi during the watch
interval.  The client receives callbacks periodically, and at those
times, it has the chance to cancel the watch.  This is what the
poll/timeout intervals were for: to guarantee that the client will get
some sort of callback no less frequently than the requested
interval(s).

> What thread is doing the servicing of the socket to PMCD, and if the
> main app thread, when?

It would be the single pmapi/app thread.

If one puts the burden of buffering and client-state-keeping onto a
PMDA, probably the same flavour of scheme can work there too, with
single-threaded polling.

- FChE

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
       [not found] <105152664.981101284508372475.JavaMail.root@mail-au.aconex.com>
@ 2010-09-15  3:12 ` Greg Banks
  2010-09-15 12:11   ` Frank Ch. Eigler
  0 siblings, 1 reply; 48+ messages in thread
From: Greg Banks @ 2010-09-15  3:12 UTC (permalink / raw)
  To: nathans; +Cc: Frank Ch. Eigler, Ken McDonell, systemtap, pcp

nathans@aconex.com wrote:
> ----- "Ken McDonell" <kenj@internode.on.net> wrote:
>
> One other random comment - wrt your code snippet, Frank, it'd
> probably be more consistent to do the timeout/interval setting
> via pmSetMode.  The other async requests that Greg/Max did do
> not have an opaque void* (passthru) parameter either ... so,
> just need to think about whether we want consistency or not
> there (and whether than concept needs to be available in those
> other async calls).
>
>   
Personally I'm a big fan of void *closure pointers, see the pmLoop*() 
functions for examples.

Guys, let's please not take the existing async API calls as an example 
of good design or as a precedent.  I think they should be considered a 
short term expedient, and replaced with better design.

Re Frank's API proposal: how does the client cancel a watch?  What 
thread is doing the servicing of the socket to PMCD, and if the main app 
thread, when?

-- 
Greg.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-13 13:29       ` Max Matveev
@ 2010-09-13 20:53         ` Ken McDonell
  0 siblings, 0 replies; 48+ messages in thread
From: Ken McDonell @ 2010-09-13 20:53 UTC (permalink / raw)
  To: Max Matveev; +Cc: Frank Ch. Eigler, systemtap, pcp

On 13/09/2010 11:29 PM, Max Matveev wrote:
> On Mon, 13 Sep 2010 02:43:03 +1000, Ken McDonell wrote:
>
>   kenj>  Some of the suggestions to date include ...
>
>   kenj>  + data filtering predicates pushed from a client to pmcd and then on to
>   kenj>  a pmda to enable or restrict the types of events or conditions on event
>   kenj>  parameters that would be evaluated before asynchronously sending
>   kenj>  matching events to the client
>
> How would that work if multiple clients request mutually exclusive
> predicates?

I wonder about this also.  But my initial guess would be that predicates 
from different clients could be combined with a boolean OR to register 
the union of the events of interest, and then applying individual 
predicates to the stream to produce the events for each client.  If the 
underlying event mechanism cannot support this, e.g. some process based 
cpu event registers where only one process at a time can be traced, then 
"first in best dressed" is probably the only protocol that will work.

>   kenj>  + additional per-client state data being held in pmcd to allow rate
>   kenj>  aggregation (and similar temporal averaging) to be done at pmcd, rather
>   kenj>  than the client [note I have a long-standing objection to this approach
>   kenj>  based on the original design criteria that pmcd needs to be mean and
>   kenj>  lean to reduce impact, and data reduction and analysis should be pushed
>   kenj>  out to the clients where the computation can be done without impacting
>   kenj>  the system being monitored ... but maybe it is time to revist this, as
>   kenj>  the current environments where PCP is being used may differ from those
>   kenj>  we were concerned with in 1994]
>
> Recent (2010) experience on 3rd rate platform suggests that this is
> still an issue - doing too much calculation in pmda is adding to the
> time it takes to fetch the data, unless pmcd can magically hide delays
> induced by calculations it has to make or calculations made by pmda
> ALL clients suffer.

Yep that is always a risk ... and as a general rule I'd like to see 
collection being restricted at the pmcd site (e.g. don't probe for stats 
that no one is asking for) and aggregated processing like rate 
calculations moved out to the clients.

>   kenj>  Depending on the set of goals we agree on, there may even be a place to
>   kenj>  consider maintaining the poll-based mechanism, but the export data is a
>   kenj>  variable length buffer of all event records (each aggregated and
>   kenj>  self-identifying as below) seen since the last poll-based sample.
>
> How will this work with multiple clients? Will the clients get a "snap
> time" to indicate when pmda updated its buffer or will pmda need to
> remember the state of each client (which would mean dragging client
> information into pmda, maintaining that information and somehow
> retiring per-client state without affecting the clients).
>

I think it would need per-client state in the pmda (which assumes a 
protocol change between pmcd and the pmdas) ... the implementation might 
involve a dynamic ring buffer, sequence numbers and last fetched 
sequence number per client.

The advantage is we maintain polled and synchronous protocols between 
the clients and pmcd ... it is exactly these sort of pros and cons that 
I'd like to see us discussing.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-12 16:43     ` Ken McDonell
  2010-09-13  2:21       ` Greg Banks
  2010-09-13 13:29       ` Max Matveev
@ 2010-09-13 20:39       ` Frank Ch. Eigler
  2 siblings, 0 replies; 48+ messages in thread
From: Frank Ch. Eigler @ 2010-09-13 20:39 UTC (permalink / raw)
  To: Ken McDonell; +Cc: pcp, systemtap

Hi, Ken -

> >(2) protocol extensions for live-push on pmda and pmcd-client interfaces
> >     This clearly larger effort is only worth undertaking with the
> >     community's sympathy and assistance.  It might have some
> >     interesting integration possibilities with the other tools,
> >     espectially pmie (the inference engine).
> 
> I'd like to go back to a comment Nathan made at the start of this 
> thread, namely to try and get a clear idea of the problem we're trying 
> to solve here and the typical use cases.  [...]

I guess the basic idea is to allow a single client tool to be able to
draw & analyze both gross performance metrics, as well as the
underlying events that explain those metrics.

> Some of the suggestions to date include ...
> 
> + being able to push data from pmcd asynchronously to clients, as 
> opposed to the time-based pulling from the clients that we support today

Yes:

> [later:] Depending on the set of goals we agree on, there may even
> be a place to consider maintaining the poll-based mechanism, but the
> export data is a variable length buffer of all event records (each
> aggregated and self-identifying as below) seen since the last
> poll-based sample. [...]

As Max says, this would seem to require keeping some client state and
buffers in pmcd and/or pmda, to avoid missing events between
consecutive calls.

Instead of that, I'm starting to sketch out a hybrid scheme that, on
the pmapi side, is represented like this.  (Please excuse the
inclusion of actual code.  It makes things more concrete and easier to
discuss.)

------------------------------------------------------------------------
/*
 * Callback function from pmWatch(), supplying zero or more pmResult rows
 * accumulated during this pmWatch() interval.  The first argument gives
 * number of pmResults in the second argument.  The third argument is
 * a generic data pointer passed through from pmWatch().
 *
 * The function should not call pmFreeResult() on the incoming values.
 * The function may return 0 to indicate its desire to continue watching,
 * or a non-zero value to abort the watch.  This value will be returned
 * from pmWatch.
 */
typedef int (*pmWatchCallBack)(int resCount, const pmResult ** results,
                               void * data);

/*
 * Fetch metrics periodically, as if pmFetch() was called at the given
 * poll interval (if any).  First few parameters are as for pmFetch().
 * Each pmFetch() result is supplied via the given callback function.
 * The callback function can consume the data, and return a value
 * to dictate whether the polling loop is to continue or stop.
 *
 * In addition, if a PMDA pushes discrete metric updates during this
 * watch period, the callback function will be invoked more frequently.
 * (Other metric slots will have a NULL pmResult->vset[].)
 *
 * If given, approximately every poll interval, the callback function
 * is called (possibly with a zero resCount) to give the application a
 * chance to quit the loop.
 */
extern int pmWatch(int, pmID *,
                   pmWatchCallBack fn, void * data,
                   const struct timeval *pollInterval,
                   const struct timeval *timeoutInterval);
------------------------------------------------------------------------

So a pmapi client would make a single long-duration pmWatch call to
libpcp.  libpcp calls back into the application periodically (to poll
normal metric values) or whenever discrete events arrive.  Eventually
the app says "enough" by returning the appropriate rc.

At the pmda.h or PDU side, I don't have a corresponding sketch yet.  I
wonder if we could permit multithreading just for the corresponding
parts of the API:

    pmcd->pmda     (*pmdaInterface.version.five.watch)(..., callbackFn, cbKey, ...);
                   # pmda spawns a new thread, sets it up
                   =>  key (thread-id)

    pmda thread2   (*callbackFn) (n, "event data pmResult" [array], cbKey, ...)

    pmcd->pmda (*pmdaInterface.version.five.unwatch)(key);
                   # pmda kills thread2
                   => void

to register an interest in metrics with the PMDA, have a new thread
threads call back into PMCD only to supply new data via a dedicated
function, then eventually unregister.  This may require only
relatively small parts of libpcp/pcp_pdma to be made thread-safe.

> + data filtering predicates pushed from a client to pmcd and then on to
> a pmda to enable or restrict the types of events or conditions on event
> parameters that would be evaluated before asynchronously sending
> matching events to the client

Right.  This would represent a pure performance optimization if there
were only a single concurrent client.  With more than one, a filtering
algebra would be needed.  I don't have a sketch for this yet.

> + handling event records with associated event parameters as an extended
> data type

Right.  Hiding JSON or somesuch in a string is probably OK, unless we
want to reify filtering and inferencing upon them.

> + additional per-client state data being held in pmcd to allow rate
> aggregation (and similar temporal averaging) to be done at pmcd, rather
> than the client [note I have a long-standing objection [...]

I guess it depends on what we could be saving by having pmcd perform
such conversions instead of clients.  Client-side CPU and storage
seems cheaper than network traffic, if the data reduction is moderate,
but if it's high, it's the probably other way.  (In the systemtap
model, we encourage users to filter events aggressively at the source,
which turns the data firehose into a dribble.  To exploit this fully
in the pcp-intermediated world though, we'd have to pass filtering
parameters through.)

> + better support for web-based monitoring tools (although Javascript
> evolution may make this less pressing that it was 5 years ago)

Right, at this point it seems like a fatter javascript app should be
able to do this job without pmcd help; the web app just needs to
access the pmapi (through a proxy if necessary).

> + better support for analysis that spans the timeline between the
> current and the recent past

This sounds like useful but future work.  Until it is done, we could
have clients perform archive-vs-live data merging on their own, or
else have the users start clients early enough to absorb the "recent
past" data as live.

> Returning to Frank's point, I'm not sure pmie would be able to consume
> asynchronous events ... [...]

That's OK, it should at worst ignore such events.  At best, in the
future, it could gain some more general temporal/reactive-database
type facilities to do something meaningful.

- FChE

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-12 16:43     ` Ken McDonell
  2010-09-13  2:21       ` Greg Banks
@ 2010-09-13 13:29       ` Max Matveev
  2010-09-13 20:53         ` Ken McDonell
  2010-09-13 20:39       ` Frank Ch. Eigler
  2 siblings, 1 reply; 48+ messages in thread
From: Max Matveev @ 2010-09-13 13:29 UTC (permalink / raw)
  To: Ken McDonell; +Cc: Frank Ch. Eigler, systemtap, pcp

On Mon, 13 Sep 2010 02:43:03 +1000, Ken McDonell wrote:

 kenj> Some of the suggestions to date include ...

 kenj> + data filtering predicates pushed from a client to pmcd and then on to 
 kenj> a pmda to enable or restrict the types of events or conditions on event 
 kenj> parameters that would be evaluated before asynchronously sending 
 kenj> matching events to the client

How would that work if multiple clients request mutually exclusive
predicates?

 kenj> + additional per-client state data being held in pmcd to allow rate 
 kenj> aggregation (and similar temporal averaging) to be done at pmcd, rather 
 kenj> than the client [note I have a long-standing objection to this approach 
 kenj> based on the original design criteria that pmcd needs to be mean and 
 kenj> lean to reduce impact, and data reduction and analysis should be pushed 
 kenj> out to the clients where the computation can be done without impacting 
 kenj> the system being monitored ... but maybe it is time to revist this, as 
 kenj> the current environments where PCP is being used may differ from those 
 kenj> we were concerned with in 1994]

Recent (2010) experience on 3rd rate platform suggests that this is
still an issue - doing too much calculation in pmda is adding to the
time it takes to fetch the data, unless pmcd can magically hide delays
induced by calculations it has to make or calculations made by pmda
ALL clients suffer.

 kenj> Depending on the set of goals we agree on, there may even be a place to 
 kenj> consider maintaining the poll-based mechanism, but the export data is a 
 kenj> variable length buffer of all event records (each aggregated and 
 kenj> self-identifying as below) seen since the last poll-based sample.

How will this work with multiple clients? Will the clients get a "snap
time" to indicate when pmda updated its buffer or will pmda need to
remember the state of each client (which would mean dragging client
information into pmda, maintaining that information and somehow
retiring per-client state without affecting the clients).

max

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-12 16:43     ` Ken McDonell
@ 2010-09-13  2:21       ` Greg Banks
  2010-09-13 13:29       ` Max Matveev
  2010-09-13 20:39       ` Frank Ch. Eigler
  2 siblings, 0 replies; 48+ messages in thread
From: Greg Banks @ 2010-09-13  2:21 UTC (permalink / raw)
  To: Ken McDonell; +Cc: Frank Ch. Eigler, systemtap, pcp

Ken McDonell wrote:
> Apologies for my tardiness in responding, but I'm travelling at the 
> moment (typing this on a train and then on a ferry somewhere in Norway).
>
>
>   
Sounds like fun :)
>
>   
> On 3/09/2010 5:39 AM, Frank Ch. Eigler wrote:
>   
>> ...
>>
>> OK, then it looks like we'd have at least a few separate pieces to
>> work on:
>>
>> * extensions to the PMCD<->PMDA API/protocol to allow PMDAs to push
>>    event data, and corresponding extensions for PMclients<->PMCD
>>     
>
> I'd really like to see some more discussion on how people think this is 
> going to work.  None of the PCP libraries are thread-safe (again a 
> deliberate design decision at the original point of conception),
I've made a brief survey of the place in the libpcp code which are not 
threadsafe; there's *lots* of them but most are easily fixed without 
breaking external interfaces.  I'd estimate a few weeks' work is 
involved.  I'm interested in helping on this for my own reasons (I'd 
like kmchart to be more robust when communication with pmdas is disrupted).

>  and 
> asynchronous delivery of data from pmdas through pmcd to clients 
> increases the likelihood that people will want to use multiple threads 
> to handle PCP calls.  There are some asynchronous calls that were 
> grafted onto libpcp later on, but these have very little use in existing 
> code and no QA coverage.
>
>   
They're also a right bugger to program with, as we've discovered.  I 
would be happy to see them deprecated in favour of full libpcp thread 
safety.

-- 
Greg.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-08-31 19:49   ` Frank Ch. Eigler
  2010-09-01  6:25     ` Mark Goodwin
@ 2010-09-12 16:43     ` Ken McDonell
  2010-09-13  2:21       ` Greg Banks
                         ` (2 more replies)
  1 sibling, 3 replies; 48+ messages in thread
From: Ken McDonell @ 2010-09-12 16:43 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: pcp, systemtap

Apologies for my tardiness in responding, but I'm travelling at the 
moment (typing this on a train and then on a ferry somewhere in Norway).

On 1/09/2010 5:49 AM, Frank Ch. Eigler wrote:
> Hi -
>
> Thanks, Nathan, Ken, Greg, Mark, for clarifying the status quo and
> some of the history.
>
>  ...
>
> (2) protocol extensions for live-push on pmda and pmcd-client interfaces
>      This clearly larger effort is only worth undertaking with the
>      community's sympathy and assistance.  It might have some
>      interesting integration possibilities with the other tools,
>      espectially pmie (the inference engine).

I'd like to go back to a comment Nathan made at the start of this 
thread, namely to try and get a clear idea of the problem we're trying 
to solve here and the typical use cases.  I think it is important to get 
all of this on the table before we start too much of a discussion about 
possible evolutionary change for PCP (something I am very supportive of, 
in general terms).

Some of the suggestions to date include ...

+ being able to push data from pmcd asynchronously to clients, as 
opposed to the time-based pulling from the clients that we support today

+ data filtering predicates pushed from a client to pmcd and then on to 
a pmda to enable or restrict the types of events or conditions on event 
parameters that would be evaluated before asynchronously sending 
matching events to the client

+ handling event records with associated event parameters as an extended 
data type

+ additional per-client state data being held in pmcd to allow rate 
aggregation (and similar temporal averaging) to be done at pmcd, rather 
than the client [note I have a long-standing objection to this approach 
based on the original design criteria that pmcd needs to be mean and 
lean to reduce impact, and data reduction and analysis should be pushed 
out to the clients where the computation can be done without impacting 
the system being monitored ... but maybe it is time to revist this, as 
the current environments where PCP is being used may differ from those 
we were concerned with in 1994]

+ better support for web-based monitoring tools (although Javascript 
evolution may make this less pressing that it was 5 years ago)

+ better support for analysis that spans the timeline between the 
current and the recent past

This is already a long list, with the work items spanning about 2 orders 
of magnitude of effort.  It would be good to drive towards consensus on 
this list of items, and then prioritizing them.

Depending on the set of goals we agree on, there may even be a place to 
consider maintaining the poll-based mechanism, but the export data is a 
variable length buffer of all event records (each aggregated and 
self-identifying as below) seen since the last poll-based sample.

Returning to Frank's point, I'm not sure pmie would be able to consume 
asynchronous events ... it is already a very complicated predicate 
engine with the notion of rules being scheduled and evaluated with fixed 
(but not necessarily identical) evaluation intervals for each rule. Some 
of the aggregation, existential, universal and percentile predicates 
don't have sound semantics in the presence of asynchrounous data 
arrival, e.g. some_inst(), all_inst(), count_sample(), etc.

> For the static-pmns issue, the possibility of dynamic instance
> domains, metric subspaces is probably sufficient, if the event
> parameters are limited to only 1-2 degrees of freedom.  (In contrast,
> imagine browsing a trace of NFS or kernel VFS operations; these have
> ~5 parameters.)

I am not sure this is a problem.  Each event has a unique timestamp, so 
each parameter could be encoded as a PCP metric of the appropriate type 
and semantics. If that is not adequate, then the best approach would 
seem to be to extend the base data types to include some sort of 
self-encoded aggregate ... I don't have a strong view on which of the 
existing "standards" should be adopted here, but it does not appear to 
be a hard problem at the PCP protocol layer ... constructing the 
aggregate would be PMDA (or similar) responsibility and interpretation 
of the aggregate is a client responsibility, although it would be more 
consistent with the PCP approach if the aggregate included the semantics 
and metadata for the parameters, even if this is only delivered once per 
client connection.

>...

On 3/09/2010 5:39 AM, Frank Ch. Eigler wrote:
> ...
>
> OK, then it looks like we'd have at least a few separate pieces to
> work on:
>
> * extensions to the PMCD<->PMDA API/protocol to allow PMDAs to push
>    event data, and corresponding extensions for PMclients<->PMCD

I'd really like to see some more discussion on how people think this is 
going to work.  None of the PCP libraries are thread-safe (again a 
deliberate design decision at the original point of conception), and 
asynchronous delivery of data from pmdas through pmcd to clients 
increases the likelihood that people will want to use multiple threads 
to handle PCP calls.  There are some asynchronous calls that were 
grafted onto libpcp later on, but these have very little use in existing 
code and no QA coverage.

> * teaching some of the existing clients to process such data

As I mentioned above, I think we need to preserve the metadata concepts 
in the PCP protocols so that this data does not become opaque and only 
understood by the producer and consumer (one of my long-stranding 
complaints about SMNP and the MIB concept which PCP has so far done a 
better job of addressing).

> * a systemtap PMDA that listens to pmStore filtering/control
instructions;
>    probably using plain type STRING for JSON payload

Currently there is no client identification (and hence no notion of a 
session) that is passed down from pmcd to the pmdas, so how would this 
filtering work in presence of multiplexed requests coming from a number 
of clients?  And when would the fitlering stop?  And is it possible that 
multiple clients could request filter predicates that are mutually 
exclusive?

> * a PMCD<->XMLRPC bridge

I am not sure that pmcd is the right place to put this ... if the bridge 
was client of pmcd, this would be more PCP-like, and match the way in 
which pmproxy (and to a lesser extent derived metrics) are supported.

> * the web application itself

I'm not a web guy, but this seems the simplest piece of the puzzle ... 8^)>

I suspect the proposals here are substantive enough that they require a 
white paper and discussion, rather than a convoluted email thread.  If I 
could get some feedback and answers to my questions, I'd be happy to put 
together an initial document to guide the discussion ... if someone else 
wants to drive, that's fine by me also.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
       [not found] <1740408065.702181283817937875.JavaMail.root@mail-au.aconex.com>
@ 2010-09-07  0:09 ` nathans
  0 siblings, 0 replies; 48+ messages in thread
From: nathans @ 2010-09-07  0:09 UTC (permalink / raw)
  To: Frank Ch. Eigler, Greg Banks, pcp, systemtap


----- "Frank Ch. Eigler" <fche@redhat.com> wrote:

> Greg Banks <gnb@evostor.com> writes:
> 
> >>> For the web-based frontend issue, yeah, javascript+svg+etc.
> sounds
> >>> most promising, especially if it can be made to speak the native
> wire
> >>> protocol to pmcd.
> 
> Sure; that could be tackled later / orthogonally in principle.  But
> since modern javascript appears to lack low level socket access APIs,
> this may have to be done whether we like it or not.  (Or go Java.)
> ...
> > [...]  Instead I would abandon all attempts at building a time
> > machine, push all the brains out to JS code in the browser, and
> > create a very simple stateless HTTP-to-PCP protocol bridge daemon
> to
> > allow PCP data to be shipped from pmcd to frontend code as either
> > XML or JSON.  Modern browsers have sufficiently fast and functional
> > JS engines that this is now feasible.
> 
> OK, then it looks like we'd have at least a few separate pieces to
> work on:
> ...
> * a PMCD<->XMLRPC bridge

Just as an FYI - I've begun prototyping this.  I have access to a bevy
of web programmers here too, so hopefully can leverage their knowledge
in this space too.  I'll be starting out with my pmproxy approach, but
if that becomes problematic will go the separate daemon route.

I expect we will be running this in our production environments before
too long too, now that its occured to us :) we have immediate users of
this info with a more web/java friendly protocol.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-01 15:05   ` David Smith
@ 2010-09-06 16:39     ` Ken McDonell
  0 siblings, 0 replies; 48+ messages in thread
From: Ken McDonell @ 2010-09-06 16:39 UTC (permalink / raw)
  To: David Smith; +Cc: Frank Ch. Eigler, systemtap

On 2/09/2010 1:05 AM, David Smith wrote:
> On 08/29/2010 10:54 AM, Ken McDonell wrote:
>
> ... stuff deleted ...
>
>> As Nathan has suggested, if event traces are intended for retrospective
>> analysis (as opposed to event counters being suited for either real time
>> or retrospective analysis), then there is an alternative approach,
>> namely to create a PCP archive directly from a source of data without
>> involving pmcd or a pmda or pmlogger.  We've recently reworked the
>> "pmimport" services to expose better APIs to support just this style of
>> use ... see LOGIMPORT(3) and sar2pcp(1) for an example.  I think this
>> approach is possibly a better semantic match between PCP and a stream of
>> event records.
>
> Hmm.  If I'm understanding all the acronyms correctly, I'm not seeing
> the benefit of using LOGIMPORT to create a PCP archive vs. involving
> pcmd/pmda/pmlogger.  Could you expand here?

David,

The "benefit" is that importing data to create a PCP archive is a data 
translation process that is not dependent on polled sampling of data ... 
you can consume a stream of timestamped data and create a corresponding 
PCP archive as an off-line or batch process.  Import tools are also 
comparatively simple to write.

The "disadvantage" is that you're no closer to real-time monitoring with 
this approach, so it is usually used in cases where there is an existing 
body of historical data and one is interested in using pmie or pmchart 
or pmlogsummary for some retrospective analysis ... non-PCP tools like 
sar and monitoring subsystems that support "Export to Excel" are the 
most common examples.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-02  2:05       ` Greg Banks
@ 2010-09-02 19:40         ` Frank Ch. Eigler
  0 siblings, 0 replies; 48+ messages in thread
From: Frank Ch. Eigler @ 2010-09-02 19:40 UTC (permalink / raw)
  To: Greg Banks; +Cc: Mark Goodwin, systemtap, pcp

Greg Banks <gnb@evostor.com> writes:

> [...]
> Firstly, do you need to view the actual parameters involved when
> fetching values, or just use those parameters for filtering purposes
> to select some subset of all VFS operations (e.g. "show me read()s and
> write()s to inode 12345 on /foo") ?

You mean whether they may be needed only for filtering control, and
not for display?  I'm sure it's needed for display too - else a user
might not know what to filter on.


> Secondly, there's a "convention" for encoding faux
> multiple-dimension instance names, but it's really just a horrible
> hack for encoding an arbitrary tuple as a single string, like awk
> does.

Yeah.  OTOH if filtering needs to be done in an intermediate layer
like the PMCD or PMLOG* or PMPROXY, then tuple-wide data and its
operations would need to be more first-class, instead of being
smuggled in a PM_TYPE_STRING.


>>> For the web-based frontend issue, yeah, javascript+svg+etc. sounds
>>> most promising, especially if it can be made to speak the native wire
>>> protocol to pmdc.

> It certainly could do, but for firewall and AJAX friendliness I'd vote
> for wrapping it in HTTP, XML-RPC style.

Sure; that could be tackled later / orthogonally in principle.  But
since modern javascript appears to lack low level socket access APIs,
this may have to be done whether we like it or not.  (Or go Java.)


>> [...]
>> Time averaging, aggregation and filtering were all ambitious aims
>> of the project Greg's talking about - I wonder if that code could
>> ever be resurrected and open sourced?
>
> Euurgh, dear Lord nonono :(
>
> Frank: that project didn't serve archives, it had a PMDA component
> which presented new metrics which were rate converted and averaged
> versions of existing metrics.  This wasn't the best of ideas:

I can see how one could interpret filtering in the middle as
necessitating computing virtualized metrics, and that does seem
complicated, but I was not trying to get into that area.


> [...]  Instead I would abandon all attempts at building a time
> machine, push all the brains out to JS code in the browser, and
> create a very simple stateless HTTP-to-PCP protocol bridge daemon to
> allow PCP data to be shipped from pmcd to frontend code as either
> XML or JSON.  Modern browsers have sufficiently fast and functional
> JS engines that this is now feasible.

OK, then it looks like we'd have at least a few separate pieces to
work on:

* extensions to the PMCD<->PMDA API/protocol to allow PMDAs to push
  event data, and corresponding extensions for PMclients<->PMCD

* teaching some of the existing clients to process such data

* a systemtap PMDA that listens to pmStore filtering/control instructions;
  probably using plain type STRING for JSON payload

* a PMCD<->XMLRPC bridge

* the web application itself


- FChE

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-02  3:42 ` nathans
@ 2010-09-02  4:11   ` Greg Banks
  0 siblings, 0 replies; 48+ messages in thread
From: Greg Banks @ 2010-09-02  4:11 UTC (permalink / raw)
  To: nathans; +Cc: Frank Ch. Eigler, pcp, systemtap, Mark Goodwin

nathans@aconex.com wrote:
> ----- "Greg Banks" <gnb@evostor.com> wrote:
>
>   
>>
>> [...] create a very 
>> simple stateless HTTP-to-PCP protocol bridge daemon [...]
>>     
>
> That echo's my thoughts, we should be able to extend pmproxy to do this
> too - instead of simply proxying native protocol, it could convert from
> XML/JSON client requests on the front end to regular PCP protocol on the
> backend (optionally, and only if the client requests come in that way so
> existing proxy protocol unchanged).  Then we wouldn't need a new daemon,
> etc.
>
>   
Sure we could do it in pmproxy, but I don't see what it buys us other 
than not having to start one more daemon in the init script?

-- 
Greg.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
       [not found] <1459138113.589721283398633993.JavaMail.root@mail-au.aconex.com>
@ 2010-09-02  3:42 ` nathans
  2010-09-02  4:11   ` Greg Banks
  0 siblings, 1 reply; 48+ messages in thread
From: nathans @ 2010-09-02  3:42 UTC (permalink / raw)
  To: Frank Ch. Eigler, Greg Banks; +Cc: pcp, systemtap, Mark Goodwin


----- "Greg Banks" <gnb@evostor.com> wrote:

> Mark Goodwin wrote:
> > On 09/01/2010 05:49 AM, Frank Ch. Eigler wrote:
> >> For the web-based frontend issue, yeah, javascript+svg+etc. sounds
> >> most promising, especially if it can be made to speak the native wire
> >> protocol to pmdc.  [pmcd] 
> It certainly could do, but for firewall and AJAX friendliness I'd vote
> for wrapping it in HTTP, XML-RPC style.
> ...
> push all the brains out to JS code in the browser, and create a very 
> simple stateless HTTP-to-PCP protocol bridge daemon to allow PCP data
> to be shipped from pmcd to frontend code as either XML or JSON.  Modern 
> browsers have sufficiently fast and functional JS engines that this is

That echo's my thoughts, we should be able to extend pmproxy to do this
too - instead of simply proxying native protocol, it could convert from
XML/JSON client requests on the front end to regular PCP protocol on the
backend (optionally, and only if the client requests come in that way so
existing proxy protocol unchanged).  Then we wouldn't need a new daemon,
etc.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-09-01  6:25     ` Mark Goodwin
@ 2010-09-02  2:05       ` Greg Banks
  2010-09-02 19:40         ` Frank Ch. Eigler
  0 siblings, 1 reply; 48+ messages in thread
From: Greg Banks @ 2010-09-02  2:05 UTC (permalink / raw)
  To: Mark Goodwin; +Cc: Frank Ch. Eigler, systemtap, pcp

Mark Goodwin wrote:
> On 09/01/2010 05:49 AM, Frank Ch. Eigler wrote:
>   
>   
>> For the static-pmns issue, the possibility of dynamic instance
>> domains, metric subspaces is probably sufficient, if the event
>> parameters are limited to only 1-2 degrees of freedom.  (In contrast,
>> imagine browsing a trace of NFS or kernel VFS operations; these have
>> ~5 parameters.)
>>     
>
> PCP instance domains are traditionally single dimensional, though there
> are a few exceptions such as kernel.percpu.interrupts. It's easy enough
> to split multi-dimensional data structures out into multiple metrics with
> a common instance domain.
>   
Two comments.

Firstly, do you need to view the actual parameters involved when 
fetching values, or just use those parameters for filtering purposes to 
select some subset of all VFS operations (e.g. "show me read()s and 
write()s to inode 12345 on /foo") ?

Secondly, there's a "convention" for encoding faux multiple-dimension 
instance names, but it's really just a horrible hack for encoding an 
arbitrary tuple as a single string, like awk does.

>
>   
>> For the web-based frontend issue, yeah, javascript+svg+etc. sounds
>> most promising, especially if it can be made to speak the native wire
>> protocol to pmdc.  
It certainly could do, but for firewall and AJAX friendliness I'd vote 
for wrapping it in HTTP, XML-RPC style.

>> This would seem to argue for a stateful
>> archive-serving pmdc, or perhaps a archive-serving proxy, as in Greg's
>> old project.
>>     
>
> Time averaging, aggregation and filtering were all ambitious aims
> of the project Greg's talking about - I wonder if that code could
> ever be resurrected and open sourced? 

Euurgh, dear Lord nonono :(

Frank: that project didn't serve archives, it had a PMDA component which 
presented new metrics which were rate converted and averaged versions of 
existing metrics.  This wasn't the best of ideas:

> One abomination here was
> that a PMDA could also be a client - and potentially query itself
> for metrics(!)
>   
Doing it again (the fourth time!), I would not try that particular stunt 
again.  Instead I would abandon all attempts at building a time machine, 
push all the brains out to JS code in the browser, and create a very 
simple stateless HTTP-to-PCP protocol bridge daemon to allow PCP data to 
be shipped from pmcd to frontend code as either XML or JSON.  Modern 
browsers have sufficiently fast and functional JS engines that this is 
now feasible.

Alternately, and this is a lot more risky, I'd add rate conversion and 
time-averaging features to pmcd.

-- 
Greg.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-08-29 15:55 ` [pcp] " Ken McDonell
@ 2010-09-01 15:05   ` David Smith
  2010-09-06 16:39     ` Ken McDonell
  0 siblings, 1 reply; 48+ messages in thread
From: David Smith @ 2010-09-01 15:05 UTC (permalink / raw)
  To: Ken McDonell; +Cc: Frank Ch. Eigler, systemtap

On 08/29/2010 10:54 AM, Ken McDonell wrote:

... stuff deleted ...

> As Nathan has suggested, if event traces are intended for retrospective
> analysis (as opposed to event counters being suited for either real time
> or retrospective analysis), then there is an alternative approach,
> namely to create a PCP archive directly from a source of data without
> involving pmcd or a pmda or pmlogger.  We've recently reworked the
> "pmimport" services to expose better APIs to support just this style of
> use ... see LOGIMPORT(3) and sar2pcp(1) for an example.  I think this
> approach is possibly a better semantic match between PCP and a stream of
> event records.

Hmm.  If I'm understanding all the acronyms correctly, I'm not seeing
the benefit of using LOGIMPORT to create a PCP archive vs. involving
pcmd/pmda/pmlogger.  Could you expand here?

Thanks.

-- 
David Smith
dsmith@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-08-31 19:49   ` Frank Ch. Eigler
@ 2010-09-01  6:25     ` Mark Goodwin
  2010-09-02  2:05       ` Greg Banks
  2010-09-12 16:43     ` Ken McDonell
  1 sibling, 1 reply; 48+ messages in thread
From: Mark Goodwin @ 2010-09-01  6:25 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: pcp, systemtap

On 09/01/2010 05:49 AM, Frank Ch. Eigler wrote:
> Hi -
>
> Thanks, Nathan, Ken, Greg, Mark, for clarifying the status quo and
> some of the history.
>
> We understand that the two problem domains are traditionally handled
> with the event-tracing -vs- stats-monitoring distinction.  We're trying
> to see where best to focus efforts to make some small steps to bridge
> the two, where plenty of compromises are possible.  We'd prefer to
> help build on an existing project with a nice community than to do new
> stuff.

yes certainly :)

> For the poll-based data gathering issue, a couple of approaches came up:
>
> (1) bypassing pmcd and generating an pmarchive file directly from
>      trace data This appears to imply continuing the archive-vs-live
>      dichotomy that makes it difficult for clients to process both
>      recent and current data seamlessly together.

one of the issues with the live vrs archive dichotomy is that live
data is always available (since you're requesting it explicitly from
a PMDA that is otherwise passive), whereas the archive data is not
available unless configured to be collected before-hand (see pmlogger).
There is too much data to collect everything all the time - it's too
impracticable and intrusive, so some form of filtering and/or aggregation
needs to be done (see pmlogsummary, and Greg's old project too).

> Since using such
>      files would probably also need a custom client, then we'd not be
>      using much of the pcp infrastructure, only as a passive data
>      encoding layer.  This may not be worthwhile.
>
> (2) protocol extensions for live-push on pmda and pmcd-client interfaces
>      This clearly larger effort is only worth undertaking with the
>      community's sympathy and assistance.  It might have some
>      interesting integration possibilities with the other tools,
>      espectially pmie (the inference engine).

yep - I suspect Ken and maybe Nathan would have further comments on this

>
> For the static-pmns issue, the possibility of dynamic instance
> domains, metric subspaces is probably sufficient, if the event
> parameters are limited to only 1-2 degrees of freedom.  (In contrast,
> imagine browsing a trace of NFS or kernel VFS operations; these have
> ~5 parameters.)

PCP instance domains are traditionally single dimensional, though there
are a few exceptions such as kernel.percpu.interrupts. It's easy enough
to split multi-dimensional data structures out into multiple metrics with
a common instance domain.

> For the scalar-payloads issue, the BLOB/STRING metric types are indeed
> available but are opaque to other tools, so don't compose well.  Would
> you consider one additional data type, something like a JSON[1]
> string?  It would be self-describing, with pmie and general processing
> opportunities, though those numbers would lack the PMDA_PMUNITS
> dimensioning.

this could work using string or binary blob data types in the
existing protocols - though there is a size limit. And one of
the blessed features of PCP is the client monitoring tools can
more or less monitor any metrics - so any solution here would
also need specially crafted client tools. Extensions to the perl
binding would probably work best, e.g. interface with perl-JASON-*

> For the filtering issue, pmStore() is an interesting possibility,
> allowing the PMDAs to bear the brunt.  OTOH, if pmcd evolved into a
> data-push-capable widget, it could serve as a filtering proxy,
> requiring separate API or interpretation of the pmStore data.

well pmcd is already data-push capable using the pmstore interface,
allowing clients to store values for certain metrics in some of
the PMDAs. Filtering and parsing is done by the PMDA itself and
pmcd just acts as a proxy passthru (kind of a back-channel to
the pull interface).

pmstore hasn't really been used in anger like this though - more just
for setting config & control options and the like. The same (or similar)
protocol has also been used for a data source to open a socket directly
to a PMDA and tie into the PMDA's select loop, rather than going via pmcd.

>
> For the web-based frontend issue, yeah, javascript+svg+etc. sounds
> most promising, especially if it can be made to speak the native wire
> protocol to pmdc.  This would seem to argue for a stateful
> archive-serving pmdc, or perhaps a archive-serving proxy, as in Greg's
> old project.

Time averaging, aggregation and filtering were all ambitious aims
of the project Greg's talking about - I wonder if that code could
ever be resurrected and open sourced? One abomination here was
that a PMDA could also be a client - and potentially query itself
for metrics(!)

> Is this sounding reasonable?
>

it's going to take a lot more discussion, but enthusiasm seems to
be on our side :)

Cheers
-- Mark Goodwin

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
       [not found] ` <4C7A7DFE.2040606@internode.on.net>
  2010-08-31  3:29   ` Greg Banks
@ 2010-08-31 19:49   ` Frank Ch. Eigler
  2010-09-01  6:25     ` Mark Goodwin
  2010-09-12 16:43     ` Ken McDonell
  1 sibling, 2 replies; 48+ messages in thread
From: Frank Ch. Eigler @ 2010-08-31 19:49 UTC (permalink / raw)
  To: pcp, systemtap

Hi -

Thanks, Nathan, Ken, Greg, Mark, for clarifying the status quo and
some of the history.

We understand that the two problem domains are traditionally handled
with the event-tracing -vs- stats-monitoring distinction.  We're trying
to see where best to focus efforts to make some small steps to bridge
the two, where plenty of compromises are possible.  We'd prefer to
help build on an existing project with a nice community than to do new
stuff.

For the poll-based data gathering issue, a couple of approaches came up:

(1) bypassing pmcd and generating an pmarchive file directly from
    trace data This appears to imply continuing the archive-vs-live
    dichotomy that makes it difficult for clients to process both
    recent and current data seamlessly together.  Since using such
    files would probably also need a custom client, then we'd not be
    using much of the pcp infrastructure, only as a passive data
    encoding layer.  This may not be worthwhile.

(2) protocol extensions for live-push on pmda and pmcd-client interfaces
    This clearly larger effort is only worth undertaking with the
    community's sympathy and assistance.  It might have some
    interesting integration possibilities with the other tools,
    espectially pmie (the inference engine).

For the static-pmns issue, the possibility of dynamic instance
domains, metric subspaces is probably sufficient, if the event
parameters are limited to only 1-2 degrees of freedom.  (In contrast,
imagine browsing a trace of NFS or kernel VFS operations; these have
~5 parameters.)

For the scalar-payloads issue, the BLOB/STRING metric types are indeed
available but are opaque to other tools, so don't compose well.  Would
you consider one additional data type, something like a JSON[1]
string?  It would be self-describing, with pmie and general processing
opportunities, though those numbers would lack the PMDA_PMUNITS
dimensioning.

For the filtering issue, pmStore() is an interesting possibility,
allowing the PMDAs to bear the brunt.  OTOH, if pmcd evolved into a
data-push-capable widget, it could serve as a filtering proxy,
requiring separate API or interpretation of the pmStore data.

For the web-based frontend issue, yeah, javascript+svg+etc. sounds
most promising, especially if it can be made to speak the native wire
protocol to pmdc.  This would seem to argue for a stateful
archive-serving pmdc, or perhaps a archive-serving proxy, as in Greg's
old project.

Is this sounding reasonable?

- FChE

[1] http://en.wikipedia.org/wiki/JSON

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
       [not found] ` <4C7A7DFE.2040606@internode.on.net>
@ 2010-08-31  3:29   ` Greg Banks
  2010-08-31 19:49   ` Frank Ch. Eigler
  1 sibling, 0 replies; 48+ messages in thread
From: Greg Banks @ 2010-08-31  3:29 UTC (permalink / raw)
  To: Ken McDonell; +Cc: Frank Ch. Eigler, systemtap, pcp

Ken McDonell wrote:
> On 28/08/2010 1:39 AM, Frank Ch. Eigler wrote:
>
> The table below outlines some of the differences ... these help to 
> explain why PCP is /a priori/ not necessarily suitable for event 
> tracing. 
>
>
> 	
> 	
>
> 	
> 	
>
> 	
> 	
>
> 	
> 	
>
> 	
> 	
>
> 	
> 	
>
> 	
> 	
>
> 	
> 	
>
>
I think another problem is the dynamic range of time scales.  Event 
tracing tracing tends to require analysis of behaviour that manifests at 
wildly varying time scales in the same trace, from the tens of seconds 
down to the microseconds.  PCP's front ends are not very good at doing 
this kind of thing, and don't really handle zooming or LoD or 
bookmarking well.

>
>
>> * no web-based frontends
>>
>>   In our usage, it would be desirable to have some mini pcp-gui that
>>   is based on web technologies rather than QT.
>>     
>
> There are several examples of web interfaces driven by PCP data ... 
> but each of these has been developed as a proprietary and specific 
> application and hence is not included in the PCP open source 
> distribution.  The PCP APIs provide all the services needed to build 
> something like this.
>
Myself and at least one other person on the PCP list have been involved 
with designing three generations of one such proprietary web front end, 
and we found it quite a difficult problem to solve.  The main issue was 
that the PCP architecture is basically a stateless client-driven pull, 
so that any operation which needs to maintain state across multiple 
samples (like time averages, or rate conversion of counters) needs to be 
done all the way out in the client.  Our browser requirements prevented 
us from using Javascript, so we had no practical way to do that, and had 
to insert a caching/rate conversion/averaging daemon in between.  That 
daemon proved...troublesome.  These days a JS + AJAX + SVG solution 
would probably do the trick nicely, and would be interesting to write.

Also, Frank: you mentioned NFS in passing; I'm curious as to what 
exactly you're up to?

-- 
Greg.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
  2010-08-27 15:39 Frank Ch. Eigler
@ 2010-08-29 15:55 ` Ken McDonell
  2010-09-01 15:05   ` David Smith
       [not found] ` <4C7A7DFE.2040606@internode.on.net>
  1 sibling, 1 reply; 48+ messages in thread
From: Ken McDonell @ 2010-08-29 15:55 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap

[resending as text so it makes it to the systemtap list]

On 28/08/2010 1:39 AM, Frank Ch. Eigler wrote:
 > Hi -
 >
 > We're investigating to what extent the PCP suite may be suitable for
 > more general low-level event tracing.  Just from docs / source gazing
 > (so please excuse my terminology errors), a few challenges would seem
 > to be:

G'day Frank and others.

Apologies for the length of this reply, but there are a number of
non-trivial issues at play here.

Nathan has already answered some of your questions.  I'd like to start
by providing some historical and design center context.  From the outset
PCP was *not* designed for event-tracing, but PCP *was* designed for a
specific class of performance monitoring and management scenarios.

The table below outlines some of the differences ... these help to
explain why PCP is /a priori/ not necessarily suitable for event
tracing.  This does not mean PCP could not evolve to support
event-tracing in the ways Nathan has suggested, we just need to
understand that the needs are different and make sure we do not end up
morphing PCP into something that no longer works for the original design
center and may not work all that well for event tracing.

Locality of data processing
     PCP Design Center
         Monitored system is typically not the same system that the
         collection and/or analysis is performed on.
     Event Tracing
         Data collection happens on the system being monitored, analysis
         may happen later on another system.

Real time analysis
     PCP Design Center
         Central to the design requirements.
     Event Tracing
         Often not required, other than edge-triggers to start and stop
         collection.

Retrospective analysis
     PCP Design Center
         Central to the design requirements.
     Event Tracing
         Central to the design requirements.

Time scales
     PCP Design Center
         We are typically concerned with large and complex systems where
         average levels of activity over periods of the order of tens of
         seconds are representative.
     Event Tracing
         Short-term and transients are often important, and inter-arrival
         time for events may be on the order of milliseconds.

Data rates
     PCP Design Center
         Moderate. Monitoring is often long-term, requiring broad and
         shallow data collection, with a small number of narrow and deep
         collections aligned to known or suspected problem areas.
     Event Tracing
         Very high.  Monitoring is most often narrow, deep and short-lived.

Data spread
     PCP Design Center
         Very broad ... interesting data may come from a number of places,
         e.g.  hardware instrumentation, operating system stats, service
         layers and libraries, applications and distributed applications.
     Event Tracing
         Very narrow ... one source and one host.

Data semantics
     PCP Design Center
         A very broad range, but the most common are activity levels and
         event *counters* (with little or no event parameter information)
     Event Tracing
         Very specific, being the record of an event and its parameters
         with a high resolution time stamp.

Data source extensibility
     PCP Design Center
         Critical.
     Event Tracing
         Rare.

So with this backgrtound, let's look at Frank's specific questions.

 > * poll-based data gathering
 >
 >    It seems as though PMDAs are used exclusively in 'polling' mode,
 >    meaning that underlying system statistics are periodically queried
 >    and summary results stored.  In our context, it would be useful if
 >    PMDAs could push event data into the stream as they occur - perhaps
 >    hundreds of times a second.

Yep, this would be a big change.  There is not really a data stream in
PCP ... there is a source of performance metrics (a host or an archive)
and clients connect to that source and pull data at a sample interval
defined by the client.

At the host source, the co-ordinating daemon (pmcd) maintains no cache
nor stream of recent data ... a client asks for a specific subset of the
available information, this is instantiated and returned to the client.
There is no requirement for the subsets of the requested information to
be the same for consecutive requests from a single client, and pmcd is
receiving requests from a number of clients that are handled completely
independently.

As Nathan has suggested, if event traces are intended for retrospective
analysis (as opposed to event counters being suited for either real time
or retrospective analysis), then there is an alternative approach,
namely to create a PCP archive directly from a source of data without
involving pmcd or a pmda or pmlogger.  We've recently reworked the
"pmimport" services to expose better APIs to support just this style of
use ... see LOGIMPORT(3) and sar2pcp(1) for an example.  I think this
approach is possibly a better semantic match between PCP and a stream of
event records.

 > * relatively static pmns
 >
 >    It would be desirable if PMNS metrics were parametrizable with
 >    strings/numbers, so that a PMDA engine could use it to synthesize
 >    metrics on demand from a large space.  (Example: have a
 >    "kernel-probe" PMNS namespace, parametrized by function name, which
 >    returns statistics of that function's execution.  There are too many
 >    kernel functions, and they vary from host to host enough, so that
 >    enumerating them as a static PMNS table would be impractical.)

This is not so much of a problem.  We've relaxed the PMNS services to
allow PMDAs to dynamically define new metrics on the fly.  And as Nathan
has pointed out, the instance domain provides a dynamic dimension for
the available metric values that may also be useful, e.g. this is how
all of procfs is instantiated.

 > * scalar payloads
 >
 >    It seems as though each metric value provided by PMDAs is
 >    necessarily a scalar value, as opposed to some structured type.  For
 >    event tracing, it would be useful to have tuples.  Front-ends could
 >    choose the interesting fields to render.  (Example: tracing NFS
 >    calls, complete with decoded payloads.)
 >

We've tried really hard to make the PCP metadata rich enough (in the
data model and the API services) to enable clients to be data-driven,
based on what performance data happens to be available today from a host
or archive.  This is why the data aggregate (or blob) data type that
Nathan has mentioned is rarely used (although it is fully supported).

If there was a tight coupling between the source of the event data and
the client that interprets the event data, then the PCP data aggregate
could be used to provide a transport and storage encapsulation that is
consistent with the PCP APIs and protocols.  Of course, such a client
would be exposed to all of the word-size, endian and version issues that
plague other binary formats for performance data, e.g. the sar variants
based on AT&T UNIX.

 > * filtering
 >
 >    It would be desirable for the apps fetching metric values to
 >    communicate a filtering predicate associated with them, perhaps as
 >    per pmie rules.  This is to allow the data server daemon to reduce
 >    the amount of data sent to the gui frontends.  Perhaps also it could
 >    use them to inform PMDAs as a form of subscription, and in turn they
 >    could reduce the amount of data flow.

PMDAs are free to do as much or as little work as they choose.  Some are
totally demand-driven, instantiating only the information they are asked
for when they are asked for it.  Others use cacheing strategies to
refresh some or all of the information at each request.  Others maintain
timestamped caches and only refresh when the information is deemed
"stale".  Another class run a refresh thread that is contunally updating
a data cache, and requests are serviced from the cache.

The PMDA behaviour can be modal ... based on client requests, or more
interestingly as Nathan has suggested using the pmStore(3) API to allow
one or more clients to enable/disable collection (think about expensive,
detailed information that you don't want to collect unless some client
*really* wants it).  The values passed into the PMDA via pmStore(3) are
associated with PCP metrics, so they have the full richness of the PCP
data model to encode switches, text strings, blobs, etc.

 > * no web-based frontends
 >
 >    In our usage, it would be desirable to have some mini pcp-gui that
 >    is based on web technologies rather than QT.

There are several examples of web interfaces driven by PCP data ... but
each of these has been developed as a proprietary and specific
application and hence is not included in the PCP open source
distribution.  The PCP APIs provide all the services needed to build
something like this.

 >
 > To what extent could/should PCP be used/extended to cover this space?

I think this suggestion is worth further discussion, but we probably
need some more concrete examples of the sorts of event trace data that
is being considered, and the most likely use cases and patterns for that
data.

Cheers, Ken.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [pcp] suitability of PCP for event tracing
       [not found] <1010363924.405041282969375594.JavaMail.root@mail-au.aconex.com>
@ 2010-08-28  4:24 ` nathans
  0 siblings, 0 replies; 48+ messages in thread
From: nathans @ 2010-08-28  4:24 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap, pcp

Hi Frank & systemtap folks,

(long time systemtap user here, love your work - thanks!)

----- "Frank Ch. Eigler" <fche@redhat.com> wrote:

> 
> We're investigating to what extent the PCP suite may be suitable for
> more general low-level event tracing.  Just from docs / source gazing
> (so please excuse my terminology errors), a few challenges would seem
> to be:
> 
> * poll-based data gathering
> 
>   It seems as though PMDAs are used exclusively in 'polling' mode,
>   meaning that underlying system statistics are periodically queried
>   and summary results stored.  In our context, it would be useful if
>   PMDAs could push event data into the stream as they occur - perhaps
>   hundreds of times a second.

Yes, this is a bit of a square-peg-round-hole situation.  Would be worth
backing up a bit and trying to understand what the aim is here -- is
there particular functionality you're after?  If you're interested only
in storing trace data historically, we'd probably go down a different
route to if you want both live and historical (much more tricky).

Having said that, I can imagine protocol extensions to support a live
push mechanism though - it would be a fascinating research project and
could provide some fairly unique capabilities.

We would need an additional pmcd/client exchange to register initial
interest in receiving tracing information, which would have to be able
to identify from which pmda that information will originate.  We'd also
need an extension to the pmcd/pmda protocol to allow these out-of-band
pmda-driven events to be pushed to pmcd for it to multiplex out to any/
all interested client tools.

Mark, Ken or one of the other PCP guys might be able to envisage a way
to overlay this on the existing protocol, but I think it would take a
protocol rev (v3) to accomplish this.  Which is a fair bit of work, but
would be quite awesome IMHO.  There have definately been occassions on
which I've wanted to see trace data alongside sampled data in a chart
(something like Figure 1 & 2 in http://lwn.net/Articles/299483/) - to
solve this in a generic way (arbitrary sample-based metrics from PCP,
arbitrary trace-data from systemtap) would be quite powerful.

> * relatively static pmns
> 
>   It would be desirable if PMNS metrics were parametrizable with
>   strings/numbers, so that a PMDA engine could use it to synthesize
>   metrics on demand from a large space.  (Example: have a
>   "kernel-probe" PMNS namespace, parametrized by function name, which
>   returns statistics of that function's execution.  There are too
> many
>   kernel functions, and they vary from host to host enough, so that
>   enumerating them as a static PMNS table would be impractical.)

This one's easier - there are PMDAs that do this already, in particular
the MMV agent and the Linux kernel cgroup metrics are generated on the fly
with the namespace managed by the PMDA rather than by the static files that
were traditionally used in PCP (src/pmdas/mmv in the tree is a complete
example, src/pmdas/sample/src has a few simple examples too).

The other dimension that can be used there is "instances", which can be
dynamic as well, even for static metric definitions - so in your example
you might have a metric which is "number of times function entered" and
the set of instances might be each function name.

> 
> * scalar payloads
> 
>   It seems as though each metric value provided by PMDAs is
>   necessarily a scalar value, as opposed to some structured type. 
> For
>   event tracing, it would be useful to have tuples.  Front-ends could
>   choose the interesting fields to render.  (Example: tracing NFS
>   calls, complete with decoded payloads.)

Its not widely known or used, but there is an "aggregate" metric type
which is basically a blob (with associated length).  You could certainly
take advantage of that (obviously, requires tools to know whats in the
blob and agree on its format with the PMDA in order to make sense of it).
An example there would be the sample.sysinfo metric in src/pmdas/sample.

> * filtering
> 
>   It would be desirable for the apps fetching metric values to
>   communicate a filtering predicate associated with them, perhaps as
>   per pmie rules.  This is to allow the data server daemon to reduce
>   the amount of data sent to the gui frontends.  Perhaps also it
> could
>   use them to inform PMDAs as a form of subscription, and in turn
> they
>   could reduce the amount of data flow.

This is also feasible - there is a pmStore(3) component to the protocol
which allows clients to communicate with the PMDAs.  You could have a
metric which expresses the filtering expression, perhaps as a string,
which client tools could store into and then the PMDA would enact some
different tracing policy.

> * no web-based frontends
> 
>   In our usage, it would be desirable to have some mini pcp-gui that
>   is based on web technologies rather than QT.

I know Mark's spoken of tackling this area in the past ... but not really
an area of interest for myself (my needs covered already).  Could be done,
just needs someone with the itch to scratch at it.

> 
> To what extent could/should PCP be used/extended to cover this space?
> 

Definately doable, the tricky work would all be in coding the PMDA if you
want everything "live" and the client/pmcd/pmda protocol extensions.

If you are more interested in doing retrospective analysis, many tracing
tools (like blktrace for i/o, wireshark/argus/... for net tracing, xperf
on win32, etc - although not systemtap afaik?) - have a mechanism for
storing trace data ondisk.  There's a scriptable API for taking such data
and producing PCP logs from it - so, that might be another avenue of
interest to you guys, perhaps.  Its would be a good place to start with
prototyping too, to get sample- and trace- data together in a PCP archive
and then playback with PCP tools to see what's needed on the client side
to best explore that data.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2010-12-03 11:13 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1780385660.592861283401180077.JavaMail.root@mail-au.aconex.com>
2010-09-02  4:22 ` [pcp] suitability of PCP for event tracing nathans
2010-09-02  4:30   ` Greg Banks
     [not found] <534400126.208681291368854553.JavaMail.root@acxmail-au2.aconex.com>
2010-12-03  9:40 ` nathans
2010-12-03 11:13   ` Ken McDonell
     [not found] <1949991220.207641291354253203.JavaMail.root@acxmail-au2.aconex.com>
2010-12-03  5:32 ` nathans
2010-12-03  6:08   ` Ken McDonell
     [not found] <290445718.141091290639324786.JavaMail.root@acxmail-au2.aconex.com>
2010-11-24 22:58 ` nathans
2010-11-27  5:29   ` Ken McDonell
2010-11-28 19:08     ` Ken McDonell
     [not found] <1565492777.26861289432902163.JavaMail.root@acxmail-au2.aconex.com>
2010-11-10 23:49 ` nathans
2010-11-11  1:46   ` Max Matveev
2010-11-23 20:48   ` Ken McDonell
     [not found] <1362202390.1923851286924784463.JavaMail.root@mail-au.aconex.com>
2010-10-12 23:07 ` nathans
     [not found] <2010549822.1115071284891293369.JavaMail.root@mail-au.aconex.com>
2010-09-19 10:19 ` nathans
     [not found] <1341556404.1064361284677032819.JavaMail.root@mail-au.aconex.com>
2010-09-16 23:18 ` nathans
2010-09-18 14:21   ` Ken McDonell
2010-09-19  9:28     ` Max Matveev
2010-09-19  9:49       ` Nathan Scott
     [not found] <105152664.981101284508372475.JavaMail.root@mail-au.aconex.com>
2010-09-15  3:12 ` Greg Banks
2010-09-15 12:11   ` Frank Ch. Eigler
2010-09-16  0:21     ` Greg Banks
2010-09-16  1:04       ` Frank Ch. Eigler
2010-09-16  2:07         ` Greg Banks
2010-09-16 12:40           ` Ken McDonell
2010-09-16 14:24             ` Frank Ch. Eigler
2010-09-16 15:53               ` Ken McDonell
2010-09-23 22:15                 ` Frank Ch. Eigler
2010-10-11  8:02                   ` Ken McDonell
2010-10-11 12:34                     ` Nathan Scott
2010-10-12 20:37                       ` Ken McDonell
2010-11-10  0:43                     ` Ken McDonell
     [not found] <1740408065.702181283817937875.JavaMail.root@mail-au.aconex.com>
2010-09-07  0:09 ` nathans
     [not found] <1459138113.589721283398633993.JavaMail.root@mail-au.aconex.com>
2010-09-02  3:42 ` nathans
2010-09-02  4:11   ` Greg Banks
     [not found] <1010363924.405041282969375594.JavaMail.root@mail-au.aconex.com>
2010-08-28  4:24 ` nathans
2010-08-27 15:39 Frank Ch. Eigler
2010-08-29 15:55 ` [pcp] " Ken McDonell
2010-09-01 15:05   ` David Smith
2010-09-06 16:39     ` Ken McDonell
     [not found] ` <4C7A7DFE.2040606@internode.on.net>
2010-08-31  3:29   ` Greg Banks
2010-08-31 19:49   ` Frank Ch. Eigler
2010-09-01  6:25     ` Mark Goodwin
2010-09-02  2:05       ` Greg Banks
2010-09-02 19:40         ` Frank Ch. Eigler
2010-09-12 16:43     ` Ken McDonell
2010-09-13  2:21       ` Greg Banks
2010-09-13 13:29       ` Max Matveev
2010-09-13 20:53         ` Ken McDonell
2010-09-13 20:39       ` Frank Ch. Eigler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).