public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* RE: [SCRIPT] NUMA page fault accounting.
@ 2006-03-21 18:58 Stone, Joshua I
  2006-03-21 20:37 ` Jose R. Santos
  2006-03-22 15:57 ` Jose R. Santos
  0 siblings, 2 replies; 4+ messages in thread
From: Stone, Joshua I @ 2006-03-21 18:58 UTC (permalink / raw)
  To: jrs; +Cc: systemtap

Jose R. Santos wrote:
>         page_faults [pid(), $write_access ? 1 : 0] ++
>         node_faults [pid(), addr_to_node($address)] ++

You could improve scalability of this script by using statistics to
maintain your count, e.g.:

         page_faults [pid(), $write_access ? 1 : 0] <<< 1
         node_faults [pid(), addr_to_node($address)] <<< 1

And then access the values with @count(page_faults[...]).

Other than that, this looks good.  It might be nice to start publishing
case studies on the website, so if you have a real problem that you
solved with this, please share!


Josh

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [SCRIPT] NUMA page fault accounting.
  2006-03-21 18:58 [SCRIPT] NUMA page fault accounting Stone, Joshua I
@ 2006-03-21 20:37 ` Jose R. Santos
  2006-03-22 15:57 ` Jose R. Santos
  1 sibling, 0 replies; 4+ messages in thread
From: Jose R. Santos @ 2006-03-21 20:37 UTC (permalink / raw)
  To: Stone, Joshua I; +Cc: systemtap

Stone, Joshua I wrote:

>Jose R. Santos wrote:
>>         page_faults [pid(), $write_access ? 1 : 0] ++
>>         node_faults [pid(), addr_to_node($address)] ++
>
>You could improve scalability of this script by using statistics to
>maintain your count, e.g.:
>
>         page_faults [pid(), $write_access ? 1 : 0] <<< 1
>         node_faults [pid(), addr_to_node($address)] <<< 1
>
>And then access the values with @count(page_faults[...]).
>  
>
OK, I will play with this and send a revised script.

>Other than that, this looks good.  It might be nice to start publishing
>case studies on the website, so if you have a real problem that you
>solved with this, please share!
>
>
>Josh
>  
>

This script does not solve a particular problem, it is meant to narrow 
down common issues that we have seen on some of our customer workloads.  
I've been thinking of ways to use SystemTap in our performance area and 
one of the possibilities that I'm currently working with is to come up 
with small scripts which are design to narrow down performance related 
problem.  The inspiration behind this script is that we have had 
multiple cases were a customer has brougt issues with performance of 
their application on our servers which are sometime hard to narrow 
down.  NUMA related issues cause by bad compiler optimization, 
non-optimal code design or Linux kernel issues have appeared more than 
once.  This is the first of what I hope will be many scripts that are 
design with this purpose in mind.

I will also be working on a script which will be called why_idle.stp 
which will be use to determine the reasons why a large system is not 
able to run at 100% CPU capacity.  This is another common problem that 
we have seen here.

Thanks for the comments

-JRS

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [SCRIPT] NUMA page fault accounting.
  2006-03-21 18:58 [SCRIPT] NUMA page fault accounting Stone, Joshua I
  2006-03-21 20:37 ` Jose R. Santos
@ 2006-03-22 15:57 ` Jose R. Santos
  1 sibling, 0 replies; 4+ messages in thread
From: Jose R. Santos @ 2006-03-22 15:57 UTC (permalink / raw)
  To: Stone, Joshua I; +Cc: systemtap

Stone, Joshua I wrote:

>Jose R. Santos wrote:
>>         page_faults [pid(), $write_access ? 1 : 0] ++
>>         node_faults [pid(), addr_to_node($address)] ++
>
>You could improve scalability of this script by using statistics to
>maintain your count, e.g.:
>
>         page_faults [pid(), $write_access ? 1 : 0] <<< 1
>         node_faults [pid(), addr_to_node($address)] <<< 1
>
>And then access the values with @count(page_faults[...]).
>
>Other than that, this looks good.  It might be nice to start publishing
>case studies on the website, so if you have a real problem that you
>solved with this, please share!
>
>
>Josh
>  
>
Here the updated script based on your suggestions.  Unfortunately, the 
machine that I uses is an old pre-production box that gives me a lot of 
variability between runs even when Im not running the SystemTap script 
so I cant verify if the changes are faster running the stream benchmark.

Thanks

-JRS

#! /usr/local/bin/stap -g

global execnames, page_faults, node_faults

function addr_to_node:long(addr:long)
%{
        int nid;
        int pfn = __pa(THIS->addr) >> PAGE_SHIFT;
        for_each_online_node(nid)
                if ( node_start_pfn(nid) <= pfn &&
                        pfn < (node_start_pfn(nid) +
                        NODE_DATA(nid)->node_spanned_pages) )
                {
                        THIS->__retvalue = nid;                
                        break;
                }

%}

probe kernel.function("__handle_mm_fault") {
        execnames[pid()] = execname()
        page_faults [pid(), $write_access ? 1 : 0] <<< 1
        node_faults [pid(), addr_to_node($address)] <<< 1

}

function print_pf () {
        print ("            Execname\t     PID\tRead Faults\tWrite 
Faults\n")
        print 
("====================\t========\t===========\t============\n")
        foreach (pid in execnames) {
                printf ("%20s\t%8d\t%11d\t%12d\t", execnames[pid], pid,
                        @count(page_faults[pid,0]), 
@count(page_faults[pid,1]))

                foreach ([pid2,node+] in node_faults) {
                        if (pid2 == pid)
                                printf ("Node[%d]=%d\t", node,
                                        @count(node_faults[pid2, node]))
                }
                print ("\n")

        }
}

probe begin {
  print ("Starting pagefault counters \n")
}

probe end {
  print ("Printing counters: \n")
  print_pf ()
  print ("Done\n")
}

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [SCRIPT] NUMA page fault accounting.
@ 2006-03-21 16:44 Jose R. Santos
  0 siblings, 0 replies; 4+ messages in thread
From: Jose R. Santos @ 2006-03-21 16:44 UTC (permalink / raw)
  To: systemtap

Hi folks,

My teams has had the need to do analysis of page faults that happen on a 
per NUMA node basis and I've come up with this simple SystemTap script 
that does this for me.  I only have access to a PPC64 NUMA box but I 
took care to use generic arch independed NUMA code and I think it should 
work for any kernel that has NUMA enable.  Here a short output  of 
running the stream benchmark with a 4.5GB array on a 8GB system.

Starting pagefault counters
Printing counters:
            Execname         PID        Read Faults     Write Faults
====================    ========        ===========     ============
           stream       22786                 33           668186    
Node[0]=521176  Node[1]=147043  
                stpd       22803                  1                3    
Node[0]=3       Node[1]=1       

and the script:

#! stap -g

global execnames, page_faults, node_faults

function addr_to_node:long(addr:long)
%{
        int nid;
        int pfn = __pa(THIS->addr) >> PAGE_SHIFT;
        for_each_online_node(nid)
                if ( node_start_pfn(nid) <= pfn &&
                        pfn < (node_start_pfn(nid) +
                        NODE_DATA(nid)->node_spanned_pages) )
                {
                        THIS->__retvalue = nid;                
                        break;
                }

%}

probe kernel.function("__handle_mm_fault") {
        execnames[pid()] = execname()
        page_faults [pid(), $write_access ? 1 : 0] ++
        node_faults [pid(), addr_to_node($address)] ++

}

function print_pf () {
        print ("            Execname\t     PID\tRead Faults\tWrite 
Faults\n")
        print 
("====================\t========\t===========\t============\n")
        foreach (pid in execnames) {
                printf ("%20s\t%8d\t%11d\t%12d\t", execnames[pid], pid,
                        page_faults[pid,0], page_faults[pid,1])

                foreach ([pid2,node+] in node_faults) {
                        if (pid2 == pid)
                                printf ("Node[%d]=%d\t", node,
                                        node_faults[pid2, node])
                }
                print ("\n")

        }
}

probe begin {
  print ("Starting pagefault counters \n")
}

probe end {
  print ("Printing counters: \n")
  print_pf ()
  print ("Done\n")
}


Comments welcome - Enjoy

-JRS

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-03-22 15:57 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-03-21 18:58 [SCRIPT] NUMA page fault accounting Stone, Joshua I
2006-03-21 20:37 ` Jose R. Santos
2006-03-22 15:57 ` Jose R. Santos
  -- strict thread matches above, loose matches on Subject: below --
2006-03-21 16:44 Jose R. Santos

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).