Linux VFS cache hit rate script

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* Linux VFS cache hit rate script
@ 2011-04-20 19:04 Jake Maul
  2011-04-20 20:09 ` William Cohen
  0 siblings, 1 reply; 17+ messages in thread
From: Jake Maul @ 2011-04-20 19:04 UTC (permalink / raw)
  To: systemtap

[-- Attachment #1: Type: text/plain, Size: 438 bytes --]

Greetings all,

I have recently put a script up on the War Stories page, and it was
requested of me to also post it to the list, along with a .meta file.
They're both attached.

http://sourceware.org/systemtap/wiki/WSCacheHitRate

Let me know what you think! It's my first SystemTap script, so I'm
sure there are improvements that could be made. I'm by no means a
kernel hacker, or even a C dev... just a lowly sysadmin. :)

Thanks,
Jake

[-- Attachment #2: cache-hit-rate.stp --]
[-- Type: application/octet-stream, Size: 780 bytes --]

global cache_bytes, disk_bytes, counter

probe vfs.read.return {
  if (bytes_read>0) {
    if (devname=="N/A") { /* N/A means cache hit... right? */
      cache_bytes += bytes_read
    } else {
      disk_bytes += bytes_read
    }
  }
}

# print VFS hits and misses every 5 second, plus the hit rate in %
probe timer.s(5) {
    if (counter%15 == 0) {
        printf ("\n%18s %18s %10s %10s\n", "Cache Reads (KB)", "Disk Reads (KB)", "Miss Rate", "Hit Rate")
    }
    counter++
    hitrate =  10000 * cache_bytes / (cache_bytes+disk_bytes)
    missrate = 10000 * disk_bytes / (cache_bytes+disk_bytes)
    printf ("%18d %18d %6d.%02d%% %6d.%02d%%\n", cache_bytes/1024, disk_bytes/1024, missrate/100, missrate%100, hitrate/100, hitrate%100)
    cache_bytes = 0
    disk_bytes = 0
}

[-- Attachment #3: cache-hit-rate.meta --]
[-- Type: application/octet-stream, Size: 536 bytes --]

title: VFS Cache Hit Rate
name: cache-hit-rate.stp
version: 1.0
author: Jake Maul
keywords: io profiling vfs cache
subsystem: io
status: production
exit: user-controlled
output: timed interval (vmstat-like)
scope: system-wide
description:  The cache-hit-rate.stp script checks the bytes_read and devname fields of vfs.read.return, and uses them to determine how much data was read directly from disk vs how much was read from the Linux VFS cache.
test_check: stap -p4 cache-hit-rate.stp
test_installcheck: stap iostats.stp -c "sleep 6"

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux VFS cache hit rate script
  2011-04-20 19:04 Linux VFS cache hit rate script Jake Maul
@ 2011-04-20 20:09 ` William Cohen
  2011-04-20 20:46   ` Jake Maul
  2011-04-21 20:17   ` Jake Maul
  0 siblings, 2 replies; 17+ messages in thread
From: William Cohen @ 2011-04-20 20:09 UTC (permalink / raw)
  To: Jake Maul; +Cc: systemtap

On 04/20/2011 03:04 PM, Jake Maul wrote:
> Greetings all,

Hi Jake,

Thanks for taking the time to submit the example.

> 
> I have recently put a script up on the War Stories page, and it was
> requested of me to also post it to the list, along with a .meta file.
> They're both attached.
> 
> http://sourceware.org/systemtap/wiki/WSCacheHitRate

Why the need for  -DMAXMAPENTRIES=100000 on the second example on the wiki page? There doesn't seem to be any arrays in the script.

> 
> Let me know what you think! It's my first SystemTap script, so I'm
> sure there are improvements that could be made. I'm by no means a
> kernel hacker, or even a C dev... just a lowly sysadmin. :)

Include a short header like the following at the beginning of the script:

#!/usr/bin/env stap
# cache-hit.stp
# Copyright (C) 2011 xxx <xxx@yyy.zzz>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.

Should avoid long line lengths in the printf statements. Wrapped lines don't look good in manuals.

The following line sounds unsure, need to check that that is really the case:

 if (devname=="N/A") { /* N/A means cache hit... right? */

It would be good to have a better explanation for the "N/A" check.

-Will

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux VFS cache hit rate script
  2011-04-20 20:09 ` William Cohen
@ 2011-04-20 20:46   ` Jake Maul
  2011-04-20 20:57     ` Josh Stone
  2011-04-21 20:17   ` Jake Maul
  1 sibling, 1 reply; 17+ messages in thread
From: Jake Maul @ 2011-04-20 20:46 UTC (permalink / raw)
  To: William Cohen; +Cc: systemtap

On Wed, Apr 20, 2011 at 1:08 PM, William Cohen <wcohen@redhat.com> wrote:
> On 04/20/2011 03:04 PM, Jake Maul wrote:
>> Greetings all,
>
> Hi Jake,
>
> Thanks for taking the time to submit the example.
>
>>
>> I have recently put a script up on the War Stories page, and it was
>> requested of me to also post it to the list, along with a .meta file.
>> They're both attached.
>>
>> http://sourceware.org/systemtap/wiki/WSCacheHitRate
>
> Why the need for  -DMAXMAPENTRIES=100000 on the second example on the wiki page? There doesn't seem to be any arrays in the script.

For some reason on that particular server, I get this error sometimes:

ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at
/usr/share/systemtap/tapset/vfs.stp:769:9

I freely admit, that was voodoo-administration on my part- I found
something somewhere that indicated you can make that variable larger
like that, and tried it out. It seems to help, but I haven't the
slightest idea why. If anyone can shed some light on that, I'd be much
obliged. :)

On a different server, it works fine without that. They're both RHEL5,
but the working on is 64-bit and the one that sometimes errors is
32-bit PAE. Not sure if that matters. They do have completely
different datasets and workloads.

>> Let me know what you think! It's my first SystemTap script, so I'm
>> sure there are improvements that could be made. I'm by no means a
>> kernel hacker, or even a C dev... just a lowly sysadmin. :)
>
> Include a short header like the following at the beginning of the script:
>
> #!/usr/bin/env stap
> # cache-hit.stp
> # Copyright (C) 2011 xxx <xxx@yyy.zzz>
> #
> # This program is free software; you can redistribute it and/or modify
> # it under the terms of the GNU General Public License version 2 as
> # published by the Free Software Foundation.

Good call, I'll add something.

> Should avoid long line lengths in the printf statements. Wrapped lines don't look good in manuals.

Also good.

> The following line sounds unsure, need to check that that is really the case:
>
>  if (devname=="N/A") { /* N/A means cache hit... right? */
>
> It would be good to have a better explanation for the "N/A" check.

This is from the beginner's guide, actually:

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/SystemTap_Beginners_Guide/mainsect-disk.html

It bothers me too, because I can't find any documentation that says
explicitly when you would get an 'N/A' devname. I don't know if it's a
SystemTap thing or a kernel thing, but being that I don't have
experience working on either, I'm fairly lost.

Thanks for the feedback!
Jake

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux VFS cache hit rate script
  2011-04-20 20:46   ` Jake Maul
@ 2011-04-20 20:57     ` Josh Stone
  2011-04-20 22:25       ` Jake Maul
  0 siblings, 1 reply; 17+ messages in thread
From: Josh Stone @ 2011-04-20 20:57 UTC (permalink / raw)
  To: Jake Maul; +Cc: William Cohen, systemtap

On 04/20/2011 01:45 PM, Jake Maul wrote:
> On Wed, Apr 20, 2011 at 1:08 PM, William Cohen <wcohen@redhat.com> wrote:
>> Why the need for -DMAXMAPENTRIES=100000 on the second example on
>> the wiki page? There doesn't seem to be any arrays in the script.
> 
> For some reason on that particular server, I get this error sometimes:
> 
> ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at
> /usr/share/systemtap/tapset/vfs.stp:769:9

The error location looks bogus, which we should investigate, but this is
probably from the array used in vfs.stp:__find_bdevname().  I'm not sure
how you could be encountering so many unique "dev" though.

>> The following line sounds unsure, need to check that that is really the case:
>>
>>  if (devname=="N/A") { /* N/A means cache hit... right? */
>>
>> It would be good to have a better explanation for the "N/A" check.
> 
> This is from the beginner's guide, actually:
> 
> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/SystemTap_Beginners_Guide/mainsect-disk.html
> 
> It bothers me too, because I can't find any documentation that says
> explicitly when you would get an 'N/A' devname. I don't know if it's a
> SystemTap thing or a kernel thing, but being that I don't have
> experience working on either, I'm fairly lost.

That string comes from dev.stp:bdevname(bdev), whenever the parameter is
zero.  I'm not sure about the particulars though, whether bdev==0 really
means it's a cache hit...

Josh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux VFS cache hit rate script
  2011-04-20 20:57     ` Josh Stone
@ 2011-04-20 22:25       ` Jake Maul
  0 siblings, 0 replies; 17+ messages in thread
From: Jake Maul @ 2011-04-20 22:25 UTC (permalink / raw)
  To: Josh Stone; +Cc: William Cohen, systemtap

>> ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at
>> /usr/share/systemtap/tapset/vfs.stp:769:9
>
> The error location looks bogus, which we should investigate, but this is
> probably from the array used in vfs.stp:__find_bdevname().  I'm not sure
> how you could be encountering so many unique "dev" though.

I made a quick test script- the only 2 devnames I run into are dm-0
and N/A on that system, and by far N/A is more common. However, this
test script fails pretty quickly (less than 5 seconds), with this
error:

# stap devnames.stp | sort | uniq -c
ERROR: probe overhead exceeded threshold
WARNING: Number of errors: 1, skipped probes: 0
Pass 5: run failed.  Try again with another '--vp 00001' option.
     41 dm-0
  55180 N/A


>> This is from the beginner's guide, actually:
>>
>> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/SystemTap_Beginners_Guide/mainsect-disk.html
>>
>> It bothers me too, because I can't find any documentation that says
>> explicitly when you would get an 'N/A' devname. I don't know if it's a
>> SystemTap thing or a kernel thing, but being that I don't have
>> experience working on either, I'm fairly lost.
>
> That string comes from dev.stp:bdevname(bdev), whenever the parameter is
> zero.  I'm not sure about the particulars though, whether bdev==0 really
> means it's a cache hit...
>
> Josh
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux VFS cache hit rate script
  2011-04-20 20:09 ` William Cohen
  2011-04-20 20:46   ` Jake Maul
@ 2011-04-21 20:17   ` Jake Maul
  2011-04-21 22:31     ` William Cohen
  1 sibling, 1 reply; 17+ messages in thread
From: Jake Maul @ 2011-04-21 20:17 UTC (permalink / raw)
  To: William Cohen; +Cc: systemtap

New version of the script uploaded to War Stories, incorporating your
suggested changes. I have no problem with the GPLv2, so I took your
wording exactly. :)

http://sourceware.org/systemtap/wiki/WSCacheHitRate

Of course there's still the concern about devname "N/A", but I
clarified the comment so as to make it obvious where I got that
assumption from. At least then if does turn out to be incorrect, it
will be obvious that something other than just this one script needs
to be fixed.

Jake


On Wed, Apr 20, 2011 at 1:08 PM, William Cohen <wcohen@redhat.com> wrote:
> On 04/20/2011 03:04 PM, Jake Maul wrote:
>> Greetings all,
>
> Hi Jake,
>
> Thanks for taking the time to submit the example.
>
>>
>> I have recently put a script up on the War Stories page, and it was
>> requested of me to also post it to the list, along with a .meta file.
>> They're both attached.
>>
>> http://sourceware.org/systemtap/wiki/WSCacheHitRate
>
> Why the need for  -DMAXMAPENTRIES=100000 on the second example on the wiki page? There doesn't seem to be any arrays in the script.
>
>>
>> Let me know what you think! It's my first SystemTap script, so I'm
>> sure there are improvements that could be made. I'm by no means a
>> kernel hacker, or even a C dev... just a lowly sysadmin. :)
>
> Include a short header like the following at the beginning of the script:
>
> #!/usr/bin/env stap
> # cache-hit.stp
> # Copyright (C) 2011 xxx <xxx@yyy.zzz>
> #
> # This program is free software; you can redistribute it and/or modify
> # it under the terms of the GNU General Public License version 2 as
> # published by the Free Software Foundation.
>
>
> Should avoid long line lengths in the printf statements. Wrapped lines don't look good in manuals.
>
>
> The following line sounds unsure, need to check that that is really the case:
>
>  if (devname=="N/A") { /* N/A means cache hit... right? */
>
> It would be good to have a better explanation for the "N/A" check.
>
> -Will
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux VFS cache hit rate script
  2011-04-21 20:17   ` Jake Maul
@ 2011-04-21 22:31     ` William Cohen
  2011-04-21 23:02       ` Jake Maul
  0 siblings, 1 reply; 17+ messages in thread
From: William Cohen @ 2011-04-21 22:31 UTC (permalink / raw)
  To: Jake Maul; +Cc: systemtap

On 04/21/2011 04:17 PM, Jake Maul wrote:
> New version of the script uploaded to War Stories, incorporating your
> suggested changes. I have no problem with the GPLv2, so I took your
> wording exactly. :)
> 
> http://sourceware.org/systemtap/wiki/WSCacheHitRate
> 
> Of course there's still the concern about devname "N/A", but I
> clarified the comment so as to make it obvious where I got that
> assumption from. At least then if does turn out to be incorrect, it
> will be obvious that something other than just this one script needs
> to be fixed.
> 
> Jake

Hi Jake,

I traced through vfs.stp and found where the "N/A" is coming from:

function bdevname:string(bdev:long)
{
	if (bdev == 0)
		return "N/A"
...
}


Whether "N/A" always indicated things are cached I don't know. Need to explore further.

Found a possible reason got the overflow for the associative array; there is an associative array used for caching names in vfs.stp:

global __devnames
function __find_bdevname:string(dev:long, bdev:long)
{
	if (dev in __devnames)
		return __devnames[dev]
	else
		return __devnames[dev] = bdevname(bdev)
}

The only way that is going to fill up is if there are a lot of different values for dev.

-Will

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux VFS cache hit rate script
  2011-04-21 22:31     ` William Cohen
@ 2011-04-21 23:02       ` Jake Maul
  2011-04-21 23:46         ` Josh Stone
  0 siblings, 1 reply; 17+ messages in thread
From: Jake Maul @ 2011-04-21 23:02 UTC (permalink / raw)
  To: William Cohen; +Cc: systemtap

On Thu, Apr 21, 2011 at 3:31 PM, William Cohen <wcohen@redhat.com> wrote:
> Found a possible reason got the overflow for the associative array; there is an associative array used for caching names in vfs.stp:
>
> global __devnames
> function __find_bdevname:string(dev:long, bdev:long)
> {
>        if (dev in __devnames)
>                return __devnames[dev]
>        else
>                return __devnames[dev] = bdevname(bdev)
> }
>
> The only way that is going to fill up is if there are a lot of different values for dev.
>
> -Will
>

I wrote a quick script to test this:
# cat devnames.stp
probe vfs.read.return {
	printf("\tdev: %d\tdevname: %s\n", dev, devname)
}

Output on the affected system:

# stap -DSTP_NO_OVERLOAD devnames.stp -c "sleep 60" | sort | uniq -c
ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at
/usr/share/systemtap/tapset/vfs.stp:769:9
WARNING: Number of errors: 1, skipped probes: 2035
Pass 5: run failed.  Try again with another '--vp 00001' option.
      2 	dev: 0	devname: N/A
 762956 	dev: 16	devname: N/A
    520 	dev: 18	devname: N/A
   4183 	dev: 22	devname: N/A
      4 	dev: 23	devname: N/A
   1288 	dev: 265289728	devname: dm-0
      1 	dev: 27	devname: N/A
    872 	dev: 3	devname: N/A
   3094 	dev: 5	devname: N/A
 380875 	dev: 6	devname: N/A

Both this script and the cache-hit-rate.stp script generally die with
that error in 30 seconds or less. The data above is around 30 seconds
or so.

That bizarrely long dev number might be relevant... or maybe that's
just a normal quirk of LVM?

The number of entries when it crashes varies... 10 above, but it's
also stopped on 7,8, or 9 dev's shown. I can't see why the limit would
vary like that, nor why there would be a limit so low.

Jake

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux VFS cache hit rate script
  2011-04-21 23:02       ` Jake Maul
@ 2011-04-21 23:46         ` Josh Stone
  2011-04-22 19:17           ` Jake Maul
  2011-04-25 22:53           ` Josh Stone
  0 siblings, 2 replies; 17+ messages in thread
From: Josh Stone @ 2011-04-21 23:46 UTC (permalink / raw)
  To: Jake Maul; +Cc: William Cohen, systemtap

On 04/21/2011 04:01 PM, Jake Maul wrote:
>       2 	dev: 0	devname: N/A
>  762956 	dev: 16	devname: N/A
>     520 	dev: 18	devname: N/A
>    4183 	dev: 22	devname: N/A
>       4 	dev: 23	devname: N/A
>    1288 	dev: 265289728	devname: dm-0
>       1 	dev: 27	devname: N/A
>     872 	dev: 3	devname: N/A
>    3094 	dev: 5	devname: N/A
>  380875 	dev: 6	devname: N/A

Oy - it's possible that the "dev" is meaningless (or means something
different) when the name is returned "N/A" (bdev==0), so we may be
caching noise as the index.  Or in any case, caching "N/A" in so many
different forms is not helpful.  Would you be willing to try this
modified function?

function __find_bdevname:string(dev:long, bdev:long)
{
	if (bdev == 0)
		return "N/A"
	if (dev in __devnames)
		return __devnames[dev]
	return __devnames[dev] = bdevname(bdev)
}

> Both this script and the cache-hit-rate.stp script generally die with
> that error in 30 seconds or less. The data above is around 30 seconds
> or so.
> 
> That bizarrely long dev number might be relevant... or maybe that's
> just a normal quirk of LVM?

It's not so bizarre - kernel device numbers are (MAJOR<<20)|MINOR, so
this turns out to be device 253,0.  That also means all those low dev
numbers have MAJOR==0, which I think supports my theory that they are
not normal.

Josh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux VFS cache hit rate script
  2011-04-21 23:46         ` Josh Stone
@ 2011-04-22 19:17           ` Jake Maul
  2011-04-22 20:28             ` Josh Stone
  2011-04-25 22:53           ` Josh Stone
  1 sibling, 1 reply; 17+ messages in thread
From: Jake Maul @ 2011-04-22 19:17 UTC (permalink / raw)
  To: Josh Stone; +Cc: William Cohen, systemtap

On Thu, Apr 21, 2011 at 4:45 PM, Josh Stone <jistone@redhat.com> wrote:
> Oy - it's possible that the "dev" is meaningless (or means something
> different) when the name is returned "N/A" (bdev==0), so we may be
> caching noise as the index.  Or in any case, caching "N/A" in so many
> different forms is not helpful.  Would you be willing to try this
> modified function?
>
> function __find_bdevname:string(dev:long, bdev:long)
> {
>        if (bdev == 0)
>                return "N/A"
>        if (dev in __devnames)
>                return __devnames[dev]
>        return __devnames[dev] = bdevname(bdev)
> }

No problem... seems to have no effect though:

 339869 	dev: 16	devname: N/A
    345 	dev: 18	devname: N/A
   4569 	dev: 22	devname: N/A
      5 	dev: 23	devname: N/A
    798 	dev: 265289728	devname: dm-0
    640 	dev: 3	devname: N/A
   2594 	dev: 5	devname: N/A
 250861 	dev: 6	devname: N/A

I just commented out the existing version in
/usr/share/systemtap/tapsets/vfs.stp, and put your function in
instead. Do I need to do anything more than that, like rebuild
something? (not a systemtap dev)

Jake

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux VFS cache hit rate script
  2011-04-22 19:17           ` Jake Maul
@ 2011-04-22 20:28             ` Josh Stone
  2011-04-22 21:47               ` Jake Maul
  0 siblings, 1 reply; 17+ messages in thread
From: Josh Stone @ 2011-04-22 20:28 UTC (permalink / raw)
  To: Jake Maul; +Cc: William Cohen, systemtap

On 04/22/2011 12:16 PM, Jake Maul wrote:
> No problem... seems to have no effect though:
> 
>  339869 	dev: 16	devname: N/A
>     345 	dev: 18	devname: N/A
>    4569 	dev: 22	devname: N/A
>       5 	dev: 23	devname: N/A
>     798 	dev: 265289728	devname: dm-0
>     640 	dev: 3	devname: N/A
>    2594 	dev: 5	devname: N/A
>  250861 	dev: 6	devname: N/A
> 
> I just commented out the existing version in
> /usr/share/systemtap/tapsets/vfs.stp, and put your function in
> instead. Do I need to do anything more than that, like rebuild
> something? (not a systemtap dev)

That should do it.  I should have been more clear - the effect I
intended was just to relieve the need for MAXMAPENTRIES.  The
"__devnames" map should now only contain the real device names, and not
waste any time/space on "N/A" entries.  Your script's output would not
be affected though.

Josh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux VFS cache hit rate script
  2011-04-22 20:28             ` Josh Stone
@ 2011-04-22 21:47               ` Jake Maul
  2011-04-25 21:54                 ` Josh Stone
  0 siblings, 1 reply; 17+ messages in thread
From: Jake Maul @ 2011-04-22 21:47 UTC (permalink / raw)
  To: Josh Stone; +Cc: systemtap

On Fri, Apr 22, 2011 at 1:28 PM, Josh Stone <jistone@redhat.com> wrote:
> That should do it.  I should have been more clear - the effect I
> intended was just to relieve the need for MAXMAPENTRIES.  The
> "__devnames" map should now only contain the real device names, and not
> waste any time/space on "N/A" entries.  Your script's output would not
> be affected though.
>
> Josh
>

Ah. Well, in that case, sorry, still no good. I cut out that bit of
the output, but:

# time stap -DSTP_NO_OVERLOAD devnames.stp -c "sleep 35" | sort | uniq -c
ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at
/usr/share/systemtap/tapset/vfs.stp:780:9
WARNING: Number of errors: 1, skipped probes: 2126
Pass 5: run failed.  Try again with another '--vp 00001' option.

Here's an interesting note though: the line numbers are different. The
only difference between the runs is which of the 2 functions is
commented out.

With stock function:
ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at
/usr/share/systemtap/tapset/vfs.stp:769:9

With your function:
ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at
/usr/share/systemtap/tapset/vfs.stp:780:9


Here's that snippet in vfs.stp (pasting here because the line numbers
are going to be off now, as compared to the stock file):

    761 probe vfs.read = kernel.function("vfs_read")
    762 {
    763         file = $file
    764         pos = $pos
    765         buf = $buf
    766         bytes_to_read = $count
    767         dev = __file_dev($file)
    768         devname = __find_bdevname(dev, __file_bdev($file))
    769         ino = __file_ino($file)
    770
    771         name = "vfs.read"
    772         argstr = sprintf("%d, %d, %p", $count, $pos, $buf)
    773 }
    774
    775 probe vfs.read.return = kernel.function("vfs_read").return
    776 {
    777         name = "vfs.read"
    778         retstr = sprintf("%d", $return)
    779
    780         file = $file
    781         pos = $pos
    782         buf = $buf
    783         bytes_to_read = $count
    784         dev = __file_dev($file)
    785         devname = __find_bdevname(dev, __file_bdev($file))
    786         ino = __file_ino($file)
    787
    788         ret = $return
    789         bytes_read = $return > 0 ? $return : 0
    790         error = $return < 0 ? $return : 0
    791         error_str = error ? errno_str(error) : ""
    792 }

So as you can see, one errors in vfs.read, the other in
vfs.read.return. In both cases something to do with $file gets
assigned to a normal variable.

Side note: I wonder if the "name" variables are messed up... vfs.read*
and vfs.write* don't follow quite the same pattern:

probe vfs.read = kernel.function("vfs_read")
        name = "vfs.read"
probe vfs.read.return = kernel.function("vfs_read").return
        name = "vfs.read"
probe vfs.readv = kernel.function("vfs_readv")
        name = "vfs.read"
probe vfs.readv.return = kernel.function("vfs_readv").return
        name = "vfs.readv"
probe vfs.write = kernel.function("vfs_write")
        name = "vfs.write"
probe vfs.write.return = kernel.function("vfs_write").return
        name = "vfs.write"
probe vfs.writev = kernel.function("vfs_writev")
        name = "vfs.writev"
probe vfs.writev.return = kernel.function("vfs_writev").return
        name = "vfs.writev"


Jake

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux VFS cache hit rate script
  2011-04-22 21:47               ` Jake Maul
@ 2011-04-25 21:54                 ` Josh Stone
  2011-04-26  2:11                   ` Jake Maul
  0 siblings, 1 reply; 17+ messages in thread
From: Josh Stone @ 2011-04-25 21:54 UTC (permalink / raw)
  To: Jake Maul; +Cc: systemtap

On 04/22/2011 02:47 PM, Jake Maul wrote:
> Here's an interesting note though: the line numbers are different. The
> only difference between the runs is which of the 2 functions is
> commented out.
> 
> With stock function:
> ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at
> /usr/share/systemtap/tapset/vfs.stp:769:9
> 
> With your function:
> ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at
> /usr/share/systemtap/tapset/vfs.stp:780:9

Was the "stock" one before you added my new version of the function?  I
think that would easily explain why the line numbers got shifted.  Note
that both are talking about $file at column 9, which in your snippet
only makes sense for the "<TAB>file = $file".

Anyway, I just realized that there is a way that $file could actually be
involved with a map overflow (which I thought was wrong before).  When
accessing parameters in a .return probe, we first use an entry probe to
save the value into a map, then read that back in the actual .return
probe.  So that saving map could be the thing that is overflowing.

I notice that you also have a bunch of skipped probes, which is probably
due to not enough kretprobe slots available.  (-t can tell for sure.)  I
think this may be causing a bug that sometimes the saved $file is not
being cleared, because the part that would use it is getting skipped.
Your 2035 skipped seems not enough to accumulate past the default 10,000
MAXMAPENTRIES, but I don't have other ideas.  I would suggest tweaking
KRETACTIVE until there's no skipping, as described here:
	http://sourceware.org/systemtap/wiki/TipSkippedProbes

I'm not sure there's a way we could more cleanly fail this case of
skipped probe vs. stale saved $var.  On kernels since 2.6.25 we actually
use a different saving mechanism though, which wouldn't have the same
mapping problem.

> Side note: I wonder if the "name" variables are messed up... vfs.read*
> and vfs.write* don't follow quite the same pattern:
> 
> probe vfs.read = kernel.function("vfs_read")
>         name = "vfs.read"
> probe vfs.read.return = kernel.function("vfs_read").return
>         name = "vfs.read"
> probe vfs.readv = kernel.function("vfs_readv")
>         name = "vfs.read"
> probe vfs.readv.return = kernel.function("vfs_readv").return
>         name = "vfs.readv"

Yeah, looks like a typo - the third should probably be readv.

Josh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux VFS cache hit rate script
  2011-04-25 21:54                 ` Josh Stone
@ 2011-04-26  2:11                   ` Jake Maul
  0 siblings, 0 replies; 17+ messages in thread
From: Jake Maul @ 2011-04-26  2:11 UTC (permalink / raw)
  To: Josh Stone; +Cc: systemtap

On Mon, Apr 25, 2011 at 2:54 PM, Josh Stone <jistone@redhat.com> wrote:
> On 04/22/2011 02:47 PM, Jake Maul wrote:
>> Here's an interesting note though: the line numbers are different. The
>> only difference between the runs is which of the 2 functions is
>> commented out.
>>
>> With stock function:
>> ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at
>> /usr/share/systemtap/tapset/vfs.stp:769:9
>>
>> With your function:
>> ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at
>> /usr/share/systemtap/tapset/vfs.stp:780:9
>
> Was the "stock" one before you added my new version of the function?  I
> think that would easily explain why the line numbers got shifted.  Note
> that both are talking about $file at column 9, which in your snippet
> only makes sense for the "<TAB>file = $file".

After: the only difference is which was commented out.

> Anyway, I just realized that there is a way that $file could actually be
> involved with a map overflow (which I thought was wrong before).  When
> accessing parameters in a .return probe, we first use an entry probe to
> save the value into a map, then read that back in the actual .return
> probe.  So that saving map could be the thing that is overflowing.
>
> I notice that you also have a bunch of skipped probes, which is probably
> due to not enough kretprobe slots available.  (-t can tell for sure.)  I
> think this may be causing a bug that sometimes the saved $file is not
> being cleared, because the part that would use it is getting skipped.
> Your 2035 skipped seems not enough to accumulate past the default 10,000
> MAXMAPENTRIES, but I don't have other ideas.  I would suggest tweaking
> KRETACTIVE until there's no skipping, as described here:
>        http://sourceware.org/systemtap/wiki/TipSkippedProbes

Will give this a try tomorrow.

> I'm not sure there's a way we could more cleanly fail this case of
> skipped probe vs. stale saved $var.  On kernels since 2.6.25 we actually
> use a different saving mechanism though, which wouldn't have the same
> mapping problem.

If it won't happen on more modern systems, I don't see a major problem
with ignoring it- it'll resolve itself. :)

>> Side note: I wonder if the "name" variables are messed up... vfs.read*
>> and vfs.write* don't follow quite the same pattern:
>>
>> probe vfs.read = kernel.function("vfs_read")
>>         name = "vfs.read"
>> probe vfs.read.return = kernel.function("vfs_read").return
>>         name = "vfs.read"
>> probe vfs.readv = kernel.function("vfs_readv")
>>         name = "vfs.read"
>> probe vfs.readv.return = kernel.function("vfs_readv").return
>>         name = "vfs.readv"
>
> Yeah, looks like a typo - the third should probably be readv.
>
>
> Josh
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux VFS cache hit rate script
  2011-04-21 23:46         ` Josh Stone
  2011-04-22 19:17           ` Jake Maul
@ 2011-04-25 22:53           ` Josh Stone
  2011-04-26  1:59             ` Jake Maul
  1 sibling, 1 reply; 17+ messages in thread
From: Josh Stone @ 2011-04-25 22:53 UTC (permalink / raw)
  To: Jake Maul; +Cc: William Cohen, systemtap

On 04/21/2011 04:45 PM, Josh Stone wrote:
> On 04/21/2011 04:01 PM, Jake Maul wrote:
>>       2 	dev: 0	devname: N/A
>>  762956 	dev: 16	devname: N/A
>>     520 	dev: 18	devname: N/A
>>    4183 	dev: 22	devname: N/A
>>       4 	dev: 23	devname: N/A
>>    1288 	dev: 265289728	devname: dm-0
>>       1 	dev: 27	devname: N/A
>>     872 	dev: 3	devname: N/A
>>    3094 	dev: 5	devname: N/A
>>  380875 	dev: 6	devname: N/A
[...]
>> That bizarrely long dev number might be relevant... or maybe that's
>> just a normal quirk of LVM?
> 
> It's not so bizarre - kernel device numbers are (MAJOR<<20)|MINOR, so
> this turns out to be device 253,0.  That also means all those low dev
> numbers have MAJOR==0, which I think supports my theory that they are
> not normal.

I was thinking about what could be causing such high N/A counts, even on
my nearly-idle laptop.  I'm pretty sure now that these MAJOR==0 are
actually "emulated" filesystems, like sysfs, procfs, etc.  So I don't
think the N/A has anything to do with caching - it's just that there's
literally no block device associated with the request.

Then I think the high counts here are because stap is getting into a
feedback loop as it reads your printfs over debugfs.  A request comes
in, your script printfs it; then stapio reads that printf and copies it
to be read again by a tty emulator or your |sort|uniq pipe -- creating
even more vfs_read events, in a never ending chain.  So you should
probably at least filter stap's own events out of your results with a
condition like:  if (pid() != stp_pid()) { printf... }

Some of the other probes in the vfs tapset deal with the page cache
directly, which I think you'll need to get true vfs caching rates.

Josh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux VFS cache hit rate script
  2011-04-25 22:53           ` Josh Stone
@ 2011-04-26  1:59             ` Jake Maul
  2011-04-26  2:44               ` Josh Stone
  0 siblings, 1 reply; 17+ messages in thread
From: Jake Maul @ 2011-04-26  1:59 UTC (permalink / raw)
  To: Josh Stone; +Cc: William Cohen, systemtap

On Mon, Apr 25, 2011 at 3:53 PM, Josh Stone <jistone@redhat.com> wrote:

> I was thinking about what could be causing such high N/A counts, even on
> my nearly-idle laptop.  I'm pretty sure now that these MAJOR==0 are
> actually "emulated" filesystems, like sysfs, procfs, etc.  So I don't
> think the N/A has anything to do with caching - it's just that there's
> literally no block device associated with the request.

Hmm, that would make a lot of sense. Can anyone verify this? Or, how
is it verifiable?

> Then I think the high counts here are because stap is getting into a
> feedback loop as it reads your printfs over debugfs.  A request comes
> in, your script printfs it; then stapio reads that printf and copies it
> to be read again by a tty emulator or your |sort|uniq pipe -- creating
> even more vfs_read events, in a never ending chain.  So you should
> probably at least filter stap's own events out of your results with a
> condition like:  if (pid() != stp_pid()) { printf... }

Heh... that would make some sense, indeed, at least for one of the
dev: numbers in my output.

> Some of the other probes in the vfs tapset deal with the page cache
> directly, which I think you'll need to get true vfs caching rates.

Which probes are you thinking about? I see probes on when things get
added or removed from cache, but nothing jumps out at me as the ideal
way to see what was a hit or miss.

Jake

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux VFS cache hit rate script
  2011-04-26  1:59             ` Jake Maul
@ 2011-04-26  2:44               ` Josh Stone
  0 siblings, 0 replies; 17+ messages in thread
From: Josh Stone @ 2011-04-26  2:44 UTC (permalink / raw)
  To: Jake Maul; +Cc: William Cohen, systemtap

On 04/25/2011 06:59 PM, Jake Maul wrote:
> On Mon, Apr 25, 2011 at 3:53 PM, Josh Stone <jistone@redhat.com> wrote:
> 
>> I was thinking about what could be causing such high N/A counts, even on
>> my nearly-idle laptop.  I'm pretty sure now that these MAJOR==0 are
>> actually "emulated" filesystems, like sysfs, procfs, etc.  So I don't
>> think the N/A has anything to do with caching - it's just that there's
>> literally no block device associated with the request.
> 
> Hmm, that would make a lot of sense. Can anyone verify this? Or, how
> is it verifiable?

I realized this when I did "lsof -c stapio", and I saw that the debugfs
handles were on device 0,7.  Then I did a systemwide lsof, and I could
only see MAJOR==0 on procfs files, pipes, anon, etc.

Plus it didn't make sense to me that the device parameters would have
any idea whether things were cached.  Even though you're looking in a
return probe, the examination of the parameters is no different than in
an entry probe.

>> Some of the other probes in the vfs tapset deal with the page cache
>> directly, which I think you'll need to get true vfs caching rates.
> 
> Which probes are you thinking about? I see probes on when things get
> added or removed from cache, but nothing jumps out at me as the ideal
> way to see what was a hit or miss.

It's less straightforward, but I think it can be done with a combo of
probes.  Something like this (untested):

  global pages, active
  global total_read, total_paged
  probe vfs.read {
    if (pid() != stp_pid())
      active[tid()] = 1
  }
  probe vfs.add_to_page_cache {
    if (active[tid()])
      pages[tid()] += nrpages
  }
  probe vfs.read.return {
    if (active[tid()]) {
      total_read[pid(), tid(), execname()] <<< bytes_read
      total_paged[pid(), tid(), execname()] <<< pages[tid()]
      delete pages[tid()]
      delete active[tid()]
    }
  }
  probe end {
    foreach ([p,t,e] in total_read)
      printf("%5d %5d %16s  read %d bytes, %d pages added\n", p, t, e,
             @sum(total_read[pid(), tid(), execname()]),
             @sum(total_paged[pid(), tid(), execname()]))
  }

Problem is, I don't think you can tell this way how much of the new
pages went toward this read, since you don't know how it was aligned.
But in aggregate, it's probably pretty close.

Josh

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-04-26  2:44 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-20 19:04 Linux VFS cache hit rate script Jake Maul
2011-04-20 20:09 ` William Cohen
2011-04-20 20:46   ` Jake Maul
2011-04-20 20:57     ` Josh Stone
2011-04-20 22:25       ` Jake Maul
2011-04-21 20:17   ` Jake Maul
2011-04-21 22:31     ` William Cohen
2011-04-21 23:02       ` Jake Maul
2011-04-21 23:46         ` Josh Stone
2011-04-22 19:17           ` Jake Maul
2011-04-22 20:28             ` Josh Stone
2011-04-22 21:47               ` Jake Maul
2011-04-25 21:54                 ` Josh Stone
2011-04-26  2:11                   ` Jake Maul
2011-04-25 22:53           ` Josh Stone
2011-04-26  1:59             ` Jake Maul
2011-04-26  2:44               ` Josh Stone

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).