* Linux VFS cache hit rate script @ 2011-04-20 19:04 Jake Maul 2011-04-20 20:09 ` William Cohen 0 siblings, 1 reply; 17+ messages in thread From: Jake Maul @ 2011-04-20 19:04 UTC (permalink / raw) To: systemtap [-- Attachment #1: Type: text/plain, Size: 438 bytes --] Greetings all, I have recently put a script up on the War Stories page, and it was requested of me to also post it to the list, along with a .meta file. They're both attached. http://sourceware.org/systemtap/wiki/WSCacheHitRate Let me know what you think! It's my first SystemTap script, so I'm sure there are improvements that could be made. I'm by no means a kernel hacker, or even a C dev... just a lowly sysadmin. :) Thanks, Jake [-- Attachment #2: cache-hit-rate.stp --] [-- Type: application/octet-stream, Size: 780 bytes --] global cache_bytes, disk_bytes, counter probe vfs.read.return { if (bytes_read>0) { if (devname=="N/A") { /* N/A means cache hit... right? */ cache_bytes += bytes_read } else { disk_bytes += bytes_read } } } # print VFS hits and misses every 5 second, plus the hit rate in % probe timer.s(5) { if (counter%15 == 0) { printf ("\n%18s %18s %10s %10s\n", "Cache Reads (KB)", "Disk Reads (KB)", "Miss Rate", "Hit Rate") } counter++ hitrate = 10000 * cache_bytes / (cache_bytes+disk_bytes) missrate = 10000 * disk_bytes / (cache_bytes+disk_bytes) printf ("%18d %18d %6d.%02d%% %6d.%02d%%\n", cache_bytes/1024, disk_bytes/1024, missrate/100, missrate%100, hitrate/100, hitrate%100) cache_bytes = 0 disk_bytes = 0 } [-- Attachment #3: cache-hit-rate.meta --] [-- Type: application/octet-stream, Size: 536 bytes --] title: VFS Cache Hit Rate name: cache-hit-rate.stp version: 1.0 author: Jake Maul keywords: io profiling vfs cache subsystem: io status: production exit: user-controlled output: timed interval (vmstat-like) scope: system-wide description: The cache-hit-rate.stp script checks the bytes_read and devname fields of vfs.read.return, and uses them to determine how much data was read directly from disk vs how much was read from the Linux VFS cache. test_check: stap -p4 cache-hit-rate.stp test_installcheck: stap iostats.stp -c "sleep 6" ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Linux VFS cache hit rate script 2011-04-20 19:04 Linux VFS cache hit rate script Jake Maul @ 2011-04-20 20:09 ` William Cohen 2011-04-20 20:46 ` Jake Maul 2011-04-21 20:17 ` Jake Maul 0 siblings, 2 replies; 17+ messages in thread From: William Cohen @ 2011-04-20 20:09 UTC (permalink / raw) To: Jake Maul; +Cc: systemtap On 04/20/2011 03:04 PM, Jake Maul wrote: > Greetings all, Hi Jake, Thanks for taking the time to submit the example. > > I have recently put a script up on the War Stories page, and it was > requested of me to also post it to the list, along with a .meta file. > They're both attached. > > http://sourceware.org/systemtap/wiki/WSCacheHitRate Why the need for -DMAXMAPENTRIES=100000 on the second example on the wiki page? There doesn't seem to be any arrays in the script. > > Let me know what you think! It's my first SystemTap script, so I'm > sure there are improvements that could be made. I'm by no means a > kernel hacker, or even a C dev... just a lowly sysadmin. :) Include a short header like the following at the beginning of the script: #!/usr/bin/env stap # cache-hit.stp # Copyright (C) 2011 xxx <xxx@yyy.zzz> # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 as # published by the Free Software Foundation. Should avoid long line lengths in the printf statements. Wrapped lines don't look good in manuals. The following line sounds unsure, need to check that that is really the case: if (devname=="N/A") { /* N/A means cache hit... right? */ It would be good to have a better explanation for the "N/A" check. -Will ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Linux VFS cache hit rate script 2011-04-20 20:09 ` William Cohen @ 2011-04-20 20:46 ` Jake Maul 2011-04-20 20:57 ` Josh Stone 2011-04-21 20:17 ` Jake Maul 1 sibling, 1 reply; 17+ messages in thread From: Jake Maul @ 2011-04-20 20:46 UTC (permalink / raw) To: William Cohen; +Cc: systemtap On Wed, Apr 20, 2011 at 1:08 PM, William Cohen <wcohen@redhat.com> wrote: > On 04/20/2011 03:04 PM, Jake Maul wrote: >> Greetings all, > > Hi Jake, > > Thanks for taking the time to submit the example. > >> >> I have recently put a script up on the War Stories page, and it was >> requested of me to also post it to the list, along with a .meta file. >> They're both attached. >> >> http://sourceware.org/systemtap/wiki/WSCacheHitRate > > Why the need for -DMAXMAPENTRIES=100000 on the second example on the wiki page? There doesn't seem to be any arrays in the script. For some reason on that particular server, I get this error sometimes: ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at /usr/share/systemtap/tapset/vfs.stp:769:9 I freely admit, that was voodoo-administration on my part- I found something somewhere that indicated you can make that variable larger like that, and tried it out. It seems to help, but I haven't the slightest idea why. If anyone can shed some light on that, I'd be much obliged. :) On a different server, it works fine without that. They're both RHEL5, but the working on is 64-bit and the one that sometimes errors is 32-bit PAE. Not sure if that matters. They do have completely different datasets and workloads. >> Let me know what you think! It's my first SystemTap script, so I'm >> sure there are improvements that could be made. I'm by no means a >> kernel hacker, or even a C dev... just a lowly sysadmin. :) > > Include a short header like the following at the beginning of the script: > > #!/usr/bin/env stap > # cache-hit.stp > # Copyright (C) 2011 xxx <xxx@yyy.zzz> > # > # This program is free software; you can redistribute it and/or modify > # it under the terms of the GNU General Public License version 2 as > # published by the Free Software Foundation. Good call, I'll add something. > Should avoid long line lengths in the printf statements. Wrapped lines don't look good in manuals. Also good. > The following line sounds unsure, need to check that that is really the case: > > if (devname=="N/A") { /* N/A means cache hit... right? */ > > It would be good to have a better explanation for the "N/A" check. This is from the beginner's guide, actually: http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/SystemTap_Beginners_Guide/mainsect-disk.html It bothers me too, because I can't find any documentation that says explicitly when you would get an 'N/A' devname. I don't know if it's a SystemTap thing or a kernel thing, but being that I don't have experience working on either, I'm fairly lost. Thanks for the feedback! Jake ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Linux VFS cache hit rate script 2011-04-20 20:46 ` Jake Maul @ 2011-04-20 20:57 ` Josh Stone 2011-04-20 22:25 ` Jake Maul 0 siblings, 1 reply; 17+ messages in thread From: Josh Stone @ 2011-04-20 20:57 UTC (permalink / raw) To: Jake Maul; +Cc: William Cohen, systemtap On 04/20/2011 01:45 PM, Jake Maul wrote: > On Wed, Apr 20, 2011 at 1:08 PM, William Cohen <wcohen@redhat.com> wrote: >> Why the need for -DMAXMAPENTRIES=100000 on the second example on >> the wiki page? There doesn't seem to be any arrays in the script. > > For some reason on that particular server, I get this error sometimes: > > ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at > /usr/share/systemtap/tapset/vfs.stp:769:9 The error location looks bogus, which we should investigate, but this is probably from the array used in vfs.stp:__find_bdevname(). I'm not sure how you could be encountering so many unique "dev" though. >> The following line sounds unsure, need to check that that is really the case: >> >> if (devname=="N/A") { /* N/A means cache hit... right? */ >> >> It would be good to have a better explanation for the "N/A" check. > > This is from the beginner's guide, actually: > > http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/SystemTap_Beginners_Guide/mainsect-disk.html > > It bothers me too, because I can't find any documentation that says > explicitly when you would get an 'N/A' devname. I don't know if it's a > SystemTap thing or a kernel thing, but being that I don't have > experience working on either, I'm fairly lost. That string comes from dev.stp:bdevname(bdev), whenever the parameter is zero. I'm not sure about the particulars though, whether bdev==0 really means it's a cache hit... Josh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Linux VFS cache hit rate script 2011-04-20 20:57 ` Josh Stone @ 2011-04-20 22:25 ` Jake Maul 0 siblings, 0 replies; 17+ messages in thread From: Jake Maul @ 2011-04-20 22:25 UTC (permalink / raw) To: Josh Stone; +Cc: William Cohen, systemtap >> ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at >> /usr/share/systemtap/tapset/vfs.stp:769:9 > > The error location looks bogus, which we should investigate, but this is > probably from the array used in vfs.stp:__find_bdevname(). I'm not sure > how you could be encountering so many unique "dev" though. I made a quick test script- the only 2 devnames I run into are dm-0 and N/A on that system, and by far N/A is more common. However, this test script fails pretty quickly (less than 5 seconds), with this error: # stap devnames.stp | sort | uniq -c ERROR: probe overhead exceeded threshold WARNING: Number of errors: 1, skipped probes: 0 Pass 5: run failed. Try again with another '--vp 00001' option. 41 dm-0 55180 N/A >> This is from the beginner's guide, actually: >> >> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/SystemTap_Beginners_Guide/mainsect-disk.html >> >> It bothers me too, because I can't find any documentation that says >> explicitly when you would get an 'N/A' devname. I don't know if it's a >> SystemTap thing or a kernel thing, but being that I don't have >> experience working on either, I'm fairly lost. > > That string comes from dev.stp:bdevname(bdev), whenever the parameter is > zero. I'm not sure about the particulars though, whether bdev==0 really > means it's a cache hit... > > Josh > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Linux VFS cache hit rate script 2011-04-20 20:09 ` William Cohen 2011-04-20 20:46 ` Jake Maul @ 2011-04-21 20:17 ` Jake Maul 2011-04-21 22:31 ` William Cohen 1 sibling, 1 reply; 17+ messages in thread From: Jake Maul @ 2011-04-21 20:17 UTC (permalink / raw) To: William Cohen; +Cc: systemtap New version of the script uploaded to War Stories, incorporating your suggested changes. I have no problem with the GPLv2, so I took your wording exactly. :) http://sourceware.org/systemtap/wiki/WSCacheHitRate Of course there's still the concern about devname "N/A", but I clarified the comment so as to make it obvious where I got that assumption from. At least then if does turn out to be incorrect, it will be obvious that something other than just this one script needs to be fixed. Jake On Wed, Apr 20, 2011 at 1:08 PM, William Cohen <wcohen@redhat.com> wrote: > On 04/20/2011 03:04 PM, Jake Maul wrote: >> Greetings all, > > Hi Jake, > > Thanks for taking the time to submit the example. > >> >> I have recently put a script up on the War Stories page, and it was >> requested of me to also post it to the list, along with a .meta file. >> They're both attached. >> >> http://sourceware.org/systemtap/wiki/WSCacheHitRate > > Why the need for -DMAXMAPENTRIES=100000 on the second example on the wiki page? There doesn't seem to be any arrays in the script. > >> >> Let me know what you think! It's my first SystemTap script, so I'm >> sure there are improvements that could be made. I'm by no means a >> kernel hacker, or even a C dev... just a lowly sysadmin. :) > > Include a short header like the following at the beginning of the script: > > #!/usr/bin/env stap > # cache-hit.stp > # Copyright (C) 2011 xxx <xxx@yyy.zzz> > # > # This program is free software; you can redistribute it and/or modify > # it under the terms of the GNU General Public License version 2 as > # published by the Free Software Foundation. > > > Should avoid long line lengths in the printf statements. Wrapped lines don't look good in manuals. > > > The following line sounds unsure, need to check that that is really the case: > > if (devname=="N/A") { /* N/A means cache hit... right? */ > > It would be good to have a better explanation for the "N/A" check. > > -Will > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Linux VFS cache hit rate script 2011-04-21 20:17 ` Jake Maul @ 2011-04-21 22:31 ` William Cohen 2011-04-21 23:02 ` Jake Maul 0 siblings, 1 reply; 17+ messages in thread From: William Cohen @ 2011-04-21 22:31 UTC (permalink / raw) To: Jake Maul; +Cc: systemtap On 04/21/2011 04:17 PM, Jake Maul wrote: > New version of the script uploaded to War Stories, incorporating your > suggested changes. I have no problem with the GPLv2, so I took your > wording exactly. :) > > http://sourceware.org/systemtap/wiki/WSCacheHitRate > > Of course there's still the concern about devname "N/A", but I > clarified the comment so as to make it obvious where I got that > assumption from. At least then if does turn out to be incorrect, it > will be obvious that something other than just this one script needs > to be fixed. > > Jake Hi Jake, I traced through vfs.stp and found where the "N/A" is coming from: function bdevname:string(bdev:long) { if (bdev == 0) return "N/A" ... } Whether "N/A" always indicated things are cached I don't know. Need to explore further. Found a possible reason got the overflow for the associative array; there is an associative array used for caching names in vfs.stp: global __devnames function __find_bdevname:string(dev:long, bdev:long) { if (dev in __devnames) return __devnames[dev] else return __devnames[dev] = bdevname(bdev) } The only way that is going to fill up is if there are a lot of different values for dev. -Will ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Linux VFS cache hit rate script 2011-04-21 22:31 ` William Cohen @ 2011-04-21 23:02 ` Jake Maul 2011-04-21 23:46 ` Josh Stone 0 siblings, 1 reply; 17+ messages in thread From: Jake Maul @ 2011-04-21 23:02 UTC (permalink / raw) To: William Cohen; +Cc: systemtap On Thu, Apr 21, 2011 at 3:31 PM, William Cohen <wcohen@redhat.com> wrote: > Found a possible reason got the overflow for the associative array; there is an associative array used for caching names in vfs.stp: > > global __devnames > function __find_bdevname:string(dev:long, bdev:long) > { > if (dev in __devnames) > return __devnames[dev] > else > return __devnames[dev] = bdevname(bdev) > } > > The only way that is going to fill up is if there are a lot of different values for dev. > > -Will > I wrote a quick script to test this: # cat devnames.stp probe vfs.read.return { printf("\tdev: %d\tdevname: %s\n", dev, devname) } Output on the affected system: # stap -DSTP_NO_OVERLOAD devnames.stp -c "sleep 60" | sort | uniq -c ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at /usr/share/systemtap/tapset/vfs.stp:769:9 WARNING: Number of errors: 1, skipped probes: 2035 Pass 5: run failed. Try again with another '--vp 00001' option. 2 dev: 0 devname: N/A 762956 dev: 16 devname: N/A 520 dev: 18 devname: N/A 4183 dev: 22 devname: N/A 4 dev: 23 devname: N/A 1288 dev: 265289728 devname: dm-0 1 dev: 27 devname: N/A 872 dev: 3 devname: N/A 3094 dev: 5 devname: N/A 380875 dev: 6 devname: N/A Both this script and the cache-hit-rate.stp script generally die with that error in 30 seconds or less. The data above is around 30 seconds or so. That bizarrely long dev number might be relevant... or maybe that's just a normal quirk of LVM? The number of entries when it crashes varies... 10 above, but it's also stopped on 7,8, or 9 dev's shown. I can't see why the limit would vary like that, nor why there would be a limit so low. Jake ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Linux VFS cache hit rate script 2011-04-21 23:02 ` Jake Maul @ 2011-04-21 23:46 ` Josh Stone 2011-04-22 19:17 ` Jake Maul 2011-04-25 22:53 ` Josh Stone 0 siblings, 2 replies; 17+ messages in thread From: Josh Stone @ 2011-04-21 23:46 UTC (permalink / raw) To: Jake Maul; +Cc: William Cohen, systemtap On 04/21/2011 04:01 PM, Jake Maul wrote: > 2 dev: 0 devname: N/A > 762956 dev: 16 devname: N/A > 520 dev: 18 devname: N/A > 4183 dev: 22 devname: N/A > 4 dev: 23 devname: N/A > 1288 dev: 265289728 devname: dm-0 > 1 dev: 27 devname: N/A > 872 dev: 3 devname: N/A > 3094 dev: 5 devname: N/A > 380875 dev: 6 devname: N/A Oy - it's possible that the "dev" is meaningless (or means something different) when the name is returned "N/A" (bdev==0), so we may be caching noise as the index. Or in any case, caching "N/A" in so many different forms is not helpful. Would you be willing to try this modified function? function __find_bdevname:string(dev:long, bdev:long) { if (bdev == 0) return "N/A" if (dev in __devnames) return __devnames[dev] return __devnames[dev] = bdevname(bdev) } > Both this script and the cache-hit-rate.stp script generally die with > that error in 30 seconds or less. The data above is around 30 seconds > or so. > > That bizarrely long dev number might be relevant... or maybe that's > just a normal quirk of LVM? It's not so bizarre - kernel device numbers are (MAJOR<<20)|MINOR, so this turns out to be device 253,0. That also means all those low dev numbers have MAJOR==0, which I think supports my theory that they are not normal. Josh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Linux VFS cache hit rate script 2011-04-21 23:46 ` Josh Stone @ 2011-04-22 19:17 ` Jake Maul 2011-04-22 20:28 ` Josh Stone 2011-04-25 22:53 ` Josh Stone 1 sibling, 1 reply; 17+ messages in thread From: Jake Maul @ 2011-04-22 19:17 UTC (permalink / raw) To: Josh Stone; +Cc: William Cohen, systemtap On Thu, Apr 21, 2011 at 4:45 PM, Josh Stone <jistone@redhat.com> wrote: > Oy - it's possible that the "dev" is meaningless (or means something > different) when the name is returned "N/A" (bdev==0), so we may be > caching noise as the index. Or in any case, caching "N/A" in so many > different forms is not helpful. Would you be willing to try this > modified function? > > function __find_bdevname:string(dev:long, bdev:long) > { > if (bdev == 0) > return "N/A" > if (dev in __devnames) > return __devnames[dev] > return __devnames[dev] = bdevname(bdev) > } No problem... seems to have no effect though: 339869 dev: 16 devname: N/A 345 dev: 18 devname: N/A 4569 dev: 22 devname: N/A 5 dev: 23 devname: N/A 798 dev: 265289728 devname: dm-0 640 dev: 3 devname: N/A 2594 dev: 5 devname: N/A 250861 dev: 6 devname: N/A I just commented out the existing version in /usr/share/systemtap/tapsets/vfs.stp, and put your function in instead. Do I need to do anything more than that, like rebuild something? (not a systemtap dev) Jake ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Linux VFS cache hit rate script 2011-04-22 19:17 ` Jake Maul @ 2011-04-22 20:28 ` Josh Stone 2011-04-22 21:47 ` Jake Maul 0 siblings, 1 reply; 17+ messages in thread From: Josh Stone @ 2011-04-22 20:28 UTC (permalink / raw) To: Jake Maul; +Cc: William Cohen, systemtap On 04/22/2011 12:16 PM, Jake Maul wrote: > No problem... seems to have no effect though: > > 339869 dev: 16 devname: N/A > 345 dev: 18 devname: N/A > 4569 dev: 22 devname: N/A > 5 dev: 23 devname: N/A > 798 dev: 265289728 devname: dm-0 > 640 dev: 3 devname: N/A > 2594 dev: 5 devname: N/A > 250861 dev: 6 devname: N/A > > I just commented out the existing version in > /usr/share/systemtap/tapsets/vfs.stp, and put your function in > instead. Do I need to do anything more than that, like rebuild > something? (not a systemtap dev) That should do it. I should have been more clear - the effect I intended was just to relieve the need for MAXMAPENTRIES. The "__devnames" map should now only contain the real device names, and not waste any time/space on "N/A" entries. Your script's output would not be affected though. Josh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Linux VFS cache hit rate script 2011-04-22 20:28 ` Josh Stone @ 2011-04-22 21:47 ` Jake Maul 2011-04-25 21:54 ` Josh Stone 0 siblings, 1 reply; 17+ messages in thread From: Jake Maul @ 2011-04-22 21:47 UTC (permalink / raw) To: Josh Stone; +Cc: systemtap On Fri, Apr 22, 2011 at 1:28 PM, Josh Stone <jistone@redhat.com> wrote: > That should do it. I should have been more clear - the effect I > intended was just to relieve the need for MAXMAPENTRIES. The > "__devnames" map should now only contain the real device names, and not > waste any time/space on "N/A" entries. Your script's output would not > be affected though. > > Josh > Ah. Well, in that case, sorry, still no good. I cut out that bit of the output, but: # time stap -DSTP_NO_OVERLOAD devnames.stp -c "sleep 35" | sort | uniq -c ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at /usr/share/systemtap/tapset/vfs.stp:780:9 WARNING: Number of errors: 1, skipped probes: 2126 Pass 5: run failed. Try again with another '--vp 00001' option. Here's an interesting note though: the line numbers are different. The only difference between the runs is which of the 2 functions is commented out. With stock function: ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at /usr/share/systemtap/tapset/vfs.stp:769:9 With your function: ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at /usr/share/systemtap/tapset/vfs.stp:780:9 Here's that snippet in vfs.stp (pasting here because the line numbers are going to be off now, as compared to the stock file): 761 probe vfs.read = kernel.function("vfs_read") 762 { 763 file = $file 764 pos = $pos 765 buf = $buf 766 bytes_to_read = $count 767 dev = __file_dev($file) 768 devname = __find_bdevname(dev, __file_bdev($file)) 769 ino = __file_ino($file) 770 771 name = "vfs.read" 772 argstr = sprintf("%d, %d, %p", $count, $pos, $buf) 773 } 774 775 probe vfs.read.return = kernel.function("vfs_read").return 776 { 777 name = "vfs.read" 778 retstr = sprintf("%d", $return) 779 780 file = $file 781 pos = $pos 782 buf = $buf 783 bytes_to_read = $count 784 dev = __file_dev($file) 785 devname = __find_bdevname(dev, __file_bdev($file)) 786 ino = __file_ino($file) 787 788 ret = $return 789 bytes_read = $return > 0 ? $return : 0 790 error = $return < 0 ? $return : 0 791 error_str = error ? errno_str(error) : "" 792 } So as you can see, one errors in vfs.read, the other in vfs.read.return. In both cases something to do with $file gets assigned to a normal variable. Side note: I wonder if the "name" variables are messed up... vfs.read* and vfs.write* don't follow quite the same pattern: probe vfs.read = kernel.function("vfs_read") name = "vfs.read" probe vfs.read.return = kernel.function("vfs_read").return name = "vfs.read" probe vfs.readv = kernel.function("vfs_readv") name = "vfs.read" probe vfs.readv.return = kernel.function("vfs_readv").return name = "vfs.readv" probe vfs.write = kernel.function("vfs_write") name = "vfs.write" probe vfs.write.return = kernel.function("vfs_write").return name = "vfs.write" probe vfs.writev = kernel.function("vfs_writev") name = "vfs.writev" probe vfs.writev.return = kernel.function("vfs_writev").return name = "vfs.writev" Jake ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Linux VFS cache hit rate script 2011-04-22 21:47 ` Jake Maul @ 2011-04-25 21:54 ` Josh Stone 2011-04-26 2:11 ` Jake Maul 0 siblings, 1 reply; 17+ messages in thread From: Josh Stone @ 2011-04-25 21:54 UTC (permalink / raw) To: Jake Maul; +Cc: systemtap On 04/22/2011 02:47 PM, Jake Maul wrote: > Here's an interesting note though: the line numbers are different. The > only difference between the runs is which of the 2 functions is > commented out. > > With stock function: > ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at > /usr/share/systemtap/tapset/vfs.stp:769:9 > > With your function: > ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at > /usr/share/systemtap/tapset/vfs.stp:780:9 Was the "stock" one before you added my new version of the function? I think that would easily explain why the line numbers got shifted. Note that both are talking about $file at column 9, which in your snippet only makes sense for the "<TAB>file = $file". Anyway, I just realized that there is a way that $file could actually be involved with a map overflow (which I thought was wrong before). When accessing parameters in a .return probe, we first use an entry probe to save the value into a map, then read that back in the actual .return probe. So that saving map could be the thing that is overflowing. I notice that you also have a bunch of skipped probes, which is probably due to not enough kretprobe slots available. (-t can tell for sure.) I think this may be causing a bug that sometimes the saved $file is not being cleared, because the part that would use it is getting skipped. Your 2035 skipped seems not enough to accumulate past the default 10,000 MAXMAPENTRIES, but I don't have other ideas. I would suggest tweaking KRETACTIVE until there's no skipping, as described here: http://sourceware.org/systemtap/wiki/TipSkippedProbes I'm not sure there's a way we could more cleanly fail this case of skipped probe vs. stale saved $var. On kernels since 2.6.25 we actually use a different saving mechanism though, which wouldn't have the same mapping problem. > Side note: I wonder if the "name" variables are messed up... vfs.read* > and vfs.write* don't follow quite the same pattern: > > probe vfs.read = kernel.function("vfs_read") > name = "vfs.read" > probe vfs.read.return = kernel.function("vfs_read").return > name = "vfs.read" > probe vfs.readv = kernel.function("vfs_readv") > name = "vfs.read" > probe vfs.readv.return = kernel.function("vfs_readv").return > name = "vfs.readv" Yeah, looks like a typo - the third should probably be readv. Josh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Linux VFS cache hit rate script 2011-04-25 21:54 ` Josh Stone @ 2011-04-26 2:11 ` Jake Maul 0 siblings, 0 replies; 17+ messages in thread From: Jake Maul @ 2011-04-26 2:11 UTC (permalink / raw) To: Josh Stone; +Cc: systemtap On Mon, Apr 25, 2011 at 2:54 PM, Josh Stone <jistone@redhat.com> wrote: > On 04/22/2011 02:47 PM, Jake Maul wrote: >> Here's an interesting note though: the line numbers are different. The >> only difference between the runs is which of the 2 functions is >> commented out. >> >> With stock function: >> ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at >> /usr/share/systemtap/tapset/vfs.stp:769:9 >> >> With your function: >> ERROR: Array overflow, check MAXMAPENTRIES near identifier '$file' at >> /usr/share/systemtap/tapset/vfs.stp:780:9 > > Was the "stock" one before you added my new version of the function? I > think that would easily explain why the line numbers got shifted. Note > that both are talking about $file at column 9, which in your snippet > only makes sense for the "<TAB>file = $file". After: the only difference is which was commented out. > Anyway, I just realized that there is a way that $file could actually be > involved with a map overflow (which I thought was wrong before). When > accessing parameters in a .return probe, we first use an entry probe to > save the value into a map, then read that back in the actual .return > probe. So that saving map could be the thing that is overflowing. > > I notice that you also have a bunch of skipped probes, which is probably > due to not enough kretprobe slots available. (-t can tell for sure.) I > think this may be causing a bug that sometimes the saved $file is not > being cleared, because the part that would use it is getting skipped. > Your 2035 skipped seems not enough to accumulate past the default 10,000 > MAXMAPENTRIES, but I don't have other ideas. I would suggest tweaking > KRETACTIVE until there's no skipping, as described here: > http://sourceware.org/systemtap/wiki/TipSkippedProbes Will give this a try tomorrow. > I'm not sure there's a way we could more cleanly fail this case of > skipped probe vs. stale saved $var. On kernels since 2.6.25 we actually > use a different saving mechanism though, which wouldn't have the same > mapping problem. If it won't happen on more modern systems, I don't see a major problem with ignoring it- it'll resolve itself. :) >> Side note: I wonder if the "name" variables are messed up... vfs.read* >> and vfs.write* don't follow quite the same pattern: >> >> probe vfs.read = kernel.function("vfs_read") >> name = "vfs.read" >> probe vfs.read.return = kernel.function("vfs_read").return >> name = "vfs.read" >> probe vfs.readv = kernel.function("vfs_readv") >> name = "vfs.read" >> probe vfs.readv.return = kernel.function("vfs_readv").return >> name = "vfs.readv" > > Yeah, looks like a typo - the third should probably be readv. > > > Josh > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Linux VFS cache hit rate script 2011-04-21 23:46 ` Josh Stone 2011-04-22 19:17 ` Jake Maul @ 2011-04-25 22:53 ` Josh Stone 2011-04-26 1:59 ` Jake Maul 1 sibling, 1 reply; 17+ messages in thread From: Josh Stone @ 2011-04-25 22:53 UTC (permalink / raw) To: Jake Maul; +Cc: William Cohen, systemtap On 04/21/2011 04:45 PM, Josh Stone wrote: > On 04/21/2011 04:01 PM, Jake Maul wrote: >> 2 dev: 0 devname: N/A >> 762956 dev: 16 devname: N/A >> 520 dev: 18 devname: N/A >> 4183 dev: 22 devname: N/A >> 4 dev: 23 devname: N/A >> 1288 dev: 265289728 devname: dm-0 >> 1 dev: 27 devname: N/A >> 872 dev: 3 devname: N/A >> 3094 dev: 5 devname: N/A >> 380875 dev: 6 devname: N/A [...] >> That bizarrely long dev number might be relevant... or maybe that's >> just a normal quirk of LVM? > > It's not so bizarre - kernel device numbers are (MAJOR<<20)|MINOR, so > this turns out to be device 253,0. That also means all those low dev > numbers have MAJOR==0, which I think supports my theory that they are > not normal. I was thinking about what could be causing such high N/A counts, even on my nearly-idle laptop. I'm pretty sure now that these MAJOR==0 are actually "emulated" filesystems, like sysfs, procfs, etc. So I don't think the N/A has anything to do with caching - it's just that there's literally no block device associated with the request. Then I think the high counts here are because stap is getting into a feedback loop as it reads your printfs over debugfs. A request comes in, your script printfs it; then stapio reads that printf and copies it to be read again by a tty emulator or your |sort|uniq pipe -- creating even more vfs_read events, in a never ending chain. So you should probably at least filter stap's own events out of your results with a condition like: if (pid() != stp_pid()) { printf... } Some of the other probes in the vfs tapset deal with the page cache directly, which I think you'll need to get true vfs caching rates. Josh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Linux VFS cache hit rate script 2011-04-25 22:53 ` Josh Stone @ 2011-04-26 1:59 ` Jake Maul 2011-04-26 2:44 ` Josh Stone 0 siblings, 1 reply; 17+ messages in thread From: Jake Maul @ 2011-04-26 1:59 UTC (permalink / raw) To: Josh Stone; +Cc: William Cohen, systemtap On Mon, Apr 25, 2011 at 3:53 PM, Josh Stone <jistone@redhat.com> wrote: > I was thinking about what could be causing such high N/A counts, even on > my nearly-idle laptop. I'm pretty sure now that these MAJOR==0 are > actually "emulated" filesystems, like sysfs, procfs, etc. So I don't > think the N/A has anything to do with caching - it's just that there's > literally no block device associated with the request. Hmm, that would make a lot of sense. Can anyone verify this? Or, how is it verifiable? > Then I think the high counts here are because stap is getting into a > feedback loop as it reads your printfs over debugfs. A request comes > in, your script printfs it; then stapio reads that printf and copies it > to be read again by a tty emulator or your |sort|uniq pipe -- creating > even more vfs_read events, in a never ending chain. So you should > probably at least filter stap's own events out of your results with a > condition like: if (pid() != stp_pid()) { printf... } Heh... that would make some sense, indeed, at least for one of the dev: numbers in my output. > Some of the other probes in the vfs tapset deal with the page cache > directly, which I think you'll need to get true vfs caching rates. Which probes are you thinking about? I see probes on when things get added or removed from cache, but nothing jumps out at me as the ideal way to see what was a hit or miss. Jake ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Linux VFS cache hit rate script 2011-04-26 1:59 ` Jake Maul @ 2011-04-26 2:44 ` Josh Stone 0 siblings, 0 replies; 17+ messages in thread From: Josh Stone @ 2011-04-26 2:44 UTC (permalink / raw) To: Jake Maul; +Cc: William Cohen, systemtap On 04/25/2011 06:59 PM, Jake Maul wrote: > On Mon, Apr 25, 2011 at 3:53 PM, Josh Stone <jistone@redhat.com> wrote: > >> I was thinking about what could be causing such high N/A counts, even on >> my nearly-idle laptop. I'm pretty sure now that these MAJOR==0 are >> actually "emulated" filesystems, like sysfs, procfs, etc. So I don't >> think the N/A has anything to do with caching - it's just that there's >> literally no block device associated with the request. > > Hmm, that would make a lot of sense. Can anyone verify this? Or, how > is it verifiable? I realized this when I did "lsof -c stapio", and I saw that the debugfs handles were on device 0,7. Then I did a systemwide lsof, and I could only see MAJOR==0 on procfs files, pipes, anon, etc. Plus it didn't make sense to me that the device parameters would have any idea whether things were cached. Even though you're looking in a return probe, the examination of the parameters is no different than in an entry probe. >> Some of the other probes in the vfs tapset deal with the page cache >> directly, which I think you'll need to get true vfs caching rates. > > Which probes are you thinking about? I see probes on when things get > added or removed from cache, but nothing jumps out at me as the ideal > way to see what was a hit or miss. It's less straightforward, but I think it can be done with a combo of probes. Something like this (untested): global pages, active global total_read, total_paged probe vfs.read { if (pid() != stp_pid()) active[tid()] = 1 } probe vfs.add_to_page_cache { if (active[tid()]) pages[tid()] += nrpages } probe vfs.read.return { if (active[tid()]) { total_read[pid(), tid(), execname()] <<< bytes_read total_paged[pid(), tid(), execname()] <<< pages[tid()] delete pages[tid()] delete active[tid()] } } probe end { foreach ([p,t,e] in total_read) printf("%5d %5d %16s read %d bytes, %d pages added\n", p, t, e, @sum(total_read[pid(), tid(), execname()]), @sum(total_paged[pid(), tid(), execname()])) } Problem is, I don't think you can tell this way how much of the new pages went toward this read, since you don't know how it was aligned. But in aggregate, it's probably pretty close. Josh ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2011-04-26 2:44 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-04-20 19:04 Linux VFS cache hit rate script Jake Maul 2011-04-20 20:09 ` William Cohen 2011-04-20 20:46 ` Jake Maul 2011-04-20 20:57 ` Josh Stone 2011-04-20 22:25 ` Jake Maul 2011-04-21 20:17 ` Jake Maul 2011-04-21 22:31 ` William Cohen 2011-04-21 23:02 ` Jake Maul 2011-04-21 23:46 ` Josh Stone 2011-04-22 19:17 ` Jake Maul 2011-04-22 20:28 ` Josh Stone 2011-04-22 21:47 ` Jake Maul 2011-04-25 21:54 ` Josh Stone 2011-04-26 2:11 ` Jake Maul 2011-04-25 22:53 ` Josh Stone 2011-04-26 1:59 ` Jake Maul 2011-04-26 2:44 ` Josh Stone
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).