[Newbie] Help request: understanding slowdowns in a network file system

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* [Newbie] Help request: understanding slowdowns in a network file system
@ 2012-07-27 14:37 Daniel Ankers
  2012-07-27 17:21 ` Frank Ch. Eigler
  0 siblings, 1 reply; 2+ messages in thread
From: Daniel Ankers @ 2012-07-27 14:37 UTC (permalink / raw)
  To: systemtap

Hi all,
I'm trying to understand what is causing occasional slowdowns to disks
I/O in a virtual environment I manage.

The disks are stored as image files on a Gluster
(http://www.gluster.org/) FUSE filesystem, and each image file is
stored on two different Gluster servers.  This means that any disk
request from an application on a virtual server goes through something
similar to the following layers:

(1) Linux VFS on guest system
(2) Hypervisor on host system
(3) Linux VFS on host system
(4) Gluster client FUSE module on host system
(5) Network layer on host system
(6) Physical network
(7) Network layer on Gluster server system
(8) Gluster server FUSE module on Gluster server system
(9) Linux VFS on Gluster server system
(10) Filesystem code on Gluster server system
(11) Physical disk on Gluster server system

The question I need to answer is "what do I need to upgrade to fix
this problem" and I've not been able to find an answer using the usual
troubleshooting tools - I've not even been able to find anything other
than observed behaviour on the guest system

I'm reading the Systemtap Beginners Guide which has some examples
which will help at certain layers (e.g. iotime.stp) but I'm struggling
to understand how to pull everything together to get helpful
diagnostic information.

The questions I have are:
1) Is Systemtap the right tool to help me get to the bottom of this
problem?  If not, the rest of the questions don't matter...
2) As an administrator rather than a developer I don't really know
which system calls I need to be monitoring.  What is the best way to
work this out?
3) Is there a neat way to tie together requests going out of the
client with requests coming into the server?
4) Are there any hints anyone can give on the best way to approach
troubleshooting across several different processes, layers and
services like this?

Thanks,
Dan

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [Newbie] Help request: understanding slowdowns in a network file system
  2012-07-27 14:37 [Newbie] Help request: understanding slowdowns in a network file system Daniel Ankers
@ 2012-07-27 17:21 ` Frank Ch. Eigler
  0 siblings, 0 replies; 2+ messages in thread
From: Frank Ch. Eigler @ 2012-07-27 17:21 UTC (permalink / raw)
  To: Daniel Ankers; +Cc: systemtap

Hi, Daniel -

md1clv wrote:

> I'm trying to understand what is causing occasional slowdowns to disks
> I/O in a virtual environment I manage. [...]
> This means that any disk request from an application on a virtual
> server goes through something similar to the following layers:

> (1) Linux VFS on guest system
> (2) Hypervisor on host system
> (3) Linux VFS on host system
> (4) Gluster client FUSE module on host system
> (5) Network layer on host system
> (6) Physical network
> (7) Network layer on Gluster server system
> (8) Gluster server FUSE module on Gluster server system
> (9) Linux VFS on Gluster server system
> (10) Filesystem code on Gluster server system
> (11) Physical disk on Gluster server system
> [...]

Yup.

> The questions I have are:
> 1) Is Systemtap the right tool to help me get to the bottom of this
> problem?  If not, the rest of the questions don't matter...

It is a plausible tool to gather data for your analysis; other tools
can do at least some of the job too.

> 2) As an administrator rather than a developer I don't really know
> which system calls I need to be monitoring.  What is the best way to
> work this out?

Hey, your list of affected layers/systems didn't even include the
syscalls/userspace!  But basically read/write, if those are the
dominant operations, as opposed to memory-mapped I/O.  (You can
speculatively trace all syscalls for a process, and e.g. take official
notice of only those that take to complete.)

> 3) Is there a neat way to tie together requests going out of the
> client with requests coming into the server?

That's deeply protocol-specific, and is partly what makes such a
big-bang analysis job so difficult.  One needs to follow the data flow
throughout the layers, as it's encapsulated and transformed.  No
generic tool can do the job: one has to encode an understanding of all
these mappings at some point.  I'm aware of no tool that currently can
do all this, already hard-coded.  systemtap has the advantage of deep
programmability, so that you can experiment, and encode the knowledge
pretty directly (searching through data structures, following data as
it's being passed between code points, ...), without needing to
firehose-dump absolutely everything else that's going on on the
machine.

> 4) Are there any hints anyone can give on the best way to approach
> troubleshooting across several different processes, layers and
> services like this?

stap is probably suitable for gathering information on a per-host
basis, including tracking the data flow as it goes from userspace out
to the (virtual) network devices, and not too much extra data.

It sounds like you have multiple hosts that you'll need to combine the
(presumably timestamped) data and analyze further.  This would need
some tools like some sort of programmable viewer/aggregator, or simply
keen eyeballs.  stap per se (or most basic tracing tools) will be of
no direct use here.  We're working on arranging smoothish data flow
into PCP (performance co-pilot), so that its graphical event viewers /
clients could be used for this; this is work in progress.

- FChE

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2012-07-27 17:21 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-27 14:37 [Newbie] Help request: understanding slowdowns in a network file system Daniel Ankers
2012-07-27 17:21 ` Frank Ch. Eigler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).