From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <systemtap-return-13746-listarch-systemtap=sources.redhat.com@sourceware.org>
Received: (qmail 25044 invoked by alias); 30 Sep 2009 02:57:13 -0000
Received: (qmail 25029 invoked by uid 22791); 30 Sep 2009 02:57:09 -0000
X-SWARE-Spam-Status: No, hits=0.3 required=5.0 	tests=AWL,BAYES_00,SPF_HELO_PASS,SPF_PASS,TBC
X-Spam-Check-By: sourceware.org
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28)     by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Wed, 30 Sep 2009 02:57:01 +0000
Received: from int-mx03.intmail.prod.int.phx2.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.16]) 	by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id n8U2uxjH028236; 	Tue, 29 Sep 2009 22:56:59 -0400
Received: from [10.14.54.191] (dhcp-191.sjc.redhat.com [10.14.54.191]) 	by int-mx03.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id n8U2ux0n012621; 	Tue, 29 Sep 2009 22:56:59 -0400
Message-ID: <4AC2C8FB.6010701@redhat.com>
Date: Wed, 30 Sep 2009 02:57:00 -0000
From: Josh Stone <jistone@redhat.com>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.4pre) Gecko/20090922 Fedora/3.0-2.7.b4.fc11 Lightning/1.0pre Thunderbird/3.0b4
MIME-Version: 1.0
To: "David J. Wilder" <dwilder@us.ibm.com>
CC: systemtap@sourceware.org
Subject: Re: [PATCH] trace tcp connection parameters
References: <1253848025.21411.2.camel@wilder.ibm.com>
In-Reply-To: <1253848025.21411.2.camel@wilder.ibm.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-IsSubscribed: yes
Mailing-List: contact systemtap-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <systemtap.sourceware.org>
List-Subscribe: <mailto:systemtap-subscribe@sourceware.org>
List-Post: <mailto:systemtap@sourceware.org>
List-Help: <mailto:systemtap-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: systemtap-owner@sourceware.org
X-SW-Source: 2009-q3/txt/msg00897.txt.bz2

On 09/24/2009 08:07 PM, David J. Wilder wrote:
> This script (tcp_trace) can be used to trace tcp connection parameters
> and state changes.  This work was original inspired by Stephen Hemminger's
> TCP cwnd snooper (net/ipv4/tcp_probe.c). Tcp_trace is a helpful tool for
> troubleshooting connection performance issues.  I would like to included
> tcp_trace in the systemtap examples directory.

Thanks for this script David.  I don't know enough about networking to
offer much insight here, but I do have some suggestions that I think
will simplify the implementation.

> 
> Signed-off-by: Varun Chandramohan <varunc@linux.vnet.ibm.com>
> Signed-off-by: David Wilder <dwilder@us.ibm.com>
> ------------------------------------------------------
>  .../systemtap.examples/network/tcp_trace.meta      |   14 +
>  testsuite/systemtap.examples/network/tcp_trace.stp |  822 ++++++++++++++++++++
>  testsuite/systemtap.examples/network/tcp_trace.txt |   24 +
>  3 files changed, 860 insertions(+), 0 deletions(-)
> 
> diff --git a/testsuite/systemtap.examples/network/tcp_trace.meta b/testsuite/systemtap.examples/network/tcp_trace.meta
> new file mode 100644
> index 0000000..2783c89
> --- /dev/null
> +++ b/testsuite/systemtap.examples/network/tcp_trace.meta
> @@ -0,0 +1,14 @@
> +title: Tcp connection tracing utility.
> +name: tcp_trace.stp
> +version: 1.0
> +author: varuncha@in.ibm.com wilder@us.ibm.com
> +keywords: network trace
> +subsystem: network
> +status: production
> +exit: user-controlled
> +output: timed
> +scope: per-socket
> +arg_[0-9]+: tcp_trace.stp  filter=all|state|txq|rxq|srtt|snd_cwnd|snd_wnd|rexmit|pmtu|ssthresh|timer|rcvwnd [timeout=<sec>] [update=change|all] Rule
> +description: This scripts traces a given tcp connection based on the filter parameters given by the user. The indexing is done by the 4 tuples local address, remote address, local port, remote port.
> +test_check: stap -p4 tcp_trace.stp
> +test_installcheck: stap tcp_trace.stp 127.0.0.1:*-127.0.0.1:* timeout=1
> diff --git a/testsuite/systemtap.examples/network/tcp_trace.stp b/testsuite/systemtap.examples/network/tcp_trace.stp
> new file mode 100755
> index 0000000..04c2a63
> --- /dev/null
> +++ b/testsuite/systemtap.examples/network/tcp_trace.stp
> @@ -0,0 +1,822 @@
> +#! /usr/bin/env stap
> +/*
> + *	Copyright (C) 2009 IBM Corp.
> + *	This file is part of systemtap, and is free software.  You can
> + *	redistribute it and/or modify it under the terms of the GNU General
> + *	Public License (GPL); either version 2, or (at your option) any
> + *	later version.
> + *
> + *	Version 0.1     wilder@us.ibm.com	2009-05-13
> + *	Version 0.3     varuncha@in.ibm.com	2009-05-20
> + *	Version 1.0	wilder@us.ibm.com	2009-09-24
> + *
> + *	Tcp connection tracing utility.
> + *
> + *	Description:
> + *	This scripts traces a given tcp connection based on the filter
> + *	parameters given by the user. The indexing is done by the 4 tuples
> + *	local address, remote address, local port, remote port.
> + *
> + *	Synopsis:
> + *	tcp_trace.stp	[filter=all|state|txq|rxq|srtt|snd_cwnd|snd_wnd|rexmit\
> + *			|pmtu|ssthresh|timer|rcvwnd] [timeout=<N sec>]\
> + *			[update=change|all]\
> + *	Rule format:
> + *	  <local ip-address>:<local-port>-<remote ip-address>:<remote-port>
> + *
> + *	filter  tcp_trace.stp collects connection information from various 
> + *		probe points. This gives various tcp related parameters.
> + *		However it is not all parameters are needed when debuggin

+g

> + *		a tcp problem. Specifying only required parameters reduces
> + *		analysis time. Multiple parameters can be specified by using
> + *		','.
> + *
> + *	timeout (optional) When a timeout value (in seconds) is specified
> + *		tcpstat will automatically terminate it's run at the end of
> + *		the specified time and produce a report. When timeout is
> + *		omitted the script will run until the user terminates it by
> + *		typing a ^c.

I don't see anything that produces a report, although some kind of
summary might be interesting.

> + *
> + *	update (optional) By default the script only displays data if there
> + *		is a change in the parameters specified in the "filter". This
> + *		can be changed by specifying "all", which will ouput parameters
> + *		of every packet that hits the probe.

s/ouput/output/

> + *
> + *	Rule Format This is used to specify the particular connection being
> + *		traced. Only one connection can be specified at a time.
> + *		However, if needed the local and remote port can be a
> + *		wild-card (*). But the local & remote ip has to be a host ip.
> + *
> + *		The Rule Format is:
> + *		<local-address>:<local-port>-<remote address>:<remote-port>
> + *
> + *		-Address are specified as ipv4 dot notation address.
> + *		-Ports are specified as decimal numbers.

You implementation also supports (*) in the ip address, which should be
documented here.  Or for bonus points, it would be neat to support
explicit subnet/mask specs, as I think that's closer to what people
usually would deal with.

> + *
> + *	Examples:
> + *		Here are some examples of using tcp_trace:
> + *
> + *	Trace TCP connection from 172.16.15.160 to 172.16.15.1 on port 22 with
> + *	state,txq,rxq,pmtu filter
> + *	tcp_trace.stp filter=state,txq,rxq,pmtu timeout=100\
> + *		 172.16.15.160:*-172.16.15.1:22
> + *
> + *	Trace TCP connection from 172.16.15.1 to 172.16.15.160 on port 22 with
> + *	all parameters in filter
> + *	tcp_trace.stp filter=all 172.16.15.160:22-172.16.15.1:*
> + */
> +
> +global tcp_state;
> +global timer_info;
> +global filter;
> +global key_list;
> +global lastkey;
> +global tcp_param;
> +global timeout;
> +global update;
> +global count;
> +global length;
> +global tx_current_timer = 0;
> +global rx_current_timer = 0;
> +global find_timer = 0;
> +global argms;
> +global filter_flds = 0;
> +global snd_cwnd_flg = 0;
> +global snd_wnd_flg = 0;
> +global srtt_flg = 0;
> +global state_flg = 0;
> +global txq_flg = 0;
> +global rxq_flg = 0;
> +global rexmit_flg = 0;
> +global pmtu_flg = 0;
> +global ssthresh_flg = 0;
> +global timer_flg = 0;
> +global rcvwnd_flg = 0;

Just stylistic - it's not required to initialize globals to 0 or "".
The only real reason to do so is to force the type to long or string,
but we should be able to figure that out automatically.

I think some of these globals don't need to exist, but I'll try to
mention that in the places they are used.

> +
> +probe begin {
> +	timeout = -1;
> +	number_of_filters = process_cmdline()
> +
> +	if(number_of_filters < 1) {
> +		usage("No Connection Parameters Specified")
> +	} else if (number_of_filters > 1) {
> +		usage("Too many Connection Parameters Specified")
> +	}
> +	
> +	for(i =0; i < count; i++) {
> +		if(tcp_param[i] == "snd_cwnd") {
> +			snd_cwnd_flg = 1
> +		} else if(tcp_param[i] == "snd_wnd") {
> +			snd_wnd_flg = 1
> +		} else if(tcp_param[i] == "srtt") {
> +			srtt_flg = 1;
> +		} else if(tcp_param[i] == "state") {
> +			state_flg = 1;
> +		} else if(tcp_param[i] == "txq") {
> +			txq_flg = 1;
> +		} else if(tcp_param[i] == "rxq") {
> +			rxq_flg = 1;
> +		} else if(tcp_param[i] == "rexmit") {
> +			rexmit_flg = 1;
> +		} else if(tcp_param[i] == "pmtu") {
> +			pmtu_flg = 1;
> +		} else if(tcp_param[i] == "ssthresh") {
> +			ssthresh_flg = 1;
> +		} else if(tcp_param[i] == "timer") {
> +			timer_flg = 1;
> +		} else if(tcp_param[i] == "rcvwnd") {
> +			rcvwnd_flg = 1;
> +		} else if(tcp_param[i] == "all") {
> +			filter_flds = 1;
> +		} else {
> +			printf("Discard Unknown Filter = %s \n", tcp_param[i]);
> +		}
> +			
> +	}

Why not just make this part of process_cmdline()?  Then you don't need a
global for tcp_param[] or count.

> +
> +	if(!(snd_cwnd_flg | snd_wnd_flg | srtt_flg | state_flg | txq_flg |
> +		rxq_flg | rexmit_flg | pmtu_flg | ssthresh_flg | timer_flg |
> +		rcvwnd_flg))
> +		filter_flds = 1

I don't think you need the filter_flds global -- just set all of the
flags in this case.  That way every "if (foo_flg || filter_flds)" can be
just "if (foo_flg)".

> +
> +	if(state_flg || filter_flds) {
> +		tcp_state[1] = "ESTABLISHED"
> +		tcp_state[2] = "SYN_SENT"
> +		tcp_state[3] = "SYN_RECV"
> +		tcp_state[4] = "FIN_WAIT1"
> +		tcp_state[5] = "FIN_WAIT2"
> +		tcp_state[6] = "TIME_WAIT"
> +		tcp_state[7] = "CLOSE"
> +		tcp_state[8] = "CLOSE_WAIT"
> +		tcp_state[9] = "LAST_ACK"
> +		tcp_state[10] = "LISTEN"
> +		tcp_state[11] = "CLOSING"
> +	}
> +
> +	if(timer_flg || filter_flds) {
> +		timer_info[1] = "Rxmit"
> +		timer_info[2] = "Delack"
> +		timer_info[3] = "Probe"
> +		timer_info[4] = "Keepalv"
> +	}
> +		
> +	printf("Start TCP Probing.....\n");
> +	printf("\n");
> +	printf("\nSource Address         Dest Address           State       Len    Tx-Q  Rx-Q  PMTU  SndCwnd SndWnd  A.RWnd SRTT Ssthreshold Rexmit Timer\n");
> +}
> +
> +probe kernel.function("tcp_rcv_established"),
> +      kernel.function("tcp_rcv_state_process")

I see that you're not using any tapset probepoints in this script.  If
these are useful places to probe, would you consider adding them to one
of the tapsets?

Also, each of the probe bodies are nearly the same, except for a few
variations in variable sources (sk, length, r/w flag).  It would be nice
to factor that common chunk into a single function.

> +{
> +	sk = $sk
> +
> +	laddr = tcpmib_local_addr(sk);
> +	raddr = tcpmib_remote_addr(sk);
> +	lport = tcpmib_local_port(sk);
> +	rport = tcpmib_remote_port(sk);
> +	length = $skb->len
> +
> +	if ( filter_key(laddr, raddr, lport, rport, 0) == 0 ) 
> +		next;
> +
> +	if (is_packet_updated(laddr, raddr, lport, rport, 0, sk) == 0)
> +		next; 
> +
> +	print_packet_info(laddr, raddr, lport, rport, 0)
> +}
> +
> +probe kernel.function("tcp_transmit_skb")
> +{
> +	sk = $sk
> +
> +	laddr = tcpmib_local_addr(sk);  
> +	raddr = tcpmib_remote_addr(sk);
> +	lport = tcpmib_local_port(sk);
> +	rport = tcpmib_remote_port(sk);
> +
> +	length = $skb->len
> +
> +	if ( filter_key(laddr, raddr, lport, rport, 1) == 0 ) 
> +		next;
> +
> +	if (is_packet_updated(laddr, raddr, lport, rport, 1, sk) == 0)
> +		next; 
> +
> +	print_packet_info(laddr, raddr, lport, rport, 1)
> +}
> +
> +probe kernel.function("tcp_keepalive_timer")
> +{
> +	sk = $data;
> +
> +	laddr = tcpmib_local_addr(sk);  
> +	raddr = tcpmib_remote_addr(sk);
> +	lport = tcpmib_local_port(sk);
> +	rport = tcpmib_remote_port(sk);
> +
> +	length = 0
> +	tx_current_timer = 4;
> +
> +        if ( filter_key(laddr, raddr, lport, rport, 1) == 0 )
> +                next;
> +
> +        if (is_packet_updated(laddr, raddr, lport, rport, 1, sk) == 0)
> +                next;
> +
> +	print_packet_info(laddr, raddr, lport, rport, 1)
> +}
> +
> +probe kernel.function("tcp_delack_timer")
> +{
> +	sk = $data;
> +
> +	laddr = tcpmib_local_addr(sk);  
> +	raddr = tcpmib_remote_addr(sk);
> +	lport = tcpmib_local_port(sk);
> +	rport = tcpmib_remote_port(sk);
> +
> +	length = 0
> +	tx_current_timer = 2;
> +
> +        if ( filter_key(laddr, raddr, lport, rport, 1) == 0 )
> +                next;
> +
> +        if (is_packet_updated(laddr, raddr, lport, rport, 1, sk) == 0)
> +                next;
> +
> +	print_packet_info(laddr, raddr, lport, rport, 1)
> +}
> +
> +probe kernel.function("tcp_send_probe0")
> +{
> +	sk = $sk
> +
> +	if(find_timer == 3) {
> +		tx_current_timer = 3
> +		next;
> +	}
> +
> +	laddr = tcpmib_local_addr(sk);  
> +	raddr = tcpmib_remote_addr(sk);
> +	lport = tcpmib_local_port(sk);
> +	rport = tcpmib_remote_port(sk);
> +
> +	length = 0
> +
> +        if ( filter_key(laddr, raddr, lport, rport, 1) == 0 )
> +                next;
> +
> +        if (is_packet_updated(laddr, raddr, lport, rport, 1, sk) == 0)
> +                next;
> +
> +	print_packet_info(laddr, raddr, lport, rport, 1)
> +}
> +
> +probe kernel.function("tcp_write_timer")
> +{
> +	sk = $data
> +
> +	laddr = tcpmib_local_addr(sk);  
> +	raddr = tcpmib_remote_addr(sk);
> +	lport = tcpmib_local_port(sk);
> +	rport = tcpmib_remote_port(sk);
> +
> +        if ( filter_key(laddr, raddr, lport, rport, 1) == 0 )
> +                next;
> +
> +	find_timer = @cast(sk, "inet_connection_sock")->icsk_pending
> +}

This use of find_timer and rx/tx_current_timer seems racy to me -- what
if there are multiple connections matching the filter and changing
timers at the same time?

> +
> +probe kernel.function("tcp_retransmit_skb")
> +{
> +	sk = $sk;
> +
> +	if(find_timer == 1) {
> +		tx_current_timer = 1
> +		next;
> +	}
> +
> +	laddr = tcpmib_local_addr(sk);  
> +	raddr = tcpmib_remote_addr(sk);
> +	lport = tcpmib_local_port(sk);
> +	rport = tcpmib_remote_port(sk);
> +
> +	length = $skb->len
> +
> +
> +	if ( filter_key(laddr, raddr, lport, rport, 1) == 0 )
> +		next;
> +
> +	if (is_packet_updated(laddr, raddr, lport, rport, 1, sk) == 0)
> +		next;
> +
> +	print_packet_info(laddr, raddr, lport, rport, 1)
> +}
> +
> +function print_packet_info:long (laddr:long, raddr:long, lport:long,
> +				 rport:long, flag:long )
> +{
> +	temp_addr = ip_ntop(htonl(laddr))
> +	local_addr = sprintf("%s:%d",temp_addr,lport)
> +	temp_addr = ip_ntop(htonl(raddr))
> +	remote_addr = sprintf("%s:%d",temp_addr,rport)
> +	argms[1] = 5
> +	argms[3] = length

The argms array is quite strange to me.  It is only used within this
function, so it doesn't really need to be global.  The indexes are magic
-- they don't appear to have real meaning.  Some of the indexes appear
to be field widths, and some are the field values.

Please just use separate local variables with descriptive names for each
of these.

> +	
> +	if(state_flg || filter_flds) {
> +		state_val = key_list[laddr, raddr, lport, rport, flag, 1]
> +		if(state_val < 1 || state_val > 11) {
> +			prnt_state = "UNKNOWN"
> +		} else {
> +			prnt_state = tcp_state[state_val]
> +		}
> +	} else {
> +		prnt_state = "-"
> +	}
> +
> +	if(txq_flg || filter_flds) { 
> +		argms[4] = key_list[laddr, raddr, lport, rport, flag, 2]
> +	} else {
> +		argms[4] = -1
> +	}

It's even more confusing when the magic indexes of key_list doesn't
match the the indexes used in argms...

> +
> +	if(rxq_flg || filter_flds) {
> +		argms[5] = key_list[laddr, raddr, lport, rport, flag, 3]
> +	} else {
> +		argms[5] = -1
> +	}
> +
> +	if(rexmit_flg || filter_flds) {
> +		argms[12] = key_list[laddr, raddr, lport, rport, flag, 4]
> +	} else {
> +		argms[12] = -1
> +	}
> +
> +	if(pmtu_flg || filter_flds) {
> +		argms[6] = key_list[laddr, raddr, lport, rport, flag, 5]
> +	} else {
> +		argms[6] = -1
> +	}
> +
> +	if(snd_cwnd_flg || filter_flds) {
> +		argms[7] = key_list[laddr, raddr, lport, rport, flag, 6]
> +	} else {
> +		argms[7] = -1
> +	}
> +
> +	if(snd_wnd_flg || filter_flds) {
> +		argms[8] = key_list[laddr, raddr, lport, rport, flag, 7]
> +	} else {
> +		argms[8] = -1
> +	}
> +
> +	if(srtt_flg || filter_flds) {
> +		argms[10] = key_list[laddr, raddr, lport, rport, flag, 8]
> +	} else {
> +		argms[10] = -1
> +	}
> +
> +	if(ssthresh_flg || filter_flds) {
> +		argms[11] = key_list[laddr, raddr, lport, rport, flag, 10]
> +	} else {
> +		argms[11] = -1
> +	}
> +
> +	if(timer_flg || filter_flds) {
> +		timer_val = key_list[laddr, raddr, lport, rport, flag, 11]
> +		if(timer_val < 1 || timer_val > 4) {
> +			prnt_timer = "N/A"
> +		} else {
> +			prnt_timer = timer_info[timer_val]
> +		}
> +	} else {
> +		prnt_timer = "-"
> +	}
> +	
> +	if(rcvwnd_flg || filter_flds) {
> +		argms[9] = key_list[laddr, raddr, lport, rport, flag, 12]
> +	} else {
> +		argms[9] = -1
> +	}
> +
> +	if(flag) {
> +                argms[0] = netmax(22,strlen(local_addr))
> +                argms[2] = netmax(22,strlen(remote_addr))
> +                buff = sprintf("%-*s %-*s %-*s %-*d %-*d %-*d %-*d %-*d %-*d %-*d %-*d %-*d",
> +		argms[0], local_addr, argms[2],remote_addr, 11, prnt_state, 6,
> +		argms[3], 5, argms[4], 5, argms[5], 5, argms[6], 7, argms[7],
> +		7, argms[8], 6, argms[9], 4, argms[10], 11, argms[11]);
> +                buff1 = sprintf("%-*d %-*s",6, argms[12], argms[1], prnt_timer);
> +
> +
> +        } else {
> +                argms[0] = netmax(22,strlen(remote_addr))
> +                argms[2] = netmax(22,strlen(local_addr))
> +                buff = sprintf("%-*s %-*s %-*s %-*d %-*d %-*d %-*d %-*d %-*d %-*d %-*d %-*d",
> +		argms[0], remote_addr, argms[2],local_addr, 11, prnt_state, 6,
> +		argms[3], 5, argms[4], 5, argms[5], 5, argms[6], 7, argms[7],
> +		7, argms[8], 6, argms[9], 4, argms[10], 11, argms[11]);
> +                buff1 = sprintf("%-*d %-*s",6, argms[12], argms[1], prnt_timer);
> +        }

The only difference I can see in this if/else is the order of the
local/remote addr, so it would probably be simpler to save which is the
source and dest, and then share the same sprintfs.

The dynamic field width using netmax is unnecessary.  A "ipv4addr:port"
string can't be more than 21 characters, so the 22 width is fine.  And
even if it could, the field width doesn't ever cause truncation, so the
whole thing will alway be printed anyway.

Actually, I don't understand why any of this is using dynamic field
widths, when you're just passing literals anyway.  The desired widths
are already designed by the header you print in the beginning, right?  I
would much prefer to see "%-22s ...".

Also, I think that the default right-alignment is easier to read for
numbers, because it's easier to compare and judge magnitudes at a glance
when the digit orders are lined up.

> +
> +	buff_final = str_replace(buff, "-1", " -")
> +	buff_final1 = str_replace(buff1, "-1", " -")
> +	printf("%s %s\n",buff_final,buff_final1)
> +}

I wouldn't copy these strings into local variables like this very often,
as our translator isn't too smart about copying strings around.  We
generate a fair amount less code if you avoid those buff_final
temporaries and pass the str_replaces directly to the printf.

> +
> +function netmax:long (val1:long, val2:long)
> +{
> +	if(val1 > val2)
> +		return val1
> +	else
> +		return val2
> +}
> +
> +function is_packet_updated:long ( laddr:long, raddr:long, lport:long,
> +					 rport:long, flag:long, sk:long )
> +{
> +	packet_updated = 0
> +
> +	if(state_flg || filter_flds) {
> +		if(update) {
> +			key_list[laddr, raddr, lport, rport, flag, 1] =
> +				 @cast(sk, "sock_common")->skc_state
> +			packet_updated = 1
> +		} else {
> +			temp = key_list[laddr, raddr, lport, rport, flag, 1]
> +			state = @cast(sk, "sock_common")->skc_state
> +			if(temp != state)
> +				packet_updated = 1
> +			key_list[laddr, raddr, lport, rport, flag, 1] = state
> +		}
> +	}
> +	if(txq_flg || filter_flds) {
> +		if(update) {
> +			if(@cast(sk, "sock_common")->skc_state == 10) 
> +				key_list[laddr, raddr, lport, rport, flag, 2] =
> +					 @cast(sk, "sock")->sk_max_ack_backlog
> +			else
> +				key_list[laddr, raddr, lport, rport, flag, 2] =
> +					(@cast(sk, "tcp_sock")->write_seq - 
> +					 @cast(sk, "tcp_sock")->snd_una)
> +			packet_updated = 1
> +		} else {
> +			temp = key_list[laddr, raddr, lport, rport, flag, 2]
> +			if(@cast(sk, "sock_common")->skc_state == 10)
> +				tx_queue =
> +					 @cast(sk, "sock")->sk_max_ack_backlog
> +			else
> +				tx_queue = (@cast(sk, "tcp_sock")->write_seq -
> +						@cast(sk, "tcp_sock")->snd_una)
> +			if(temp != tx_queue)
> +				packet_updated = 1
> +			key_list[laddr, raddr, lport, rport, flag, 2] = tx_queue
> +		}
> +	}
> +	if(rxq_flg || filter_flds) {
> +		if(update) {
> +			if(@cast(sk, "sock_common")->skc_state == 10)
> +				key_list[laddr, raddr, lport, rport, flag, 3] =
> +					@cast(sk, "sock")->sk_ack_backlog
> +			else
> +				key_list[laddr, raddr, lport, rport, flag, 3] =
> +					(@cast(sk, "tcp_sock")->rcv_nxt - 						@cast(sk, "tcp_sock")->copied_seq)
> +			packet_updated  = 1
> +		} else {
> +			temp = key_list[laddr, raddr, lport, rport, flag, 3]
> +			if(@cast(sk, "sock_common")->skc_state == 10)
> +				rx_queue = @cast(sk, "sock")->sk_max_ack_backlog
> +			else
> +				rx_queue = (@cast(sk, "tcp_sock")->rcv_nxt -
> +					@cast(sk, "tcp_sock")->copied_seq)
> +			if(temp != rx_queue)
> +				packet_updated = 1
> +			key_list[laddr, raddr, lport, rport, flag, 3] = rx_queue
> +		}
> +	}
> +	if(rexmit_flg || filter_flds) {
> +		if(update) {
> +			key_list[laddr, raddr, lport, rport, flag, 4] =
> +			 @cast(sk, "inet_connection_sock")->icsk_retransmits
> +			packet_updated = 1
> +		} else {
> +			temp = key_list[laddr, raddr, lport, rport, flag, 4]
> +			retxmit =
> +			@cast(sk, "inet_connection_sock")->icsk_retransmits
> +			if(temp != retxmit)
> +				packet_updated = 1
> +			key_list[laddr, raddr, lport, rport, flag, 4] = retxmit
> +		}
> +	}
> +	if(pmtu_flg || filter_flds) {
> +		if(update) {
> +			key_list[laddr, raddr, lport, rport, flag, 5] =
> +			 @cast(sk, "inet_connection_sock")->icsk_pmtu_cookie
> +			packet_updated = 1
> +		} else {
> +			temp = key_list[laddr, raddr, lport, rport, flag, 5]
> +			pmtu =
> +			@cast(sk, "inet_connection_sock")->icsk_pmtu_cookie
> +			if(temp != pmtu)
> +				packet_updated = 1
> +			key_list[laddr, raddr, lport, rport, flag, 5] = pmtu
> +		}
> +	}
> +	if(snd_cwnd_flg || filter_flds) {
> +		if(update) {
> +			key_list[laddr, raddr, lport, rport, flag, 6] =
> +			@cast(sk, "tcp_sock")->snd_cwnd
> +			packet_updated = 1
> +		} else {
> +			temp = key_list[laddr, raddr, lport, rport, flag, 6]
> +			snd_cwnd = @cast(sk, "tcp_sock")->snd_cwnd
> +			if(temp != snd_cwnd)
> +				packet_updated = 1
> +			key_list[laddr, raddr, lport, rport, flag, 6] = snd_cwnd
> +		}
> +	}
> +	if(snd_wnd_flg || filter_flds) {
> +		if(update) {
> +			key_list[laddr, raddr, lport, rport, flag, 7] =
> +				 @cast(sk, "tcp_sock")->snd_wnd
> +			packet_updated = 1
> +		} else {
> +			temp = key_list[laddr, raddr, lport, rport, flag, 7]
> +			snd_wnd = @cast(sk, "tcp_sock")->snd_wnd
> +			if(temp != snd_wnd)
> +				packet_updated = 1
> +			key_list[laddr, raddr, lport, rport, flag, 7] = snd_wnd
> +		}
> +	}
> +	if(srtt_flg || filter_flds) {
> +		if(update) {
> +			key_list[laddr, raddr, lport, rport, flag, 8] =
> +				(@cast(sk,"tcp_sock")->srtt) >> 3
> +			packet_updated = 1
> +		} else {
> +			temp = key_list[laddr, raddr, lport, rport, flag, 8]
> +			srtt = (@cast(sk,"tcp_sock")->srtt) >> 3
> +			if(temp != srtt)
> +				packet_updated = 1
> +			key_list[laddr, raddr, lport, rport, flag, 8] = srtt
> +		}
> +	}
> +
> +	key_list[laddr, raddr, lport, rport, flag, 9] = length
> +
> +	if(ssthresh_flg || filter_flds) {
> +		if(update) {
> +			if ((1 << @cast(sk, "inet_connection_sock")->icsk_ca_state) & ((1 << 2) | (1 << 3)))
> +				key_list[laddr, raddr, lport, rport, flag, 10] =
> +					 @cast(sk, "tcp_sock")->snd_ssthresh
> +			else
> +				key_list[laddr, raddr, lport, rport, flag, 10] =
> +				 netmax(@cast(sk, "tcp_sock")->snd_ssthresh,
> +					((@cast(sk, "tcp_sock")->snd_cwnd >>1)+
> +					(@cast(sk, "tcp_sock")->snd_cwnd >> 2)))
> +			packet_updated = 1
> +		} else {
> +			temp = key_list[laddr, raddr, lport, rport, flag, 10]
> +			if ((1 << @cast(sk, "inet_connection_sock")->icsk_ca_state) & ((1 << 2) | (1 << 3)))
> +				ssthresh = @cast(sk, "tcp_sock")->snd_ssthresh
> +			else
> +				ssthresh =
> +				 netmax(@cast(sk, "tcp_sock")->snd_ssthresh,
> +				((@cast(sk, "tcp_sock")->snd_cwnd >> 1) +
> +				(@cast(sk, "tcp_sock")->snd_cwnd >> 2)))
> +			if(temp != ssthresh)
> +				packet_updated = 1
> +			key_list[laddr, raddr, lport, rport, flag, 10] =
> +				 ssthresh
> +		}
> +	}
> +
> +	if(timer_flg || filter_flds) {
> +		if(update) {
> +			if(flag)
> +				key_list[laddr, raddr, lport, rport, flag, 11] =
> +					tx_current_timer
> +			else
> +				key_list[laddr, raddr, lport, rport, flag, 11] =
> +					rx_current_timer
> +			packet_updated = 1
> +		} else {
> +			temp = key_list[laddr, raddr, lport, rport, flag, 11]
> +			if(flag) {
> +				if(temp != tx_current_timer)
> +					packet_updated = 1
> +				key_list[laddr, raddr, lport, rport, flag, 11] =
> +					tx_current_timer
> +			} else {
> +				if(temp != rx_current_timer)
> +					packet_updated = 1
> +				key_list[laddr, raddr, lport, rport, flag, 11] =
> +					rx_current_timer
> +			}
> +		}
> +	}
> +
> +	if(rcvwnd_flg || filter_flds) {
> +		if(update) {
> +			key_list[laddr, raddr, lport, rport, flag, 12] =
> +				 @cast(sk, "tcp_sock")->rcv_wnd
> +			packet_updated = 1
> +		} else {
> +			temp = key_list[laddr, raddr, lport, rport, flag, 12]
> +			rcvwnd = @cast(sk, "tcp_sock")->rcv_wnd
> +			if(temp != rcvwnd)
> +				packet_updated = 1
> +			key_list[laddr, raddr, lport, rport, flag, 12] = rcvwnd
> +		}
> +	}
> +	return packet_updated		
> +}

There's a lot of duplication between each "if (update) / else" branch.
I would just use the else branch on all of them, and then the return
value is "update || packet_updated".

> +
> +/* All command line arguments other than the filters are processed
> + * first and must be placed on the command line prior to any filters.
> + */
> +function process_cmdline:long ()
> +{
> +        filter_number = 0
> +        count = 0
> +        for (i=1; i <= argc; i++) {
> +                argument = tokenize(argv[i], "=")
> +
> +                if (argument == "filter") {
> +                        argv[i]=""
> +                        while(strlen(byte = tokenize(argv[i], ",")) != 0) {
> +                                argv[i] = ""
> +                                tcp_param[count] = byte
> +                                count++
> +                        }
> +                        continue;
> +                }
> +
> +                if ( argument == "timeout" ){
> +                        argv[i]=""
> +                        timeout=strtol(tokenize(argv[i], "="),10)
> +                        continue;
> +                }
> +
> +		if ( argument == "update") {
> +			argv[i]=""
> +			update_disp = tokenize(argv[i], "=")
> +			continue;
> +		}
> +
> +		/* Anything on the command line after this point must
> + 		 * be a filter.
> +		 */
> +		source = tokenize(argv[i], "-")
> +			argv[i] = ""
> +		dest = tokenize(argv[i], "-")
> +
> +		source_addr = tokenize(source, ":")
> +			source=""
> +		source_port = tokenize(source, ":")
> +
> +		dest_addr = tokenize(dest, ":")
> +			dest=""
> +		dest_port = tokenize(dest, ":")
> +
> +		/* stap bug */
> +		if ( dest_port == "fobar") i=i;
> +
> +		++filter_number;
> +		j=filter_number*6;
> +		filter[j]   = ipv4_pton(source_addr,0)  // Source address
> +		filter[j+1] = ipv4_pton(source_addr,1)  // Source address mask
> +		filter[j+2] = ipv4_portton(source_port) // Source port
> +		filter[j+3] = ipv4_pton(dest_addr,0)    // Dest address
> +		filter[j+4] = ipv4_pton(dest_addr,1)    // Dest address mask
> +		filter[j+5] = ipv4_portton(dest_port)   // Dest port

This is another case of magic indexes, which doesn't really need to be
an array at all.  These are distinct components, so why not just name
them filter_src_addr, filter_src_port, etc.?

> +
> +		if (filter[j]< -1 ||
> +			filter[j+1] < -1 ||
> +			filter[j+2] < -1 ||
> +			filter[j+3] < -1 ||
> +			filter[j+4] < -1 ||
> +			filter[j+5] < -1 ) return -1;
> +
> +	}
> +	if (update_disp == "all") {
> +		update = 1;
> +	} else if (update_disp == "change") {
> +		update = 0;
> +	} else {
> +		/*We default it as "change"*/
> +		update = 0;
> +	}

The way this flag is used, I would call it "always_update".

> +	return filter_number;
> +}
> +
> +/*
> + * Convert an ascii integer values between 0 and 65534 to a u16 port number.
> + * "*" are treated as wildcards and will be converted to 0xffff.
> + */
> +function ipv4_portton:long (port:string)
> +{
> +	if ( port == "*" ) port="65535";
> +		pport=strtol(port,10);
> +	if ( pport > 0xffff ){
> +		printf("Bad port number %s\n",port)
> +		return -22;
> +	}
> +	return pport
> +}

Since 65535 could be a valid port number, why not use something like -1
to represent the wildcard case?

> +
> +/*
> + * Convert ipv4 dot notation address into longs.
> + * Supports "*" in any field treating it as a wildcard by making the byte=0.
> + * If make_mask is set it creates a mask based on the "*" fields. All non='*'
> + * bytes are set to 0xff all * fields are set to 0x0;.
> + */
> +function ipv4_pton:long (addr:string, make_mask:long)
> +{
> +	i=32;
> +	ip=0;
> +	ips=addr;
> +	while(strlen(byte = tokenize(ips, ".")) != 0) {
> +		i-=8;
> +		ips="";
> +
> +		if ( byte == "*" ){
> +			byte = "0"
> +		} else {
> +			if ( make_mask ) byte = "255";
> +		}
> +
> +		j=strtol(byte,10);
> +		if ( j > 255 ){
> +			printf("bad address %s\n",addr)
> +			return -22;
> +		}
> +			ip=ip+(j<<i) // left shift the byte into the address
> +	}
> +	if ( i != 0 ){
> +		printf("bad address %s\n",addr)
> +		return -22;
> +	}
> +	return ip;
> +}
> +
> +/*
> + * Returns a unique value (stored in the global key_list) based on the socket
> + * address tuple and the global collection index value. A new value is created
> + * if one does not already exist.
> + */
> +function build_key:long (laddr:long, raddr:long, lport:long, rport:long, flag:long)
> +{
> +	__index__ = 0
> +
> +	if ( key_list[laddr, raddr, lport, rport, flag, __index__] ) {
> +		return  key_list[laddr, raddr, lport, rport, flag, __index__]
> +	} else {
> +		key_list[laddr, raddr, lport, rport, flag, __index__]=++lastkey
> +		key = key_list[laddr, raddr, lport, rport, flag, __index__]
> +		__index__ ++
> +		while(__index__ != 12) {
> +			key_list[laddr, raddr, lport, rport, flag, __index__]=0
> +			__index__ ++
> +		}
> +		return key
> +	}
> +}
> +
> +		
> +function filter_key:long (laddr:long, raddr:long, lport:long, rport:long, flag:long)
> +{
> +	j = 6;
> +	local_filter = remote_filter = 0
> +
> +	//Local Filter
> +	if ( (laddr&filter[j+1]) == filter[j] ) {
> +		if ( (filter[j+2] == 0xffff) || (lport == filter[j+2]))
> +			local_filter = 1;
> +	}
> +	// Remote filter
> +	if ( (raddr&filter[j+4]) == filter[j+3] ) {
> +		if ( (filter[j+5] == 0xffff) || (rport == filter[j+5]))
> +			remote_filter = 1;
> +	}
> +
> +	if(local_filter && remote_filter) {
> +		return build_key(laddr, raddr, lport, rport, flag);
> +	} else {
> +	        return 0;
> +	}
> +}

The result of filter_key is only used as a boolean, so build_key's value
is wasted.  (It does still have the side effect of clearing the other
key_list indexes though.)

However, I think you could use this key to simplify a lot of other
indexing.  The tuple (laddr,raddr,lport,rport,flag) only needs to be
checked once here, and then the resulting key could be passed around for
your index everywhere else.

> +
> +/* Terminates the run in timeout seconds, using global timeout value */
> +probe timer.s(1) {
> +	if ( timeout == -1 ) next
> +	if ( !timeout-- ) exit()
> +}

There's an off-by-one error here with the post-decrement.
(e.g. timeout=1 will wait for 2 seconds.)

> +
> +function usage (msg:string)
> +{
> +	printf("\nUsage:\n");
> +	printf("\ttcp_trace.stp  filter=all|state|txq|rxq|srtt|snd_cwnd|snd_wnd|rexmit|pmtu|ssthresh|timer|rcvwnd [timeout=<sec>] [update=change|all] Rule\n");
> +	printf("\tRule format:  <local ip-address>:<local-port>-<remote ip-address>:<remote-port>\n\n");
> +	error(msg);
> +}

Note that printf() goes to stdout, and error() goes to stderr, and
there's no strict ordering between the two.

> diff --git a/testsuite/systemtap.examples/network/tcp_trace.txt b/testsuite/systemtap.examples/network/tcp_trace.txt
> new file mode 100644
> index 0000000..46fe53f
> --- /dev/null
> +++ b/testsuite/systemtap.examples/network/tcp_trace.txt
> @@ -0,0 +1,24 @@
> +tcp_trace.stp filter=all 9.47.66.34:*-9.47.67.199:* 
> +Start TCP Probing.....
> +
> +Source Address         Dest Address           State       Len    Tx-Q  Rx-Q  PMTU  SndCwnd SndWnd  A.RWnd SRTT Ssthreshold Rexmit Timer
> +9.47.67.199:55587      9.47.66.34:22          SYN_RECV    20     0     0     1500  2       6144    5840   0    2147483647  0      N/A  
> +9.47.66.34:22          9.47.67.199:55587      ESTABLISHED 21     21    0     1500  3       6144    5840   0    2147483647  0      N/A  
> +9.47.67.199:55587      9.47.66.34:22          ESTABLISHED 20     21    0     1500  3       6144    5888   0    2147483647  0      N/A  
> +9.47.67.199:55587      9.47.66.34:22          ESTABLISHED 41     0     0     1500  4       6144    5888   1    2147483647  0      N/A  
> +9.47.66.34:22          9.47.67.199:55587      ESTABLISHED 0      0     21    1500  4       6144    5888   1    2147483647  0      N/A  
> +9.47.66.34:22          9.47.67.199:55587      ESTABLISHED 0      0     792   1500  4       6144    5888   1    2147483647  0      N/A  
> +9.47.66.34:22          9.47.67.199:55587      ESTABLISHED 784    784   792   1500  4       6144    7424   1    2147483647  0      N/A  
> +9.47.67.199:55587      9.47.66.34:22          ESTABLISHED 44     784   0     1500  4       6144    7424   1    2147483647  0      N/A  
> +9.47.66.34:22          9.47.67.199:55587      ESTABLISHED 152    152   0     1500  5       7680    7424   1    2147483647  0      N/A  
> +9.47.67.199:55587      9.47.66.34:22          ESTABLISHED 164    152   0     1500  5       7680    7424   1    2147483647  0      N/A  
> +9.47.66.34:22          9.47.67.199:55587      ESTABLISHED 0      0     0     1500  5       9216    7424   1    2147483647  0      Delack
> +9.47.66.34:22          9.47.67.199:55587      ESTABLISHED 720    720   0     1500  5       9216    9088   1    2147483647  0      Delack
> +9.47.67.199:55587      9.47.66.34:22          ESTABLISHED 36     720   0     1500  5       9216    9088   1    2147483647  0      N/A  
> +9.47.66.34:22          9.47.67.199:55587      ESTABLISHED 0      0     16    1500  5       10752   9088   1    2147483647  0      Delack
> +9.47.67.199:55587      9.47.66.34:22          ESTABLISHED 68     0     0     1500  5       10752   9088   1    2147483647  0      N/A  
> +9.47.66.34:22          9.47.67.199:55587      ESTABLISHED 0      0     0     1500  5       10752   9088   1    2147483647  0      Delack
> +9.47.66.34:22          9.47.67.199:55587      ESTABLISHED 48     48    0     1500  5       10752   9088   1    2147483647  0      Delack
> +9.47.67.199:55587      9.47.66.34:22          ESTABLISHED 84     48    0     1500  5       10752   9088   1    2147483647  0      N/A  
> +9.47.66.34:22          9.47.67.199:55587      ESTABLISHED 64     64    0     1500  5       10752   9088   1    2147483647  0      Delack
> +9.47.67.199:55587      9.47.66.34:22          ESTABLISHED 116    64    0     1500  5       10752   9088   1    2147483647  0      N/A  
> 
> 

I usually try to trim my reviews a bit more, but with the size of this
file it was easier for me to keep all of the context.  I hope it's still
legible for you.

Thanks,

Josh