[PATCH] Fix tokenize function.

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* [PATCH] Fix tokenize function.
@ 2009-06-13  3:29 Przemyslaw Pawelczyk
  2009-06-13 11:40 ` Przemysław Pawełczyk
  2009-06-13 12:38 ` [PATCH] Fix tokenize function and test Przemyslaw Pawelczyk
  0 siblings, 2 replies; 10+ messages in thread
From: Przemyslaw Pawelczyk @ 2009-06-13  3:29 UTC (permalink / raw)
  To: systemtap

Previous implementation was error-prone, because allowed returning empty
tokens (mimiced strsep()), which is fine if there is a NULL semantic.
Unfortunately SystemTap doesn't provide it in scripts and has only blank
string (""), therefore testing against it was misleading.
The solution is to return only non-empty tokens (mimic strtok()).

It was also unsafe, because NUL-termination wasn't guaranteed in the
passed string, but this is crucial for proper strsep() working.

* tapset/string.stp: Fix tokenize.
* stapfuncs.3stap.in: Update tokenize description.
* doc/langref.tex: Ditto.
---
 doc/langref.tex    |    4 ++--
 stapfuncs.3stap.in |    6 +++---
 tapset/string.stp  |   19 +++++++++++--------
 3 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/doc/langref.tex b/doc/langref.tex
index 5aefa27..5a149d1 100644
--- a/doc/langref.tex
+++ b/doc/langref.tex
@@ -3160,8 +3160,8 @@ General syntax:
 tokenize:string (input:string, delim:string)
 \end{verbatim}
 \end{vindent}
-This function returns the next token in the given input string, where 
-the tokens are delimited by one of the characters in the delim string.
+This function returns the next non-empty token in the given input string,
+where the tokens are delimited by characters in the delim string.
 If the input string is non-NULL, it returns the first token. If the input string
 is NULL, it returns the next token in the string passed in the previous call
 to tokenize. If no delimiter is found, the entire remaining input string
diff --git a/stapfuncs.3stap.in b/stapfuncs.3stap.in
index b9326fd..b5f2502 100644
--- a/stapfuncs.3stap.in
+++ b/stapfuncs.3stap.in
@@ -166,11 +166,11 @@ specified by base.  For example, strtol("1000", 16) returns 4096.  Returns 0 if
 string cannot be converted.
 .TP
 tokenize:string (str:string, delim:string)
-Return the next token in the given str string, where the tokens are delimited
-by one of the characters in the delim string.  If the str string is not blank,
+Return the next non-empty token in the given str string, where the tokens are
+delimited by characters in the delim string.  If the str string is not blank,
 it returns the first token.  If the str string is blank, it returns the next
 token in the string passed in the previous call to tokenize. If no delimiter
-is found, the entire remaining str string is returned.  Returns blank when 
+is found, the entire remaining str string is returned.  Returns blank when
 no more tokens are left.
 
 .SS TIMESTAMP
diff --git a/tapset/string.stp b/tapset/string.stp
index 35ee9fa..cf70bc6 100644
--- a/tapset/string.stp
+++ b/tapset/string.stp
@@ -70,25 +70,28 @@ function text_strn:string(input:string, len:long, quoted:long)
 
 /*
  * tokenize - Given a string and a token delimiter,
- *            return the next token in the string
- * input  String to tokenize. If NULL, returns the next token in the
- *        string passed in the previous call to tokenize().
- * delim  Token delimiter. Note this is a string, but only the first
- *        character is used as the delimiter.
+ *            return the next non-empty token in the string
+ *            or blank when no more non-empty tokens are left
+ * input  String to tokenize. If NULL, returns the next non-empty token
+ *        in the string passed in the previous call to tokenize().
+ * delim  Token delimiter. Set of characters that delimit the tokens.
  */
 function tokenize:string(input:string, delim:string)
 %{ /* pure */
 	static char str[MAXSTRINGLEN];
 	static char *str_start;
+	static char *str_end;
 	char *token = NULL;
 
 	if (THIS->input[0]) {
-		strncpy(str, THIS->input, MAXSTRINGLEN);
+		strncpy(str, THIS->input, MAXSTRINGLEN - 1);
 		str_start = &str[0];
+		*(str_end = &str[0] + strnlen(str, MAXSTRINGLEN - 1)) = 0;
 	}
-	token = strsep(&str_start, THIS->delim);
+	while ((token = strsep(&str_start, THIS->delim)) && !token[0])
+		;
 	if (token)
-		strncpy (THIS->__retvalue, token, MAXSTRINGLEN);
+		strncpy(THIS->__retvalue, token, (str_start ? str_start : str_end + 1) - token);
 %}
 
 /*
-- 
1.5.6.5

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Fix tokenize function.
  2009-06-13  3:29 [PATCH] Fix tokenize function Przemyslaw Pawelczyk
@ 2009-06-13 11:40 ` Przemysław Pawełczyk
  2009-06-13 12:38 ` [PATCH] Fix tokenize function and test Przemyslaw Pawelczyk
  1 sibling, 0 replies; 10+ messages in thread
From: Przemysław Pawełczyk @ 2009-06-13 11:40 UTC (permalink / raw)
  To: systemtap

On Sat, Jun 13, 2009 at 05:13, Przemyslaw
Pawelczyk<przemyslaw@pawelczyk.it> wrote:
> Previous implementation was error-prone, because allowed returning empty
> tokens (mimiced strsep()), which is fine if there is a NULL semantic.
> Unfortunately SystemTap doesn't provide it in scripts and has only blank
> string (""), therefore testing against it was misleading.
> The solution is to return only non-empty tokens (mimic strtok()).

testsuite/systemtap.string/tokenize.exp turned out to be bogus,
because (for previous implementation of tokenize) blank textstr4
enabled further tokenization of textstr3, that should be already done.
In second version of the patch I'll fix also the test case. Simplified
example:

stap -e 'global s[2]; probe begin { s[1] = "before,,after"; s[2] = "";
foreach (i in s) { println(i); tok = tokenize(s[i], ","); while (tok
!= "") { println(tok); tok = tokenize("", ","); }}; exit(); }'

Result from previous implementation:
1
before
2
after

Result from my implementation:
1
before
after
2

> It was also unsafe, because NUL-termination wasn't guaranteed in the
> passed string, but this is crucial for proper strsep() working.

I unknowingly lied here, sorry. I'll remove that part in the second version.

Regards.

-- 
Przemysław Pawełczyk

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] Fix tokenize function and test.
  2009-06-13  3:29 [PATCH] Fix tokenize function Przemyslaw Pawelczyk
  2009-06-13 11:40 ` Przemysław Pawełczyk
@ 2009-06-13 12:38 ` Przemyslaw Pawelczyk
  2009-06-15 20:01   ` Josh Stone
  2009-06-17 23:59   ` [PATCH v2] " Przemyslaw Pawelczyk
  1 sibling, 2 replies; 10+ messages in thread
From: Przemyslaw Pawelczyk @ 2009-06-13 12:38 UTC (permalink / raw)
  To: systemtap

Previous implementation was error-prone, because allowed returning empty
tokens (mimiced strsep()), which is fine if there is a NULL semantic.
Unfortunately SystemTap doesn't provide it in scripts and has only blank
string (""), therefore testing against it was misleading.
The solution is to return only non-empty tokens (mimic strtok()).

* tapset/string.stp: Fix tokenize.
* testsuite/systemtap.string/tokenize.stp: Improve and add case with
  more than one delimiter in the delim string.
* stapfuncs.3stap.in: Update tokenize description.
* doc/langref.tex: Ditto.
---
 doc/langref.tex                         |    4 +-
 stapfuncs.3stap.in                      |    6 +-
 tapset/string.stp                       |   17 ++++---
 testsuite/systemtap.string/tokenize.exp |   16 ++++++-
 testsuite/systemtap.string/tokenize.stp |   75 +++++++++++++++++++-----------
 5 files changed, 77 insertions(+), 41 deletions(-)

diff --git a/doc/langref.tex b/doc/langref.tex
index 5aefa27..5a149d1 100644
--- a/doc/langref.tex
+++ b/doc/langref.tex
@@ -3160,8 +3160,8 @@ General syntax:
 tokenize:string (input:string, delim:string)
 \end{verbatim}
 \end{vindent}
-This function returns the next token in the given input string, where 
-the tokens are delimited by one of the characters in the delim string.
+This function returns the next non-empty token in the given input string,
+where the tokens are delimited by characters in the delim string.
 If the input string is non-NULL, it returns the first token. If the input string
 is NULL, it returns the next token in the string passed in the previous call
 to tokenize. If no delimiter is found, the entire remaining input string
diff --git a/stapfuncs.3stap.in b/stapfuncs.3stap.in
index b9326fd..b5f2502 100644
--- a/stapfuncs.3stap.in
+++ b/stapfuncs.3stap.in
@@ -166,11 +166,11 @@ specified by base.  For example, strtol("1000", 16) returns 4096.  Returns 0 if
 string cannot be converted.
 .TP
 tokenize:string (str:string, delim:string)
-Return the next token in the given str string, where the tokens are delimited
-by one of the characters in the delim string.  If the str string is not blank,
+Return the next non-empty token in the given str string, where the tokens are
+delimited by characters in the delim string.  If the str string is not blank,
 it returns the first token.  If the str string is blank, it returns the next
 token in the string passed in the previous call to tokenize. If no delimiter
-is found, the entire remaining str string is returned.  Returns blank when 
+is found, the entire remaining str string is returned.  Returns blank when
 no more tokens are left.
 
 .SS TIMESTAMP
diff --git a/tapset/string.stp b/tapset/string.stp
index 35ee9fa..bfbc6c2 100644
--- a/tapset/string.stp
+++ b/tapset/string.stp
@@ -70,25 +70,28 @@ function text_strn:string(input:string, len:long, quoted:long)
 
 /*
  * tokenize - Given a string and a token delimiter,
- *            return the next token in the string
- * input  String to tokenize. If NULL, returns the next token in the
- *        string passed in the previous call to tokenize().
- * delim  Token delimiter. Note this is a string, but only the first
- *        character is used as the delimiter.
+ *            return the next non-empty token in the string
+ *            or blank when no more non-empty tokens are left
+ * input  String to tokenize. If NULL, returns the next non-empty token
+ *        in the string passed in the previous call to tokenize().
+ * delim  Token delimiter. Set of characters that delimit the tokens.
  */
 function tokenize:string(input:string, delim:string)
 %{ /* pure */
 	static char str[MAXSTRINGLEN];
 	static char *str_start;
+	static char *str_end;
 	char *token = NULL;
 
 	if (THIS->input[0]) {
 		strncpy(str, THIS->input, MAXSTRINGLEN);
 		str_start = &str[0];
+		str_end = &str[0] + strlen(str);
 	}
-	token = strsep(&str_start, THIS->delim);
+	while ((token = strsep(&str_start, THIS->delim)) && !token[0])
+		;
 	if (token)
-		strncpy (THIS->__retvalue, token, MAXSTRINGLEN);
+		strncpy(THIS->__retvalue, token, (str_start ? str_start : str_end + 1) - token);
 %}
 
 /*
diff --git a/testsuite/systemtap.string/tokenize.exp b/testsuite/systemtap.string/tokenize.exp
index 697b7c7..aa28f85 100644
--- a/testsuite/systemtap.string/tokenize.exp
+++ b/testsuite/systemtap.string/tokenize.exp
@@ -9,7 +9,9 @@ seven
 eight
 nine
 ten
+-
 one|two|three|four|five|six|seven|eight|nine|ten
+-
 a
 b
 c
@@ -17,10 +19,22 @@ d
 e
 f
 g
+-
 1
 2
 3
  
 4
-this is a string with no delimiters}
+-
+-
+this is a string with no delimiters
+-
+this
+is
+a
+string
+which
+has
+two
+delimiters}
 stap_run2 $srcdir/$subdir/$test.stp
diff --git a/testsuite/systemtap.string/tokenize.stp b/testsuite/systemtap.string/tokenize.stp
index 10703d9..1b253c8 100644
--- a/testsuite/systemtap.string/tokenize.stp
+++ b/testsuite/systemtap.string/tokenize.stp
@@ -5,42 +5,61 @@ probe begin
 	teststr3 = "1,,2,3, ,4"
 	teststr4 = ""
 	teststr5 = "this is a string with no delimiters"
+	teststr6 = "this is a string, which has two delimiters"
 
 	tok = tokenize(teststr1, "|")
 	while (tok != "") {
-		printf("%s\n", tok)
+		println(tok)
 		tok = tokenize("", "|")
 	}
 
-        tok = tokenize(teststr1, ",")
-        while (tok != "") {
-                printf("%s\n", tok)
-                tok = tokenize("", "|")
-        }
+	println("-")
+
+	tok = tokenize(teststr1, ",")
+	while (tok != "") {
+		println(tok)
+		tok = tokenize("", "|")
+	}
 	
+	println("-")
+
 	tok = tokenize(teststr2, ",")
-        while (tok != "") {
-                printf("%s\n", tok)
-                tok = tokenize("", ",")
-        }
-
-        tok = tokenize(teststr3, ",")
-        while (tok != "") {
-                printf("%s\n", tok)
-                tok = tokenize("", ",")
-        }
-
-        tok = tokenize(teststr4, ",")
-        while (tok != "") {
-                printf("%s\n", tok)
-                tok = tokenize("", ",")
-        }
-
-        tok = tokenize(teststr5, ",")
-        while (tok != "") {
-                printf("%s\n", tok)
-                tok = tokenize("", ",")
-        }
+	while (tok != "") {
+		println(tok)
+		tok = tokenize("", ",")
+	}
+
+	println("-")
+
+	tok = tokenize(teststr3, ",")
+	while (tok != "") {
+		println(tok)
+		tok = tokenize("", ",")
+	}
+
+	println("-")
+
+	tok = tokenize(teststr4, ",")
+	while (tok != "") {
+		println(tok)
+		tok = tokenize("", ",")
+	}
+
+	println("-")
+
+	tok = tokenize(teststr5, ",")
+	while (tok != "") {
+		println(tok)
+		tok = tokenize("", ",")
+	}
+
+	println("-")
+
+	tok = tokenize(teststr6, ", ")
+	while (tok != "") {
+		println(tok)
+		tok = tokenize("", ", ")
+	}
 
 	exit()
 }
-- 
1.5.6.5

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Fix tokenize function and test.
  2009-06-13 12:38 ` [PATCH] Fix tokenize function and test Przemyslaw Pawelczyk
@ 2009-06-15 20:01   ` Josh Stone
  2009-06-16 23:40     ` Przemysław Pawełczyk
  2009-06-17 23:59   ` [PATCH v2] " Przemyslaw Pawelczyk
  1 sibling, 1 reply; 10+ messages in thread
From: Josh Stone @ 2009-06-15 20:01 UTC (permalink / raw)
  To: Przemyslaw Pawelczyk; +Cc: systemtap

On 06/13/2009 05:33 AM, Przemyslaw Pawelczyk wrote:
> Previous implementation was error-prone, because allowed returning empty
> tokens (mimiced strsep()), which is fine if there is a NULL semantic.
> Unfortunately SystemTap doesn't provide it in scripts and has only blank
> string (""), therefore testing against it was misleading.
> The solution is to return only non-empty tokens (mimic strtok()).
> 
> * tapset/string.stp: Fix tokenize.
> * testsuite/systemtap.string/tokenize.stp: Improve and add case with
>   more than one delimiter in the delim string.
> * stapfuncs.3stap.in: Update tokenize description.
> * doc/langref.tex: Ditto.
[...]

> -	token = strsep(&str_start, THIS->delim);
> +	while ((token = strsep(&str_start, THIS->delim)) && !token[0])
> +		;

do-while would be cleaner, please.

>  	if (token)
> -		strncpy (THIS->__retvalue, token, MAXSTRINGLEN);
> +		strncpy(THIS->__retvalue, token, (str_start ? str_start : str_end + 1) - token);
>  %}

Why do you need the explicit length computation?  There will always be a
NUL there anyway, right?  I tried reverting this part, and the tests
still pass, so I don't see what this is for.

Also, while we're at it, strlcpy would be preferred for better
termination semantics (guaranteed NUL and no extra NUL padding).


Josh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Fix tokenize function and test.
  2009-06-15 20:01   ` Josh Stone
@ 2009-06-16 23:40     ` Przemysław Pawełczyk
  2009-06-17  0:54       ` Josh Stone
  0 siblings, 1 reply; 10+ messages in thread
From: Przemysław Pawełczyk @ 2009-06-16 23:40 UTC (permalink / raw)
  To: Josh Stone; +Cc: systemtap

On Mon, Jun 15, 2009 at 22:00, Josh Stone<jistone@redhat.com> wrote:
> On 06/13/2009 05:33 AM, Przemyslaw Pawelczyk wrote:
>> Previous implementation was error-prone, because allowed returning empty
>> tokens (mimiced strsep()), which is fine if there is a NULL semantic.
>> Unfortunately SystemTap doesn't provide it in scripts and has only blank
>> string (""), therefore testing against it was misleading.
>> The solution is to return only non-empty tokens (mimic strtok()).
>>
>> * tapset/string.stp: Fix tokenize.
>> * testsuite/systemtap.string/tokenize.stp: Improve and add case with
>>   more than one delimiter in the delim string.
>> * stapfuncs.3stap.in: Update tokenize description.
>> * doc/langref.tex: Ditto.
> [...]
>
>> -     token = strsep(&str_start, THIS->delim);
>> +     while ((token = strsep(&str_start, THIS->delim)) && !token[0])
>> +             ;
>
> do-while would be cleaner, please.

I'll fix this in next version of patch.

>
>>       if (token)
>> -             strncpy (THIS->__retvalue, token, MAXSTRINGLEN);
>> +             strncpy(THIS->__retvalue, token, (str_start ? str_start : str_end + 1) - token);
>>  %}
>
> Why do you need the explicit length computation?  There will always be a
> NUL there anyway, right?  I tried reverting this part, and the tests
> still pass, so I don't see what this is for.
>
> Also, while we're at it, strlcpy would be preferred for better
> termination semantics (guaranteed NUL and no extra NUL padding).

You contradict yourself. There is no need for using strlcpy here,
because what is here does the same, but better -- without implicit
strlen calls (only one for the whole input string). Guaranteed NUL and
no extra NUL padding.
Maybe you're talking about putting it into another patch? Yes, it
works without this change, but I would say, that is also a fix, but
for badly written code. Of course I can send it as a second patch if
you wish.

> Josh

Thank you for reviewing this patch.

Regards.

--
Przemysław Pawełczyk

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Fix tokenize function and test.
  2009-06-16 23:40     ` Przemysław Pawełczyk
@ 2009-06-17  0:54       ` Josh Stone
  2009-06-17 18:39         ` Przemysław Pawełczyk
  0 siblings, 1 reply; 10+ messages in thread
From: Josh Stone @ 2009-06-17  0:54 UTC (permalink / raw)
  To: Przemysław Pawełczyk; +Cc: systemtap

On 06/16/2009 04:40 PM, PrzemysÂław PaweÂłczyk wrote:
>>>       if (token)
>>> -             strncpy (THIS->__retvalue, token, MAXSTRINGLEN);
>>> +             strncpy(THIS->__retvalue, token, (str_start ? str_start : str_end + 1) - token);
>>>  %}
>>
>> Why do you need the explicit length computation?  There will always be a
>> NUL there anyway, right?  I tried reverting this part, and the tests
>> still pass, so I don't see what this is for.
>>
>> Also, while we're at it, strlcpy would be preferred for better
>> termination semantics (guaranteed NUL and no extra NUL padding).
> 
> You contradict yourself. There is no need for using strlcpy here,
> because what is here does the same, but better -- without implicit
> strlen calls (only one for the whole input string). Guaranteed NUL and
> no extra NUL padding.

If you really want to be concerned with optimizing this, then you should
be using memcpy instead, since the position of the NUL is precisely known.

> Maybe you're talking about putting it into another patch? Yes, it
> works without this change, but I would say, that is also a fix, but
> for badly written code. Of course I can send it as a second patch if
> you wish.

I don't understand the hate for strlcpy, but I don't care so much to
argue over it.  Technically, we could even be using strcpy here, since
the input string is already known to be small enough.

I care that the code is clear and concise.  It wasn't immediately
obvious to me that (str_start ? str_start : str_end + 1) means "the end
of this token," and my question shows that I also didn't see why that
was a win.  I get it now, but I'm bound to forget the next time I look
at it. :)

Josh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Fix tokenize function and test.
  2009-06-17  0:54       ` Josh Stone
@ 2009-06-17 18:39         ` Przemysław Pawełczyk
  2009-06-17 21:42           ` Josh Stone
  0 siblings, 1 reply; 10+ messages in thread
From: Przemysław Pawełczyk @ 2009-06-17 18:39 UTC (permalink / raw)
  To: Josh Stone; +Cc: systemtap

2009/6/17 Josh Stone <jistone@redhat.com>:
> On 06/16/2009 04:40 PM, Przemysław Pawełczyk wrote:
>>>>       if (token)
>>>> -             strncpy (THIS->__retvalue, token, MAXSTRINGLEN);
>>>> +             strncpy(THIS->__retvalue, token, (str_start ? str_start : str_end + 1) - token);
>>>>  %}
>>>
>>> Why do you need the explicit length computation?  There will always be a
>>> NUL there anyway, right?  I tried reverting this part, and the tests
>>> still pass, so I don't see what this is for.
>>>
>>> Also, while we're at it, strlcpy would be preferred for better
>>> termination semantics (guaranteed NUL and no extra NUL padding).
>>
>> You contradict yourself. There is no need for using strlcpy here,
>> because what is here does the same, but better -- without implicit
>> strlen calls (only one for the whole input string). Guaranteed NUL and
>> no extra NUL padding.
>
> If you really want to be concerned with optimizing this, then you should
> be using memcpy instead, since the position of the NUL is precisely known.

I'm not so concerned about optimization, but I don't like doing
superfluous things. You're totally right with memcpy, will you accept
it if I use it in my patch?

>> Maybe you're talking about putting it into another patch? Yes, it
>> works without this change, but I would say, that is also a fix, but
>> for badly written code. Of course I can send it as a second patch if
>> you wish.
>
> I don't understand the hate for strlcpy, but I don't care so much to
> argue over it.  Technically, we could even be using strcpy here, since
> the input string is already known to be small enough.

I have no feelings towards strlcpy. Why do you think so? It's useful
function and I don't negate this fact.
Simply there is no reason for doing implicit strlen call every
tokenize("", delim), because (as you pointed already) position of the
NUL is precisely known.
Maybe in a few years MAXSTRINGLEN will be hundred times greater than
today, so then we will be changing this function once again?
IMHO assuming that input string is known to be small enough is not the
right way.

> I care that the code is clear and concise.  It wasn't immediately
> obvious to me that (str_start ? str_start : str_end + 1) means "the end
> of this token," and my question shows that I also didn't see why that
> was a win.  I get it now, but I'm bound to forget the next time I look
> at it. :)

"The end of this token" semantic is consequence of how strsep works.
Nevertheless I can add some comments if you think it will make the
code more clear.

> Josh

Regards.

-- 
Przemysław Pawełczyk

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Fix tokenize function and test.
  2009-06-17 18:39         ` Przemysław Pawełczyk
@ 2009-06-17 21:42           ` Josh Stone
  0 siblings, 0 replies; 10+ messages in thread
From: Josh Stone @ 2009-06-17 21:42 UTC (permalink / raw)
  To: Przemysław Pawełczyk; +Cc: systemtap

On 06/17/2009 11:39 AM, PrzemysÂław PaweÂłczyk wrote:
> 2009/6/17 Josh Stone <jistone@redhat.com>:
>> If you really want to be concerned with optimizing this, then you should
>> be using memcpy instead, since the position of the NUL is precisely known.
> 
> I'm not so concerned about optimization, but I don't like doing
> superfluous things. You're totally right with memcpy, will you accept
> it if I use it in my patch?

Yes, that's fine.

>>> Maybe you're talking about putting it into another patch? Yes, it
>>> works without this change, but I would say, that is also a fix, but
>>> for badly written code. Of course I can send it as a second patch if
>>> you wish.
>>
>> I don't understand the hate for strlcpy, but I don't care so much to
>> argue over it.  Technically, we could even be using strcpy here, since
>> the input string is already known to be small enough.
> 
> I have no feelings towards strlcpy. Why do you think so? It's useful
> function and I don't negate this fact.

I misunderstood your comment about "badly written code."  There are some
notable people who are against strlcpy (e.g. Drepper won't add it to
glibc).  In a case like ours with fixed buffers, I think it makes
complete sense.

> Simply there is no reason for doing implicit strlen call every
> tokenize("", delim), because (as you pointed already) position of the
> NUL is precisely known.
> Maybe in a few years MAXSTRINGLEN will be hundred times greater than
> today, so then we will be changing this function once again?
> IMHO assuming that input string is known to be small enough is not the
> right way.

What is there to change?  The input and output buffers are both sized
MAXSTRINGLEN.  The tokens are subsets of the input string, thus they
must also be less than MAXSTRINGLEN.  But anyway, that was an off-handed
comment that doesn't need to be addressed in your patch.

>> I care that the code is clear and concise.  It wasn't immediately
>> obvious to me that (str_start ? str_start : str_end + 1) means "the end
>> of this token," and my question shows that I also didn't see why that
>> was a win.  I get it now, but I'm bound to forget the next time I look
>> at it. :)
> 
> "The end of this token" semantic is consequence of how strsep works.
> Nevertheless I can add some comments if you think it will make the
> code more clear.

Right, I was just distracted because a variable with "start" in the name
is also used as an end pointer.  A comment would be nice, or even just
self-documenting with a "token_end" temporary.

Josh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2] Fix tokenize function and test.
  2009-06-13 12:38 ` [PATCH] Fix tokenize function and test Przemyslaw Pawelczyk
  2009-06-15 20:01   ` Josh Stone
@ 2009-06-17 23:59   ` Przemyslaw Pawelczyk
  2009-06-18  1:29     ` Josh Stone
  1 sibling, 1 reply; 10+ messages in thread
From: Przemyslaw Pawelczyk @ 2009-06-17 23:59 UTC (permalink / raw)
  To: systemtap

Previous implementation was error-prone, because allowed returning empty
tokens (mimiced strsep()), which is fine if there is a NULL semantic.
Unfortunately SystemTap doesn't provide it in scripts and has only blank
string (""), therefore testing against it was misleading.
The solution is to return only non-empty tokens (mimic strtok()).

* tapset/string.stp: Fix tokenize.
* testsuite/systemtap.string/tokenize.stp: Improve and add case with
  more than one delimiter in the delim string.
* testsuite/systemtap.string/tokenize.exp: Ditto.
* stapfuncs.3stap.in: Update tokenize description.
* doc/langref.tex: Ditto.
---
 doc/langref.tex                         |    4 +-
 stapfuncs.3stap.in                      |    6 +-
 tapset/string.stp                       |   23 ++++++---
 testsuite/systemtap.string/tokenize.exp |   16 ++++++-
 testsuite/systemtap.string/tokenize.stp |   75 +++++++++++++++++++-----------
 5 files changed, 82 insertions(+), 42 deletions(-)

diff --git a/doc/langref.tex b/doc/langref.tex
index 5aefa27..5a149d1 100644
--- a/doc/langref.tex
+++ b/doc/langref.tex
@@ -3160,8 +3160,8 @@ General syntax:
 tokenize:string (input:string, delim:string)
 \end{verbatim}
 \end{vindent}
-This function returns the next token in the given input string, where 
-the tokens are delimited by one of the characters in the delim string.
+This function returns the next non-empty token in the given input string,
+where the tokens are delimited by characters in the delim string.
 If the input string is non-NULL, it returns the first token. If the input string
 is NULL, it returns the next token in the string passed in the previous call
 to tokenize. If no delimiter is found, the entire remaining input string
diff --git a/stapfuncs.3stap.in b/stapfuncs.3stap.in
index 518ff2b..3d88b2e 100644
--- a/stapfuncs.3stap.in
+++ b/stapfuncs.3stap.in
@@ -166,11 +166,11 @@ specified by base.  For example, strtol("1000", 16) returns 4096.  Returns 0 if
 string cannot be converted.
 .TP
 tokenize:string (str:string, delim:string)
-Return the next token in the given str string, where the tokens are delimited
-by one of the characters in the delim string.  If the str string is not blank,
+Return the next non-empty token in the given str string, where the tokens are
+delimited by characters in the delim string.  If the str string is not blank,
 it returns the first token.  If the str string is blank, it returns the next
 token in the string passed in the previous call to tokenize. If no delimiter
-is found, the entire remaining str string is returned.  Returns blank when 
+is found, the entire remaining str string is returned.  Returns blank when
 no more tokens are left.
 
 .SS TIMESTAMP
diff --git a/tapset/string.stp b/tapset/string.stp
index 35ee9fa..cc84292 100644
--- a/tapset/string.stp
+++ b/tapset/string.stp
@@ -70,25 +70,32 @@ function text_strn:string(input:string, len:long, quoted:long)
 
 /*
  * tokenize - Given a string and a token delimiter,
- *            return the next token in the string
- * input  String to tokenize. If NULL, returns the next token in the
- *        string passed in the previous call to tokenize().
- * delim  Token delimiter. Note this is a string, but only the first
- *        character is used as the delimiter.
+ *            return the next non-empty token in the string
+ *            or blank when no more non-empty tokens are left
+ * input  String to tokenize. If NULL, returns the next non-empty token
+ *        in the string passed in the previous call to tokenize().
+ * delim  Token delimiter. Set of characters that delimit the tokens.
  */
 function tokenize:string(input:string, delim:string)
 %{ /* pure */
 	static char str[MAXSTRINGLEN];
 	static char *str_start;
+	static char *str_end;
 	char *token = NULL;
+	char *token_end = NULL;
 
 	if (THIS->input[0]) {
 		strncpy(str, THIS->input, MAXSTRINGLEN);
 		str_start = &str[0];
+		str_end = &str[0] + strlen(str);
+	}
+	do {
+		token = strsep(&str_start, THIS->delim);
+	} while (token && !token[0]);
+	if (token) {
+		token_end = (str_start ? str_start - 1 : str_end);
+		memcpy(THIS->__retvalue, token, token_end - token + 1);
 	}
-	token = strsep(&str_start, THIS->delim);
-	if (token)
-		strncpy (THIS->__retvalue, token, MAXSTRINGLEN);
 %}
 
 /*
diff --git a/testsuite/systemtap.string/tokenize.exp b/testsuite/systemtap.string/tokenize.exp
index 697b7c7..aa28f85 100644
--- a/testsuite/systemtap.string/tokenize.exp
+++ b/testsuite/systemtap.string/tokenize.exp
@@ -9,7 +9,9 @@ seven
 eight
 nine
 ten
+-
 one|two|three|four|five|six|seven|eight|nine|ten
+-
 a
 b
 c
@@ -17,10 +19,22 @@ d
 e
 f
 g
+-
 1
 2
 3
  
 4
-this is a string with no delimiters}
+-
+-
+this is a string with no delimiters
+-
+this
+is
+a
+string
+which
+has
+two
+delimiters}
 stap_run2 $srcdir/$subdir/$test.stp
diff --git a/testsuite/systemtap.string/tokenize.stp b/testsuite/systemtap.string/tokenize.stp
index 10703d9..1b253c8 100644
--- a/testsuite/systemtap.string/tokenize.stp
+++ b/testsuite/systemtap.string/tokenize.stp
@@ -5,42 +5,61 @@ probe begin
 	teststr3 = "1,,2,3, ,4"
 	teststr4 = ""
 	teststr5 = "this is a string with no delimiters"
+	teststr6 = "this is a string, which has two delimiters"
 
 	tok = tokenize(teststr1, "|")
 	while (tok != "") {
-		printf("%s\n", tok)
+		println(tok)
 		tok = tokenize("", "|")
 	}
 
-        tok = tokenize(teststr1, ",")
-        while (tok != "") {
-                printf("%s\n", tok)
-                tok = tokenize("", "|")
-        }
+	println("-")
+
+	tok = tokenize(teststr1, ",")
+	while (tok != "") {
+		println(tok)
+		tok = tokenize("", "|")
+	}
 	
+	println("-")
+
 	tok = tokenize(teststr2, ",")
-        while (tok != "") {
-                printf("%s\n", tok)
-                tok = tokenize("", ",")
-        }
-
-        tok = tokenize(teststr3, ",")
-        while (tok != "") {
-                printf("%s\n", tok)
-                tok = tokenize("", ",")
-        }
-
-        tok = tokenize(teststr4, ",")
-        while (tok != "") {
-                printf("%s\n", tok)
-                tok = tokenize("", ",")
-        }
-
-        tok = tokenize(teststr5, ",")
-        while (tok != "") {
-                printf("%s\n", tok)
-                tok = tokenize("", ",")
-        }
+	while (tok != "") {
+		println(tok)
+		tok = tokenize("", ",")
+	}
+
+	println("-")
+
+	tok = tokenize(teststr3, ",")
+	while (tok != "") {
+		println(tok)
+		tok = tokenize("", ",")
+	}
+
+	println("-")
+
+	tok = tokenize(teststr4, ",")
+	while (tok != "") {
+		println(tok)
+		tok = tokenize("", ",")
+	}
+
+	println("-")
+
+	tok = tokenize(teststr5, ",")
+	while (tok != "") {
+		println(tok)
+		tok = tokenize("", ",")
+	}
+
+	println("-")
+
+	tok = tokenize(teststr6, ", ")
+	while (tok != "") {
+		println(tok)
+		tok = tokenize("", ", ")
+	}
 
 	exit()
 }
-- 
1.5.6.5

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] Fix tokenize function and test.
  2009-06-17 23:59   ` [PATCH v2] " Przemyslaw Pawelczyk
@ 2009-06-18  1:29     ` Josh Stone
  0 siblings, 0 replies; 10+ messages in thread
From: Josh Stone @ 2009-06-18  1:29 UTC (permalink / raw)
  To: Przemyslaw Pawelczyk; +Cc: systemtap

On 06/17/2009 04:50 PM, Przemyslaw Pawelczyk wrote:
> Previous implementation was error-prone, because allowed returning empty
> tokens (mimiced strsep()), which is fine if there is a NULL semantic.
> Unfortunately SystemTap doesn't provide it in scripts and has only blank
> string (""), therefore testing against it was misleading.
> The solution is to return only non-empty tokens (mimic strtok()).
> 
> * tapset/string.stp: Fix tokenize.
> * testsuite/systemtap.string/tokenize.stp: Improve and add case with
>   more than one delimiter in the delim string.
> * testsuite/systemtap.string/tokenize.exp: Ditto.
> * stapfuncs.3stap.in: Update tokenize description.
> * doc/langref.tex: Ditto.
> ---

Applied, thanks!

Josh

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-06-18  1:29 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-13  3:29 [PATCH] Fix tokenize function Przemyslaw Pawelczyk
2009-06-13 11:40 ` Przemysław Pawełczyk
2009-06-13 12:38 ` [PATCH] Fix tokenize function and test Przemyslaw Pawelczyk
2009-06-15 20:01   ` Josh Stone
2009-06-16 23:40     ` Przemysław Pawełczyk
2009-06-17  0:54       ` Josh Stone
2009-06-17 18:39         ` Przemysław Pawełczyk
2009-06-17 21:42           ` Josh Stone
2009-06-17 23:59   ` [PATCH v2] " Przemyslaw Pawelczyk
2009-06-18  1:29     ` Josh Stone

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).