public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* Re: Fix %Z parsing in strptime [BZ #16088]
@ 2023-07-17 12:57 Wilco Dijkstra
  2023-07-17 17:19 ` Paul Eggert
  0 siblings, 1 reply; 7+ messages in thread
From: Wilco Dijkstra @ 2023-07-17 12:57 UTC (permalink / raw)
  To: lancasterharp; +Cc: 'GNU C Library'

Hi Stanley,

> +	  /* we recognize the format [-+a-zA-Z0-9]{3,} */

Is that really the correct format? Eg. should it be able to parse UTC offsets like UTC+04:30?

> 	  while (ISSPACE (*rp))
> 	    rp++;
>-	  while (!ISSPACE (*rp) && *rp != '\0')
>+	  
>+	  const char* stop_rp = rp + 3;
>+	  while (((*rp >= 'A' && *rp <= 'Z') || (*rp >= 'a' && *rp <= 'z') || (*rp >= '0' && *rp <= '9')) && (rp < stop_rp) && *rp != '\0')

This is basically isalnum() but not allowing '+' or '-'. And if timezone offsets are allowed,
we'd also need ':'. And rather than allowing any sequence of these characters, would it
not be better to scan just for isalpha() first, and only if you see '+' or '-' check the UTC
offset syntax?

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fix %Z parsing in strptime [BZ #16088]
  2023-07-17 12:57 Fix %Z parsing in strptime [BZ #16088] Wilco Dijkstra
@ 2023-07-17 17:19 ` Paul Eggert
  0 siblings, 0 replies; 7+ messages in thread
From: Paul Eggert @ 2023-07-17 17:19 UTC (permalink / raw)
  To: Wilco Dijkstra, lancasterharp; +Cc: 'GNU C Library'

On 2023-07-17 05:57, Wilco Dijkstra via Libc-alpha wrote:
> Hi Stanley,
> 
>> +	  /* we recognize the format [-+a-zA-Z0-9]{3,} */
> 
> Is that really the correct format? Eg. should it be able to parse UTC offsets like UTC+04:30?

strftime doesn't generate such offsets, so this is asking whether 
strptime should accept forms that strftime doesn't generate, at the cost 
of strptime misparsing some unusual forms that strftime does generate 
(e.g., strftime with "%Z:%Y").


> This is basically isalnum() but not allowing '+' or '-'.

Yes, '+' and '-' should be allowed.


> rather than allowing any sequence of these characters, would it
> not be better to scan just for isalpha() first, and only if you see '+' or '-' check the UTC
> offset syntax?

POSIX allows settings like TZ='<-X->0', unfortunately, which generate 
time zone abbreviations like "-X-". Also, an actual isalpha would be 
dicey, as abbreviations use ASCII letters (OK, letters in the POSIX 
locale, basically the same thing), so it's arguably better to avoid 
isalpha here.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fix %Z parsing in strptime [BZ #16088]
  2023-07-18 17:17   ` Stanley Lancaster
  2023-07-18 17:22     ` Stanley Lancaster
@ 2023-07-19  2:13     ` Paul Eggert
  1 sibling, 0 replies; 7+ messages in thread
From: Paul Eggert @ 2023-07-19  2:13 UTC (permalink / raw)
  To: Stanley Lancaster; +Cc: libc-alpha

Unfortunately that patch was corrupted by Gmail. Please use git 
send-email to send patches.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fix %Z parsing in strptime [BZ #16088]
  2023-07-18 17:17   ` Stanley Lancaster
@ 2023-07-18 17:22     ` Stanley Lancaster
  2023-07-19  2:13     ` Paul Eggert
  1 sibling, 0 replies; 7+ messages in thread
From: Stanley Lancaster @ 2023-07-18 17:22 UTC (permalink / raw)
  To: Paul Eggert; +Cc: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 3816 bytes --]

I made several of the changes Paul suggested. And ran all the tests. This
patch does not fill out the tm_zone information because the manpage for the
api lists this a format character where "the  corresponding fields are
parsed, but no field in tm is changed." I feel that if this is the
behaviour developers have expected & written around in the past, that
behaviour should stay. However, if everyone here feels that a full
implementation should be written, I can do so.

On Tue, Jul 18, 2023 at 12:17 PM Stanley Lancaster <lancasterharp@gmail.com>
wrote:

> ---
>  time/strptime_l.c   | 16 +++++++++++-----
>  time/tst-strptime.c |  2 ++
>  2 files changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/time/strptime_l.c b/time/strptime_l.c
> index 85c3249fcc..2382defc92 100644
> --- a/time/strptime_l.c
> +++ b/time/strptime_l.c
> @@ -770,11 +770,17 @@ __strptime_internal (const char *rp, const char
> *fmt, struct tm *tmp,
>    break;
>   case 'Z':
>    /* Read timezone but perform no conversion.  */
> -  while (ISSPACE (*rp))
> -    rp++;
> -  while (!ISSPACE (*rp) && *rp != '\0')
> -    rp++;
> -  break;
> + {
> + while (ISSPACE (*rp))
> +     rp++;
> + /* Read time zone but perform no conversion. Recognize the format
> [-+a-zA-Z0-9]{3,}.  */
> + const char* start_rp = rp;
> + while ((*rp >= 'A' && *rp <= 'Z') || (*rp >= 'a' && *rp <= 'z') || (*rp
> >= '0' && *rp <= '9'))
> + rp++;
> + if (start_rp+3 < rp)
> + return NULL;
> + }
>   case 'z':
>    /* We recognize four formats:
>       1. Two digits specify hours.
> diff --git a/time/tst-strptime.c b/time/tst-strptime.c
> index 3dae9e0594..40145cb109 100644
> --- a/time/tst-strptime.c
> +++ b/time/tst-strptime.c
> @@ -48,6 +48,8 @@ static const struct
>      6, 0, 0, 1 },
>    { "en_US.ISO-8859-1", "2000-01-01 08:12:21 PM", "%Y-%m-%d %I:%M:%S %p",
>      6, 0, 0, 1 },
> +  { "en_US.ISO-8859-1", "2000-01-01 08:12:21 AM CST/", "%Y-%m-%d %I:%M:%S
> %p %Z/",
> +    6, 0, 0, 1},
>    { "ja_JP.EUC-JP", "2001 20 \xb7\xee", "%Y %U %a", 1, 140, 4, 21 },
>    { "ja_JP.EUC-JP", "2001 21 \xb7\xee", "%Y %W %a", 1, 140, 4, 21 },
>    /* Most of the languages do not need the declension of the month names
> --
> 2.39.3
>
> On Fri, Jul 14, 2023 at 11:11 AM Paul Eggert <eggert@cs.ucla.edu> wrote:
>
>> On 2023-07-14 07:52, Stanley Lancaster via Libc-alpha wrote:
>>
>> >         /* Read timezone but perform no conversion.  */
>> > +       /* we recognize the format [-+a-zA-Z0-9]{3,} */
>>
>> Use GNU style in comment with active voice sentences and two spaces
>> after sentence end, e.g., "/* Read time zone but perform no conversion.
>> Recognize the format [-+a-zA-Z0-9]{3,}.  */".
>>
>>
>> > +       const char* stop_rp = rp + 3;
>>
>> Again, GNU style: "char *stop_rp" not "char* stop_rp".
>>
>> More important, this has undefined behavior if rp + 3 overflows.
>> Instead, count the number of bytes after the loop finishes, and make
>> sure it's 3 or more.
>>
>>
>> > +       while (((*rp >= 'A' && *rp <= 'Z') || (*rp >= 'a' && *rp <=
>> 'z') || (*rp >= '0' && *rp <= '9')) && (rp < stop_rp) && *rp != '\0')
>>
>> Omit "&& *rp != '\0'"; it's redundant. Reindent to 80 columns.
>>
>>
>> > +  { "C", "1999CST0502123412", "%Y%Z%m%d%H%M%S", 0, 121, 4, 2 },
>>
>> I don't see how this test passes. "CST0502123412" is treated as a time
>> zone abbreviation, so the only info is the year. Did you run the tests?
>> If so, why did this test pass? If not, please run.
>>
>> Since the patch does not fix BZ#16088, it needs a commit message that
>> describes what the patch does and why it's a win even though it doesn't
>> fix. In particular, the patch does not set tm_zone, and there's a reason
>> for that, and this should be explained.
>>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fix %Z parsing in strptime [BZ #16088]
  2023-07-14 16:11 ` Paul Eggert
@ 2023-07-18 17:17   ` Stanley Lancaster
  2023-07-18 17:22     ` Stanley Lancaster
  2023-07-19  2:13     ` Paul Eggert
  0 siblings, 2 replies; 7+ messages in thread
From: Stanley Lancaster @ 2023-07-18 17:17 UTC (permalink / raw)
  To: Paul Eggert; +Cc: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 3104 bytes --]

---
 time/strptime_l.c   | 16 +++++++++++-----
 time/tst-strptime.c |  2 ++
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/time/strptime_l.c b/time/strptime_l.c
index 85c3249fcc..2382defc92 100644
--- a/time/strptime_l.c
+++ b/time/strptime_l.c
@@ -770,11 +770,17 @@ __strptime_internal (const char *rp, const char *fmt,
struct tm *tmp,
   break;
  case 'Z':
   /* Read timezone but perform no conversion.  */
-  while (ISSPACE (*rp))
-    rp++;
-  while (!ISSPACE (*rp) && *rp != '\0')
-    rp++;
-  break;
+ {
+ while (ISSPACE (*rp))
+     rp++;
+ /* Read time zone but perform no conversion. Recognize the format
[-+a-zA-Z0-9]{3,}.  */
+ const char* start_rp = rp;
+ while ((*rp >= 'A' && *rp <= 'Z') || (*rp >= 'a' && *rp <= 'z') || (*rp
>= '0' && *rp <= '9'))
+ rp++;
+ if (start_rp+3 < rp)
+ return NULL;
+ }
  case 'z':
   /* We recognize four formats:
      1. Two digits specify hours.
diff --git a/time/tst-strptime.c b/time/tst-strptime.c
index 3dae9e0594..40145cb109 100644
--- a/time/tst-strptime.c
+++ b/time/tst-strptime.c
@@ -48,6 +48,8 @@ static const struct
     6, 0, 0, 1 },
   { "en_US.ISO-8859-1", "2000-01-01 08:12:21 PM", "%Y-%m-%d %I:%M:%S %p",
     6, 0, 0, 1 },
+  { "en_US.ISO-8859-1", "2000-01-01 08:12:21 AM CST/", "%Y-%m-%d %I:%M:%S
%p %Z/",
+    6, 0, 0, 1},
   { "ja_JP.EUC-JP", "2001 20 \xb7\xee", "%Y %U %a", 1, 140, 4, 21 },
   { "ja_JP.EUC-JP", "2001 21 \xb7\xee", "%Y %W %a", 1, 140, 4, 21 },
   /* Most of the languages do not need the declension of the month names
-- 
2.39.3

On Fri, Jul 14, 2023 at 11:11 AM Paul Eggert <eggert@cs.ucla.edu> wrote:

> On 2023-07-14 07:52, Stanley Lancaster via Libc-alpha wrote:
>
> >         /* Read timezone but perform no conversion.  */
> > +       /* we recognize the format [-+a-zA-Z0-9]{3,} */
>
> Use GNU style in comment with active voice sentences and two spaces
> after sentence end, e.g., "/* Read time zone but perform no conversion.
> Recognize the format [-+a-zA-Z0-9]{3,}.  */".
>
>
> > +       const char* stop_rp = rp + 3;
>
> Again, GNU style: "char *stop_rp" not "char* stop_rp".
>
> More important, this has undefined behavior if rp + 3 overflows.
> Instead, count the number of bytes after the loop finishes, and make
> sure it's 3 or more.
>
>
> > +       while (((*rp >= 'A' && *rp <= 'Z') || (*rp >= 'a' && *rp <= 'z')
> || (*rp >= '0' && *rp <= '9')) && (rp < stop_rp) && *rp != '\0')
>
> Omit "&& *rp != '\0'"; it's redundant. Reindent to 80 columns.
>
>
> > +  { "C", "1999CST0502123412", "%Y%Z%m%d%H%M%S", 0, 121, 4, 2 },
>
> I don't see how this test passes. "CST0502123412" is treated as a time
> zone abbreviation, so the only info is the year. Did you run the tests?
> If so, why did this test pass? If not, please run.
>
> Since the patch does not fix BZ#16088, it needs a commit message that
> describes what the patch does and why it's a win even though it doesn't
> fix. In particular, the patch does not set tm_zone, and there's a reason
> for that, and this should be explained.
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fix %Z parsing in strptime [BZ #16088]
  2023-07-14 14:52 Stanley Lancaster
@ 2023-07-14 16:11 ` Paul Eggert
  2023-07-18 17:17   ` Stanley Lancaster
  0 siblings, 1 reply; 7+ messages in thread
From: Paul Eggert @ 2023-07-14 16:11 UTC (permalink / raw)
  To: Stanley Lancaster; +Cc: libc-alpha

On 2023-07-14 07:52, Stanley Lancaster via Libc-alpha wrote:

>   	  /* Read timezone but perform no conversion.  */
> +	  /* we recognize the format [-+a-zA-Z0-9]{3,} */

Use GNU style in comment with active voice sentences and two spaces 
after sentence end, e.g., "/* Read time zone but perform no conversion. 
Recognize the format [-+a-zA-Z0-9]{3,}.  */".


> +	  const char* stop_rp = rp + 3;

Again, GNU style: "char *stop_rp" not "char* stop_rp".

More important, this has undefined behavior if rp + 3 overflows. 
Instead, count the number of bytes after the loop finishes, and make 
sure it's 3 or more.


> +	  while (((*rp >= 'A' && *rp <= 'Z') || (*rp >= 'a' && *rp <= 'z') || (*rp >= '0' && *rp <= '9')) && (rp < stop_rp) && *rp != '\0')

Omit "&& *rp != '\0'"; it's redundant. Reindent to 80 columns.


> +  { "C", "1999CST0502123412", "%Y%Z%m%d%H%M%S", 0, 121, 4, 2 },

I don't see how this test passes. "CST0502123412" is treated as a time 
zone abbreviation, so the only info is the year. Did you run the tests? 
If so, why did this test pass? If not, please run.

Since the patch does not fix BZ#16088, it needs a commit message that 
describes what the patch does and why it's a win even though it doesn't 
fix. In particular, the patch does not set tm_zone, and there's a reason 
for that, and this should be explained.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Fix %Z parsing in strptime [BZ #16088]
@ 2023-07-14 14:52 Stanley Lancaster
  2023-07-14 16:11 ` Paul Eggert
  0 siblings, 1 reply; 7+ messages in thread
From: Stanley Lancaster @ 2023-07-14 14:52 UTC (permalink / raw)
  To: libc-alpha; +Cc: Stanley Lancaster

---
 time/strptime_l.c   | 5 ++++-
 time/tst-strptime.c | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/time/strptime_l.c b/time/strptime_l.c
index 85c3249fcc..5954015c4e 100644
--- a/time/strptime_l.c
+++ b/time/strptime_l.c
@@ -770,9 +770,12 @@ __strptime_internal (const char *rp, const char *fmt, struct tm *tmp,
 	  break;
 	case 'Z':
 	  /* Read timezone but perform no conversion.  */
+	  /* we recognize the format [-+a-zA-Z0-9]{3,} */
 	  while (ISSPACE (*rp))
 	    rp++;
-	  while (!ISSPACE (*rp) && *rp != '\0')
+	  
+	  const char* stop_rp = rp + 3;
+	  while (((*rp >= 'A' && *rp <= 'Z') || (*rp >= 'a' && *rp <= 'z') || (*rp >= '0' && *rp <= '9')) && (rp < stop_rp) && *rp != '\0')
 	    rp++;
 	  break;
 	case 'z':
diff --git a/time/tst-strptime.c b/time/tst-strptime.c
index 3dae9e0594..31d6945ef1 100644
--- a/time/tst-strptime.c
+++ b/time/tst-strptime.c
@@ -38,6 +38,7 @@ static const struct
   { "C", "03/03/00", "%D", 5, 62, 2, 3 },
   { "C", "9/9/99", "%x", 4, 251, 8, 9 },
   { "C", "19990502123412", "%Y%m%d%H%M%S", 0, 121, 4, 2 },
+  { "C", "1999CST0502123412", "%Y%Z%m%d%H%M%S", 0, 121, 4, 2 },
   { "C", "2001 20 Mon", "%Y %U %a", 1, 140, 4, 21 },
   { "C", "2001 21 Mon", "%Y %W %a", 1, 140, 4, 21 },
   { "C", "2001 21 Mon", "%2000Y %W %a", 1, 140, 4, 21 },
-- 
2.39.3


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-07-19  2:13 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-17 12:57 Fix %Z parsing in strptime [BZ #16088] Wilco Dijkstra
2023-07-17 17:19 ` Paul Eggert
  -- strict thread matches above, loose matches on Subject: below --
2023-07-14 14:52 Stanley Lancaster
2023-07-14 16:11 ` Paul Eggert
2023-07-18 17:17   ` Stanley Lancaster
2023-07-18 17:22     ` Stanley Lancaster
2023-07-19  2:13     ` Paul Eggert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).