bug fix for hp-timing.h (aarch64)

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* bug fix for hp-timing.h (aarch64)
@ 2023-01-11 16:44 Tang, Jun
  2023-01-11 17:22 ` Zack Weinberg
  0 siblings, 1 reply; 13+ messages in thread
From: Tang, Jun @ 2023-01-11 16:44 UTC (permalink / raw)
  To: libc-alpha; +Cc: Pop, Sebastian

[-- Attachment #1: Type: text/plain, Size: 909 bytes --]

On Graviton 3 the frequency is 1050000000 and the original function hp-timing.h no longer gives the correct answer. The following patch can fix the bug:

```
diff --git a/sysdeps/aarch64/hp-timing.h b/sysdeps/aarch64/hp-timing.h
index c8469e3011..c5a1204d2c 100644
--- a/sysdeps/aarch64/hp-timing.h
+++ b/sysdeps/aarch64/hp-timing.h
@@ -41,7 +41,7 @@ typedef uint64_t hp_timing_t;
#define HP_TIMING_DIFF(Diff, Start, End)                       \
({  hp_timing_t freq;                                          \
     __asm__ __volatile__ ("mrs %0, cntfrq_el0" : "=r" (freq)); \
-   (Diff) = ((End) - (Start)) * (UINT64_C(1000000000) / freq); \
+   (Diff) = (hp_timing_t)((End) - (Start)) * ((1000000000.0) / freq);  \
})
 #endif /* hp-timing.h */
```

I have also submitted a bug report with the proposed changes at -
https://sourceware.org/bugzilla/show_bug.cgi?id=29329

Regards,

Jun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bug fix for hp-timing.h (aarch64)
  2023-01-11 16:44 bug fix for hp-timing.h (aarch64) Tang, Jun
@ 2023-01-11 17:22 ` Zack Weinberg
  2023-01-31 14:47   ` Tang, Jun
  0 siblings, 1 reply; 13+ messages in thread
From: Zack Weinberg @ 2023-01-11 17:22 UTC (permalink / raw)
  To: GNU libc development

On Wed, Jan 11, 2023, at 11:44 AM, Tang, Jun wrote:
> On Graviton 3 the frequency is 1050000000 and the original function 
> hp-timing.h no longer gives the correct answer. The following patch can 
> fix the bug:

Please try this instead:

> -   (Diff) = ((End) - (Start)) * (UINT64_C(1000000000) / freq); \
> +   (Diff) = (((End) - (Start)) * UINT64_C(1000000000)) / freq; \

This avoids using floating-point arithmetic but should still do the rescale correctly.
(The multiplier and divisor are both guaranteed to be less than 2**32, so, as long as we do the multiplication in 64-bit arithmetic, we cannot lose bits via overflow.)

zw

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: bug fix for hp-timing.h (aarch64)
  2023-01-11 17:22 ` Zack Weinberg
@ 2023-01-31 14:47   ` Tang, Jun
  0 siblings, 0 replies; 13+ messages in thread
From: Tang, Jun @ 2023-01-31 14:47 UTC (permalink / raw)
  To: Zack Weinberg, GNU libc development

Zack,

After rounds of discussion, the suggested solution is the best option and works for all existing benches.  Let me know if I need to submit another request to get this merged

> -   (Diff) = ((End) - (Start)) * (UINT64_C(1000000000) / freq); \
> +   (Diff) = (((End) - (Start)) * UINT64_C(1000000000)) / freq; \

Regards,

Jun

-----Original Message-----
From: Zack Weinberg <zack@owlfolio.org> 
Sent: Wednesday, January 11, 2023 11:22 AM
To: GNU libc development <libc-alpha@sourceware.org>
Subject: Re: bug fix for hp-timing.h (aarch64)

On Wed, Jan 11, 2023, at 11:44 AM, Tang, Jun wrote:
> On Graviton 3 the frequency is 1050000000 and the original function 
> hp-timing.h no longer gives the correct answer. The following patch 
> can fix the bug:

Please try this instead:

> -   (Diff) = ((End) - (Start)) * (UINT64_C(1000000000) / freq); \
> +   (Diff) = (((End) - (Start)) * UINT64_C(1000000000)) / freq; \

This avoids using floating-point arithmetic but should still do the rescale correctly.
(The multiplier and divisor are both guaranteed to be less than 2**32, so, as long as we do the multiplication in 64-bit arithmetic, we cannot lose bits via overflow.)

zw

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bug fix for hp-timing.h (aarch64)
  2023-01-16 17:01             ` Noah Goldstein
@ 2023-01-16 18:35               ` Wilco Dijkstra
  0 siblings, 0 replies; 13+ messages in thread
From: Wilco Dijkstra @ 2023-01-16 18:35 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: Zack Weinberg, Tang, Jun, GNU libc development

Hi Noah,

>> Many benchtests run a fixed number of iterations on varying inputs (eg.
>> string of size 1 vs 10000), so you wouldn't statically know which timer to use.
>> Targets that have timer overflow issues (eg. Alpha) don't use it at all in the
>> benchtests.
>
> You could probably estimate with a static number of iterations but payloads
> can be order of magnitude different i.e long strstr inputs vs short
> memset.

Yes, the iteration count remains fixed for all subtests while the inputs (and time
taken) vary by several orders of magnitude. So a single benchmark could report
inaccurate results on small inputs due to not running long enough and on large
inputs due to running for too long...

> How about then 3 APIs?
>
> `TIMING_NOW` -> statically choose `TIMING_NOW_SHORT` vs `TIMING_NOW_LONG`
> based on iter count, then explicit timers?

I'm still not sure what you're trying to solve here - the underlying problem is not
the timers but the fixed iteration counts. We could scale iteration counts based on
input size or the time it takes for that input.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bug fix for hp-timing.h (aarch64)
  2023-01-16 16:33           ` Wilco Dijkstra
@ 2023-01-16 17:01             ` Noah Goldstein
  2023-01-16 18:35               ` Wilco Dijkstra
  0 siblings, 1 reply; 13+ messages in thread
From: Noah Goldstein @ 2023-01-16 17:01 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Zack Weinberg, Tang, Jun, GNU libc development

On Mon, Jan 16, 2023 at 8:34 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Hi Noah,
>
> > If it's a long running benchmark (increased iteration count) then the
> > OS preemption
> > effect should scale up and the constant cost of the 2x function calls
> > / going to the
> > OS for getting the time should scale down proportionally.
> >
> > Maybe we just need two timer apis for `TIMING_NOW_LONG` and
> > `TIMING_NOW_SHORT`?
>
> Many benchtests run a fixed number of iterations on varying inputs (eg.
> string of size 1 vs 10000), so you wouldn't statically know which timer to use.
> Targets that have timer overflow issues (eg. Alpha) don't use it at all in the
> benchtests.

You could probably estimate with a static number of iterations but payloads
can be order of magnitude different i.e long strstr inputs vs short
memset.

How about then 3 APIs?

`TIMING_NOW` -> statically choose `TIMING_NOW_SHORT` vs `TIMING_NOW_LONG`
based on iter count, then explicit timers?

>
> Cheers,
> Wilco

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bug fix for hp-timing.h (aarch64)
  2023-01-12 20:51         ` Noah Goldstein
@ 2023-01-16 16:33           ` Wilco Dijkstra
  2023-01-16 17:01             ` Noah Goldstein
  0 siblings, 1 reply; 13+ messages in thread
From: Wilco Dijkstra @ 2023-01-16 16:33 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: Zack Weinberg, Tang, Jun, GNU libc development

Hi Noah,

> If it's a long running benchmark (increased iteration count) then the
> OS preemption
> effect should scale up and the constant cost of the 2x function calls
> / going to the
> OS for getting the time should scale down proportionally.
>
> Maybe we just need two timer apis for `TIMING_NOW_LONG` and
> `TIMING_NOW_SHORT`?

Many benchtests run a fixed number of iterations on varying inputs (eg.
string of size 1 vs 10000), so you wouldn't statically know which timer to use. 
Targets that have timer overflow issues (eg. Alpha) don't use it at all in the
benchtests.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bug fix for hp-timing.h (aarch64)
  2023-01-12 20:32       ` Wilco Dijkstra
@ 2023-01-12 20:51         ` Noah Goldstein
  2023-01-16 16:33           ` Wilco Dijkstra
  0 siblings, 1 reply; 13+ messages in thread
From: Noah Goldstein @ 2023-01-12 20:51 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Zack Weinberg, Tang, Jun, GNU libc development

On Thu, Jan 12, 2023 at 12:32 PM Wilco Dijkstra via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Hi Zack,
>
> >> On AArch64 the counter is absolutely solid - it's fixed frequency and system
> >> wide monotonically increasing. So there are no issues with varying clock
> >> frequency or rescheduling to a different core, you get elapsed time at
> >> nanosecond accuracy.
> >
> > ... but that's not what you _want_ for a benchmark!  You don't want the time
> > to tick up while the benchmark process is preempted (not executing on _any_
> > core), for instance.  Or am I misunderstanding what this is used for?
>
> Basically it is a lower overhead version of clock_get_time (CLOCK_MONOTONIC)
> so you can use it to benchmark even small bits of code (since it doesn't insert
> extra function calls or affect register allocation in a major way).
>
> We don't strictly need to scale the result, you could just subtract the counts.
> However it makes the result equivalent to clock_get_time and allows one to
> compute bytes copied/ns if needed or compare different CPUs.
>
> It's hard to account for the effects of interrupts or preemption. And GLIBC
> benchtests aren't all that smart either, so I often end up having to increase the
> iteration count to average out the OS effects (including frequency scaling).
> We could improve this, eg. take several samples at lower iteration counts
> and only report the fastest one.

If it's a long running benchmark (increased iteration count) then the
OS preemption
effect should scale up and the constant cost of the 2x function calls
/ going to the
OS for getting the time should scale down proportionally.

Maybe we just need two timer apis for `TIMING_NOW_LONG` and
`TIMING_NOW_SHORT`?

>
> Cheers,
> Wilco

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bug fix for hp-timing.h (aarch64)
  2023-01-12 18:54     ` Zack Weinberg
@ 2023-01-12 20:32       ` Wilco Dijkstra
  2023-01-12 20:51         ` Noah Goldstein
  0 siblings, 1 reply; 13+ messages in thread
From: Wilco Dijkstra @ 2023-01-12 20:32 UTC (permalink / raw)
  To: Zack Weinberg, Tang, Jun; +Cc: GNU libc development

Hi Zack,

>> On AArch64 the counter is absolutely solid - it's fixed frequency and system
>> wide monotonically increasing. So there are no issues with varying clock
>> frequency or rescheduling to a different core, you get elapsed time at
>> nanosecond accuracy.
>
> ... but that's not what you _want_ for a benchmark!  You don't want the time
> to tick up while the benchmark process is preempted (not executing on _any_
> core), for instance.  Or am I misunderstanding what this is used for?

Basically it is a lower overhead version of clock_get_time (CLOCK_MONOTONIC)
so you can use it to benchmark even small bits of code (since it doesn't insert
extra function calls or affect register allocation in a major way).

We don't strictly need to scale the result, you could just subtract the counts.
However it makes the result equivalent to clock_get_time and allows one to
compute bytes copied/ns if needed or compare different CPUs.

It's hard to account for the effects of interrupts or preemption. And GLIBC
benchtests aren't all that smart either, so I often end up having to increase the
iteration count to average out the OS effects (including frequency scaling).
We could improve this, eg. take several samples at lower iteration counts
and only report the fastest one.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bug fix for hp-timing.h (aarch64)
  2023-01-12 18:07   ` Wilco Dijkstra
@ 2023-01-12 18:54     ` Zack Weinberg
  2023-01-12 20:32       ` Wilco Dijkstra
  0 siblings, 1 reply; 13+ messages in thread
From: Zack Weinberg @ 2023-01-12 18:54 UTC (permalink / raw)
  To: Wilco Dijkstra, Tang, Jun; +Cc: GNU libc development

On Thu, Jan 12, 2023, at 1:07 PM, Wilco Dijkstra wrote:
> Hi Zack,
>
>> I'm surprised to hear you say that.  To me, 16 seconds is an absurdly _long_
>> interval for use of hp-timing.h.  For anything longer than a few tens of
>> milliseconds, the overhead of two system calls is negligible, but both CPU
>> frequency changes and getting descheduled in the middle of the
>> measurement become serious concerns.  So it seems more appropriate to
>> me to use clock_gettime with a per-process or per-thread clock ID for
>> measurements on the scale of tens of seconds.
>
> On AArch64 the counter is absolutely solid - it's fixed frequency and system
> wide monotonically increasing. So there are no issues with varying clock
> frequency or rescheduling to a different core, you get elapsed time at
> nanosecond accuracy.

... but that's not what you _want_ for a benchmark!  You don't want the time to tick up while the benchmark process is preempted (not executing on _any_ core), for instance.  Or am I misunderstanding what this is used for?

zw

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bug fix for hp-timing.h (aarch64)
  2023-01-12 14:38 ` Tang, Jun
@ 2023-01-12 18:07   ` Wilco Dijkstra
  2023-01-12 18:54     ` Zack Weinberg
  0 siblings, 1 reply; 13+ messages in thread
From: Wilco Dijkstra @ 2023-01-12 18:07 UTC (permalink / raw)
  To: Tang, Jun, zack; +Cc: 'GNU C Library'

Hi Zack,

> I'm surprised to hear you say that.  To me, 16 seconds is an absurdly _long_
> interval for use of hp-timing.h.  For anything longer than a few tens of
> milliseconds, the overhead of two system calls is negligible, but both CPU
> frequency changes and getting descheduled in the middle of the
> measurement become serious concerns.  So it seems more appropriate to
> me to use clock_gettime with a per-process or per-thread clock ID for
> measurements on the scale of tens of seconds.

On AArch64 the counter is absolutely solid - it's fixed frequency and system
wide monotonically increasing. So there are no issues with varying clock
frequency or rescheduling to a different core, you get elapsed time at
nanosecond accuracy.

With such a high quality counter it seems a waste to allow it to overflow
after a short time when there is a simple way to avoid it. Note we could
remove the volatile from the asm as that would allow more lifting of the
cntfreq and start the FP division earlier.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bug fix for hp-timing.h (aarch64)
  2023-01-11 22:49 Wilco Dijkstra
  2023-01-12 14:38 ` Tang, Jun
@ 2023-01-12 15:11 ` Zack Weinberg
  1 sibling, 0 replies; 13+ messages in thread
From: Zack Weinberg @ 2023-01-12 15:11 UTC (permalink / raw)
  To: GNU libc development

On Wed, Jan 11, 2023, at 5:49 PM, Wilco Dijkstra via Libc-alpha wrote:
>> -   (Diff) = ((End) - (Start)) * (UINT64_C(1000000000) / freq); \
>> +   (Diff) = (((End) - (Start)) * UINT64_C(1000000000)) / freq; \
>>
>> This avoids using floating-point arithmetic but should still do the rescale correctly.
>
> However this will overflow after about 16 seconds - and that's short 
> even for basic benchmarking.

I'm surprised to hear you say that.  To me, 16 seconds is an absurdly _long_ interval for use of hp-timing.h.  For anything longer than a few tens of milliseconds, the overhead of two system calls is negligible, but both CPU frequency changes and getting descheduled in the middle of the measurement become serious concerns.  So it seems more appropriate to me to use clock_gettime with a per-process or per-thread clock ID for measurements on the scale of tens of seconds.

This isn't a sustained objection to the patch, I just want to understand where you're coming from.

zw

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: bug fix for hp-timing.h (aarch64)
  2023-01-11 22:49 Wilco Dijkstra
@ 2023-01-12 14:38 ` Tang, Jun
  2023-01-12 18:07   ` Wilco Dijkstra
  2023-01-12 15:11 ` Zack Weinberg
  1 sibling, 1 reply; 13+ messages in thread
From: Tang, Jun @ 2023-01-12 14:38 UTC (permalink / raw)
  To: Wilco Dijkstra, zack; +Cc: 'GNU C Library'

Zack, Wilco,

Thanks for the suggestions,

Hp-timing.h is used for glibc bench timing, and calculating in float is to avoid any future overflowing (the chance is probably very small). 

Regards,

Jun

-----Original Message-----
From: Wilco Dijkstra <Wilco.Dijkstra@arm.com> 
Sent: Wednesday, January 11, 2023 4:49 PM
To: zack@owlfolio.org; Tang, Jun <juntangc@amazon.com>
Cc: 'GNU C Library' <libc-alpha@sourceware.org>
Subject: [EXTERNAL] bug fix for hp-timing.h (aarch64)

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

Hi Zack,

> -   (Diff) = ((End) - (Start)) * (UINT64_C(1000000000) / freq); \
> +   (Diff) = (((End) - (Start)) * UINT64_C(1000000000)) / freq; \
>
> This avoids using floating-point arithmetic but should still do the rescale correctly.
> (The multiplier and divisor are both guaranteed to be less than 2**32, 
> so, as long as we do the multiplication in 64-bit arithmetic, we 
> cannot lose bits via overflow.)

However this will overflow after about 16 seconds - and that's short even for basic benchmarking. Since this code is not performance critical, using floating point here is reasonable.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug fix for hp-timing.h (aarch64)
@ 2023-01-11 22:49 Wilco Dijkstra
  2023-01-12 14:38 ` Tang, Jun
  2023-01-12 15:11 ` Zack Weinberg
  0 siblings, 2 replies; 13+ messages in thread
From: Wilco Dijkstra @ 2023-01-11 22:49 UTC (permalink / raw)
  To: zack, juntangc; +Cc: 'GNU C Library'

Hi Zack,

> -   (Diff) = ((End) - (Start)) * (UINT64_C(1000000000) / freq); \
> +   (Diff) = (((End) - (Start)) * UINT64_C(1000000000)) / freq; \
>
> This avoids using floating-point arithmetic but should still do the rescale correctly.
> (The multiplier and divisor are both guaranteed to be less than 2**32, so, as long
> as we do the multiplication in 64-bit arithmetic, we cannot lose bits via overflow.)

However this will overflow after about 16 seconds - and that's short even for basic
benchmarking. Since this code is not performance critical, using floating point here
is reasonable.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-01-31 14:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-11 16:44 bug fix for hp-timing.h (aarch64) Tang, Jun
2023-01-11 17:22 ` Zack Weinberg
2023-01-31 14:47   ` Tang, Jun
2023-01-11 22:49 Wilco Dijkstra
2023-01-12 14:38 ` Tang, Jun
2023-01-12 18:07   ` Wilco Dijkstra
2023-01-12 18:54     ` Zack Weinberg
2023-01-12 20:32       ` Wilco Dijkstra
2023-01-12 20:51         ` Noah Goldstein
2023-01-16 16:33           ` Wilco Dijkstra
2023-01-16 17:01             ` Noah Goldstein
2023-01-16 18:35               ` Wilco Dijkstra
2023-01-12 15:11 ` Zack Weinberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).