Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
@ 2017-10-11 21:20 Wilco Dijkstra
  2017-10-11 21:39 ` Adhemerval Zanella
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Wilco Dijkstra @ 2017-10-11 21:20 UTC (permalink / raw)
  To: adhemerval.zanella, siddhesh; +Cc: libc-alpha, nd, Szabolcs Nagy

Adhemerval Zanella wrote:
> LGTM, for the pointer brought by Szabolcs:
>
>  - I still prefer to have the macros on one file instead of multiple ones.
>    It makes checks for all the possible code patch easier and one can
>    theoretical build the desirable memset by just define the required
>    macros instead of pulling from source code from different files.

At the design level, I don't think that's a good idea. It's essential that this
code is easy to read and understand. It must be 100% correct and is
extremely performance sensitive to minor changes (eg. instruction order
or alignment can have major impact). Inline macros and lots of complex
#ifdefs in the code mean getting this right becomes almost impossible.
I really don't want to create large blobs of unmaintainable assembly code
like several other targets.

Note we also should benchmark this on various other cores to see what
the impact is. I wrote this memset code for specific reasons - changing that
could have a big impact on some cores, so we need to show this doesn't
cause regressions.

Also we need to decide on how to read the ZVA size. I don't think there is
any point in supporting that many options (reading it, using a global, using
an ifunc and not using ZVA at all). The very first memset had a few of these
options, and I removed them precisely because it created way too many
permutations to test and benchmark. So if we add ifuncs, then we shouldn't
need to use a global too. For the dynamic linker we could just use a basic
memset without any ifuncs (so then it becomes 2 variants, ifunc using ZVA,
and no ZVA).

Finally we'll need to look into more detail at the new memcpy benchmarks -
while looping through memory seems like a good idea, it appears like it
only increments by 1. So at first sight it's still testing the exact same thing as
the existing benchmarks - all data is always cached in L1. For memset I guess
we're still missing a randomized benchmark based on real trace data.

Wilco

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
  2017-10-11 21:20 [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function Wilco Dijkstra
@ 2017-10-11 21:39 ` Adhemerval Zanella
  2017-10-11 21:44   ` Andrew Pinski
  2017-10-11 22:21   ` Wilco Dijkstra
  2017-10-12  1:48 ` Siddhesh Poyarekar
  2017-10-12  1:54 ` memcpy walk benchmark [was: Hoist ZVA check out of the memset function] Siddhesh Poyarekar
  2 siblings, 2 replies; 21+ messages in thread
From: Adhemerval Zanella @ 2017-10-11 21:39 UTC (permalink / raw)
  To: Wilco Dijkstra, siddhesh; +Cc: libc-alpha, nd, Szabolcs Nagy



On 11/10/2017 18:20, Wilco Dijkstra wrote:
> Adhemerval Zanella wrote:
>> LGTM, for the pointer brought by Szabolcs:
>>
>>  - I still prefer to have the macros on one file instead of multiple ones.
>>    It makes checks for all the possible code patch easier and one can
>>    theoretical build the desirable memset by just define the required
>>    macros instead of pulling from source code from different files.
> 
> At the design level, I don't think that's a good idea. It's essential that this
> code is easy to read and understand. It must be 100% correct and is
> extremely performance sensitive to minor changes (eg. instruction order
> or alignment can have major impact). Inline macros and lots of complex
> #ifdefs in the code mean getting this right becomes almost impossible.
> I really don't want to create large blobs of unmaintainable assembly code
> like several other targets.

My idea is to prevent have different code paths in different files that
are 'injected' by macros or ifdefs.  Getting all of them in one place is
better imho than spread over multiple files.  Now, what you are advocating
is a different topic: whether the modifications of the generic
implementation are valuable.

> 
> Note we also should benchmark this on various other cores to see what
> the impact is. I wrote this memset code for specific reasons - changing that
> could have a big impact on some cores, so we need to show this doesn't
> cause regressions.

Ideally it would prefer to have a more concise selection as:

   1. A memset that reads dczid_el0 using mrs (as the default one).
      This will be selected as default and thus should not incur in any
      regression.

   2. A memset that reads the zva size from global variable and is selected
      solely by falkor (and ideally it would handle 64 and 128 cacheline
      sizes).

   3. no zva variant in the case of zva being 0.

> 
> Also we need to decide on how to read the ZVA size. I don't think there is
> any point in supporting that many options (reading it, using a global, using
> an ifunc and not using ZVA at all). The very first memset had a few of these
> options, and I removed them precisely because it created way too many
> permutations to test and benchmark. So if we add ifuncs, then we shouldn't
> need to use a global too. For the dynamic linker we could just use a basic
> memset without any ifuncs (so then it becomes 2 variants, ifunc using ZVA,
> and no ZVA).

Siddhesh can help me out, but the idea is *precisely* that Falkor is faster
using a global variable than accessing the dczid_el0 and its shown by its
benchmarks.  Now, the discussion is indeed if this microarchitecture difference
is worth adding ifuncs variants.

For loader it is already not using ifunc (and afaik there is not support
to do so).

> 
> Finally we'll need to look into more detail at the new memcpy benchmarks -
> while looping through memory seems like a good idea, it appears like it
> only increments by 1. So at first sight it's still testing the exact same thing as
> the existing benchmarks - all data is always cached in L1. For memset I guess
> we're still missing a randomized benchmark based on real trace data.

I think we can move this discussion to the benchmark thread itself.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
  2017-10-11 21:39 ` Adhemerval Zanella
@ 2017-10-11 21:44   ` Andrew Pinski
  2017-10-11 21:47     ` Adhemerval Zanella
  2017-10-12  1:30     ` Siddhesh Poyarekar
  2017-10-11 22:21   ` Wilco Dijkstra
  1 sibling, 2 replies; 21+ messages in thread
From: Andrew Pinski @ 2017-10-11 21:44 UTC (permalink / raw)
  To: Adhemerval Zanella
  Cc: Wilco Dijkstra, siddhesh, libc-alpha, nd, Szabolcs Nagy

On Wed, Oct 11, 2017 at 2:39 PM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
> On 11/10/2017 18:20, Wilco Dijkstra wrote:
>> Adhemerval Zanella wrote:
>>> LGTM, for the pointer brought by Szabolcs:
>>>
>>>  - I still prefer to have the macros on one file instead of multiple ones.
>>>    It makes checks for all the possible code patch easier and one can
>>>    theoretical build the desirable memset by just define the required
>>>    macros instead of pulling from source code from different files.
>>
>> At the design level, I don't think that's a good idea. It's essential that this
>> code is easy to read and understand. It must be 100% correct and is
>> extremely performance sensitive to minor changes (eg. instruction order
>> or alignment can have major impact). Inline macros and lots of complex
>> #ifdefs in the code mean getting this right becomes almost impossible.
>> I really don't want to create large blobs of unmaintainable assembly code
>> like several other targets.
>
> My idea is to prevent have different code paths in different files that
> are 'injected' by macros or ifdefs.  Getting all of them in one place is
> better imho than spread over multiple files.  Now, what you are advocating
> is a different topic: whether the modifications of the generic
> implementation are valuable.
>
>>
>> Note we also should benchmark this on various other cores to see what
>> the impact is. I wrote this memset code for specific reasons - changing that
>> could have a big impact on some cores, so we need to show this doesn't
>> cause regressions.
>
> Ideally it would prefer to have a more concise selection as:
>
>    1. A memset that reads dczid_el0 using mrs (as the default one).
>       This will be selected as default and thus should not incur in any
>       regression.
>
>    2. A memset that reads the zva size from global variable and is selected
>       solely by falkor (and ideally it would handle 64 and 128 cacheline
>       sizes).
>
>    3. no zva variant in the case of zva being 0.
>
>>
>> Also we need to decide on how to read the ZVA size. I don't think there is
>> any point in supporting that many options (reading it, using a global, using
>> an ifunc and not using ZVA at all). The very first memset had a few of these
>> options, and I removed them precisely because it created way too many
>> permutations to test and benchmark. So if we add ifuncs, then we shouldn't
>> need to use a global too. For the dynamic linker we could just use a basic
>> memset without any ifuncs (so then it becomes 2 variants, ifunc using ZVA,
>> and no ZVA).
>
> Siddhesh can help me out, but the idea is *precisely* that Falkor is faster
> using a global variable than accessing the dczid_el0 and its shown by its
> benchmarks.  Now, the discussion is indeed if this microarchitecture difference
> is worth adding ifuncs variants.

For at least the micro-archs I work with, reading dczid_el0 can and
will most likely be faster than reading from global memory.
Especially if the global memory is not in the L1 cache.  This is one
case where a micro-benchmark can fall down.  If the global memory is
in L1 cache the read is 3 cycles while reading from dczid_el0 is 4
cycles, but once it is out of L1 cache, reading becomes 10x worse plus
it pollutes the L1 cache.

Thanks,
Andrew

>
> For loader it is already not using ifunc (and afaik there is not support
> to do so).
>
>>
>> Finally we'll need to look into more detail at the new memcpy benchmarks -
>> while looping through memory seems like a good idea, it appears like it
>> only increments by 1. So at first sight it's still testing the exact same thing as
>> the existing benchmarks - all data is always cached in L1. For memset I guess
>> we're still missing a randomized benchmark based on real trace data.
>
> I think we can move this discussion to the benchmark thread itself.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
  2017-10-11 21:44   ` Andrew Pinski
@ 2017-10-11 21:47     ` Adhemerval Zanella
  2017-10-12  1:30     ` Siddhesh Poyarekar
  1 sibling, 0 replies; 21+ messages in thread
From: Adhemerval Zanella @ 2017-10-11 21:47 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: Wilco Dijkstra, siddhesh, libc-alpha, nd, Szabolcs Nagy



On 11/10/2017 18:44, Andrew Pinski wrote:
> On Wed, Oct 11, 2017 at 2:39 PM, Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>> On 11/10/2017 18:20, Wilco Dijkstra wrote:
>>> Adhemerval Zanella wrote:
>>>> LGTM, for the pointer brought by Szabolcs:
>>>>
>>>>  - I still prefer to have the macros on one file instead of multiple ones.
>>>>    It makes checks for all the possible code patch easier and one can
>>>>    theoretical build the desirable memset by just define the required
>>>>    macros instead of pulling from source code from different files.
>>>
>>> At the design level, I don't think that's a good idea. It's essential that this
>>> code is easy to read and understand. It must be 100% correct and is
>>> extremely performance sensitive to minor changes (eg. instruction order
>>> or alignment can have major impact). Inline macros and lots of complex
>>> #ifdefs in the code mean getting this right becomes almost impossible.
>>> I really don't want to create large blobs of unmaintainable assembly code
>>> like several other targets.
>>
>> My idea is to prevent have different code paths in different files that
>> are 'injected' by macros or ifdefs.  Getting all of them in one place is
>> better imho than spread over multiple files.  Now, what you are advocating
>> is a different topic: whether the modifications of the generic
>> implementation are valuable.
>>
>>>
>>> Note we also should benchmark this on various other cores to see what
>>> the impact is. I wrote this memset code for specific reasons - changing that
>>> could have a big impact on some cores, so we need to show this doesn't
>>> cause regressions.
>>
>> Ideally it would prefer to have a more concise selection as:
>>
>>    1. A memset that reads dczid_el0 using mrs (as the default one).
>>       This will be selected as default and thus should not incur in any
>>       regression.
>>
>>    2. A memset that reads the zva size from global variable and is selected
>>       solely by falkor (and ideally it would handle 64 and 128 cacheline
>>       sizes).
>>
>>    3. no zva variant in the case of zva being 0.
>>
>>>
>>> Also we need to decide on how to read the ZVA size. I don't think there is
>>> any point in supporting that many options (reading it, using a global, using
>>> an ifunc and not using ZVA at all). The very first memset had a few of these
>>> options, and I removed them precisely because it created way too many
>>> permutations to test and benchmark. So if we add ifuncs, then we shouldn't
>>> need to use a global too. For the dynamic linker we could just use a basic
>>> memset without any ifuncs (so then it becomes 2 variants, ifunc using ZVA,
>>> and no ZVA).
>>
>> Siddhesh can help me out, but the idea is *precisely* that Falkor is faster
>> using a global variable than accessing the dczid_el0 and its shown by its
>> benchmarks.  Now, the discussion is indeed if this microarchitecture difference
>> is worth adding ifuncs variants.
> 
> For at least the micro-archs I work with, reading dczid_el0 can and
> will most likely be faster than reading from global memory.
> Especially if the global memory is not in the L1 cache.  This is one
> case where a micro-benchmark can fall down.  If the global memory is
> in L1 cache the read is 3 cycles while reading from dczid_el0 is 4
> cycles, but once it is out of L1 cache, reading becomes 10x worse plus
> it pollutes the L1 cache.

My understanding it is Falkor idiosyncrasy. 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
  2017-10-11 21:39 ` Adhemerval Zanella
  2017-10-11 21:44   ` Andrew Pinski
@ 2017-10-11 22:21   ` Wilco Dijkstra
  2017-10-12  0:01     ` Adhemerval Zanella
  2017-10-12  1:38     ` Siddhesh Poyarekar
  1 sibling, 2 replies; 21+ messages in thread
From: Wilco Dijkstra @ 2017-10-11 22:21 UTC (permalink / raw)
  To: Adhemerval Zanella, siddhesh; +Cc: libc-alpha, nd, Szabolcs Nagy

Adhemerval Zanella wrote:

> My idea is to prevent have different code paths in different files that
> are 'injected' by macros or ifdefs.  Getting all of them in one place is
> better imho than spread over multiple files.  Now, what you are advocating
> is a different topic: whether the modifications of the generic
> implementation are valuable.

Well I'm talking about the patch as proposed. I can no longer understand
the order or the logic, and it is impossible to figure out which of the many
possible cases may have changed alignment of the code - whether by
accident or on purpose.

> Ideally it would prefer to have a more concise selection as:
>
>   1. A memset that reads dczid_el0 using mrs (as the default one).
>      This will be selected as default and thus should not incur in any
>      regression.
>
>   2. A memset that reads the zva size from global variable and is selected
>      solely by falkor (and ideally it would handle 64 and 128 cacheline
>      sizes).
>
>   3. no zva variant in the case of zva being 0.

I don't see how that makes sense. If we decide to use an ifunc, then you
no longer need to check the ZVA size. So I just don't understand why we
need a global at all...

Wilco

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
  2017-10-11 22:21   ` Wilco Dijkstra
@ 2017-10-12  0:01     ` Adhemerval Zanella
  2017-10-12  1:38     ` Siddhesh Poyarekar
  1 sibling, 0 replies; 21+ messages in thread
From: Adhemerval Zanella @ 2017-10-12  0:01 UTC (permalink / raw)
  To: Wilco Dijkstra, siddhesh; +Cc: libc-alpha, nd, Szabolcs Nagy

On 11/10/2017 19:21, Wilco Dijkstra wrote:
> Adhemerval Zanella wrote:
> 
>> My idea is to prevent have different code paths in different files that
>> are 'injected' by macros or ifdefs.Â  Getting all of them in one place is
>> better imho than spread over multiple files.Â  Now, what you are advocating
>> is a different topic: whether the modifications of the generic
>> implementation are valuable.
> 
> Well I'm talking about the patch as proposed. I can no longer understand
> the order or the logic, and it is impossible to figure out which of the many
> possible cases may have changed alignment of the code - whether by
> accident or on purpose.

Right, that is a fair point. Which alternative do prefer for this kind of
change? Have different implementations with a more linear code (and
potentially duplicate them) or a common implementation with macros
on each file?

> 
>> Ideally it would prefer to have a more concise selection as:
>>
>> Â Â  1. A memset that reads dczid_el0 using mrs (as the default one).
>> Â Â Â Â Â  This will be selected as default and thus should not incur in any
>> Â Â Â Â Â  regression.
>>
>> Â Â  2. A memset that reads the zva size from global variable and is selected
>> Â Â Â Â Â  solely by falkor (and ideally it would handle 64 and 128 cacheline
>> Â Â Â Â Â  sizes).
>>
>> Â Â  3. no zva variant in the case of zva being 0.
> 
> I don't see how that makes sense. If we decide to use an ifunc, then you
> no longer need to check the ZVA size. So I just don't understand why we
> need a global at all...

This is taking in consideration your point about possible regression, since
the first options would be essentially what we have.  But I do think
Siddhesh proposal to be a good way forward and I can't see how it might
regress on current hardware since it is in fact removing the requirement
to get the zva code (and thus cycles to adjust the loop base on it). 
So what kind of regressions do you have in mind with current proposal?

Siddhesh also stated that it improves for both different micro-architectures 
(it also handle the cases where configuration does not advertise zva size).
I will try to run this patch with the hardware I have access here to check for
possible regressions, but I recall I already tested on APM X-GENE and it showed
no regression.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
  2017-10-11 21:44   ` Andrew Pinski
  2017-10-11 21:47     ` Adhemerval Zanella
@ 2017-10-12  1:30     ` Siddhesh Poyarekar
  1 sibling, 0 replies; 21+ messages in thread
From: Siddhesh Poyarekar @ 2017-10-12  1:30 UTC (permalink / raw)
  To: Andrew Pinski, Adhemerval Zanella
  Cc: Wilco Dijkstra, libc-alpha, nd, Szabolcs Nagy

On Thursday 12 October 2017 03:14 AM, Andrew Pinski wrote:
> For at least the micro-archs I work with, reading dczid_el0 can and
> will most likely be faster than reading from global memory.
> Especially if the global memory is not in the L1 cache.  This is one
> case where a micro-benchmark can fall down.  If the global memory is
> in L1 cache the read is 3 cycles while reading from dczid_el0 is 4
> cycles, but once it is out of L1 cache, reading becomes 10x worse plus
> it pollutes the L1 cache.

This is a falkor caveat - mrs ends up being significantly slower.  Also,
the question about reading a global is pointless since I've dropped that
code path.  It only affected non-standard zva sizes anyway so it doesn't
affect current cores.

Siddhesh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
  2017-10-11 22:21   ` Wilco Dijkstra
  2017-10-12  0:01     ` Adhemerval Zanella
@ 2017-10-12  1:38     ` Siddhesh Poyarekar
  1 sibling, 0 replies; 21+ messages in thread
From: Siddhesh Poyarekar @ 2017-10-12  1:38 UTC (permalink / raw)
  To: Wilco Dijkstra, Adhemerval Zanella; +Cc: libc-alpha, nd, Szabolcs Nagy

On Thursday 12 October 2017 03:51 AM, Wilco Dijkstra wrote:
> Well I'm talking about the patch as proposed. I can no longer understand
> the order or the logic, and it is impossible to figure out which of the many
> possible cases may have changed alignment of the code - whether by
> accident or on purpose.

I haven't tried to get the alignments right; the alignments should be
similar in the default case (i.e. with MEMSET_ZVA == 1) since the code
is pretty much unchanged for it.  The alignments in other cases will
have changed and may even be slightly sub-optimal but that can be tuned
further.  This is a significantly big jump in performance that we should
take and then move forward.

As for the macros and ifdefs, there are two ways to go about it - either
maintain multiple independent copies of memset or have a single copy
with ifdefs that is slightly more complicated but is still fundamentally
the same.  I agree that the latter is harder to read but as long as it
is not prohibitively hard and is not attempting to consolidate very
different algorithms, it should be the way to go since it is
significantly easier to maintain in the long run.

Siddhesh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
  2017-10-11 21:20 [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function Wilco Dijkstra
  2017-10-11 21:39 ` Adhemerval Zanella
@ 2017-10-12  1:48 ` Siddhesh Poyarekar
  2017-10-12 11:14   ` Szabolcs Nagy
  2017-10-12  1:54 ` memcpy walk benchmark [was: Hoist ZVA check out of the memset function] Siddhesh Poyarekar
  2 siblings, 1 reply; 21+ messages in thread
From: Siddhesh Poyarekar @ 2017-10-12  1:48 UTC (permalink / raw)
  To: Wilco Dijkstra, adhemerval.zanella; +Cc: libc-alpha, nd, Szabolcs Nagy

[I've responded to your point about macros elsewhere in the thread]

On Thursday 12 October 2017 02:50 AM, Wilco Dijkstra wrote:
> Note we also should benchmark this on various other cores to see what
> the impact is. I wrote this memset code for specific reasons - changing that
> could have a big impact on some cores, so we need to show this doesn't
> cause regressions.

Yes, and that's something I'll lean on you to study and fix since I
don't have access to all that hardware :)

That said, the effect of dropping zva checks should be positive on every
core for memset(0).  The alignments might change things a bit but I
think we should fix those as we go along and not wait for that to be
correct since this gain is pretty big to keep away.  This also adds
incentive to document our alignment decisions since they weren't
documented in the earlier code.

> Also we need to decide on how to read the ZVA size. I don't think there is
> any point in supporting that many options (reading it, using a global, using
> an ifunc and not using ZVA at all). The very first memset had a few of these
> options, and I removed them precisely because it created way too many
> permutations to test and benchmark. So if we add ifuncs, then we shouldn't
> need to use a global too. For the dynamic linker we could just use a basic
> memset without any ifuncs (so then it becomes 2 variants, ifunc using ZVA,
> and no ZVA).

Sorry, I just noticed that I had not updated the commit text; I dropped
the static variable after discussion with Szabolcs.  The code now simply
uses the old style memset for non-standard zva sizes since we agreed
that it's not a problem worth solving right now given that no announced
cores have non-standard zva sizes.

Siddhesh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* memcpy walk benchmark [was: Hoist ZVA check out of the memset function]
  2017-10-11 21:20 [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function Wilco Dijkstra
  2017-10-11 21:39 ` Adhemerval Zanella
  2017-10-12  1:48 ` Siddhesh Poyarekar
@ 2017-10-12  1:54 ` Siddhesh Poyarekar
  2017-10-16 13:12   ` Wilco Dijkstra
  2 siblings, 1 reply; 21+ messages in thread
From: Siddhesh Poyarekar @ 2017-10-12  1:54 UTC (permalink / raw)
  To: Wilco Dijkstra, adhemerval.zanella; +Cc: libc-alpha, nd, Szabolcs Nagy

On Thursday 12 October 2017 02:50 AM, Wilco Dijkstra wrote:
> Finally we'll need to look into more detail at the new memcpy benchmarks -
> while looping through memory seems like a good idea, it appears like it
> only increments by 1. So at first sight it's still testing the exact same thing as
> the existing benchmarks - all data is always cached in L1. For memset I guess
> we're still missing a randomized benchmark based on real trace data.

That's a slightly incorrect description of the benchmark.  The benchmark
walks through two buffers, one forward and one backward.  While it
increments one buffer, it decrements the other.  Also, it interchanges
the source and destination buffers for alternate memcpy calls.
Altogether it ends up ensuring that the L1 hit impact is significantly
reduced; the reduction in impact is proportional to the size of the
memcpy, so larger sizes would almost never hit L1.

We are indeed missing a randomized memset benchmark.  I would be happy
to review one :)

Siddhesh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
  2017-10-12  1:48 ` Siddhesh Poyarekar
@ 2017-10-12 11:14   ` Szabolcs Nagy
  2017-10-12 12:57     ` Siddhesh Poyarekar
  0 siblings, 1 reply; 21+ messages in thread
From: Szabolcs Nagy @ 2017-10-12 11:14 UTC (permalink / raw)
  To: siddhesh, Wilco Dijkstra, adhemerval.zanella; +Cc: nd, libc-alpha

On 12/10/17 02:48, Siddhesh Poyarekar wrote:
> [I've responded to your point about macros elsewhere in the thread]
> 
> On Thursday 12 October 2017 02:50 AM, Wilco Dijkstra wrote:
>> Note we also should benchmark this on various other cores to see what
>> the impact is. I wrote this memset code for specific reasons - changing that
>> could have a big impact on some cores, so we need to show this doesn't
>> cause regressions.
> 
> Yes, and that's something I'll lean on you to study and fix since I
> don't have access to all that hardware :)
> 
> That said, the effect of dropping zva checks should be positive on every
> core for memset(0).  The alignments might change things a bit but I
> think we should fix those as we go along and not wait for that to be
> correct since this gain is pretty big to keep away.  This also adds
> incentive to document our alignment decisions since they weren't
> documented in the earlier code.
> 

since the patch modifies the memset code that runs in
most cases, the impact of the change should be well
understood before it goes in.

in the static linked case it will introduce a plt
indirection, i don't expect a big gain except on falkor
and there is a risk of regression on other cores, it
also makes the code less maintainable.

we should not be in a hurry applying such patches.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
  2017-10-12 11:14   ` Szabolcs Nagy
@ 2017-10-12 12:57     ` Siddhesh Poyarekar
  2017-10-12 14:01       ` Szabolcs Nagy
  0 siblings, 1 reply; 21+ messages in thread
From: Siddhesh Poyarekar @ 2017-10-12 12:57 UTC (permalink / raw)
  To: Szabolcs Nagy, Wilco Dijkstra, adhemerval.zanella; +Cc: nd, libc-alpha

On Thursday 12 October 2017 04:43 PM, Szabolcs Nagy wrote:
> since the patch modifies the memset code that runs in
> most cases, the impact of the change should be well
> understood before it goes in.
> 
> in the static linked case it will introduce a plt
> indirection, i don't expect a big gain except on falkor
> and there is a risk of regression on other cores, it
> also makes the code less maintainable.

Why do you think this gives a gain only on falkor, especially when I
asserted that it gives a significant gain on falkor *and* mustang?  Why
do you think having a bunch of branches in live code (that's the bit
that gets hoisted out) would be beneficial to any core?

Siddhesh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
  2017-10-12 12:57     ` Siddhesh Poyarekar
@ 2017-10-12 14:01       ` Szabolcs Nagy
  2017-10-12 14:42         ` Siddhesh Poyarekar
  0 siblings, 1 reply; 21+ messages in thread
From: Szabolcs Nagy @ 2017-10-12 14:01 UTC (permalink / raw)
  To: siddhesh, Wilco Dijkstra, adhemerval.zanella; +Cc: nd, libc-alpha

On 12/10/17 13:57, Siddhesh Poyarekar wrote:
> On Thursday 12 October 2017 04:43 PM, Szabolcs Nagy wrote:
>> since the patch modifies the memset code that runs in
>> most cases, the impact of the change should be well
>> understood before it goes in.
>>
>> in the static linked case it will introduce a plt
>> indirection, i don't expect a big gain except on falkor
>> and there is a risk of regression on other cores, it
>> also makes the code less maintainable.
> 
> Why do you think this gives a gain only on falkor, especially when I
> asserted that it gives a significant gain on falkor *and* mustang?  Why
> do you think having a bunch of branches in live code (that's the bit
> that gets hoisted out) would be beneficial to any core?
> 

there was no testing on real workloads only on
a ubenchmark that has issues outlined by wilco.

it is known that some cores are sensitive to the
exact code sequence in memset and you changed it
without testing the affected cores (cortex-a53),
probably only wilco knows why he wrote the code
exactly the way it is written, so i need evidence
that there is no regression or wilco's approval.

mustang is a devboard with x-gene cores, which is
not representative of the existing aarch64 cores
in use, but even on mustang it's not clear that the
ubenchmark speed up mean improvement in practice.
(string functions are hard to benchmark, there are
many cases and concerns, which is why i'm conservative
about accepting patches to them)

in principle the approach is fine, but it will take
more time to get confident about the patch.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
  2017-10-12 14:01       ` Szabolcs Nagy
@ 2017-10-12 14:42         ` Siddhesh Poyarekar
  2017-10-16 12:17           ` Siddhesh Poyarekar
  0 siblings, 1 reply; 21+ messages in thread
From: Siddhesh Poyarekar @ 2017-10-12 14:42 UTC (permalink / raw)
  To: Szabolcs Nagy, Wilco Dijkstra, adhemerval.zanella; +Cc: nd, libc-alpha

On Thursday 12 October 2017 07:31 PM, Szabolcs Nagy wrote:
> there was no testing on real workloads only on
> a ubenchmark that has issues outlined by wilco.

It is the kind of change that shouldn't even need a benchmark.  It is a
no-brainer - I am hoisting out a check that is done inside every
function call into the IFUNC, thus reducing a branch inside the code.
Compilers do this all the time for loop invariant ops and this is no
different.

I have split out the discussion about the benchmark because I believe
Wilco has misread the benchmark and I've tried to explain how in that
thread.

> it is known that some cores are sensitive to the
> exact code sequence in memset and you changed it
> without testing the affected cores (cortex-a53),
> probably only wilco knows why he wrote the code
> exactly the way it is written, so i need evidence
> that there is no regression or wilco's approval.

I've posted two patches and it looks like you're conflating the two.
The second patch needs more testing I agree but we're discussing the
first patch, which is far more straightforward.

> mustang is a devboard with x-gene cores, which is
> not representative of the existing aarch64 cores
> in use, but even on mustang it's not clear that the
> ubenchmark speed up mean improvement in practice.
> (string functions are hard to benchmark, there are
> many cases and concerns, which is why i'm conservative
> about accepting patches to them)
> 
> in principle the approach is fine, but it will take
> more time to get confident about the patch.

Like I said before, the benchmark has been misunderstood and I maintain
that the gains will be seen in every benchmark for the input range I
pointed out without causing any noticeable change elsewhere.

So barring those things I read your response as you thinking that the
patch is OK but you're afraid of some unknown unknowns hitting back.  In
that case I suggest we commit this now, let CI loops over take over (in
Cavium, Linaro, ARM, etc.) and then point out regressions if any.  The
valid patch lying around on the mailing list or an arbitrary branch is
never going to get tested.

Siddhesh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
  2017-10-12 14:42         ` Siddhesh Poyarekar
@ 2017-10-16 12:17           ` Siddhesh Poyarekar
  2017-10-16 13:08             ` Szabolcs Nagy
  0 siblings, 1 reply; 21+ messages in thread
From: Siddhesh Poyarekar @ 2017-10-16 12:17 UTC (permalink / raw)
  To: Szabolcs Nagy, Wilco Dijkstra, adhemerval.zanella; +Cc: nd, libc-alpha

On Thursday 12 October 2017 08:11 PM, Siddhesh Poyarekar wrote:
> Like I said before, the benchmark has been misunderstood and I maintain
> that the gains will be seen in every benchmark for the input range I
> pointed out without causing any noticeable change elsewhere.
> 
> So barring those things I read your response as you thinking that the
> patch is OK but you're afraid of some unknown unknowns hitting back.  In
> that case I suggest we commit this now, let CI loops over take over (in
> Cavium, Linaro, ARM, etc.) and then point out regressions if any.  The
> valid patch lying around on the mailing list or an arbitrary branch is
> never going to get tested.

Szabolcs, have you thought about this some more?

Siddhesh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
  2017-10-16 12:17           ` Siddhesh Poyarekar
@ 2017-10-16 13:08             ` Szabolcs Nagy
  2017-10-16 13:36               ` Siddhesh Poyarekar
  0 siblings, 1 reply; 21+ messages in thread
From: Szabolcs Nagy @ 2017-10-16 13:08 UTC (permalink / raw)
  To: siddhesh, Wilco Dijkstra, adhemerval.zanella; +Cc: nd, libc-alpha

On 16/10/17 13:17, Siddhesh Poyarekar wrote:
> On Thursday 12 October 2017 08:11 PM, Siddhesh Poyarekar wrote:
>> Like I said before, the benchmark has been misunderstood and I maintain
>> that the gains will be seen in every benchmark for the input range I
>> pointed out without causing any noticeable change elsewhere.
>>
>> So barring those things I read your response as you thinking that the
>> patch is OK but you're afraid of some unknown unknowns hitting back.  In
>> that case I suggest we commit this now, let CI loops over take over (in
>> Cavium, Linaro, ARM, etc.) and then point out regressions if any.  The
>> valid patch lying around on the mailing list or an arbitrary branch is
>> never going to get tested.
> 
> Szabolcs, have you thought about this some more?
> 

i think the patch can be reduced to have
3 variants instead of 4 (the nozva case does
not seem to buy much over the generic code)

and the alignments have to be checked for the
new fixed zva size cases because they seem to
be wrong.

it's not entirely clear what should happen
for libc internal memset calls, that needs
more thought.

quick test shows that the current benchmark was
not running enough iterations and not testing
enough cases, i'd expect the actual speed up
to be lower than previously reported.

overall this needs more work, as i said string
functions needs to be modified very carefully
since many things depend on their behaviour.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: memcpy walk benchmark [was: Hoist ZVA check out of the memset function]
  2017-10-12  1:54 ` memcpy walk benchmark [was: Hoist ZVA check out of the memset function] Siddhesh Poyarekar
@ 2017-10-16 13:12   ` Wilco Dijkstra
  2017-10-16 13:27     ` Siddhesh Poyarekar
  0 siblings, 1 reply; 21+ messages in thread
From: Wilco Dijkstra @ 2017-10-16 13:12 UTC (permalink / raw)
  To: adhemerval.zanella, siddhesh; +Cc: libc-alpha, nd, Szabolcs Nagy

Siddhesh Poyarekar wrote:  
> On Thursday 12 October 2017 02:50 AM, Wilco Dijkstra wrote:
>> Finally we'll need to look into more detail at the new memcpy benchmarks -
>> while looping through memory seems like a good idea, it appears like it
>> only increments by 1. So at first sight it's still testing the exact same thing as
>> the existing benchmarks - all data is always cached in L1. For memset I guess
>> we're still missing a randomized benchmark based on real trace data.
>
> That's a slightly incorrect description of the benchmark.  The benchmark
> walks through two buffers, one forward and one backward.  While it
> increments one buffer, it decrements the other.  Also, it interchanges
> the source and destination buffers for alternate memcpy calls.
> Altogether it ends up ensuring that the L1 hit impact is significantly
> reduced; the reduction in impact is proportional to the size of the
> memcpy, so larger sizes would almost never hit L1.

No this is not the case. For larger sizes it accesses less and less memory.
See eg. bench-memset-walk.c:

size_t i, iters = MIN_PAGE_SIZE / n;

   for (i = 0; i < iters && s <= s_end; s++, i++) 
     CALL (impl, s, c, n);

Clearly the working set is (MIN_PAGE_SIZE / n) + n. For n=256 it is
128KB, for n=1024 it is just 33KB. I wouldn't call either a 32MB walk...

Presumably the intention was to do s += n?

Wilco

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: memcpy walk benchmark [was: Hoist ZVA check out of the memset function]
  2017-10-16 13:12   ` Wilco Dijkstra
@ 2017-10-16 13:27     ` Siddhesh Poyarekar
  2017-10-16 14:58       ` Wilco Dijkstra
  0 siblings, 1 reply; 21+ messages in thread
From: Siddhesh Poyarekar @ 2017-10-16 13:27 UTC (permalink / raw)
  To: Wilco Dijkstra, adhemerval.zanella; +Cc: libc-alpha, nd, Szabolcs Nagy

On Monday 16 October 2017 06:42 PM, Wilco Dijkstra wrote:
> No this is not the case. For larger sizes it accesses less and less memory.
> See eg. bench-memset-walk.c:
> 
> size_t i, iters = MIN_PAGE_SIZE / n;
> 
>    for (i = 0; i < iters && s <= s_end; s++, i++) 
>      CALL (impl, s, c, n);

Ahh wait, my intention was to start from s_end and go down to s,
decrementing along the way.  That should have different behaviour from
incrementing since the latter will tune hardware prefetchers for future
iterations, thus skewing readings.

> Clearly the working set is (MIN_PAGE_SIZE / n) + n. For n=256 it is
> 128KB, for n=1024 it is just 33KB. I wouldn't call either a 32MB walk...
> 
> Presumably the intention was to do s += n?

Right, except it was supposed to be s_end -= n.

I'll post a fix for both those issues and re-test memset.  Any other
comments on the test?  BTW, I thought you were talking about memcpy and
not memset.  Can you please review those tests as well?

Siddhesh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
  2017-10-16 13:08             ` Szabolcs Nagy
@ 2017-10-16 13:36               ` Siddhesh Poyarekar
  0 siblings, 0 replies; 21+ messages in thread
From: Siddhesh Poyarekar @ 2017-10-16 13:36 UTC (permalink / raw)
  To: Szabolcs Nagy, Wilco Dijkstra, adhemerval.zanella; +Cc: nd, libc-alpha

On Monday 16 October 2017 06:37 PM, Szabolcs Nagy wrote:
> i think the patch can be reduced to have
> 3 variants instead of 4 (the nozva case does
> not seem to buy much over the generic code)

How about this then for least change; I'll just add an alternate routine
for zva == 64 and keep everything else the same.

> and the alignments have to be checked for the
> new fixed zva size cases because they seem to
> be wrong.

I'll keep the routine separate so that it's easier for you to see the
alignments of various bits.

> it's not entirely clear what should happen
> for libc internal memset calls, that needs
> more thought.

Default to the current memset?

> quick test shows that the current benchmark was
> not running enough iterations and not testing
> enough cases, i'd expect the actual speed up
> to be lower than previously reported.

It's running enough iterations, it's just not running across the array
like I had planned, so the iterations are overlapping and hence
ineffective.  I'll post a fix and also the new numbers.

Siddhesh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: memcpy walk benchmark [was: Hoist ZVA check out of the memset function]
  2017-10-16 13:27     ` Siddhesh Poyarekar
@ 2017-10-16 14:58       ` Wilco Dijkstra
  2017-10-16 15:17         ` Siddhesh Poyarekar
  0 siblings, 1 reply; 21+ messages in thread
From: Wilco Dijkstra @ 2017-10-16 14:58 UTC (permalink / raw)
  To: adhemerval.zanella, siddhesh; +Cc: libc-alpha, nd, Szabolcs Nagy

Siddhesh Poyarekar wrote:

> Ahh wait, my intention was to start from s_end and go down to s,
> decrementing along the way.  That should have different behaviour from
> incrementing since the latter will tune hardware prefetchers for future
> iterations, thus skewing readings.

Right, that seems like a good approach.

> I'll post a fix for both those issues and re-test memset.  Any other
> comments on the test?  BTW, I thought you were talking about memcpy and
> not memset.  Can you please review those tests as well?

They are identical so have the same issues. For memcpy/memmove I don't
think it is a good idea to copy the same data back and forth, since that's not
a common usage scenario, but also because it might penalize cores that
bypass L1 for write streams.

Generally the tests don't run long enough (even if they do access all 32MB),
so I'd say they need an outer loop to repeat say 20 times. Also if we do
exactly the same amount of work for each possible size, printing the total
time would make comparing results between different sizes a bit easier.

Wilco

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: memcpy walk benchmark [was: Hoist ZVA check out of the memset function]
  2017-10-16 14:58       ` Wilco Dijkstra
@ 2017-10-16 15:17         ` Siddhesh Poyarekar
  0 siblings, 0 replies; 21+ messages in thread
From: Siddhesh Poyarekar @ 2017-10-16 15:17 UTC (permalink / raw)
  To: Wilco Dijkstra, adhemerval.zanella; +Cc: libc-alpha, nd, Szabolcs Nagy

On Monday 16 October 2017 08:28 PM, Wilco Dijkstra wrote:
> They are identical so have the same issues. For memcpy/memmove I don't
> think it is a good idea to copy the same data back and forth, since that's not
> a common usage scenario, but also because it might penalize cores that
> bypass L1 for write streams.

Those benchmarks actually emulate the behaviour of an internal
proprietary workload, i.e. it is not 1:1 exact behaviour but it tracks
the performance.  However it is a fair point.  I'll just make it a copy
that walks backwards, like memset and see what the difference is and if
it continues to track the internal workload.  My guess is that it simply
depends on invalidation, which automatically happens for falkor since it
bypasses L1 for write streams.

> Generally the tests don't run long enough (even if they do access all 32MB),
> so I'd say they need an outer loop to repeat say 20 times. Also if we do

The total run time of the benchmark is quite long.  If we repeat the
runs then maybe we should consider reducing the sizes that are measured,
maybe limiting them to just > 128 bytes.

> exactly the same amount of work for each possible size, printing the total
> time would make comparing results between different sizes a bit easier.

Agreed.  I had done that with my first iteration with the transfer rate,
but put in time in the end to make it consistent with other tests.  A
total time per size is a better replacement than per-call time.

Siddhesh

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-10-16 15:17 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-11 21:20 [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function Wilco Dijkstra
2017-10-11 21:39 ` Adhemerval Zanella
2017-10-11 21:44   ` Andrew Pinski
2017-10-11 21:47     ` Adhemerval Zanella
2017-10-12  1:30     ` Siddhesh Poyarekar
2017-10-11 22:21   ` Wilco Dijkstra
2017-10-12  0:01     ` Adhemerval Zanella
2017-10-12  1:38     ` Siddhesh Poyarekar
2017-10-12  1:48 ` Siddhesh Poyarekar
2017-10-12 11:14   ` Szabolcs Nagy
2017-10-12 12:57     ` Siddhesh Poyarekar
2017-10-12 14:01       ` Szabolcs Nagy
2017-10-12 14:42         ` Siddhesh Poyarekar
2017-10-16 12:17           ` Siddhesh Poyarekar
2017-10-16 13:08             ` Szabolcs Nagy
2017-10-16 13:36               ` Siddhesh Poyarekar
2017-10-12  1:54 ` memcpy walk benchmark [was: Hoist ZVA check out of the memset function] Siddhesh Poyarekar
2017-10-16 13:12   ` Wilco Dijkstra
2017-10-16 13:27     ` Siddhesh Poyarekar
2017-10-16 14:58       ` Wilco Dijkstra
2017-10-16 15:17         ` Siddhesh Poyarekar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).