public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* k-byte memset/memcpy/strlen builtins
@ 2017-01-11 16:16 Robin Dapp
  2017-01-11 17:42 ` Richard Biener
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Robin Dapp @ 2017-01-11 16:16 UTC (permalink / raw)
  To: gcc

Hi,

When examining the performance of some test cases on s390 I realized
that we could do better for constructs like 2-byte memcpys or
2-byte/4-byte memsets. Due to some s390-specific architectural
properties, we could be faster by e.g. avoiding excessive unrolling and
using dedicated memory instructions (or similar).

For 1-byte memset/memcpy the builtin functions provide a straightforward
way to achieve this. At first sight it seemed possible to extend
tree-loop-distribution.c to include the additional variants we need.
However, multibyte memsets/memcpys are not covered by the C standard and
I'm therefore unsure if such an approach is preferable or if there are
more idiomatic ways or places where to add the functionality.

The same question goes for 2-byte strlen. I didn't see a recognition
pattern for strlen (apart from optimizations due to known string length
in tree-ssa-strlen.c). Would it make sense to include strlen recognition
and subsequently handling for 2-byte strlen? The situation might of
course more complicated than memset because of encodings etc. My snippet
in question used a fixed-length encoding of 2 bytes, however.

Another simple idea to tackle this would be a peephole optimization but
I'm not sure if this is really feasible for something like memset.
Wouldn't the peephole have to be recursive then?

Regards
 Robin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: k-byte memset/memcpy/strlen builtins
  2017-01-11 16:16 k-byte memset/memcpy/strlen builtins Robin Dapp
@ 2017-01-11 17:42 ` Richard Biener
  2017-01-12  8:26   ` Robin Dapp
  2017-01-11 18:05 ` Aaron Sawdey
  2017-01-12 18:25 ` Martin Sebor
  2 siblings, 1 reply; 6+ messages in thread
From: Richard Biener @ 2017-01-11 17:42 UTC (permalink / raw)
  To: gcc, Robin Dapp

On January 11, 2017 5:16:43 PM GMT+01:00, Robin Dapp <rdapp@linux.vnet.ibm.com> wrote:
>Hi,
>
>When examining the performance of some test cases on s390 I realized
>that we could do better for constructs like 2-byte memcpys or
>2-byte/4-byte memsets. Due to some s390-specific architectural
>properties, we could be faster by e.g. avoiding excessive unrolling and
>using dedicated memory instructions (or similar).

Not sure why you mention memcpy, how does that depend on 'element size'?

>For 1-byte memset/memcpy the builtin functions provide a
>straightforward
>way to achieve this. At first sight it seemed possible to extend
>tree-loop-distribution.c to include the additional variants we need.
>However, multibyte memsets/memcpys are not covered by the C standard
>and
>I'm therefore unsure if such an approach is preferable or if there are
>more idiomatic ways or places where to add the functionality.

Yes, for memset with larger element we could add an optab plus internal function combination and use that when the target wants.  Or always use such IFN and fall back to loopy expansion.

>The same question goes for 2-byte strlen. I didn't see a recognition
>pattern for strlen (apart from optimizations due to known string length
>in tree-ssa-strlen.c). Would it make sense to include strlen
>recognition
>and subsequently handling for 2-byte strlen? The situation might of

I'd say a multibyte memchr might make sense, but strlen specifically?  Not sure.

Likewise multibyte memcmp.

Richard.

>course more complicated than memset because of encodings etc. My
>snippet
>in question used a fixed-length encoding of 2 bytes, however.
>
>Another simple idea to tackle this would be a peephole optimization but
>I'm not sure if this is really feasible for something like memset.
>Wouldn't the peephole have to be recursive then?
>
>Regards
> Robin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: k-byte memset/memcpy/strlen builtins
  2017-01-11 16:16 k-byte memset/memcpy/strlen builtins Robin Dapp
  2017-01-11 17:42 ` Richard Biener
@ 2017-01-11 18:05 ` Aaron Sawdey
  2017-01-12 18:25 ` Martin Sebor
  2 siblings, 0 replies; 6+ messages in thread
From: Aaron Sawdey @ 2017-01-11 18:05 UTC (permalink / raw)
  To: Robin Dapp, gcc

On Wed, 2017-01-11 at 17:16 +0100, Robin Dapp wrote:
> Hi,

Hi Robin,
  I thought I'd share some of what I've run into while doing similar
things for the rs6000 target.

First off, be aware that glibc does some macro expansion things to try
to handle 1/2/3 byte string operations in some cases.

Secondly, the way I approached this was to use the patterns 
defined in optabs.def for these things:

OPTAB_D (cmpmem_optab, "cmpmem$a")
OPTAB_D (cmpstr_optab, "cmpstr$a")
OPTAB_D (cmpstrn_optab, "cmpstrn$a")
OPTAB_D (movmem_optab, "movmem$a")
OPTAB_D (setmem_optab, "setmem$a")
OPTAB_D (strlen_optab, "strlen$a")

If you define movmemsi, that should get used by expand_builtin_memcpy
for any memcpy call that it sees.

The constraints I was able to find when implementing cmpmemsi for
memcmp were:
 * don't compare past the given length (obviously)
 * don't read past the given length
 * except it's ok to do so if you can prove via alignment or
   runtime check that you are not going to cause a pagefault.
   Not crossing a 4k boundary seems to be generally viewed as
   acceptable.

I would recommend looking at preprocessed code to make sure no funny
business is happening, and then look at your .md files. It looks to me
like s390 has got both movmem and strlen patterns there already.

If I understand correctly you are wanting to do multi-byte characters.
Seems to me you need to follow the path Richard Biener suggests and
make optab expansions that handle wider chars and then perhaps map
wcslen et. al. to them?

   Aaron
> 
> For 1-byte memset/memcpy the builtin functions provide a
> straightforward
> way to achieve this. At first sight it seemed possible to extend
> tree-loop-distribution.c to include the additional variants we need.
> However, multibyte memsets/memcpys are not covered by the C standard
> and
> I'm therefore unsure if such an approach is preferable or if there
> are
> more idiomatic ways or places where to add the functionality.
> 
> The same question goes for 2-byte strlen. I didn't see a recognition
> pattern for strlen (apart from optimizations due to known string
> length
> in tree-ssa-strlen.c). Would it make sense to include strlen
> recognition
> and subsequently handling for 2-byte strlen? The situation might of
> course more complicated than memset because of encodings etc. My
> snippet
> in question used a fixed-length encoding of 2 bytes, however.
> 
> Another simple idea to tackle this would be a peephole optimization
> but
> I'm not sure if this is really feasible for something like memset.
> Wouldn't the peephole have to be recursive then?
> 
> Regards
>  Robin
> 
-- 
Aaron Sawdey, Ph.D.  acsawdey@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: k-byte memset/memcpy/strlen builtins
  2017-01-11 17:42 ` Richard Biener
@ 2017-01-12  8:26   ` Robin Dapp
  2017-01-12 12:30     ` Richard Biener
  0 siblings, 1 reply; 6+ messages in thread
From: Robin Dapp @ 2017-01-12  8:26 UTC (permalink / raw)
  To: Richard Biener, gcc

> Yes, for memset with larger element we could add an optab plus
> internal function combination and use that when the target wants.  Or
> always use such IFN and fall back to loopy expansion.

So, adding additional patterns in tree-loop-distribute.c (and mapping
them to dedicated optabs) is fine? Or does the yes refer to the
"else"/"or" part of my question (how would the backend recognize the
patterns then)?

> I'd say a multibyte memchr might make sense, but strlen specifically?
> Not sure.

ok, memchr would also work for the snippet I have in mind.

Regards
 Robin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: k-byte memset/memcpy/strlen builtins
  2017-01-12  8:26   ` Robin Dapp
@ 2017-01-12 12:30     ` Richard Biener
  0 siblings, 0 replies; 6+ messages in thread
From: Richard Biener @ 2017-01-12 12:30 UTC (permalink / raw)
  To: Robin Dapp; +Cc: GCC Development

On Thu, Jan 12, 2017 at 9:26 AM, Robin Dapp <rdapp@linux.vnet.ibm.com> wrote:
>> Yes, for memset with larger element we could add an optab plus
>> internal function combination and use that when the target wants.  Or
>> always use such IFN and fall back to loopy expansion.
>
> So, adding additional patterns in tree-loop-distribute.c (and mapping
> them to dedicated optabs) is fine? Or does the yes refer to the
> "else"/"or" part of my question (how would the backend recognize the
> patterns then)?

Yes, enhancing tree-loop-distribution.c with extra patterns is fine (hey, there
were supposed to be patterns for all of lapack & friends ... ;))

The question is only whether loop-distribution should always create the IFN
or just if the backend has an optab so expansion is trivial (rather than
needing to re-build a loop doing the operation).

Richard.

>> I'd say a multibyte memchr might make sense, but strlen specifically?
>> Not sure.
>
> ok, memchr would also work for the snippet I have in mind.
>
> Regards
>  Robin
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: k-byte memset/memcpy/strlen builtins
  2017-01-11 16:16 k-byte memset/memcpy/strlen builtins Robin Dapp
  2017-01-11 17:42 ` Richard Biener
  2017-01-11 18:05 ` Aaron Sawdey
@ 2017-01-12 18:25 ` Martin Sebor
  2 siblings, 0 replies; 6+ messages in thread
From: Martin Sebor @ 2017-01-12 18:25 UTC (permalink / raw)
  To: Robin Dapp, gcc

On 01/11/2017 09:16 AM, Robin Dapp wrote:
> Hi,
>
> When examining the performance of some test cases on s390 I realized
> that we could do better for constructs like 2-byte memcpys or
> 2-byte/4-byte memsets. Due to some s390-specific architectural
> properties, we could be faster by e.g. avoiding excessive unrolling and
> using dedicated memory instructions (or similar).

There are at least two enhancement requests in Bugzilla to improve
memcmp when one or more of the arguments are constant: bugs 12086
and 78257 (the former for constant small lengths and the latter
for constant byte arrays).  It seems that one or both of these might
also benefit from some of your ideas and/or vice versa.

Martin

> For 1-byte memset/memcpy the builtin functions provide a straightforward
> way to achieve this. At first sight it seemed possible to extend
> tree-loop-distribution.c to include the additional variants we need.
> However, multibyte memsets/memcpys are not covered by the C standard and
> I'm therefore unsure if such an approach is preferable or if there are
> more idiomatic ways or places where to add the functionality.
>
> The same question goes for 2-byte strlen. I didn't see a recognition
> pattern for strlen (apart from optimizations due to known string length
> in tree-ssa-strlen.c). Would it make sense to include strlen recognition
> and subsequently handling for 2-byte strlen? The situation might of
> course more complicated than memset because of encodings etc. My snippet
> in question used a fixed-length encoding of 2 bytes, however.
>
> Another simple idea to tackle this would be a peephole optimization but
> I'm not sure if this is really feasible for something like memset.
> Wouldn't the peephole have to be recursive then?
>
> Regards
>  Robin
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-01-12 18:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-11 16:16 k-byte memset/memcpy/strlen builtins Robin Dapp
2017-01-11 17:42 ` Richard Biener
2017-01-12  8:26   ` Robin Dapp
2017-01-12 12:30     ` Richard Biener
2017-01-11 18:05 ` Aaron Sawdey
2017-01-12 18:25 ` Martin Sebor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).