[RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work
@ 2014-09-24  8:11 David Wohlferd
  2014-09-24  8:48 ` Richard Biener
  0 siblings, 1 reply; 17+ messages in thread
From: David Wohlferd @ 2014-09-24  8:11 UTC (permalink / raw)
  To: gcc, Yury Gribov, Hans-Peter Nilsson

Hans-Peter Nilsson: I should have listened to you back when you raised 
concerns about this.  My apologies for ever doubting you.

In summary:

- The "trick" in the docs for using an arbitrarily sized struct to force 
register flushes for inline asm does not work.
- Placing the inline asm in a separate routine can sometimes mask the 
problem with the trick not working.
- The sample that has been in the docs forever performs an unhelpful, 
unexpected, and probably unwanted stack allocation + memcpy.

Details:

Here is the text from the docs:

-----------
One trick to avoid [using the "memory" clobber] is available if the size 
of the memory being accessed is known at compile time. For example, if 
accessing ten bytes of a string, use a memory input like:

     "m"( ({ struct { char x[10]; } *p = (void *)ptr ; *p; }) )
-----------

When I did the re-write of gcc's inline asm docs, I left the description 
for this (essentially) untouched.  I just took it on faith that "magic 
happens" and the right code gets generated.  But reading a recent post 
raised questions for me, so I tried it.  And what I found was that not 
only does this not work, it actually just makes a mess.

I started with some code that I knew required some memory clobbering:

     #include <stdio.h>

     int main(int argc, char* argv[])
     {
       struct
       {
         int a;
         int b;
       } c;

       c.a = 1;
       c.b = 2;

       int Count = sizeof(c);
       void *Dest;

       __asm__ __volatile__ ("rep; stosb"
            : "=D" (Dest), "+c" (Count)
            : "0" (&c), "a" (0)
            //: "memory"
       );

       printf("%u %u\n", c.a, c.b);
     }

As written, this x64 code (compiled with -O2) will print out "1 2", even 
though someone might (incorrectly) expect the asm to overwrite the 
struct with zeros.  Adding the memory clobber allows this code to work 
as expected (printing "0 0").

Now that I have code I can use to see if registers are getting flushed, 
I removed the memory clobber, and tried just 'clobbering' the struct:

     #include <stdio.h>

     int main(int argc, char* argv[])
     {
       struct
       {
         int a;
         int b;
       } c;

       c.a = 1;
       c.b = 2;

       int Count = sizeof(c);
       void *Dest;

       __asm__ __volatile__ ("rep; stosb"
            : "=D" (Dest), "+c" (Count)
            : "0" (&c), "a" (0),
            "m" ( ({ struct foo { char x[8]; } *p = (struct foo *)&c ; 
*p; }) )
       );

       printf("%u %u\n", c.a, c.b);
     }

I'm using a named struct (foo) to avoid some compiler messages, but 
other than that, I believe this is the same as what's in the docs. And 
it doesn't work.  I still get "1 2".

At this point I realized that code I've seen using this trick usually 
has the asm in its own routine.  When I try this, it still fails.  
Unless I start cranking up the size of x from 8 to ~250.  At ~250, 
suddenly it starts working.  Apparently this is because at this point, 
gcc decides not to inline the routine anymore, and flushes the registers 
before calling the non-inline code.

And why does changing the size of the structure we are pointing to 
result in increases in the size of the routine?  Reading the -S output, 
the "*p" at the end of this constraint generates a call to memcpy the 
250 characters onto the stack, which it passes to the asm as %4, which 
is never used.  Argh!

Conclusion:

What I expected when using that sample code from the docs was that any 
registers that contain values from the struct would get flushed to 
memory.  This was intended to be a 'cheaper' alternative to doing a 
full-on "memory" clobber.  What I got instead was an unexpected/unneeded 
stack allocation and memcpy, and STILL didn't get the values flushed.  
Yeah, not exactly the 'cheaper' I was hoping for.

Is the example in the docs just written incorrectly?  Did this get 
broken somewhere along the line?  Or am I just using it wrong?

I'm using gcc version 4.9.0 (x86_64-win32-seh-rev2, Built by MinGW-W64 
project).  Remember to compile these x64 samples with -O2.

dw

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work
  2014-09-24  8:11 [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work David Wohlferd
@ 2014-09-24  8:48 ` Richard Biener
  2014-09-25 20:48   ` Yury Gribov
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Biener @ 2014-09-24  8:48 UTC (permalink / raw)
  To: David Wohlferd; +Cc: gcc, Yury Gribov, Hans-Peter Nilsson

On Wed, Sep 24, 2014 at 9:43 AM, David Wohlferd <dw@limegreensocks.com> wrote:
> Hans-Peter Nilsson: I should have listened to you back when you raised
> concerns about this.  My apologies for ever doubting you.
>
> In summary:
>
> - The "trick" in the docs for using an arbitrarily sized struct to force
> register flushes for inline asm does not work.
> - Placing the inline asm in a separate routine can sometimes mask the
> problem with the trick not working.
> - The sample that has been in the docs forever performs an unhelpful,
> unexpected, and probably unwanted stack allocation + memcpy.
>
> Details:
>
> Here is the text from the docs:
>
> -----------
> One trick to avoid [using the "memory" clobber] is available if the size of
> the memory being accessed is known at compile time. For example, if
> accessing ten bytes of a string, use a memory input like:
>
>     "m"( ({ struct { char x[10]; } *p = (void *)ptr ; *p; }) )

Well - this can't work because you essentially are using a _value_
here (looking at the GIMPLE - I'm not sure if a statement expression
evaluates to an lvalue.

It should work if you simply do this without a stmt expression:

  "m" (*(struct { char x[10]; } *)ptr)

because that's clearly an lvalue (and the GIMPLE correctly says so):

  <bb 2>:
  c.a = 1;
  c.b = 2;
  __asm__ __volatile__("rep; stosb" : "=D" Dest_4, "=c" Count_5 : "0"
&c, "a" 0, "m" MEM[(struct foo *)&c], "1" 8);
  printf ("%u %u\n", 1, 2);

note that we still constant propagated 1 and 2 for the reason that
the asm didn't get any VDEF.  That's because you do not have any
memory output!  So while it keeps 'c' live it doesn't consider it
modified by the asm.  You'd still need to clobber the memory,
but "m" clobbers are not supported, only "memory".

Thus fixed asm:


      __asm__ __volatile__ ("rep; stosb"
           : "=D" (Dest), "+c" (Count)
           : "0" (&c), "a" (0),
           "m" (*( struct foo { char x[8]; } *)&c)
           : "memory"
      );

where I'm not 100% sure if the "m" input is now pointless (that is,
if a "memory" clobber also constitutes a use of all memory).

Richard.

> -----------
>
> When I did the re-write of gcc's inline asm docs, I left the description for
> this (essentially) untouched.  I just took it on faith that "magic happens"
> and the right code gets generated.  But reading a recent post raised
> questions for me, so I tried it.  And what I found was that not only does
> this not work, it actually just makes a mess.
>
> I started with some code that I knew required some memory clobbering:
>
>     #include <stdio.h>
>
>     int main(int argc, char* argv[])
>     {
>       struct
>       {
>         int a;
>         int b;
>       } c;
>
>       c.a = 1;
>       c.b = 2;
>
>       int Count = sizeof(c);
>       void *Dest;
>
>       __asm__ __volatile__ ("rep; stosb"
>            : "=D" (Dest), "+c" (Count)
>            : "0" (&c), "a" (0)
>            //: "memory"
>       );
>
>       printf("%u %u\n", c.a, c.b);
>     }
>
> As written, this x64 code (compiled with -O2) will print out "1 2", even
> though someone might (incorrectly) expect the asm to overwrite the struct
> with zeros.  Adding the memory clobber allows this code to work as expected
> (printing "0 0").
>
> Now that I have code I can use to see if registers are getting flushed, I
> removed the memory clobber, and tried just 'clobbering' the struct:
>
>     #include <stdio.h>
>
>     int main(int argc, char* argv[])
>     {
>       struct
>       {
>         int a;
>         int b;
>       } c;
>
>       c.a = 1;
>       c.b = 2;
>
>       int Count = sizeof(c);
>       void *Dest;
>
>       __asm__ __volatile__ ("rep; stosb"
>            : "=D" (Dest), "+c" (Count)
>            : "0" (&c), "a" (0),
>            "m" ( ({ struct foo { char x[8]; } *p = (struct foo *)&c ; *p; })
> )
>       );
>
>       printf("%u %u\n", c.a, c.b);
>     }
>
> I'm using a named struct (foo) to avoid some compiler messages, but other
> than that, I believe this is the same as what's in the docs. And it doesn't
> work.  I still get "1 2".
>
> At this point I realized that code I've seen using this trick usually has
> the asm in its own routine.  When I try this, it still fails.  Unless I
> start cranking up the size of x from 8 to ~250.  At ~250, suddenly it starts
> working.  Apparently this is because at this point, gcc decides not to
> inline the routine anymore, and flushes the registers before calling the
> non-inline code.
>
> And why does changing the size of the structure we are pointing to result in
> increases in the size of the routine?  Reading the -S output, the "*p" at
> the end of this constraint generates a call to memcpy the 250 characters
> onto the stack, which it passes to the asm as %4, which is never used.
> Argh!
>
> Conclusion:
>
> What I expected when using that sample code from the docs was that any
> registers that contain values from the struct would get flushed to memory.
> This was intended to be a 'cheaper' alternative to doing a full-on "memory"
> clobber.  What I got instead was an unexpected/unneeded stack allocation and
> memcpy, and STILL didn't get the values flushed.  Yeah, not exactly the
> 'cheaper' I was hoping for.
>
> Is the example in the docs just written incorrectly?  Did this get broken
> somewhere along the line?  Or am I just using it wrong?
>
> I'm using gcc version 4.9.0 (x86_64-win32-seh-rev2, Built by MinGW-W64
> project).  Remember to compile these x64 samples with -O2.
>
> dw

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work
  2014-09-24  8:48 ` Richard Biener
@ 2014-09-25 20:48   ` Yury Gribov
  2014-09-26 10:50     ` David Wohlferd
  0 siblings, 1 reply; 17+ messages in thread
From: Yury Gribov @ 2014-09-25 20:48 UTC (permalink / raw)
  To: Richard Biener, David Wohlferd; +Cc: gcc, Hans-Peter Nilsson

On 09/24/2014 12:31 PM, Richard Biener wrote:
> On Wed, Sep 24, 2014 at 9:43 AM, David Wohlferd <dw@limegreensocks.com> wrote:
>> Hans-Peter Nilsson: I should have listened to you back when you raised
>> concerns about this.  My apologies for ever doubting you.
>>
>> In summary:
>>
>> - The "trick" in the docs for using an arbitrarily sized struct to force
>> register flushes for inline asm does not work.
>> - Placing the inline asm in a separate routine can sometimes mask the
>> problem with the trick not working.
>> - The sample that has been in the docs forever performs an unhelpful,
>> unexpected, and probably unwanted stack allocation + memcpy.
>>
>> Details:
>>
>> Here is the text from the docs:
>>
>> -----------
>> One trick to avoid [using the "memory" clobber] is available if the size of
>> the memory being accessed is known at compile time. For example, if
>> accessing ten bytes of a string, use a memory input like:
>>
>>      "m"( ({ struct { char x[10]; } *p = (void *)ptr ; *p; }) )
>
> Well - this can't work because you essentially are using a _value_
> here (looking at the GIMPLE - I'm not sure if a statement expression
> evaluates to an lvalue.
>
> It should work if you simply do this without a stmt expression:
>
>    "m" (*(struct { char x[10]; } *)ptr)
>
> because that's clearly an lvalue (and the GIMPLE correctly says so):
>
>    <bb 2>:
>    c.a = 1;
>    c.b = 2;
>    __asm__ __volatile__("rep; stosb" : "=D" Dest_4, "=c" Count_5 : "0"
> &c, "a" 0, "m" MEM[(struct foo *)&c], "1" 8);
>    printf ("%u %u\n", 1, 2);
>
> note that we still constant propagated 1 and 2 for the reason that
> the asm didn't get any VDEF.  That's because you do not have any
> memory output!  So while it keeps 'c' live it doesn't consider it
> modified by the asm.  You'd still need to clobber the memory,
> but "m" clobbers are not supported, only "memory".
>
> Thus fixed asm:
>
>
>        __asm__ __volatile__ ("rep; stosb"
>             : "=D" (Dest), "+c" (Count)
>             : "0" (&c), "a" (0),
>             "m" (*( struct foo { char x[8]; } *)&c)
>             : "memory"
>        );
>
> where I'm not 100% sure if the "m" input is now pointless (that is,
> if a "memory" clobber also constitutes a use of all memory).

Or maybe even
   __asm__ __volatile__ ("rep; stosb"
        : "=D" (Dest), "+c" (Count), "+m" (*(struct foo { char x[8]; } *)&c)
        : "0" (&c), "a" (0)
   );
to avoid the big-hammer memory clobber?

-Y

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work
  2014-09-25 20:48   ` Yury Gribov
@ 2014-09-26 10:50     ` David Wohlferd
  2014-09-26 14:47       ` Richard Biener
  0 siblings, 1 reply; 17+ messages in thread
From: David Wohlferd @ 2014-09-26 10:50 UTC (permalink / raw)
  To: Yury Gribov, richard.guenther, gcc,
	hp@bitrange.com >> Hans-Peter Nilsson, Gerald Pfeifer

On 9/25/2014 12:36 AM, Yury Gribov wrote:
> On 09/24/2014 12:31 PM, Richard Biener wrote:
>> On Wed, Sep 24, 2014 at 9:43 AM, David Wohlferd 
>> <dw@limegreensocks.com> wrote:
>>> Hans-Peter Nilsson: I should have listened to you back when you raised
>>> concerns about this.  My apologies for ever doubting you.
>>>
>>> In summary:
>>>
>>> - The "trick" in the docs for using an arbitrarily sized struct to 
>>> force
>>> register flushes for inline asm does not work.
>>> - Placing the inline asm in a separate routine can sometimes mask the
>>> problem with the trick not working.
>>> - The sample that has been in the docs forever performs an unhelpful,
>>> unexpected, and probably unwanted stack allocation + memcpy.
>>>
>>> Details:
>>>
>>> Here is the text from the docs:
>>>
>>> -----------
>>> One trick to avoid [using the "memory" clobber] is available if the 
>>> size of
>>> the memory being accessed is known at compile time. For example, if
>>> accessing ten bytes of a string, use a memory input like:
>>>
>>>      "m"( ({ struct { char x[10]; } *p = (void *)ptr ; *p; }) )
>>
>> Well - this can't work because you essentially are using a _value_
>> here (looking at the GIMPLE - I'm not sure if a statement expression
>> evaluates to an lvalue.
>>
>> It should work if you simply do this without a stmt expression:
>>
>>    "m" (*(struct { char x[10]; } *)ptr)
>>
>> because that's clearly an lvalue (and the GIMPLE correctly says so):
>>
>>    <bb 2>:
>>    c.a = 1;
>>    c.b = 2;
>>    __asm__ __volatile__("rep; stosb" : "=D" Dest_4, "=c" Count_5 : "0"
>> &c, "a" 0, "m" MEM[(struct foo *)&c], "1" 8);
>>    printf ("%u %u\n", 1, 2);
>>
>> note that we still constant propagated 1 and 2 for the reason that
>> the asm didn't get any VDEF.  That's because you do not have any
>> memory output!  So while it keeps 'c' live it doesn't consider it
>> modified by the asm.  You'd still need to clobber the memory,
>> but "m" clobbers are not supported, only "memory".
>>
>> Thus fixed asm:
>>
>>
>>        __asm__ __volatile__ ("rep; stosb"
>>             : "=D" (Dest), "+c" (Count)
>>             : "0" (&c), "a" (0),
>>             "m" (*( struct foo { char x[8]; } *)&c)
>>             : "memory"
>>        );
>>
>> where I'm not 100% sure if the "m" input is now pointless (that is,
>> if a "memory" clobber also constitutes a use of all memory).
>
> Or maybe even
>   __asm__ __volatile__ ("rep; stosb"
>        : "=D" (Dest), "+c" (Count), "+m" (*(struct foo { char x[8]; } 
> *)&c)
>        : "0" (&c), "a" (0)
>   );
> to avoid the big-hammer memory clobber?
>
> -Y
>

Thank you both for the responses.  At this point I've started composing 
some replacement text for the docs (below), cuz clearly what's there is 
both inadequate and wrong.  All comments are welcome.

While the code in Richard's response will always produce the correct 
results, the intent here is to use memory constraints to *avoid* the 
performance penalties of the memory clobber.  The existing docs say this 
should work, and I've seen a number of places using it (linux kernel, 
glibc, etc).  If this does work, we should update the docs to say how 
it's done.  If this doesn't work, we should say that too.

Looking at Yury's response, that code does work (in c, not c++).  At 
least it does sometimes.  The problem is that sometimes gcc can "lose 
track" of what a pointer points to.  And when it does, gcc can get 
confused about what to flush.  Here's a simple example that shows this 
(4.9.0, x64, -O2, 'c'):

#include <stdio.h>

inline void *memset( void * Dest, int c, size_t Count) {
    void *dummy;

    __asm__ (
       "rep stosb"
       : "=D" (dummy), "+c" (Count), "=m" (*( struct foo { char x[8]; } 
*)Dest)
       : "0" (Dest), "a" (c)
       : "cc"//, "memory"
    );

    return Dest;
}

int main() {
    struct {
      int a;
    } c;

    void *v;
    asm volatile("#" : "=r" (v) : "0" (&c)  );

    c.a = 0x30303030;

    //memset(&c, 0, sizeof(c));
    memset(v, 0, sizeof(c));
    printf("%x\n", c.a);
}

This code will work if you pass &c to memset.  But it will fail if you 
use v.  Oh, the wonders of aliasing.

And this is why I'm having a problem doc'ing this.  I love the potential 
benefits of using this.  But if you are writing general-purpose 
routines, how can you hope to know whether the code calling you will 
pass you a pointer that will work with this trick? This kind of thing 
can introduce *horribly* hard to track down problems.  Note that the 
memory clobber always works, but potentially with a performance penalty. 
<sigh>

I could simply skip describing the whole problem with aliasing, but that 
just hides the problem.  I hate when docs do that.

That said, here's what I've written so far.  Any improvements or 
corrections are very welcome.

==========================
Flushing registers to memory has performance implications and may be an 
issue
for time-sensitive code. One trick to avoid flushing more registers than is
absolutely required is to use a memory constraint.

For example this i386 code works for both c and c++:

@example
inline void *memset(void *Dest, int c, size_t Count)
@{
    struct _reallybigstruct @{ char x[SSIZE_MAX]; @};
    void *dummy;

    asm (
       "rep stosb"
       : "=D" (dummy), "+c" (Count), "=m" (*(struct _reallybigstruct *)Dest)
       : "0" (Dest), "a" (c)
       : "cc"
    );

    return Dest;
@}
@end example

Similar code can be used if the memory is being read (by changing the memory
constraint to be an input) or updated (by changing the constraint to use
'+' instead of '=').

A few noteworthy things about clobbering using memory constraints:

@itemize @minus
@item
Note that @code{_reallybigstruct} uses a size of @code{SSIZE_MAX}. This 
does
not mean that when casting @code{Dest} to @code{(struct _reallybigstruct 
*)},
@code{Dest} must be @code{SSIZE_MAX} bytes long or that @code{SSIZE_MAX}
bytes will be used.  Nor does it affect registers associated with items 
that
may follow @code{Dest} in memory.  This only affects the symbol @code{Dest}.

@item
It may be that using memory constraints can remove the need for the asm
@code{volatile} qualifier.  This can, under certain circumstances, result in
more efficient code. Consider the memset sample above.  Without the memory
constraint, the only outputs are @code{dummy} and @code{Count}. However,
both those symbols go out of scope immediately after the asm statement. 
As a
result, gcc optimization could reasonably assume that the asm block isn't
actually needed for anything.  Even adding the "memory" clobber doesn't
change this. Normally this is resolved by adding the asm @code{volatile}
qualifier. However, by adding the memory constraint, gcc also considers
whether updating the @code{Dest} block will affect the subsequent code.
Only in the event that the @code{Dest} block is not needed will gcc 
consider
removing this asm code.

@end itemize

@strong{Warning}: Older versions of this documentation show an example 
for using
memory constraints like this:

@example
"m"( (@{ struct @{ char x[10]; @} *p = (void *)ptr ; *p; @}) )
@end example

This code not only doesn't compile in c++, it has potentially undesirable
side effects that affect performance, as well as producing incorrect 
results.

==========================

dw

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work
  2014-09-26 10:50     ` David Wohlferd
@ 2014-09-26 14:47       ` Richard Biener
  2014-10-03  6:41         ` David Wohlferd
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Biener @ 2014-09-26 14:47 UTC (permalink / raw)
  To: David Wohlferd
  Cc: Yury Gribov, gcc, hp@bitrange.com >> Hans-Peter Nilsson,
	Gerald Pfeifer

On Fri, Sep 26, 2014 at 12:37 PM, David Wohlferd <dw@limegreensocks.com> wrote:
>
> On 9/25/2014 12:36 AM, Yury Gribov wrote:
>>
>> On 09/24/2014 12:31 PM, Richard Biener wrote:
>>>
>>> On Wed, Sep 24, 2014 at 9:43 AM, David Wohlferd <dw@limegreensocks.com>
>>> wrote:
>>>>
>>>> Hans-Peter Nilsson: I should have listened to you back when you raised
>>>> concerns about this.  My apologies for ever doubting you.
>>>>
>>>> In summary:
>>>>
>>>> - The "trick" in the docs for using an arbitrarily sized struct to force
>>>> register flushes for inline asm does not work.
>>>> - Placing the inline asm in a separate routine can sometimes mask the
>>>> problem with the trick not working.
>>>> - The sample that has been in the docs forever performs an unhelpful,
>>>> unexpected, and probably unwanted stack allocation + memcpy.
>>>>
>>>> Details:
>>>>
>>>> Here is the text from the docs:
>>>>
>>>> -----------
>>>> One trick to avoid [using the "memory" clobber] is available if the size
>>>> of
>>>> the memory being accessed is known at compile time. For example, if
>>>> accessing ten bytes of a string, use a memory input like:
>>>>
>>>>      "m"( ({ struct { char x[10]; } *p = (void *)ptr ; *p; }) )
>>>
>>>
>>> Well - this can't work because you essentially are using a _value_
>>> here (looking at the GIMPLE - I'm not sure if a statement expression
>>> evaluates to an lvalue.
>>>
>>> It should work if you simply do this without a stmt expression:
>>>
>>>    "m" (*(struct { char x[10]; } *)ptr)
>>>
>>> because that's clearly an lvalue (and the GIMPLE correctly says so):
>>>
>>>    <bb 2>:
>>>    c.a = 1;
>>>    c.b = 2;
>>>    __asm__ __volatile__("rep; stosb" : "=D" Dest_4, "=c" Count_5 : "0"
>>> &c, "a" 0, "m" MEM[(struct foo *)&c], "1" 8);
>>>    printf ("%u %u\n", 1, 2);
>>>
>>> note that we still constant propagated 1 and 2 for the reason that
>>> the asm didn't get any VDEF.  That's because you do not have any
>>> memory output!  So while it keeps 'c' live it doesn't consider it
>>> modified by the asm.  You'd still need to clobber the memory,
>>> but "m" clobbers are not supported, only "memory".
>>>
>>> Thus fixed asm:
>>>
>>>
>>>        __asm__ __volatile__ ("rep; stosb"
>>>             : "=D" (Dest), "+c" (Count)
>>>             : "0" (&c), "a" (0),
>>>             "m" (*( struct foo { char x[8]; } *)&c)
>>>             : "memory"
>>>        );
>>>
>>> where I'm not 100% sure if the "m" input is now pointless (that is,
>>> if a "memory" clobber also constitutes a use of all memory).
>>
>>
>> Or maybe even
>>   __asm__ __volatile__ ("rep; stosb"
>>        : "=D" (Dest), "+c" (Count), "+m" (*(struct foo { char x[8]; }
>> *)&c)
>>        : "0" (&c), "a" (0)
>>   );
>> to avoid the big-hammer memory clobber?
>>
>> -Y
>>
>
> Thank you both for the responses.  At this point I've started composing some
> replacement text for the docs (below), cuz clearly what's there is both
> inadequate and wrong.  All comments are welcome.
>
> While the code in Richard's response will always produce the correct
> results, the intent here is to use memory constraints to *avoid* the
> performance penalties of the memory clobber.  The existing docs say this
> should work, and I've seen a number of places using it (linux kernel, glibc,
> etc).  If this does work, we should update the docs to say how it's done.
> If this doesn't work, we should say that too.
>
> Looking at Yury's response, that code does work (in c, not c++).  At least
> it does sometimes.  The problem is that sometimes gcc can "lose track" of
> what a pointer points to.  And when it does, gcc can get confused about what
> to flush.  Here's a simple example that shows this (4.9.0, x64, -O2, 'c'):
>
> #include <stdio.h>
>
> inline void *memset( void * Dest, int c, size_t Count) {
>    void *dummy;
>
>    __asm__ (
>       "rep stosb"
>       : "=D" (dummy), "+c" (Count), "=m" (*( struct foo { char x[8]; }
> *)Dest)
>       : "0" (Dest), "a" (c)
>       : "cc"//, "memory"
>    );
>
>    return Dest;
> }
>
> int main() {
>    struct {
>      int a;
>    } c;
>
>    void *v;
>    asm volatile("#" : "=r" (v) : "0" (&c)  );
>
>    c.a = 0x30303030;
>
>    //memset(&c, 0, sizeof(c));
>    memset(v, 0, sizeof(c));
>    printf("%x\n", c.a);
> }
>
> This code will work if you pass &c to memset.  But it will fail if you use
> v.  Oh, the wonders of aliasing.

Yeah, in this case because *(struct foo { char x[8]; } *)Dest uses alias-set
'4' but c.a is accessed with alias-set '3'.

You can't expect a struct with a char array to end up using alias-set
zero (which probably was intended).  You want

"=m" (*( struct foo  { char x[8]; } __attribute__((may_alias)) *)Dest)

which makes it work.  Or use *(int *)Dest as you are later accessing
it with int, so you even do not lose TBAA.

Richard.

> And this is why I'm having a problem doc'ing this.  I love the potential
> benefits of using this.  But if you are writing general-purpose routines,
> how can you hope to know whether the code calling you will pass you a
> pointer that will work with this trick? This kind of thing can introduce
> *horribly* hard to track down problems.  Note that the memory clobber always
> works, but potentially with a performance penalty. <sigh>
>
> I could simply skip describing the whole problem with aliasing, but that
> just hides the problem.  I hate when docs do that.
>
> That said, here's what I've written so far.  Any improvements or corrections
> are very welcome.
>
> ==========================
> Flushing registers to memory has performance implications and may be an
> issue
> for time-sensitive code. One trick to avoid flushing more registers than is
> absolutely required is to use a memory constraint.
>
> For example this i386 code works for both c and c++:
>
> @example
> inline void *memset(void *Dest, int c, size_t Count)
> @{
>    struct _reallybigstruct @{ char x[SSIZE_MAX]; @};
>    void *dummy;
>
>    asm (
>       "rep stosb"
>       : "=D" (dummy), "+c" (Count), "=m" (*(struct _reallybigstruct *)Dest)
>       : "0" (Dest), "a" (c)
>       : "cc"
>    );
>
>    return Dest;
> @}
> @end example
>
> Similar code can be used if the memory is being read (by changing the memory
> constraint to be an input) or updated (by changing the constraint to use
> '+' instead of '=').
>
> A few noteworthy things about clobbering using memory constraints:
>
> @itemize @minus
> @item
> Note that @code{_reallybigstruct} uses a size of @code{SSIZE_MAX}. This does
> not mean that when casting @code{Dest} to @code{(struct _reallybigstruct
> *)},
> @code{Dest} must be @code{SSIZE_MAX} bytes long or that @code{SSIZE_MAX}
> bytes will be used.  Nor does it affect registers associated with items that
> may follow @code{Dest} in memory.  This only affects the symbol @code{Dest}.
>
> @item
> It may be that using memory constraints can remove the need for the asm
> @code{volatile} qualifier.  This can, under certain circumstances, result in
> more efficient code. Consider the memset sample above.  Without the memory
> constraint, the only outputs are @code{dummy} and @code{Count}. However,
> both those symbols go out of scope immediately after the asm statement. As a
> result, gcc optimization could reasonably assume that the asm block isn't
> actually needed for anything.  Even adding the "memory" clobber doesn't
> change this. Normally this is resolved by adding the asm @code{volatile}
> qualifier. However, by adding the memory constraint, gcc also considers
> whether updating the @code{Dest} block will affect the subsequent code.
> Only in the event that the @code{Dest} block is not needed will gcc consider
> removing this asm code.
>
> @end itemize
>
> @strong{Warning}: Older versions of this documentation show an example for
> using
> memory constraints like this:
>
> @example
> "m"( (@{ struct @{ char x[10]; @} *p = (void *)ptr ; *p; @}) )
> @end example
>
> This code not only doesn't compile in c++, it has potentially undesirable
> side effects that affect performance, as well as producing incorrect
> results.
>
> ==========================
>
> dw

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work
  2014-09-26 14:47       ` Richard Biener
@ 2014-10-03  6:41         ` David Wohlferd
  2014-10-03  8:07           ` Hans-Peter Nilsson
  2014-10-06 11:09           ` Richard Biener
  0 siblings, 2 replies; 17+ messages in thread
From: David Wohlferd @ 2014-10-03  6:41 UTC (permalink / raw)
  To: richard.guenther; +Cc: y.gribov, gcc, hp, gerald

> You want
>
> "=m" (*( struct foo  { char x[8]; } __attribute__((may_alias)) *)Dest)

Thank you.  With your help, that worse-than-useless sample in the docs 
is getting closer to something people can actually use.

Except for one last serious problem:  This trick only works for very 
specific (and very small) sizes of x (something else the existing docs 
fail to mention).

On x64: Sizes of 1/2/4/8/16 correctly clobber Dest (and only Dest) as 
expected.  Trying to use any other values (5, 12, 32, etc) seem to force 
a full memory clobber.  This is the case REGARDLESS of the actual size 
of the underlying structure.  For example clobbering 16 bytes of a 12 
byte struct only clobbers the 12 bytes of Dest (as expected), but 
clobbering 12 bytes of a 12 byte struct performs a full memory clobber.  
This presents a problem for general purpose routines like memset, where 
the actual size of the block cannot be hardcoded into the struct.

glibc (which uses this trick to avoid using the memory clobber) uses a 
size of 0xfffffff.  Being larger than 16, this always causes a full 
memory clobber, not the "memory constraint" clobber I assume they were 
expecting.  OTOH, since they don't use may_alias, this is may actually 
be a good thing...

While I really like the idea of using memory constraints to avoid all 
out memory clobbers, 16 bytes is a pretty small maximum memory block, 
and x32 only supports a max of 8.  Unless there's some way to use larger 
sizes (say SSIZE_MAX), this feature hardly seems worth documenting.

dw

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work
  2014-10-03  6:41         ` David Wohlferd
@ 2014-10-03  8:07           ` Hans-Peter Nilsson
  2014-10-06 11:09           ` Richard Biener
  1 sibling, 0 replies; 17+ messages in thread
From: Hans-Peter Nilsson @ 2014-10-03  8:07 UTC (permalink / raw)
  To: David Wohlferd; +Cc: gcc

On Thu, 2 Oct 2014, David Wohlferd wrote:
> > You want
> >
> > "=m" (*( struct foo  { char x[8]; } __attribute__((may_alias)) *)Dest)
>
> Thank you.  With your help, that worse-than-useless sample in the docs
> is getting closer to something people can actually use.

Thank *you* for your investigation.

> While I really like the idea of using memory constraints to avoid all
> out memory clobbers, 16 bytes is a pretty small maximum memory block,
> and x32 only supports a max of 8.  Unless there's some way to use larger
> sizes (say SSIZE_MAX), this feature hardly seems worth documenting.

Exactly.  Now, with it documented and all, we're (should be) on
the hook to actually maintain that...feature, so we don't break
it now that the tweaked version works.  The best practice is a
test-case in the test-suite.  With all the required tweaks I
presume we don't have one.  It seems grepping a gimple dump
works, with a scan-dump-based test (numerous examples; framework
is in place).  Could you please consider adding one?

(Or a patch to remove the feature from the docs since it hasn't
had the intended effect for some long time - my not-so-secret
actual point...)

brgds, H-P

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work
  2014-10-03  6:41         ` David Wohlferd
  2014-10-03  8:07           ` Hans-Peter Nilsson
@ 2014-10-06 11:09           ` Richard Biener
  2014-11-13 12:14             ` David Wohlferd
  1 sibling, 1 reply; 17+ messages in thread
From: Richard Biener @ 2014-10-06 11:09 UTC (permalink / raw)
  To: David Wohlferd; +Cc: y.gribov, gcc, Hans-Peter Nilsson, Gerald Pfeifer

On Fri, Oct 3, 2014 at 8:41 AM, David Wohlferd <dw@limegreensocks.com> wrote:
>
>> You want
>>
>> "=m" (*( struct foo  { char x[8]; } __attribute__((may_alias)) *)Dest)
>
>
> Thank you.  With your help, that worse-than-useless sample in the docs is
> getting closer to something people can actually use.
>
> Except for one last serious problem:  This trick only works for very
> specific (and very small) sizes of x (something else the existing docs fail
> to mention).
>
> On x64: Sizes of 1/2/4/8/16 correctly clobber Dest (and only Dest) as
> expected.  Trying to use any other values (5, 12, 32, etc) seem to force a
> full memory clobber.  This is the case REGARDLESS of the actual size of the
> underlying structure.  For example clobbering 16 bytes of a 12 byte struct
> only clobbers the 12 bytes of Dest (as expected), but clobbering 12 bytes of
> a 12 byte struct performs a full memory clobber.  This presents a problem
> for general purpose routines like memset, where the actual size of the block
> cannot be hardcoded into the struct.
>
> glibc (which uses this trick to avoid using the memory clobber) uses a size
> of 0xfffffff.  Being larger than 16, this always causes a full memory
> clobber, not the "memory constraint" clobber I assume they were expecting.
> OTOH, since they don't use may_alias, this is may actually be a good
> thing...
>
> While I really like the idea of using memory constraints to avoid all out
> memory clobbers, 16 bytes is a pretty small maximum memory block, and x32
> only supports a max of 8.  Unless there's some way to use larger sizes (say
> SSIZE_MAX), this feature hardly seems worth documenting.

I wonder how you figured out that a 12 byte clobber performs a full
memory clobber?  I'm quite sure it does not.

Richard.

> dw

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work
  2014-10-06 11:09           ` Richard Biener
@ 2014-11-13 12:14             ` David Wohlferd
  2014-11-13 13:08               ` Richard Biener
  2014-11-13 14:02               ` Hans-Peter Nilsson
  0 siblings, 2 replies; 17+ messages in thread
From: David Wohlferd @ 2014-11-13 12:14 UTC (permalink / raw)
  To: Richard Biener, Hans-Peter Nilsson; +Cc: y.gribov, gcc, Gerald Pfeifer

Sorry for the (very) delayed response.  I'm still looking for feedback 
here so I can fix the docs.

To refresh: The topic of conversation was the (extremely) wrong 
explanation that has been in the docs since forever about how to use 
memory constraints with inline asm to avoid the performance hit of a 
full memory clobber.  Trying to understand how this really works has led 
to some surprising results.

Me:
 >> While I really like the idea of using memory constraints to avoid 
all out
 >> memory clobbers, 16 bytes is a pretty small maximum memory block, 
and x86
 >> only supports a max of 8.  Unless there's some way to use larger 
sizes (say
 >> SSIZE_MAX), this feature hardly seems worth documenting.

Richard:
 > I wonder how you figured out that a 12 byte clobber performs a full
 > memory clobber?

Here's the code (compiled with gcc version 4.9.0 x86_64-win32-seh-rev2, 
using -m64 -O2 -fdump-final-insns):

--------------------
#include <stdio.h>

#define MYSIZE 3

inline void
__stosb(unsigned char *Dest, unsigned char Data, size_t Count)
{
    struct _reallybigstruct { char x[MYSIZE]; }
       *p = (struct _reallybigstruct *)Dest;

    __asm__ __volatile__ ("rep stos{b|b}"
       : "+D" (Dest), "+c" (Count), "=m" (*p)
       : [Data] "a" (Data)
       //: "memory"
    );
}

int main()
{
    unsigned char buff[100];
    buff[5] = 'A';

    __stosb(buff, 'B', sizeof(buff));
    printf("%c\n", buff[5]);
}
--------------------

In summary:

    1) Create a 100 byte buffer, and set buff[5] to 'A'.
    2) Call __stosb, which uses inline asm to overwrite all of buff with 
'B'.
    3) Use a memory constraint in __stosb to flush buff.  The size of 
the memory constraint is controlled by a #define.

With this, I have a simple way to test various sizes of memory 
constraints to see if the buffer gets flushed.  If it *is* flushing the 
buffer, printing buff[5] after __stosb will print 'B'.  If it is *not* 
flushing, it will print 'A'.

Results:
    - Since buff[5] is the 6th byte in the buffer, using memory 
constraint sizes of 1, 2 & 4 (not surprisingly) all print 'A', showing 
that no flush was done.
    - Sizes of 8 and 16 print 'B', showing that the flush was done. This 
is also the expected result, since I am now flushing enough of buff to 
include buff[5].
    - The surprise comes from using a size of 3 or 5.  These also print 
'B'.  WTF?  If 4 doesn't flush, why does 3?

I believe the answer comes from reading the RTL.  The difference between 
sizes of 3 and 16 comes here:

   (set (mem/c:TI (plus:DI (reg/f:DI 7 sp)
      (const_int 32 [0x20])) [ MEM[(struct _reallybigstruct *)&buff]+0 
S16 A128])
      (asm_operands/v:TI ("rep stos{b|b}") ("=m") 2 [

    (set (mem/c:BLK (plus:DI (reg/f:DI 7 sp)
          (const_int 32 [0x20])) [ MEM[(struct _reallybigstruct 
*)&buff]+0 S3 A128])
          (asm_operands/v:BLK ("rep stos{b|b}") ("=m") 2 [

While I don't actually speak RTL, TI clearly refers to TIMode. 
Apparently when using a size that exactly matches a machine mode, asm 
memory references (on i386) can flush the exact number of bytes.  But 
for other sizes, gcc seems to falls back to BLK mode, which doesn't.

I don't know the exact meaning of BLK on a "set" or "asm_operands." Does 
it cause a full clobber?  Or just a complete clobber of buff? Attempting 
to answer that question leads us to the second bit of code:

--------------------
#include <stdio.h>

#define MYSIZE 8

inline void
__stosb(unsigned char *Dest, unsigned char Data, size_t Count)
{
    struct _reallybigstruct { char x[MYSIZE]; }
       *p = (struct _reallybigstruct *)Dest;

    __asm__ __volatile__ ("rep stos{b|b}"
       : "+D" (Dest), "+c" (Count), "=m" (*p)
       : [Data] "a" (Data)
       //: "memory"
    );
}
int main()
{
    unsigned char buff[100], buff2[100];
    buff[5] = 'A';
    buff2[5] = 'M';
    asm("#" : : "r" (buff2));

    __stosb(buff, 'B', sizeof(buff));
    printf("%c %c\n", buff[5], buff2[5]);
}
--------------------

Here I've added a buff2, and I set buff2[5] to 'M' (aka ascii 77), which 
I also print.  I still perform the memory constraint against buff, then 
I check to see if it is affecting buff2.

I start by compiling this with a size of 8 and look at the -S output.  
If this is NOT doing a full clobber, gcc should be able to just print 
buff2[5] by moving 77 into the appropriate register before calling 
printf.  And indeed, that's what we see.

/APP
  # 17 "mem2.cpp" 1
         rep stosb
  # 0 "" 2
/NO_APP
         movzbl  37(%rsp), %edx
         movl    $77, %r8d
         leaq    .LC0(%rip), %rcx
         call    printf

If using a size of 3 *is* causing a full memory clobber, we would expect 
to see the value getting read from memory before calling printf.  And 
indeed, that's also what we see.

/APP
  # 17 "mem2.cpp" 1
         rep stosb
  # 0 "" 2
/NO_APP
         movzbl  37(%rsp), %edx
         leaq    .LC0(%rip), %rcx
         movzbl  149(%rsp), %r8d

I don't know the internals of gcc well enough to understand exactly why 
this is happening.  But from a user's point of view, it sure looks like 
a memory clobber.

As I said before, triggering a full memory clobber for anything over 16 
bytes (and most sizes under 16 bytes) makes this feature all but 
useless.  So if that's really what's happening, we need to decide what 
to do next:

1) Can this be "fixed?"
2) Do we want to doc the current behavior?
3) Or do we just remove this section?

I think it could be a nice performance win for inline asm if it could be 
made to work right, but I have no idea what might be involved in that.  
Failing that, I guess if it doesn't work and isn't going to work, I'd 
recommend removing the text for this feature.

Since all 3 suggestions require a doc change, I'll just say that I'm 
prepared to start work on the doc patch as soon as someone lets me know 
what the plan is.

Richard?  Hans-Peter?  Your thoughts?

Thanks,
dw

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work
  2014-11-13 12:14             ` David Wohlferd
@ 2014-11-13 13:08               ` Richard Biener
  2014-11-14 10:05                 ` Segher Boessenkool
  2014-11-13 14:02               ` Hans-Peter Nilsson
  1 sibling, 1 reply; 17+ messages in thread
From: Richard Biener @ 2014-11-13 13:08 UTC (permalink / raw)
  To: David Wohlferd; +Cc: Hans-Peter Nilsson, y.gribov, gcc, Gerald Pfeifer

On Thu, Nov 13, 2014 at 1:03 PM, David Wohlferd <dw@limegreensocks.com> wrote:
> Sorry for the (very) delayed response.  I'm still looking for feedback here
> so I can fix the docs.
>
> To refresh: The topic of conversation was the (extremely) wrong explanation
> that has been in the docs since forever about how to use memory constraints
> with inline asm to avoid the performance hit of a full memory clobber.
> Trying to understand how this really works has led to some surprising
> results.
>
> Me:
>>> While I really like the idea of using memory constraints to avoid all out
>>> memory clobbers, 16 bytes is a pretty small maximum memory block, and x86
>>> only supports a max of 8.  Unless there's some way to use larger sizes
>>> (say
>>> SSIZE_MAX), this feature hardly seems worth documenting.
>
> Richard:
>> I wonder how you figured out that a 12 byte clobber performs a full
>> memory clobber?
>
> Here's the code (compiled with gcc version 4.9.0 x86_64-win32-seh-rev2,
> using -m64 -O2 -fdump-final-insns):
>
> --------------------
> #include <stdio.h>
>
> #define MYSIZE 3
>
> inline void
> __stosb(unsigned char *Dest, unsigned char Data, size_t Count)
> {
>    struct _reallybigstruct { char x[MYSIZE]; }
>       *p = (struct _reallybigstruct *)Dest;
>
>    __asm__ __volatile__ ("rep stos{b|b}"
>       : "+D" (Dest), "+c" (Count), "=m" (*p)
>       : [Data] "a" (Data)
>       //: "memory"
>    );
> }
>
> int main()
> {
>    unsigned char buff[100];
>    buff[5] = 'A';
>
>    __stosb(buff, 'B', sizeof(buff));
>    printf("%c\n", buff[5]);
> }
> --------------------
>
> In summary:
>
>    1) Create a 100 byte buffer, and set buff[5] to 'A'.
>    2) Call __stosb, which uses inline asm to overwrite all of buff with 'B'.
>    3) Use a memory constraint in __stosb to flush buff.  The size of the
> memory constraint is controlled by a #define.
>
> With this, I have a simple way to test various sizes of memory constraints
> to see if the buffer gets flushed.  If it *is* flushing the buffer, printing
> buff[5] after __stosb will print 'B'.  If it is *not* flushing, it will
> print 'A'.
>
> Results:
>    - Since buff[5] is the 6th byte in the buffer, using memory constraint
> sizes of 1, 2 & 4 (not surprisingly) all print 'A', showing that no flush
> was done.
>    - Sizes of 8 and 16 print 'B', showing that the flush was done. This is
> also the expected result, since I am now flushing enough of buff to include
> buff[5].
>    - The surprise comes from using a size of 3 or 5.  These also print 'B'.
> WTF?  If 4 doesn't flush, why does 3?
>
> I believe the answer comes from reading the RTL.  The difference between
> sizes of 3 and 16 comes here:
>
>   (set (mem/c:TI (plus:DI (reg/f:DI 7 sp)
>      (const_int 32 [0x20])) [ MEM[(struct _reallybigstruct *)&buff]+0 S16
> A128])
>      (asm_operands/v:TI ("rep stos{b|b}") ("=m") 2 [
>
>    (set (mem/c:BLK (plus:DI (reg/f:DI 7 sp)
>          (const_int 32 [0x20])) [ MEM[(struct _reallybigstruct *)&buff]+0 S3
> A128])
>          (asm_operands/v:BLK ("rep stos{b|b}") ("=m") 2 [
>
> While I don't actually speak RTL, TI clearly refers to TIMode. Apparently
> when using a size that exactly matches a machine mode, asm memory references
> (on i386) can flush the exact number of bytes.  But for other sizes, gcc
> seems to falls back to BLK mode, which doesn't.
>
> I don't know the exact meaning of BLK on a "set" or "asm_operands." Does it
> cause a full clobber?  Or just a complete clobber of buff? Attempting to
> answer that question leads us to the second bit of code:
>
> --------------------
> #include <stdio.h>
>
> #define MYSIZE 8
>
> inline void
> __stosb(unsigned char *Dest, unsigned char Data, size_t Count)
> {
>    struct _reallybigstruct { char x[MYSIZE]; }
>       *p = (struct _reallybigstruct *)Dest;
>
>    __asm__ __volatile__ ("rep stos{b|b}"
>       : "+D" (Dest), "+c" (Count), "=m" (*p)
>       : [Data] "a" (Data)
>       //: "memory"
>    );
> }
> int main()
> {
>    unsigned char buff[100], buff2[100];
>    buff[5] = 'A';
>    buff2[5] = 'M';
>    asm("#" : : "r" (buff2));
>
>    __stosb(buff, 'B', sizeof(buff));
>    printf("%c %c\n", buff[5], buff2[5]);
> }
> --------------------
>
> Here I've added a buff2, and I set buff2[5] to 'M' (aka ascii 77), which I
> also print.  I still perform the memory constraint against buff, then I
> check to see if it is affecting buff2.
>
> I start by compiling this with a size of 8 and look at the -S output.  If
> this is NOT doing a full clobber, gcc should be able to just print buff2[5]
> by moving 77 into the appropriate register before calling printf.  And
> indeed, that's what we see.
>
> /APP
>  # 17 "mem2.cpp" 1
>         rep stosb
>  # 0 "" 2
> /NO_APP
>         movzbl  37(%rsp), %edx
>         movl    $77, %r8d
>         leaq    .LC0(%rip), %rcx
>         call    printf
>
> If using a size of 3 *is* causing a full memory clobber, we would expect to
> see the value getting read from memory before calling printf.  And indeed,
> that's also what we see.
>
> /APP
>  # 17 "mem2.cpp" 1
>         rep stosb
>  # 0 "" 2
> /NO_APP
>         movzbl  37(%rsp), %edx
>         leaq    .LC0(%rip), %rcx
>         movzbl  149(%rsp), %r8d
>
> I don't know the internals of gcc well enough to understand exactly why this
> is happening.  But from a user's point of view, it sure looks like a memory
> clobber.
>
> As I said before, triggering a full memory clobber for anything over 16
> bytes (and most sizes under 16 bytes) makes this feature all but useless.
> So if that's really what's happening, we need to decide what to do next:
>
> 1) Can this be "fixed?"
> 2) Do we want to doc the current behavior?
> 3) Or do we just remove this section?
>
> I think it could be a nice performance win for inline asm if it could be
> made to work right, but I have no idea what might be involved in that.
> Failing that, I guess if it doesn't work and isn't going to work, I'd
> recommend removing the text for this feature.
>
> Since all 3 suggestions require a doc change, I'll just say that I'm
> prepared to start work on the doc patch as soon as someone lets me know what
> the plan is.
>
> Richard?  Hans-Peter?  Your thoughts?

Just from a very very quick look you miss:

>   (set (mem/c:TI (plus:DI (reg/f:DI 7 sp)
>      (const_int 32 [0x20])) [ MEM[(struct _reallybigstruct *)&buff]+0 S16
> A128])
>      (asm_operands/v:TI ("rep stos{b|b}") ("=m") 2 [
>
>    (set (mem/c:BLK (plus:DI (reg/f:DI 7 sp)
>          (const_int 32 [0x20])) [ MEM[(struct _reallybigstruct *)&buff]+0 S3
> A128])
>          (asm_operands/v:BLK ("rep stos{b|b}") ("=m") 2 [

The memory attributes - the first one has 'S16' (size 16 bytes), the
2nd has 'S3' (size 3 bytes).  So the information is clearly there.
It might be that RTL alias analysis / CSE give up too early here
(we don't optimize across asm() on the GIMPLE level at all ... heh).

I didn't look where it gives up (even though appearantly it does).

Richard.

> Thanks,
> dw

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work
  2014-11-13 12:14             ` David Wohlferd
  2014-11-13 13:08               ` Richard Biener
@ 2014-11-13 14:02               ` Hans-Peter Nilsson
  2014-11-13 14:15                 ` Richard Biener
  1 sibling, 1 reply; 17+ messages in thread
From: Hans-Peter Nilsson @ 2014-11-13 14:02 UTC (permalink / raw)
  To: David Wohlferd; +Cc: Richard Biener, y.gribov, gcc, Gerald Pfeifer

On Thu, 13 Nov 2014, David Wohlferd wrote:
> Sorry for the (very) delayed response.  I'm still looking for feedback here so
> I can fix the docs.

Thank you for your diligence.

> As I said before, triggering a full memory clobber for anything over 16 bytes
> (and most sizes under 16 bytes) makes this feature all but useless.  So if
> that's really what's happening, we need to decide what to do next:
>
> 1) Can this be "fixed?"
> 2) Do we want to doc the current behavior?
> 3) Or do we just remove this section?
>
> I think it could be a nice performance win for inline asm if it could be made
> to work right, but I have no idea what might be involved in that.  Failing
> that, I guess if it doesn't work and isn't going to work, I'd recommend
> removing the text for this feature.
>
> Since all 3 suggestions require a doc change, I'll just say that I'm prepared
> to start work on the doc patch as soon as someone lets me know what the plan
> is.
>
> Richard?  Hans-Peter?  Your thoughts?

I've forgot if someone mentioned whether we have a test-case in
our test-suite for this feature.  If we don't, then 3; removal.
If we do, I guess it's flawed or at least not agreeing with the
documentation?  Then it *might* be worth the effort fixing that
and additional test-coverage (depending on the person stepping
up...) but 3 is IMHO still an arguably sane option.

brgds, H-P

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work
  2014-11-13 14:02               ` Hans-Peter Nilsson
@ 2014-11-13 14:15                 ` Richard Biener
  2014-11-14  1:18                   ` David Wohlferd
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Biener @ 2014-11-13 14:15 UTC (permalink / raw)
  To: Hans-Peter Nilsson; +Cc: David Wohlferd, y.gribov, gcc, Gerald Pfeifer

On Thu, Nov 13, 2014 at 2:53 PM, Hans-Peter Nilsson <hp@bitrange.com> wrote:
> On Thu, 13 Nov 2014, David Wohlferd wrote:
>> Sorry for the (very) delayed response.  I'm still looking for feedback here so
>> I can fix the docs.
>
> Thank you for your diligence.
>
>> As I said before, triggering a full memory clobber for anything over 16 bytes
>> (and most sizes under 16 bytes) makes this feature all but useless.  So if
>> that's really what's happening, we need to decide what to do next:
>>
>> 1) Can this be "fixed?"
>> 2) Do we want to doc the current behavior?
>> 3) Or do we just remove this section?
>>
>> I think it could be a nice performance win for inline asm if it could be made
>> to work right, but I have no idea what might be involved in that.  Failing
>> that, I guess if it doesn't work and isn't going to work, I'd recommend
>> removing the text for this feature.
>>
>> Since all 3 suggestions require a doc change, I'll just say that I'm prepared
>> to start work on the doc patch as soon as someone lets me know what the plan
>> is.
>>
>> Richard?  Hans-Peter?  Your thoughts?
>
> I've forgot if someone mentioned whether we have a test-case in
> our test-suite for this feature.  If we don't, then 3; removal.
> If we do, I guess it's flawed or at least not agreeing with the
> documentation?  Then it *might* be worth the effort fixing that
> and additional test-coverage (depending on the person stepping
> up...) but 3 is IMHO still an arguably sane option.

Well, as what is missing is just an optimization I'd say we should
try to fix it.  And surely the docs should not promise that optimization
will happen - it should just mention that doing this might allow
optimization to happen.

Richard.

> brgds, H-P

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work
  2014-11-13 14:15                 ` Richard Biener
@ 2014-11-14  1:18                   ` David Wohlferd
  2014-11-14  9:56                     ` dw
  2014-11-14 10:15                     ` Richard Biener
  0 siblings, 2 replies; 17+ messages in thread
From: David Wohlferd @ 2014-11-14  1:18 UTC (permalink / raw)
  To: Richard Biener, Hans-Peter Nilsson; +Cc: y.gribov, gcc, Gerald Pfeifer

On 11/13/2014 6:02 AM, Richard Biener wrote:
> On Thu, Nov 13, 2014 at 2:53 PM, Hans-Peter Nilsson <hp@bitrange.com> wrote:
>> On Thu, 13 Nov 2014, David Wohlferd wrote:
>>> Sorry for the (very) delayed response.  I'm still looking for feedback here so
>>> I can fix the docs.
>> Thank you for your diligence.
>>
>>> As I said before, triggering a full memory clobber for anything over 16 bytes
>>> (and most sizes under 16 bytes) makes this feature all but useless.  So if
>>> that's really what's happening, we need to decide what to do next:
>>>
>>> 1) Can this be "fixed?"
>>> 2) Do we want to doc the current behavior?
>>> 3) Or do we just remove this section?
>>>
>>> I think it could be a nice performance win for inline asm if it could be made
>>> to work right, but I have no idea what might be involved in that.  Failing
>>> that, I guess if it doesn't work and isn't going to work, I'd recommend
>>> removing the text for this feature.
>>>
>>> Since all 3 suggestions require a doc change, I'll just say that I'm prepared
>>> to start work on the doc patch as soon as someone lets me know what the plan
>>> is.
>>>
>>> Richard?  Hans-Peter?  Your thoughts?
>> I've forgot if someone mentioned whether we have a test-case in
>> our test-suite for this feature.

I'm looking thru gcc/testsuite/*.c to see if I can spot anything. It's 
not easy since there is a lot of asm and the people who write these are 
apparently allergic to using comments to describe what they are testing.

>> If we don't, then 3; removal.
>> If we do, I guess it's flawed or at least not agreeing with the
>> documentation?  Then it *might* be worth the effort fixing that
>> and additional test-coverage (depending on the person stepping
>> up...) but 3 is IMHO still an arguably sane option.
> Well, as what is missing is just an optimization I'd say we should
> try to fix it.

While I'd love to be the one to fix this, the fact of the matter is that 
most of gcc is a black box to me.  Even if you told me roughly where to 
start, I'd have no idea of the downstream impacts of anything I changed.

So while I understand that it looks like I'm just finding work for other 
people, fixing something like this is simply beyond me.  That said, I'm 
certainly prepared to outline what I see as the interesting test cases 
and to do some testing if someone else is willing to step up and do this 
optimization.

> And surely the docs should not promise that optimization
> will happen - it should just mention that doing this might allow
> optimization to happen.

I can agree with this.  I am quite confident there will be occasions 
where gcc has no option but to fall back to doing a full clobber to 
ensure correct function (a possibility which the current docs also fails 
to mention).  So yes, the docs should be limited in what it promises here.

Which brings us to the question: what do we do now?  The 15th is fast 
approaching.  Can something like this get done before then? Can it be 
checked in for 5.0 after the 15th?  Or does it need to wait for 6.0?

If it does need to wait for 6.0, what do we want to do with the docs in 
the meantime?  Given how wrong they are currently, I'd hate to ship yet 
another release with that ugly text.  But trying to describe the best 
way to take advantage of optimizations that haven't been written yet 
is... hard.

Since (as I understand it) 5.0 docs *can* be checked in after the 15th, 
my recommendations:

  - If someone is prepared to step up and do this work for v5.0, then 
I'll wait and write the docs when they are done and can describe how it 
works.
  - If this is going to wait for 6.0, then if someone does (at least) 
enough investigative work to be able to describe how this will 
eventually work, I'll update the 5.0 docs in a general way talking about 
ways gcc *may* be able to optimize.  It should be possible to phrase 
this so code people write today will work even better tomorrow.
  - Worst case is if no one has the time to look at this for the 
foreseeable future.  In that case, I'm with Hans-Peter.  Let's take the 
existing text out.  Following the existing text makes things *worse*, 
and the current implementation is so limited that I'd be surprised if 
anyone's code actually uses it successfully.  New text can get added 
when the new code is.

Hmm.  I just had a thought: Is it possible the problem I'm seeing here 
is platform-specific?  Maybe this works perfectly for non-i386 code?  
That would certainly change my recommendations here.

dw

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work
  2014-11-14  1:18                   ` David Wohlferd
@ 2014-11-14  9:56                     ` dw
  2014-11-14 10:15                     ` Richard Biener
  1 sibling, 0 replies; 17+ messages in thread
From: dw @ 2014-11-14  9:56 UTC (permalink / raw)
  To: Richard Biener, Hans-Peter Nilsson; +Cc: y.gribov, gcc, Gerald Pfeifer


>>> I've forgot if someone mentioned whether we have a test-case in
>>> our test-suite for this feature.
>
> I'm looking thru gcc/testsuite/*.c to see if I can spot anything. It's 
> not easy since there is a lot of asm and the people who write these 
> are apparently allergic to using comments to describe what they are 
> testing.

So, I found a few tests that were *using* this feature.  But they seem 
to be checking for an ICE or page fault, rather than checking to see if 
the generated code was avoiding the memory clobber.

dw

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work
  2014-11-13 13:08               ` Richard Biener
@ 2014-11-14 10:05                 ` Segher Boessenkool
  0 siblings, 0 replies; 17+ messages in thread
From: Segher Boessenkool @ 2014-11-14 10:05 UTC (permalink / raw)
  To: Richard Biener
  Cc: David Wohlferd, Hans-Peter Nilsson, y.gribov, gcc, Gerald Pfeifer

On Thu, Nov 13, 2014 at 01:14:41PM +0100, Richard Biener wrote:
> It might be that RTL alias analysis / CSE give up too early here
> (we don't optimize across asm() on the GIMPLE level at all ... heh).
> 
> I didn't look where it gives up (even though appearantly it does).

The cse1 pass doesn't carry the store forward in the "3" case, while it
does in the "4" case.  So yes it is just a missed optimisation.


Segher

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work
  2014-11-14  1:18                   ` David Wohlferd
  2014-11-14  9:56                     ` dw
@ 2014-11-14 10:15                     ` Richard Biener
  2014-11-16 15:51                       ` David Wohlferd
  1 sibling, 1 reply; 17+ messages in thread
From: Richard Biener @ 2014-11-14 10:15 UTC (permalink / raw)
  To: David Wohlferd; +Cc: Hans-Peter Nilsson, y.gribov, gcc, Gerald Pfeifer

On Fri, Nov 14, 2014 at 12:51 AM, David Wohlferd <dw@limegreensocks.com> wrote:
>
> On 11/13/2014 6:02 AM, Richard Biener wrote:
>>
>> On Thu, Nov 13, 2014 at 2:53 PM, Hans-Peter Nilsson <hp@bitrange.com>
>> wrote:
>>>
>>> On Thu, 13 Nov 2014, David Wohlferd wrote:
>>>>
>>>> Sorry for the (very) delayed response.  I'm still looking for feedback
>>>> here so
>>>> I can fix the docs.
>>>
>>> Thank you for your diligence.
>>>
>>>> As I said before, triggering a full memory clobber for anything over 16
>>>> bytes
>>>> (and most sizes under 16 bytes) makes this feature all but useless.  So
>>>> if
>>>> that's really what's happening, we need to decide what to do next:
>>>>
>>>> 1) Can this be "fixed?"
>>>> 2) Do we want to doc the current behavior?
>>>> 3) Or do we just remove this section?
>>>>
>>>> I think it could be a nice performance win for inline asm if it could be
>>>> made
>>>> to work right, but I have no idea what might be involved in that.
>>>> Failing
>>>> that, I guess if it doesn't work and isn't going to work, I'd recommend
>>>> removing the text for this feature.
>>>>
>>>> Since all 3 suggestions require a doc change, I'll just say that I'm
>>>> prepared
>>>> to start work on the doc patch as soon as someone lets me know what the
>>>> plan
>>>> is.
>>>>
>>>> Richard?  Hans-Peter?  Your thoughts?
>>>
>>> I've forgot if someone mentioned whether we have a test-case in
>>> our test-suite for this feature.
>
>
> I'm looking thru gcc/testsuite/*.c to see if I can spot anything. It's not
> easy since there is a lot of asm and the people who write these are
> apparently allergic to using comments to describe what they are testing.
>
>>> If we don't, then 3; removal.
>>> If we do, I guess it's flawed or at least not agreeing with the
>>> documentation?  Then it *might* be worth the effort fixing that
>>> and additional test-coverage (depending on the person stepping
>>> up...) but 3 is IMHO still an arguably sane option.
>>
>> Well, as what is missing is just an optimization I'd say we should
>> try to fix it.
>
>
> While I'd love to be the one to fix this, the fact of the matter is that
> most of gcc is a black box to me.  Even if you told me roughly where to
> start, I'd have no idea of the downstream impacts of anything I changed.
>
> So while I understand that it looks like I'm just finding work for other
> people, fixing something like this is simply beyond me.  That said, I'm
> certainly prepared to outline what I see as the interesting test cases and
> to do some testing if someone else is willing to step up and do this
> optimization.
>
>> And surely the docs should not promise that optimization
>> will happen - it should just mention that doing this might allow
>> optimization to happen.
>
>
> I can agree with this.  I am quite confident there will be occasions where
> gcc has no option but to fall back to doing a full clobber to ensure correct
> function (a possibility which the current docs also fails to mention).  So
> yes, the docs should be limited in what it promises here.
>
> Which brings us to the question: what do we do now?  The 15th is fast
> approaching.  Can something like this get done before then? Can it be
> checked in for 5.0 after the 15th?  Or does it need to wait for 6.0?

Can you please file a bug in bugzilla if you haven't already done so?
We can still fix bugs during the next three months ;)

> If it does need to wait for 6.0, what do we want to do with the docs in the
> meantime?  Given how wrong they are currently, I'd hate to ship yet another
> release with that ugly text.  But trying to describe the best way to take
> advantage of optimizations that haven't been written yet is... hard.
>
> Since (as I understand it) 5.0 docs *can* be checked in after the 15th, my
> recommendations:
>
>  - If someone is prepared to step up and do this work for v5.0, then I'll
> wait and write the docs when they are done and can describe how it works.

We can at least improve the docs and fix the bogus examples therein.

>  - If this is going to wait for 6.0, then if someone does (at least) enough
> investigative work to be able to describe how this will eventually work,
> I'll update the 5.0 docs in a general way talking about ways gcc *may* be
> able to optimize.  It should be possible to phrase this so code people write
> today will work even better tomorrow.
>  - Worst case is if no one has the time to look at this for the foreseeable
> future.  In that case, I'm with Hans-Peter.  Let's take the existing text
> out.  Following the existing text makes things *worse*, and the current
> implementation is so limited that I'd be surprised if anyone's code actually
> uses it successfully.  New text can get added when the new code is.
>
> Hmm.  I just had a thought: Is it possible the problem I'm seeing here is
> platform-specific?  Maybe this works perfectly for non-i386 code?  That
> would certainly change my recommendations here.

I'd say that's highly unlikely.

Richard.

> dw

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work
  2014-11-14 10:15                     ` Richard Biener
@ 2014-11-16 15:51                       ` David Wohlferd
  0 siblings, 0 replies; 17+ messages in thread
From: David Wohlferd @ 2014-11-16 15:51 UTC (permalink / raw)
  To: Richard Biener; +Cc: Hans-Peter Nilsson, y.gribov, gcc, Gerald Pfeifer


> Can you please file a bug in bugzilla if you haven't already done so?
> We can still fix bugs during the next three months ;)

Well, I opened one.  Briefly.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63900

dw

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-11-16  3:04 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-24  8:11 [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work David Wohlferd
2014-09-24  8:48 ` Richard Biener
2014-09-25 20:48   ` Yury Gribov
2014-09-26 10:50     ` David Wohlferd
2014-09-26 14:47       ` Richard Biener
2014-10-03  6:41         ` David Wohlferd
2014-10-03  8:07           ` Hans-Peter Nilsson
2014-10-06 11:09           ` Richard Biener
2014-11-13 12:14             ` David Wohlferd
2014-11-13 13:08               ` Richard Biener
2014-11-14 10:05                 ` Segher Boessenkool
2014-11-13 14:02               ` Hans-Peter Nilsson
2014-11-13 14:15                 ` Richard Biener
2014-11-14  1:18                   ` David Wohlferd
2014-11-14  9:56                     ` dw
2014-11-14 10:15                     ` Richard Biener
2014-11-16 15:51                       ` David Wohlferd

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).