[PATCH] Asm memory constraints

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] Asm memory constraints
@ 2017-08-18 17:51 Alan Modra
  2017-08-21  0:59 ` Segher Boessenkool
  0 siblings, 1 reply; 10+ messages in thread
From: Alan Modra @ 2017-08-18 17:51 UTC (permalink / raw)
  To: gcc-patches

This patch adds some documentation on asm memory constraints, aimed
especially at constraints for arrays.  I may have invented something
new here as I've never seen "=m" (*(T (*)[]) ptr) used before.
So this isn't simply a documentation patch.  It needs blessing from a
global maintainer, I think, as to whether this is a valid approach and
something that gcc ought to continue supporting.  My poking around the
code and looking at dumps convinced me that it's OK..

	PR inline-asm/81890
	* doc/extend.texi (Clobbers): Correct vax example.  Delete old
	example of a memory input for a string of known length.  Move
	commentary out of table.  Add a number of new examples
	covering array memory inputs and outputs.
testsuite/
	* gcc.target/i386/asm-mem.c: New test.

diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index 93d542d..224518f 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -8755,7 +8755,7 @@ registers:
 asm volatile ("movc3 %0, %1, %2"
                    : /* No outputs. */
                    : "g" (from), "g" (to), "g" (count)
-                   : "r0", "r1", "r2", "r3", "r4", "r5");
+                   : "r0", "r1", "r2", "r3", "r4", "r5", "memory");
 @end example
 
 Also, there are two special clobber arguments:
@@ -8786,14 +8786,75 @@ Note that this clobber does not prevent the @emph{processor} from doing
 speculative reads past the @code{asm} statement. To prevent that, you need 
 processor-specific fence instructions.
 
-Flushing registers to memory has performance implications and may be an issue 
-for time-sensitive code.  You can use a trick to avoid this if the size of 
-the memory being accessed is known at compile time. For example, if accessing 
-ten bytes of a string, use a memory input like: 
+@end table
 
-@code{@{"m"( (@{ struct @{ char x[10]; @} *p = (void *)ptr ; *p; @}) )@}}.
+Flushing registers to memory has performance implications and may be
+an issue for time-sensitive code.  You can provide better information
+to GCC to avoid this, as shown in the following examples.  At a
+minimum, aliasing rules allow GCC to know what memory @emph{doesn't}
+need to be flushed.  Also, if GCC can prove that all of the outputs of
+a non-volatile @code{asm} statement are unused, then the @code{asm}
+may be deleted.  Removal of otherwise dead @code{asm} statements will
+not happen if they clobber @code{"memory"}.
 
-@end table
+Here is a fictitious sum of squares instruction, that takes two
+pointers to floating point values in memory and produces a floating
+point register output.
+Notice that @code{x}, and @code{y} both appear twice in the @code{asm}
+parameters, once to specify memory accessed, and once to specify a
+base register used by the @code{asm}.  You won't normally be wasting a
+register by doing this as GCC can use the same register for both
+purposes.  However, it would be foolish to use both @code{%1} and
+@code{%3} for @code{x} in this @code{asm} and expect them to be the
+same.  In fact, @code{%3} may well not even be a register.  It might
+be a symbolic memory reference to the object pointed to by @code{x}.
+
+@smallexample
+asm ("sumsq %0, %1, %2"
+     : "+f" (result)
+     : "r" (x), "r" (y), "m" (*x), "m" (*y));
+@end smallexample
+
+Here is a fictitious @code{*z++ = *x++ * *y++} instruction.
+Notice that the @code{x}, @code{y} and @code{z} pointer registers
+must be specified as input/output because the @code{asm} modifies
+them.
+
+@smallexample
+asm ("vecmul %0, %1, %2"
+     : "+r" (z), "+r" (x), "+r" (y), "=m" (*z)
+     : "m" (*x), "m" (*y));
+@end smallexample
+
+An x86 example where the string memory argument is of unknown length.
+
+@smallexample
+asm("repne scasb"
+    : "=c" (count), "+D" (p)
+    : "m" (*(const char (*)[]) p), "0" (-1), "a" (0));
+@end smallexample
+
+If you know the above will only be reading a ten byte array then you
+could instead use a memory input like:
+@code{"m" (*(const char (*)[10]) p)}.
+
+Here is an example of a PowerPC vector scale implemented in assembly,
+complete with vector and condition code clobbers, and some initialized
+offset registers that are unchanged by the @code{asm}.
+
+@smallexample
+void
+dscal (size_t n, double *x, double alpha)
+@{
+  asm ("/* lots of asm here */"
+       : "+m" (*(double (*)[n]) x), "+r" (n), "+b" (x)
+       : "d" (alpha), "b" (32), "b" (48), "b" (64),
+         "b" (80), "b" (96), "b" (112)
+       : "cr0",
+         "vs32","vs33","vs34","vs35","vs36","vs37","vs38","vs39",
+         "vs40","vs41","vs42","vs43","vs44","vs45","vs46","vs47");
+@}
+@end smallexample
 
 @anchor{GotoLabels}
 @subsubsection Goto Labels
diff --git a/gcc/testsuite/gcc.target/i386/asm-mem.c b/gcc/testsuite/gcc.target/i386/asm-mem.c
new file mode 100644
index 0000000..01522fe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/asm-mem.c
@@ -0,0 +1,58 @@
+/* { dg-do run } */
+/* { dg-options "-O3" } */
+
+/* Check that "m" array references are effective in preventing the
+   array initialization from wandering past a use in the asm.  */
+
+static int
+f1 (const char *p)
+{
+  int count;
+
+  __asm__ ("repne scasb"
+	   : "=c" (count), "+D" (p)
+	   : "m" (*(const char (*)[]) p), "0" (-1), "a" (0));
+  return -2 - count;
+}
+
+static int
+f2 (const char *p)
+{
+  int count;
+
+  __asm__ ("repne scasb"
+	   : "=c" (count), "+D" (p)
+	   : "m" (*(const char (*)[48]) p), "0" (-1), "a" (0));
+  return -2 - count;
+}
+
+static int
+f3 (int n, const char *p)
+{
+  int count;
+
+  __asm__ ("repne scasb"
+	   : "=c" (count), "+D" (p)
+	   : "m" (*(const char (*)[n]) p), "0" (-1), "a" (0));
+  return -2 - count;
+}
+
+int
+main ()
+{
+  int a;
+  char buff[48] = "hello world";
+  buff[4] = 0;
+  a = f1 (buff);
+  if (a != 4)
+    __builtin_abort ();
+  buff[4] = 'o';
+  a = f2 (buff);
+  if (a != 11)
+    __builtin_abort ();
+  buff[4] = 0;
+  a = f3 (48, buff);
+  if (a != 4)
+    __builtin_abort ();
+  return 0;
+}

-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Asm memory constraints
  2017-08-18 17:51 [PATCH] Asm memory constraints Alan Modra
@ 2017-08-21  0:59 ` Segher Boessenkool
  2017-08-21  7:36   ` Alan Modra
  0 siblings, 1 reply; 10+ messages in thread
From: Segher Boessenkool @ 2017-08-21  0:59 UTC (permalink / raw)
  To: Alan Modra; +Cc: gcc-patches

Hi Alan,

On Sat, Aug 19, 2017 at 12:19:35AM +0930, Alan Modra wrote:
> +Flushing registers to memory has performance implications and may be
> +an issue for time-sensitive code.  You can provide better information
> +to GCC to avoid this, as shown in the following examples.  At a
> +minimum, aliasing rules allow GCC to know what memory @emph{doesn't}
> +need to be flushed.  Also, if GCC can prove that all of the outputs of
> +a non-volatile @code{asm} statement are unused, then the @code{asm}
> +may be deleted.  Removal of otherwise dead @code{asm} statements will
> +not happen if they clobber @code{"memory"}.

void f(int x) { int z; asm("hcf %0,%1" : "=r"(z) : "r"(x) : "memory"); }
void g(int x) { int z; asm("hcf %0,%1" : "=r"(z) : "r"(x)); }

Both f and g are completely removed by the first jump pass immediately
after expand (via delete_trivially_dead_insns).

Do you have a testcase for the behaviour you saw?


Segher

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Asm memory constraints
  2017-08-21  0:59 ` Segher Boessenkool
@ 2017-08-21  7:36   ` Alan Modra
  2017-08-21  7:44     ` Clobbers and Scratch Registers Alan Modra
                       ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Alan Modra @ 2017-08-21  7:36 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches

On Sun, Aug 20, 2017 at 08:00:53AM -0500, Segher Boessenkool wrote:
> Hi Alan,
> 
> On Sat, Aug 19, 2017 at 12:19:35AM +0930, Alan Modra wrote:
> > +Flushing registers to memory has performance implications and may be
> > +an issue for time-sensitive code.  You can provide better information
> > +to GCC to avoid this, as shown in the following examples.  At a
> > +minimum, aliasing rules allow GCC to know what memory @emph{doesn't}
> > +need to be flushed.  Also, if GCC can prove that all of the outputs of
> > +a non-volatile @code{asm} statement are unused, then the @code{asm}
> > +may be deleted.  Removal of otherwise dead @code{asm} statements will
> > +not happen if they clobber @code{"memory"}.
> 
> void f(int x) { int z; asm("hcf %0,%1" : "=r"(z) : "r"(x) : "memory"); }
> void g(int x) { int z; asm("hcf %0,%1" : "=r"(z) : "r"(x)); }
> 
> Both f and g are completely removed by the first jump pass immediately
> after expand (via delete_trivially_dead_insns).
> 
> Do you have a testcase for the behaviour you saw?

Oh my.  I was sure that was how "memory" worked!  I see though that
every gcc I have lying around, going all the way back to gcc-2.95,
deletes the asm in your testcase.  I definitely don't want to put
something in the docs that is plain wrong, or just my idea of how
things ought to work, so the last two sentences quoted above need to
go.  Thanks for the correction.

Fixed in this revised patch.  The only controversial aspect now should
be whether those array casts ought to be officially blessed.  I've
checked that "=m" (*(T (*)[]) ptr), "=m" (*(T (*)[n]) ptr), and
"=m" (*(T (*)[10]) ptr), all generate reasonable MEM_ATTRS handled
apparently properly by alias.c and other code.

For example, at -O3 the following shows gcc moving the read of "val"
before the asm, while an asm using a "memory" clobber forces the read
to occur after the asm.

static int
f (double *x)
{
  int res;
  asm ("#%0 %1 %2" : "=r" (res) : "r" (x), "m" (*(double (*)[]) x));
  return res;
}

int val = 123;
double foo[10];

int
main ()
{
  int b = f (foo);
  __builtin_printf ("%d %d\n", val, b);
  return 0;
}


I'm also encouraged by comments like the following by rth in 2004
(gcc/c/c-typeck.c), which say that using non-kosher lvalues in memory
output constraints must continue to be supported.

      /* ??? Really, this should not be here.  Users should be using a
	 proper lvalue, dammit.  But there's a long history of using casts
	 in the output operands.  In cases like longlong.h, this becomes a
	 primitive form of typechecking -- if the cast can be removed, then
	 the output operand had a type of the proper width; otherwise we'll
	 get an error.  Gross, but ...  */
      STRIP_NOPS (output);


	* doc/extend.texi (Clobbers): Correct vax example.  Delete old
	example of a memory input for a string of known length.  Move
	commentary out of table.  Add a number of new examples
	covering array memory inputs.
testsuite/
	* gcc.target/i386/asm-mem.c: New test.

diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index 649be01..940490e 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -8755,7 +8755,7 @@ registers:
 asm volatile ("movc3 %0, %1, %2"
                    : /* No outputs. */
                    : "g" (from), "g" (to), "g" (count)
-                   : "r0", "r1", "r2", "r3", "r4", "r5");
+                   : "r0", "r1", "r2", "r3", "r4", "r5", "memory");
 @end example
 
 Also, there are two special clobber arguments:
@@ -8786,14 +8786,72 @@ Note that this clobber does not prevent the @emph{processor} from doing
 speculative reads past the @code{asm} statement. To prevent that, you need 
 processor-specific fence instructions.
 
-Flushing registers to memory has performance implications and may be an issue 
-for time-sensitive code.  You can use a trick to avoid this if the size of 
-the memory being accessed is known at compile time. For example, if accessing 
-ten bytes of a string, use a memory input like: 
+@end table
 
-@code{@{"m"( (@{ struct @{ char x[10]; @} *p = (void *)ptr ; *p; @}) )@}}.
+Flushing registers to memory has performance implications and may be
+an issue for time-sensitive code.  You can provide better information
+to GCC to avoid this, as shown in the following examples.  At a
+minimum, aliasing rules allow GCC to know what memory @emph{doesn't}
+need to be flushed.
 
-@end table
+Here is a fictitious sum of squares instruction, that takes two
+pointers to floating point values in memory and produces a floating
+point register output.
+Notice that @code{x}, and @code{y} both appear twice in the @code{asm}
+parameters, once to specify memory accessed, and once to specify a
+base register used by the @code{asm}.  You won't normally be wasting a
+register by doing this as GCC can use the same register for both
+purposes.  However, it would be foolish to use both @code{%1} and
+@code{%3} for @code{x} in this @code{asm} and expect them to be the
+same.  In fact, @code{%3} may well not be a register.  It might be a
+symbolic memory reference to the object pointed to by @code{x}.
+
+@smallexample
+asm ("sumsq %0, %1, %2"
+     : "+f" (result)
+     : "r" (x), "r" (y), "m" (*x), "m" (*y));
+@end smallexample
+
+Here is a fictitious @code{*z++ = *x++ * *y++} instruction.
+Notice that the @code{x}, @code{y} and @code{z} pointer registers
+must be specified as input/output because the @code{asm} modifies
+them.
+
+@smallexample
+asm ("vecmul %0, %1, %2"
+     : "+r" (z), "+r" (x), "+r" (y), "=m" (*z)
+     : "m" (*x), "m" (*y));
+@end smallexample
+
+An x86 example where the string memory argument is of unknown length.
+
+@smallexample
+asm("repne scasb"
+    : "=c" (count), "+D" (p)
+    : "m" (*(const char (*)[]) p), "0" (-1), "a" (0));
+@end smallexample
+
+If you know the above will only be reading a ten byte array then you
+could instead use a memory input like:
+@code{"m" (*(const char (*)[10]) p)}.
+
+Here is an example of a PowerPC vector scale implemented in assembly,
+complete with vector and condition code clobbers, and some initialized
+offset registers that are unchanged by the @code{asm}.
+
+@smallexample
+void
+dscal (size_t n, double *x, double alpha)
+@{
+  asm ("/* lots of asm here */"
+       : "+m" (*(double (*)[n]) x), "+r" (n), "+b" (x)
+       : "d" (alpha), "b" (32), "b" (48), "b" (64),
+         "b" (80), "b" (96), "b" (112)
+       : "cr0",
+         "vs32","vs33","vs34","vs35","vs36","vs37","vs38","vs39",
+         "vs40","vs41","vs42","vs43","vs44","vs45","vs46","vs47");
+@}
+@end smallexample
 
 @anchor{GotoLabels}
 @subsubsection Goto Labels
diff --git a/gcc/testsuite/gcc.target/i386/asm-mem.c b/gcc/testsuite/gcc.target/i386/asm-mem.c
new file mode 100644
index 0000000..89b713f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/asm-mem.c
@@ -0,0 +1,59 @@
+/* { dg-do run } */
+/* { dg-options "-O3" } */
+
+/* Check that "m" array references are effective in preventing the
+   array initialization from wandering past a use in the asm, and
+   that the casts remain supported.  */
+
+static int
+f1 (const char *p)
+{
+  int count;
+
+  __asm__ ("repne scasb"
+	   : "=c" (count), "+D" (p)
+	   : "m" (*(const char (*)[]) p), "0" (-1), "a" (0));
+  return -2 - count;
+}
+
+static int
+f2 (const char *p)
+{
+  int count;
+
+  __asm__ ("repne scasb"
+	   : "=c" (count), "+D" (p)
+	   : "m" (*(const char (*)[48]) p), "0" (-1), "a" (0));
+  return -2 - count;
+}
+
+static int
+f3 (int n, const char *p)
+{
+  int count;
+
+  __asm__ ("repne scasb"
+	   : "=c" (count), "+D" (p)
+	   : "m" (*(const char (*)[n]) p), "0" (-1), "a" (0));
+  return -2 - count;
+}
+
+int
+main ()
+{
+  int a;
+  char buff[48] = "hello world";
+  buff[4] = 0;
+  a = f1 (buff);
+  if (a != 4)
+    __builtin_abort ();
+  buff[4] = 'o';
+  a = f2 (buff);
+  if (a != 11)
+    __builtin_abort ();
+  buff[4] = 0;
+  a = f3 (48, buff);
+  if (a != 4)
+    __builtin_abort ();
+  return 0;
+}

-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Clobbers and Scratch Registers
  2017-08-21  7:36   ` Alan Modra
@ 2017-08-21  7:44     ` Alan Modra
  2017-08-21 19:03       ` Richard Sandiford
  2017-09-29  1:06     ` [PATCH] Asm memory constraints Alan Modra
  2017-10-12 18:27     ` Jeff Law
  2 siblings, 1 reply; 10+ messages in thread
From: Alan Modra @ 2017-08-21  7:44 UTC (permalink / raw)
  To: gcc-patches; +Cc: Sandra Loosemore

This is a revised version of
https://gcc.gnu.org/ml/gcc-patches/2017-03/msg01562.html limited to
showing just the scratch register aspect, as a followup to
https://gcc.gnu.org/ml/gcc-patches/2017-08/msg01174.html 

	* doc/extend.texi (Extended Asm <Clobbers>): Rename to
	"Clobbers and Scratch Registers".  Add paragraph on
	alternative to clobbers for scratch registers and OpenBLAS
	example.

diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index 940490e..0637672 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -8075,7 +8075,7 @@ A comma-separated list of C expressions read by the instructions in the
 @item Clobbers
 A comma-separated list of registers or other values changed by the 
 @var{AssemblerTemplate}, beyond those listed as outputs.
-An empty list is permitted.  @xref{Clobbers}.
+An empty list is permitted.  @xref{Clobbers and Scratch Registers}.
 
 @item GotoLabels
 When you are using the @code{goto} form of @code{asm}, this section contains 
@@ -8435,7 +8435,7 @@ The enclosing parentheses are a required part of the syntax.
 
 When the compiler selects the registers to use to 
 represent the output operands, it does not use any of the clobbered registers 
-(@pxref{Clobbers}).
+(@pxref{Clobbers and Scratch Registers}).
 
 Output operand expressions must be lvalues. The compiler cannot check whether 
 the operands have data types that are reasonable for the instruction being 
@@ -8671,7 +8671,8 @@ as input.  The enclosing parentheses are a required part of the syntax.
 @end table
 
 When the compiler selects the registers to use to represent the input 
-operands, it does not use any of the clobbered registers (@pxref{Clobbers}).
+operands, it does not use any of the clobbered registers
+(@pxref{Clobbers and Scratch Registers}).
 
 If there are no output operands but there are input operands, place two 
 consecutive colons where the output operands would go:
@@ -8722,9 +8723,10 @@ asm ("cmoveq %1, %2, %[result]"
    : "r" (test), "r" (new), "[result]" (old));
 @end example
 
-@anchor{Clobbers}
-@subsubsection Clobbers
+@anchor{Clobbers and Scratch Registers}
+@subsubsection Clobbers and Scratch Registers
 @cindex @code{asm} clobbers
+@cindex @code{asm} scratch registers
 
 While the compiler is aware of changes to entries listed in the output 
 operands, the inline @code{asm} code may modify more than just the outputs. For 
@@ -8853,6 +8855,65 @@ dscal (size_t n, double *x, double alpha)
 @}
 @end smallexample
 
+Rather than allocating fixed registers via clobbers to provide scratch
+registers for an @code{asm} statement, an alternative is to define a
+variable and make it an early-clobber output as with @code{a2} and
+@code{a3} in the example below.  This gives the compiler register
+allocator more freedom.  You can also define a variable and make it an
+output tied to an input as with @code{a0} and @code{a1}, tied
+respectively to @code{ap} and @code{lda}.  Of course, with tied
+outputs your @code{asm} can't use the input value after modifying the
+output register since they are one and the same register.  Note also
+that tying an input to an output is the way to set up an initialized
+temporary register modified by an @code{asm} statement.  An input not
+tied to an output is assumed by GCC to be unchanged, for example
+@code{"b" (16)} below sets up @code{%11} to 16, and GCC might use that
+register in following code if the value 16 happened to be needed.  You
+can even use a normal @code{asm} output for a scratch if all inputs
+that might share the same register are consumed before the scratch is
+used.  The VSX registers clobbered by the @code{asm} statement could
+have used this technique except for GCC's limit on the number of
+@code{asm} parameters.
+
+@smallexample
+static void
+dgemv_kernel_4x4 (long n, const double *ap, long lda,
+                  const double *x, double *y, double alpha)
+@{
+  double *a0;
+  double *a1;
+  double *a2;
+  double *a3;
+
+  __asm__
+    (
+     /* lots of asm here */
+     "#n=%1 ap=%8=%12 lda=%13 x=%7=%10 y=%0=%2 alpha=%9 o16=%11\n"
+     "#a0=%3 a1=%4 a2=%5 a3=%6"
+     :
+       "+m" (*(double (*)[n]) y),
+       "+r" (n),	// 1
+       "+b" (y),	// 2
+       "=b" (a0),	// 3
+       "=b" (a1),	// 4
+       "=&b" (a2),	// 5
+       "=&b" (a3)	// 6
+     :
+       "m" (*(const double (*)[n]) x),
+       "m" (*(const double (*)[]) ap),
+       "d" (alpha),	// 9
+       "r" (x),		// 10
+       "b" (16),	// 11
+       "3" (ap),	// 12
+       "4" (lda)	// 13
+     :
+       "cr0",
+       "vs32","vs33","vs34","vs35","vs36","vs37",
+       "vs40","vs41","vs42","vs43","vs44","vs45","vs46","vs47"
+     );
+@}
+@end smallexample
+
 @anchor{GotoLabels}
 @subsubsection Goto Labels
 @cindex @code{asm} goto labels

-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Clobbers and Scratch Registers
  2017-08-21  7:44     ` Clobbers and Scratch Registers Alan Modra
@ 2017-08-21 19:03       ` Richard Sandiford
  2017-08-22  6:32         ` Alan Modra
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Sandiford @ 2017-08-21 19:03 UTC (permalink / raw)
  To: Alan Modra; +Cc: gcc-patches, Sandra Loosemore

Thanks for doing this.

Alan Modra <amodra@gmail.com> writes:
> This is a revised version of
> https://gcc.gnu.org/ml/gcc-patches/2017-03/msg01562.html limited to
> showing just the scratch register aspect, as a followup to
> https://gcc.gnu.org/ml/gcc-patches/2017-08/msg01174.html 
>
> 	* doc/extend.texi (Extended Asm <Clobbers>): Rename to
> 	"Clobbers and Scratch Registers".  Add paragraph on
> 	alternative to clobbers for scratch registers and OpenBLAS
> 	example.
>
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index 940490e..0637672 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -8075,7 +8075,7 @@ A comma-separated list of C expressions read by the instructions in the
>  @item Clobbers
>  A comma-separated list of registers or other values changed by the 
>  @var{AssemblerTemplate}, beyond those listed as outputs.
> -An empty list is permitted.  @xref{Clobbers}.
> +An empty list is permitted.  @xref{Clobbers and Scratch Registers}.
>  
>  @item GotoLabels
>  When you are using the @code{goto} form of @code{asm}, this section contains 
> @@ -8435,7 +8435,7 @@ The enclosing parentheses are a required part of the syntax.
>  
>  When the compiler selects the registers to use to 
>  represent the output operands, it does not use any of the clobbered registers 
> -(@pxref{Clobbers}).
> +(@pxref{Clobbers and Scratch Registers}).
>  
>  Output operand expressions must be lvalues. The compiler cannot check whether 
>  the operands have data types that are reasonable for the instruction being 
> @@ -8671,7 +8671,8 @@ as input.  The enclosing parentheses are a required part of the syntax.
>  @end table
>  
>  When the compiler selects the registers to use to represent the input 
> -operands, it does not use any of the clobbered registers (@pxref{Clobbers}).
> +operands, it does not use any of the clobbered registers
> +(@pxref{Clobbers and Scratch Registers}).
>  
>  If there are no output operands but there are input operands, place two 
>  consecutive colons where the output operands would go:
> @@ -8722,9 +8723,10 @@ asm ("cmoveq %1, %2, %[result]"
>     : "r" (test), "r" (new), "[result]" (old));
>  @end example
>  
> -@anchor{Clobbers}
> -@subsubsection Clobbers
> +@anchor{Clobbers and Scratch Registers}
> +@subsubsection Clobbers and Scratch Registers
>  @cindex @code{asm} clobbers
> +@cindex @code{asm} scratch registers
>  
>  While the compiler is aware of changes to entries listed in the output 
>  operands, the inline @code{asm} code may modify more than just the outputs. For 
> @@ -8853,6 +8855,65 @@ dscal (size_t n, double *x, double alpha)
>  @}
>  @end smallexample
>  
> +Rather than allocating fixed registers via clobbers to provide scratch
> +registers for an @code{asm} statement, an alternative is to define a
> +variable and make it an early-clobber output as with @code{a2} and
> +@code{a3} in the example below.  This gives the compiler register
> +allocator more freedom.  You can also define a variable and make it an
> +output tied to an input as with @code{a0} and @code{a1}, tied
> +respectively to @code{ap} and @code{lda}.

I think it's worth emphasising that tying operands doesn't change
whether an output needs an earlyclobber or not.  E.g. for:

  asm ("%0 = f(%1); use %2"
       : "=r" (a) : "0" (b), "r" (c));

the compiler can assign the same register to all three operands if
it can prove that b == c on entry.  Since %0 is being modified before
%2 is used, it needs to be:

  asm ("%0 = f(%1); use %2"
       : "=&r" (a) : "0" (b), "r" (c));

instead.

Thanks,
Richard

> Of course, with tied
> +outputs your @code{asm} can't use the input value after modifying the
> +output register since they are one and the same register.  Note also
> +that tying an input to an output is the way to set up an initialized
> +temporary register modified by an @code{asm} statement.  An input not
> +tied to an output is assumed by GCC to be unchanged, for example
> +@code{"b" (16)} below sets up @code{%11} to 16, and GCC might use that
> +register in following code if the value 16 happened to be needed.  You
> +can even use a normal @code{asm} output for a scratch if all inputs
> +that might share the same register are consumed before the scratch is
> +used.  The VSX registers clobbered by the @code{asm} statement could
> +have used this technique except for GCC's limit on the number of
> +@code{asm} parameters.
> +
> +@smallexample
> +static void
> +dgemv_kernel_4x4 (long n, const double *ap, long lda,
> +                  const double *x, double *y, double alpha)
> +@{
> +  double *a0;
> +  double *a1;
> +  double *a2;
> +  double *a3;
> +
> +  __asm__
> +    (
> +     /* lots of asm here */
> +     "#n=%1 ap=%8=%12 lda=%13 x=%7=%10 y=%0=%2 alpha=%9 o16=%11\n"
> +     "#a0=%3 a1=%4 a2=%5 a3=%6"
> +     :
> +       "+m" (*(double (*)[n]) y),
> +       "+r" (n),	// 1
> +       "+b" (y),	// 2
> +       "=b" (a0),	// 3
> +       "=b" (a1),	// 4
> +       "=&b" (a2),	// 5
> +       "=&b" (a3)	// 6
> +     :
> +       "m" (*(const double (*)[n]) x),
> +       "m" (*(const double (*)[]) ap),
> +       "d" (alpha),	// 9
> +       "r" (x),		// 10
> +       "b" (16),	// 11
> +       "3" (ap),	// 12
> +       "4" (lda)	// 13
> +     :
> +       "cr0",
> +       "vs32","vs33","vs34","vs35","vs36","vs37",
> +       "vs40","vs41","vs42","vs43","vs44","vs45","vs46","vs47"
> +     );
> +@}
> +@end smallexample
> +
>  @anchor{GotoLabels}
>  @subsubsection Goto Labels
>  @cindex @code{asm} goto labels

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Clobbers and Scratch Registers
  2017-08-21 19:03       ` Richard Sandiford
@ 2017-08-22  6:32         ` Alan Modra
  2017-08-22  6:33           ` Alan Modra
  2017-10-12 18:36           ` Jeff Law
  0 siblings, 2 replies; 10+ messages in thread
From: Alan Modra @ 2017-08-22  6:32 UTC (permalink / raw)
  To: gcc-patches, Sandra Loosemore, richard.sandiford

On Mon, Aug 21, 2017 at 06:33:09PM +0100, Richard Sandiford wrote:
> I think it's worth emphasising that tying operands doesn't change
> whether an output needs an earlyclobber or not.  E.g. for:

Thanks for noticing this.  It turns out that my OpenBLAS example
actually ought to have an early-clobber on one of the tied outputs, so
you've also alerted me to another bug in the power8 code.  (Well, only
if the dgemv kernel was called directly from user code with a 16*N A
matrix, or I suppose if LTO was used.)  So I now have a real-world
example of the situation where you need an early-clobber on tied
outputs, and also where an early-clobber is undesirable.

Revised and expanded.

	* doc/extend.texi (Extended Asm <Clobbers>): Rename to
	"Clobbers and Scratch Registers".  Add paragraph on
	alternative to clobbers for scratch registers and OpenBLAS
	example.

diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index 940490e..cef6c57 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -8075,7 +8075,7 @@ A comma-separated list of C expressions read by the instructions in the
 @item Clobbers
 A comma-separated list of registers or other values changed by the 
 @var{AssemblerTemplate}, beyond those listed as outputs.
-An empty list is permitted.  @xref{Clobbers}.
+An empty list is permitted.  @xref{Clobbers and Scratch Registers}.
 
 @item GotoLabels
 When you are using the @code{goto} form of @code{asm}, this section contains 
@@ -8435,7 +8435,7 @@ The enclosing parentheses are a required part of the syntax.
 
 When the compiler selects the registers to use to 
 represent the output operands, it does not use any of the clobbered registers 
-(@pxref{Clobbers}).
+(@pxref{Clobbers and Scratch Registers}).
 
 Output operand expressions must be lvalues. The compiler cannot check whether 
 the operands have data types that are reasonable for the instruction being 
@@ -8671,7 +8671,8 @@ as input.  The enclosing parentheses are a required part of the syntax.
 @end table
 
 When the compiler selects the registers to use to represent the input 
-operands, it does not use any of the clobbered registers (@pxref{Clobbers}).
+operands, it does not use any of the clobbered registers
+(@pxref{Clobbers and Scratch Registers}).
 
 If there are no output operands but there are input operands, place two 
 consecutive colons where the output operands would go:
@@ -8722,9 +8723,10 @@ asm ("cmoveq %1, %2, %[result]"
    : "r" (test), "r" (new), "[result]" (old));
 @end example
 
-@anchor{Clobbers}
-@subsubsection Clobbers
+@anchor{Clobbers and Scratch Registers}
+@subsubsection Clobbers and Scratch Registers
 @cindex @code{asm} clobbers
+@cindex @code{asm} scratch registers
 
 While the compiler is aware of changes to entries listed in the output 
 operands, the inline @code{asm} code may modify more than just the outputs. For 
@@ -8853,6 +8855,75 @@ dscal (size_t n, double *x, double alpha)
 @}
 @end smallexample
 
+Rather than allocating fixed registers via clobbers to provide scratch
+registers for an @code{asm} statement, an alternative is to define a
+variable and make it an early-clobber output as with @code{a2} and
+@code{a3} in the example below.  This gives the compiler register
+allocator more freedom.  You can also define a variable and make it an
+output tied to an input as with @code{a0} and @code{a1}, tied
+respectively to @code{ap} and @code{lda}.  Of course, with tied
+outputs your @code{asm} can't use the input value after modifying the
+output register since they are one and the same register.  What's
+more, if you omit the early-clobber on the output, it is possible that
+GCC might allocate the same register to another of the inputs if GCC
+could prove they had the same value on entry to the @code{asm}.  This
+is why @code{a1} has an early-clobber.  Its tied input, @code{lda}
+might conceivably be known to have the value 16 and without an
+early-clobber share the same register as @code{%11}.  On the other
+hand, @code{ap} can't be the same as any of the other inputs, so an
+early-clobber on @code{a0} is not needed.  It is also not desirable in
+this case.  An early-clobber on @code{a0} would cause GCC to allocate
+a separate register for the @code{"m" (*(const double (*)[]) ap)}
+input.  Note that tying an input to an output is the way to set up an
+initialized temporary register modified by an @code{asm} statement.
+An input not tied to an output is assumed by GCC to be unchanged, for
+example @code{"b" (16)} below sets up @code{%11} to 16, and GCC might
+use that register in following code if the value 16 happened to be
+needed.  You can even use a normal @code{asm} output for a scratch if
+all inputs that might share the same register are consumed before the
+scratch is used.  The VSX registers clobbered by the @code{asm}
+statement could have used this technique except for GCC's limit on the
+number of @code{asm} parameters.
+
+@smallexample
+static void
+dgemv_kernel_4x4 (long n, const double *ap, long lda,
+                  const double *x, double *y, double alpha)
+@{
+  double *a0;
+  double *a1;
+  double *a2;
+  double *a3;
+
+  __asm__
+    (
+     /* lots of asm here */
+     "#n=%1 ap=%8=%12 lda=%13 x=%7=%10 y=%0=%2 alpha=%9 o16=%11\n"
+     "#a0=%3 a1=%4 a2=%5 a3=%6"
+     :
+       "+m" (*(double (*)[n]) y),
+       "+r" (n),	// 1
+       "+b" (y),	// 2
+       "=b" (a0),	// 3
+       "=&b" (a1),	// 4
+       "=&b" (a2),	// 5
+       "=&b" (a3)	// 6
+     :
+       "m" (*(const double (*)[n]) x),
+       "m" (*(const double (*)[]) ap),
+       "d" (alpha),	// 9
+       "r" (x),		// 10
+       "b" (16),	// 11
+       "3" (ap),	// 12
+       "4" (lda)	// 13
+     :
+       "cr0",
+       "vs32","vs33","vs34","vs35","vs36","vs37",
+       "vs40","vs41","vs42","vs43","vs44","vs45","vs46","vs47"
+     );
+@}
+@end smallexample
+
 @anchor{GotoLabels}
 @subsubsection Goto Labels
 @cindex @code{asm} goto labels


-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Clobbers and Scratch Registers
  2017-08-22  6:32         ` Alan Modra
@ 2017-08-22  6:33           ` Alan Modra
  2017-10-12 18:36           ` Jeff Law
  1 sibling, 0 replies; 10+ messages in thread
From: Alan Modra @ 2017-08-22  6:33 UTC (permalink / raw)
  To: gcc-patches, Sandra Loosemore, richard.sandiford

On Tue, Aug 22, 2017 at 01:41:21PM +0930, Alan Modra wrote:
> +     "#n=%1 ap=%8=%12 lda=%13 x=%7=%10 y=%0=%2 alpha=%9 o16=%11\n"
> +     "#a0=%3 a1=%4 a2=%5 a3=%6"
> +     :
> +       "+m" (*(double (*)[n]) y),
> +       "+r" (n),	// 1

Another small revision.  That needs to be "+&r" (n), in case n can be
deduced to be 16, matching one of the other inputs.

-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Asm memory constraints
  2017-08-21  7:36   ` Alan Modra
  2017-08-21  7:44     ` Clobbers and Scratch Registers Alan Modra
@ 2017-09-29  1:06     ` Alan Modra
  2017-10-12 18:27     ` Jeff Law
  2 siblings, 0 replies; 10+ messages in thread
From: Alan Modra @ 2017-09-29  1:06 UTC (permalink / raw)
  To: gcc-patches; +Cc: law

On Mon, Aug 21, 2017 at 10:29:30AM +0930, Alan Modra wrote:
> Fixed in this revised patch.  The only controversial aspect now should
> be whether those array casts ought to be officially blessed.  I've
> checked that "=m" (*(T (*)[]) ptr), "=m" (*(T (*)[n]) ptr), and
> "=m" (*(T (*)[10]) ptr), all generate reasonable MEM_ATTRS handled
> apparently properly by alias.c and other code.

Ping https://gcc.gnu.org/ml/gcc-patches/2017-08/msg01174.html
Needs a global reviewer to bless array casts in asm constraints.

-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Asm memory constraints
  2017-08-21  7:36   ` Alan Modra
  2017-08-21  7:44     ` Clobbers and Scratch Registers Alan Modra
  2017-09-29  1:06     ` [PATCH] Asm memory constraints Alan Modra
@ 2017-10-12 18:27     ` Jeff Law
  2 siblings, 0 replies; 10+ messages in thread
From: Jeff Law @ 2017-10-12 18:27 UTC (permalink / raw)
  To: Alan Modra, Segher Boessenkool; +Cc: gcc-patches

On 08/20/2017 06:59 PM, Alan Modra wrote:
> On Sun, Aug 20, 2017 at 08:00:53AM -0500, Segher Boessenkool wrote:
>> Hi Alan,
>>
>> On Sat, Aug 19, 2017 at 12:19:35AM +0930, Alan Modra wrote:
>>> +Flushing registers to memory has performance implications and may be
>>> +an issue for time-sensitive code.  You can provide better information
>>> +to GCC to avoid this, as shown in the following examples.  At a
>>> +minimum, aliasing rules allow GCC to know what memory @emph{doesn't}
>>> +need to be flushed.  Also, if GCC can prove that all of the outputs of
>>> +a non-volatile @code{asm} statement are unused, then the @code{asm}
>>> +may be deleted.  Removal of otherwise dead @code{asm} statements will
>>> +not happen if they clobber @code{"memory"}.
>>
>> void f(int x) { int z; asm("hcf %0,%1" : "=r"(z) : "r"(x) : "memory"); }
>> void g(int x) { int z; asm("hcf %0,%1" : "=r"(z) : "r"(x)); }
>>
>> Both f and g are completely removed by the first jump pass immediately
>> after expand (via delete_trivially_dead_insns).
>>
>> Do you have a testcase for the behaviour you saw?
> 
> Oh my.  I was sure that was how "memory" worked!  I see though that
> every gcc I have lying around, going all the way back to gcc-2.95,
> deletes the asm in your testcase.  I definitely don't want to put
> something in the docs that is plain wrong, or just my idea of how
> things ought to work, so the last two sentences quoted above need to
> go.  Thanks for the correction.
> 
> Fixed in this revised patch.  The only controversial aspect now should
> be whether those array casts ought to be officially blessed.  I've
> checked that "=m" (*(T (*)[]) ptr), "=m" (*(T (*)[n]) ptr), and
> "=m" (*(T (*)[10]) ptr), all generate reasonable MEM_ATTRS handled
> apparently properly by alias.c and other code.
> 
> For example, at -O3 the following shows gcc moving the read of "val"
> before the asm, while an asm using a "memory" clobber forces the read
> to occur after the asm.
> 
> static int
> f (double *x)
> {
>   int res;
>   asm ("#%0 %1 %2" : "=r" (res) : "r" (x), "m" (*(double (*)[]) x));
>   return res;
> }
> 
> int val = 123;
> double foo[10];
> 
> int
> main ()
> {
>   int b = f (foo);
>   __builtin_printf ("%d %d\n", val, b);
>   return 0;
> }
> 
> 
> I'm also encouraged by comments like the following by rth in 2004
> (gcc/c/c-typeck.c), which say that using non-kosher lvalues in memory
> output constraints must continue to be supported.
> 
>       /* ??? Really, this should not be here.  Users should be using a
> 	 proper lvalue, dammit.  But there's a long history of using casts
> 	 in the output operands.  In cases like longlong.h, this becomes a
> 	 primitive form of typechecking -- if the cast can be removed, then
> 	 the output operand had a type of the proper width; otherwise we'll
> 	 get an error.  Gross, but ...  */
>       STRIP_NOPS (output);
> 
> 
> 	* doc/extend.texi (Clobbers): Correct vax example.  Delete old
> 	example of a memory input for a string of known length.  Move
> 	commentary out of table.  Add a number of new examples
> 	covering array memory inputs.
> testsuite/
> 	* gcc.target/i386/asm-mem.c: New test.
OK.  Sorry about the long wait.

jeff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Clobbers and Scratch Registers
  2017-08-22  6:32         ` Alan Modra
  2017-08-22  6:33           ` Alan Modra
@ 2017-10-12 18:36           ` Jeff Law
  1 sibling, 0 replies; 10+ messages in thread
From: Jeff Law @ 2017-10-12 18:36 UTC (permalink / raw)
  To: Alan Modra, gcc-patches, Sandra Loosemore, richard.sandiford

On 08/21/2017 10:11 PM, Alan Modra wrote:
> On Mon, Aug 21, 2017 at 06:33:09PM +0100, Richard Sandiford wrote:
>> I think it's worth emphasising that tying operands doesn't change
>> whether an output needs an earlyclobber or not.  E.g. for:
> 
> Thanks for noticing this.  It turns out that my OpenBLAS example
> actually ought to have an early-clobber on one of the tied outputs, so
> you've also alerted me to another bug in the power8 code.  (Well, only
> if the dgemv kernel was called directly from user code with a 16*N A
> matrix, or I suppose if LTO was used.)  So I now have a real-world
> example of the situation where you need an early-clobber on tied
> outputs, and also where an early-clobber is undesirable.
> 
> Revised and expanded.
> 
> 	* doc/extend.texi (Extended Asm <Clobbers>): Rename to
> 	"Clobbers and Scratch Registers".  Add paragraph on
> 	alternative to clobbers for scratch registers and OpenBLAS
> 	example.
Also OK.  I think you had a minor revision on this which is OK as well.
I never would have spotted the additional earlyclobber requirement.

jeff

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-10-12 18:27 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-18 17:51 [PATCH] Asm memory constraints Alan Modra
2017-08-21  0:59 ` Segher Boessenkool
2017-08-21  7:36   ` Alan Modra
2017-08-21  7:44     ` Clobbers and Scratch Registers Alan Modra
2017-08-21 19:03       ` Richard Sandiford
2017-08-22  6:32         ` Alan Modra
2017-08-22  6:33           ` Alan Modra
2017-10-12 18:36           ` Jeff Law
2017-09-29  1:06     ` [PATCH] Asm memory constraints Alan Modra
2017-10-12 18:27     ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).