public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH v4 00/12] Allow TImode/OImode/XImode in op_by_pieces operations
@ 2021-05-18 19:16 H.J. Lu
  2021-05-18 19:16 ` [PATCH v4 01/12] Add TARGET_READ_MEMSET_VALUE/TARGET_GEN_MEMSET_VALUE H.J. Lu
                   ` (11 more replies)
  0 siblings, 12 replies; 52+ messages in thread
From: H.J. Lu @ 2021-05-18 19:16 UTC (permalink / raw)
  To: gcc-patches
  Cc: Richard Biener, Richard Sandiford, Uros Bizjak, Bernd Edlinger

Changes in the v4 patches:

1. Define x86 MAX_MOVE_MAX to 64, which is the constant maximum number
of bytes that a single instruction can move quickly between memory and
registers or between two memory locations.
2. Define x86 MOVE_MAX to MOVE_MAX_PIECES, which is the maximum number of
bytes we can move from memory to memory in one reasonably fast instruction.
The difference between MAX_MOVE_MAX and MOVE_MAX is that MAX_MOVE_MAX
must be a constant, independent of compiler options, since it is used in
reload.h to define struct target_reload and MOVE_MAX can vary, depending
on compiler options.

Changes in the v3 patches:

1. Split the TARGET_READ_MEMSET_VALUE and TARGET_GEN_MEMSET_VALUE changes
into the generic part and the x86 part.


1. Add TARGET_READ_MEMSET_VALUE and TARGET_GEN_MEMSET_VALUE to support
target instructions to duplicate QImode value to TImode/OImode/XImode
value for memmset.
2. x86: Avoid stack realignment when copying data
3. x86: Remov MAX_BITSIZE_MODE_ANY_INT.  Only x86 backend defines it.
4. x86: Use TImode/OImode/XImode integers for piecewise move and store.
5. x86: Add tests for TImode/OImode/XImode for piecewise move and store.
6. x86: Adjust existing tests.

On x86-64, SPEC CPU 2017 performance impact is neutral.  Glibc code size
differences with -O2 build are:

             Before         After
libc.so     1906572        1906444

Some code sequence differences in libc.so are:

<svcudp_bufcreate@GLIBC_2.2.5>:
	...
	jne    <svcudp_bufcreate@GLIBC_2.2.5+0x318>	      |		jne    <svcudp_bufcreate@GLIBC_2.2.5+0x2a8>
	test   %r15,%r15						test   %r15,%r15
	je     <svcudp_bufcreate@GLIBC_2.2.5+0x318>	      |		je     <svcudp_bufcreate@GLIBC_2.2.5+0x2a8>
	mov    %r13d,(%r14)						mov    %r13d,(%r14)
	lea    0x10(%r14),%rdi						lea    0x10(%r14),%rdi
	mov    $0x1,%ecx						mov    $0x1,%ecx
	mov    %r13d,%edx						mov    %r13d,%edx
	mov    %r15,0x40(%r12)						mov    %r15,0x40(%r12)
	mov    %r15,%rsi						mov    %r15,%rsi
	call   <xdrmem_create@GLIBC_2.2.5>				call   <xdrmem_create@GLIBC_2.2.5>
	lea    0xa2f9b(%rip),%rax        # <svcudp_op>	      |		lea    0xa2fab(%rip),%rax        # <svcudp_op>
	xor    %esi,%esi						xor    %esi,%esi
	mov    %ebp,%edi						mov    %ebp,%edi
	mov    %rax,0x8(%r12)						mov    %rax,0x8(%r12)
	movzwl 0x12(%rsp),%eax						movzwl 0x12(%rsp),%eax
	mov    $0x8,%edx				      <
	lea    0xc(%rsp),%rcx						lea    0xc(%rsp),%rcx
	mov    %r14,0x48(%r12)				      <
	add    $0x40,%r14				      <
	mov    $0x4,%r8d						mov    $0x4,%r8d
							      >		movq   $0x0,0x1d0(%r14)
							      >		mov    $0x8,%edx
	rol    $0x8,%ax							rol    $0x8,%ax
	mov    %ebp,(%r12)				      |		mov    %r14,0x48(%r12)
	movq   $0x0,0x190(%r14)				      |		add    $0x40,%r14
	mov    %ax,0x4(%r12)				      <
	mov    %r14,0x30(%r12)						mov    %r14,0x30(%r12)
							      >		mov    %ax,0x4(%r12)
							      >		mov    %ebp,(%r12)
	movl   $0x1,0xc(%rsp)						movl   $0x1,0xc(%rsp)
	call   <setsockopt>						call   <setsockopt>
	mov    %r12,%rdi						mov    %r12,%rdi
	movabs $0x101010101010101,%rdx			      <
	test   %eax,%eax						test   %eax,%eax
	mov    $0xff,%eax						mov    $0xff,%eax
	cmove  %eax,%ebx						cmove  %eax,%ebx
	movzbl %bl,%eax					      |		movd   %ebx,%xmm0
	mov    %ebx,0xc(%rsp)						mov    %ebx,0xc(%rsp)
	mov    %rax,%rsi				      |		punpcklbw %xmm0,%xmm0
	imul   %rdx,%rsi				      |		punpcklwd %xmm0,%xmm0
	mul    %rdx					      |		pshufd $0x0,%xmm0,%xmm0
	add    %rsi,%rdx				      |		movups %xmm0,0x50(%r12)
	mov    %rax,0x50(%r12)				      |		movups %xmm0,0x60(%r12)
	mov    %rdx,0x58(%r12)				      |		movups %xmm0,0x70(%r12)
	mov    %rax,0x60(%r12)				      |		movups %xmm0,0x80(%r12)
	mov    %rdx,0x68(%r12)				      |		movups %xmm0,0x90(%r12)
	mov    %rax,0x70(%r12)				      |		movups %xmm0,0xa0(%r12)
	mov    %rdx,0x78(%r12)				      |		movups %xmm0,0xb0(%r12)
	mov    %rax,0x80(%r12)				      |		movups %xmm0,0xc0(%r12)
	mov    %rdx,0x88(%r12)				      |		movups %xmm0,0xd0(%r12)
	mov    %rax,0x90(%r12)				      |		movups %xmm0,0xe0(%r12)
	mov    %rdx,0x98(%r12)				      |		movups %xmm0,0xf0(%r12)
	mov    %rax,0xa0(%r12)				      |		movups %xmm0,0x100(%r12)
	mov    %rdx,0xa8(%r12)				      |		movups %xmm0,0x110(%r12)
	mov    %rax,0xb0(%r12)				      |		movups %xmm0,0x120(%r12)
	mov    %rdx,0xb8(%r12)				      |		movups %xmm0,0x130(%r12)
	mov    %rax,0xc0(%r12)				      |		movups %xmm0,0x140(%r12)
	mov    %rdx,0xc8(%r12)				      <
	mov    %rax,0xd0(%r12)				      <
	mov    %rdx,0xd8(%r12)				      <
	mov    %rax,0xe0(%r12)				      <
	mov    %rdx,0xe8(%r12)				      <
	mov    %rax,0xf0(%r12)				      <
	mov    %rdx,0xf8(%r12)				      <
	mov    %rax,0x100(%r12)				      <
	mov    %rdx,0x108(%r12)				      <
	mov    %rax,0x110(%r12)				      <
	mov    %rdx,0x118(%r12)				      <
	mov    %rax,0x120(%r12)				      <
	mov    %rdx,0x128(%r12)				      <
	mov    %rax,0x130(%r12)				      <
	mov    %rdx,0x138(%r12)				      <
	mov    %rax,0x140(%r12)				      <
	mov    %rdx,0x148(%r12)				      <
	call   <xprt_register@GLIBC_2.2.5>				call   <xprt_register@GLIBC_2.2.5>
	add    $0x28,%rsp						add    $0x28,%rsp
	mov    %r12,%rax						mov    %r12,%rax
	pop    %rbx							pop    %rbx
	pop    %rbp							pop    %rbp
	pop    %r12							pop    %r12
	pop    %r13							pop    %r13
	pop    %r14							pop    %r14
	pop    %r15							pop    %r15
	ret    								ret    

H.J. Lu (12):
  Add TARGET_READ_MEMSET_VALUE/TARGET_GEN_MEMSET_VALUE
  x86: Add TARGET_READ_MEMSET_VALUE/TARGET_GEN_MEMSET_VALUE
  x86: Avoid stack realignment when copying data
  Remove MAX_BITSIZE_MODE_ANY_INT
  x86: Update piecewise move and store
  x86: Add AVX2 tests for PR middle-end/90773
  x86: Add tests for piecewise move and store
  x86: Also pass -mno-avx to pr72839.c
  x86: Also pass -mno-avx to cold-attribute-1.c
  x86: Also pass -mno-avx to sw-1.c for ia32
  x86: Update gcc.target/i386/incoming-11.c
  constructor: Check if it is faster to load constant from memory

 gcc/builtins.c                                |  47 +--
 gcc/config/i386/i386-expand.c                 |  18 +-
 gcc/config/i386/i386-modes.def                |  15 +-
 gcc/config/i386/i386-protos.h                 |   5 +
 gcc/config/i386/i386.c                        | 289 +++++++++++++++++-
 gcc/config/i386/i386.h                        |  44 ++-
 gcc/doc/tm.texi                               |  16 +
 gcc/doc/tm.texi.in                            |   4 +
 gcc/expr.c                                    |  10 +
 gcc/target.def                                |  20 ++
 gcc/targhooks.c                               |  56 ++++
 gcc/targhooks.h                               |   4 +
 .../gcc.target/i386/cold-attribute-1.c        |   2 +-
 gcc/testsuite/gcc.target/i386/eh_return-1.c   |  26 ++
 gcc/testsuite/gcc.target/i386/incoming-11.c   |   2 +-
 .../gcc.target/i386/pieces-memcpy-10.c        |  16 +
 .../gcc.target/i386/pieces-memcpy-11.c        |  17 ++
 .../gcc.target/i386/pieces-memcpy-12.c        |  16 +
 .../gcc.target/i386/pieces-memcpy-13.c        |  16 +
 .../gcc.target/i386/pieces-memcpy-14.c        |  17 ++
 .../gcc.target/i386/pieces-memcpy-15.c        |  16 +
 .../gcc.target/i386/pieces-memcpy-16.c        |  16 +
 .../gcc.target/i386/pieces-memcpy-7.c         |  15 +
 .../gcc.target/i386/pieces-memcpy-8.c         |  14 +
 .../gcc.target/i386/pieces-memcpy-9.c         |  14 +
 .../gcc.target/i386/pieces-memset-1.c         |  16 +
 .../gcc.target/i386/pieces-memset-10.c        |  16 +
 .../gcc.target/i386/pieces-memset-11.c        |  16 +
 .../gcc.target/i386/pieces-memset-12.c        |  16 +
 .../gcc.target/i386/pieces-memset-13.c        |  16 +
 .../gcc.target/i386/pieces-memset-14.c        |  16 +
 .../gcc.target/i386/pieces-memset-15.c        |  16 +
 .../gcc.target/i386/pieces-memset-16.c        |  16 +
 .../gcc.target/i386/pieces-memset-17.c        |  16 +
 .../gcc.target/i386/pieces-memset-18.c        |  16 +
 .../gcc.target/i386/pieces-memset-19.c        |  17 ++
 .../gcc.target/i386/pieces-memset-2.c         |  12 +
 .../gcc.target/i386/pieces-memset-20.c        |  17 ++
 .../gcc.target/i386/pieces-memset-21.c        |  17 ++
 .../gcc.target/i386/pieces-memset-22.c        |  17 ++
 .../gcc.target/i386/pieces-memset-23.c        |  17 ++
 .../gcc.target/i386/pieces-memset-24.c        |  17 ++
 .../gcc.target/i386/pieces-memset-25.c        |  17 ++
 .../gcc.target/i386/pieces-memset-26.c        |  17 ++
 .../gcc.target/i386/pieces-memset-27.c        |  17 ++
 .../gcc.target/i386/pieces-memset-28.c        |  17 ++
 .../gcc.target/i386/pieces-memset-29.c        |  17 ++
 .../gcc.target/i386/pieces-memset-3.c         |  18 ++
 .../gcc.target/i386/pieces-memset-30.c        |  17 ++
 .../gcc.target/i386/pieces-memset-31.c        |  17 ++
 .../gcc.target/i386/pieces-memset-32.c        |  17 ++
 .../gcc.target/i386/pieces-memset-33.c        |  17 ++
 .../gcc.target/i386/pieces-memset-34.c        |  17 ++
 .../gcc.target/i386/pieces-memset-35.c        |  17 ++
 .../gcc.target/i386/pieces-memset-36.c        |  17 ++
 .../gcc.target/i386/pieces-memset-37.c        |  15 +
 .../gcc.target/i386/pieces-memset-38.c        |  17 ++
 .../gcc.target/i386/pieces-memset-39.c        |  16 +
 .../gcc.target/i386/pieces-memset-4.c         |  16 +
 .../gcc.target/i386/pieces-memset-40.c        |  17 ++
 .../gcc.target/i386/pieces-memset-41.c        |  16 +
 .../gcc.target/i386/pieces-memset-42.c        |  17 ++
 .../gcc.target/i386/pieces-memset-43.c        |  17 ++
 .../gcc.target/i386/pieces-memset-5.c         |  12 +
 .../gcc.target/i386/pieces-memset-6.c         |  16 +
 .../gcc.target/i386/pieces-memset-7.c         |  16 +
 .../gcc.target/i386/pieces-memset-8.c         |  16 +
 .../gcc.target/i386/pieces-memset-9.c         |  16 +
 gcc/testsuite/gcc.target/i386/pr72839.c       |   2 +-
 gcc/testsuite/gcc.target/i386/pr90773-1.c     |  10 +-
 gcc/testsuite/gcc.target/i386/pr90773-14.c    |   2 +-
 gcc/testsuite/gcc.target/i386/pr90773-15.c    |  14 +
 gcc/testsuite/gcc.target/i386/pr90773-16.c    |  14 +
 gcc/testsuite/gcc.target/i386/pr90773-17.c    |  14 +
 gcc/testsuite/gcc.target/i386/pr90773-18.c    |  15 +
 gcc/testsuite/gcc.target/i386/pr90773-19.c    |  14 +
 gcc/testsuite/gcc.target/i386/pr90773-20.c    |  13 +
 gcc/testsuite/gcc.target/i386/pr90773-21.c    |  13 +
 gcc/testsuite/gcc.target/i386/pr90773-22.c    |  13 +
 gcc/testsuite/gcc.target/i386/pr90773-23.c    |  13 +
 gcc/testsuite/gcc.target/i386/pr90773-24.c    |  22 ++
 gcc/testsuite/gcc.target/i386/pr90773-25.c    |  20 ++
 gcc/testsuite/gcc.target/i386/pr90773-4.c     |   2 +-
 gcc/testsuite/gcc.target/i386/sw-1.c          |   1 +
 84 files changed, 1516 insertions(+), 84 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/eh_return-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-10.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-11.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-12.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-13.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-14.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-15.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-16.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-7.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-8.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-9.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-10.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-11.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-12.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-13.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-14.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-15.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-16.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-17.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-18.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-19.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-20.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-21.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-22.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-23.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-24.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-25.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-26.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-27.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-28.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-29.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-30.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-31.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-32.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-33.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-34.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-35.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-36.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-37.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-38.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-39.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-40.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-41.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-42.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-43.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-6.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-7.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-8.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-9.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-15.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-16.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-17.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-18.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-19.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-20.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-21.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-22.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-23.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 01/12] Add TARGET_READ_MEMSET_VALUE/TARGET_GEN_MEMSET_VALUE
  2021-05-18 19:16 [PATCH v4 00/12] Allow TImode/OImode/XImode in op_by_pieces operations H.J. Lu
@ 2021-05-18 19:16 ` H.J. Lu
  2021-05-19  9:25   ` Richard Biener
  2021-05-18 19:16 ` [PATCH v4 02/12] x86: Add TARGET_READ_MEMSET_VALUE/TARGET_GEN_MEMSET_VALUE H.J. Lu
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 52+ messages in thread
From: H.J. Lu @ 2021-05-18 19:16 UTC (permalink / raw)
  To: gcc-patches
  Cc: Richard Biener, Richard Sandiford, Uros Bizjak, Bernd Edlinger

Add TARGET_READ_MEMSET_VALUE and TARGET_GEN_MEMSET_VALUE to support
target instructions to duplicate QImode value to TImode/OImode/XImode
value for memmset.

	PR middle-end/90773
	* builtins.c (builtin_memset_read_str): Call
	targetm.read_memset_value.
	(builtin_memset_gen_str): Call targetm.gen_memset_value.
	* target.def (read_memset_value): New hook.
	(gen_memset_value): Likewise.
	* targhooks.c: Inclue "builtins.h".
	(default_read_memset_value): New function.
	(default_gen_memset_value): Likewise.
	* targhooks.h (default_read_memset_value): New prototype.
	(default_gen_memset_value): Likewise.
	* doc/tm.texi.in: Add TARGET_READ_MEMSET_VALUE and
	TARGET_GEN_MEMSET_VALUE hooks.
	* doc/tm.texi: Regenerated.
---
 gcc/builtins.c     | 47 ++++----------------------------------
 gcc/doc/tm.texi    | 16 +++++++++++++
 gcc/doc/tm.texi.in |  4 ++++
 gcc/target.def     | 20 +++++++++++++++++
 gcc/targhooks.c    | 56 ++++++++++++++++++++++++++++++++++++++++++++++
 gcc/targhooks.h    |  4 ++++
 6 files changed, 104 insertions(+), 43 deletions(-)

diff --git a/gcc/builtins.c b/gcc/builtins.c
index e1b284846b1..f78a36478ef 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -6584,24 +6584,11 @@ expand_builtin_strncpy (tree exp, rtx target)
    previous iteration.  */
 
 rtx
-builtin_memset_read_str (void *data, void *prevp,
+builtin_memset_read_str (void *data, void *prev,
 			 HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
 			 scalar_int_mode mode)
 {
-  by_pieces_prev *prev = (by_pieces_prev *) prevp;
-  if (prev != nullptr && prev->data != nullptr)
-    {
-      /* Use the previous data in the same mode.  */
-      if (prev->mode == mode)
-	return prev->data;
-    }
-
-  const char *c = (const char *) data;
-  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
-
-  memset (p, *c, GET_MODE_SIZE (mode));
-
-  return c_readstr (p, mode);
+  return targetm.read_memset_value ((const char *) data, prev, mode);
 }
 
 /* Callback routine for store_by_pieces.  Return the RTL of a register
@@ -6611,37 +6598,11 @@ builtin_memset_read_str (void *data, void *prevp,
    nullptr, it has the RTL info from the previous iteration.  */
 
 static rtx
-builtin_memset_gen_str (void *data, void *prevp,
+builtin_memset_gen_str (void *data, void *prev,
 			HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
 			scalar_int_mode mode)
 {
-  rtx target, coeff;
-  size_t size;
-  char *p;
-
-  by_pieces_prev *prev = (by_pieces_prev *) prevp;
-  if (prev != nullptr && prev->data != nullptr)
-    {
-      /* Use the previous data in the same mode.  */
-      if (prev->mode == mode)
-	return prev->data;
-
-      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
-      if (target != nullptr)
-	return target;
-    }
-
-  size = GET_MODE_SIZE (mode);
-  if (size == 1)
-    return (rtx) data;
-
-  p = XALLOCAVEC (char, size);
-  memset (p, 1, size);
-  coeff = c_readstr (p, mode);
-
-  target = convert_to_mode (mode, (rtx) data, 1);
-  target = expand_mult (mode, target, coeff, NULL_RTX, 1);
-  return force_reg (mode, target);
+  return targetm.gen_memset_value ((rtx) data, prev, mode);
 }
 
 /* Expand expression EXP, which is a call to the memset builtin.  Return
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 85ea9395560..51385044e76 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11868,6 +11868,22 @@ This function prepares to emit a conditional comparison within a sequence
  @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.
 @end deftypefn
 
+@deftypefn {Target Hook} rtx TARGET_READ_MEMSET_VALUE (const char *@var{c}, void *@var{prev}, scalar_int_mode @var{mode})
+This function returns the RTL of a constant integer corresponding to
+target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the stringn
+constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains
+the RTL information from the previous interation.
+@end deftypefn
+
+@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_VALUE (rtx @var{data}, void *@var{prev}, scalar_int_mode @var{mode})
+This function returns the RTL of a register containing
+@code{GET_MODE_SIZE (@var{mode})} consecutive copies of the unsigned
+char value given in the RTL register @var{data}.  For example, if
+@var{mode} is 4 bytes wide, return the RTL for 0x01010101*@var{data}.
+If @var{PREV} is not @samp{nullptr}, it is the RTL information from
+the previous iteration.
+@end deftypefn
+
 @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})
 This target hook returns a new value for the number of times @var{loop}
 should be unrolled. The parameter @var{nunroll} is the number of times
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index d8e3de14af1..8d4c3949fbf 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7956,6 +7956,10 @@ lists.
 
 @hook TARGET_GEN_CCMP_NEXT
 
+@hook TARGET_READ_MEMSET_VALUE
+
+@hook TARGET_GEN_MEMSET_VALUE
+
 @hook TARGET_LOOP_UNROLL_ADJUST
 
 @defmac POWI_MAX_MULTS
diff --git a/gcc/target.def b/gcc/target.def
index bbaf6b4f3a0..c9aca40fa88 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2694,6 +2694,26 @@ DEFHOOK
  rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree op0, tree op1, int bit_code),
  NULL)
 
+DEFHOOK
+(read_memset_value,
+ "This function returns the RTL of a constant integer corresponding to\n\
+target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the stringn\n\
+constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains\n\
+the RTL information from the previous interation.",
+ rtx, (const char *c, void *prev, scalar_int_mode mode),
+ default_read_memset_value)
+
+DEFHOOK
+(gen_memset_value,
+ "This function returns the RTL of a register containing\n\
+@code{GET_MODE_SIZE (@var{mode})} consecutive copies of the unsigned\n\
+char value given in the RTL register @var{data}.  For example, if\n\
+@var{mode} is 4 bytes wide, return the RTL for 0x01010101*@var{data}.\n\
+If @var{PREV} is not @samp{nullptr}, it is the RTL information from\n\
+the previous iteration.",
+ rtx, (rtx data, void *prev, scalar_int_mode mode),
+ default_gen_memset_value)
+
 /* Return a new value for loop unroll size.  */
 DEFHOOK
 (loop_unroll_adjust,
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 1947ef26fd6..b55e6ec6756 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -90,6 +90,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "attribs.h"
 #include "asan.h"
 #include "emit-rtl.h"
+#include "builtins.h"
 
 bool
 default_legitimate_address_p (machine_mode mode ATTRIBUTE_UNUSED,
@@ -2627,4 +2628,59 @@ default_memtag_untagged_pointer (rtx tagged_pointer, rtx target)
   return untagged_base;
 }
 
+/* Default implementation of TARGET_READ_MEMSET_VALUE.  */
+
+rtx
+default_read_memset_value (const char *c, void *prevp,
+			   scalar_int_mode mode)
+{
+  by_pieces_prev *prev = (by_pieces_prev *) prevp;
+  if (prev != nullptr && prev->data != nullptr)
+    {
+      /* Use the previous data in the same mode.  */
+      if (prev->mode == mode)
+	return prev->data;
+    }
+
+  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
+
+  memset (p, *c, GET_MODE_SIZE (mode));
+
+  return c_readstr (p, mode);
+}
+
+/* Default implementation of TARGET_GEN_MEMSET_VALUE.  */
+
+rtx
+default_gen_memset_value (rtx data, void *prevp, scalar_int_mode mode)
+{
+  rtx target, coeff;
+  size_t size;
+  char *p;
+
+  by_pieces_prev *prev = (by_pieces_prev *) prevp;
+  if (prev != nullptr && prev->data != nullptr)
+    {
+      /* Use the previous data in the same mode.  */
+      if (prev->mode == mode)
+	return prev->data;
+
+      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
+      if (target != nullptr)
+	return target;
+    }
+
+  size = GET_MODE_SIZE (mode);
+  if (size == 1)
+    return data;
+
+  p = XALLOCAVEC (char, size);
+  memset (p, 1, size);
+  coeff = c_readstr (p, mode);
+
+  target = convert_to_mode (mode, data, 1);
+  target = expand_mult (mode, target, coeff, NULL_RTX, 1);
+  return force_reg (mode, target);
+}
+
 #include "gt-targhooks.h"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index b537038c0aa..3c00927e196 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -300,4 +300,8 @@ extern rtx default_memtag_set_tag (rtx, rtx, rtx);
 extern rtx default_memtag_extract_tag (rtx, rtx);
 extern rtx default_memtag_untagged_pointer (rtx, rtx);
 
+extern rtx default_read_memset_value (const char *, void *,
+				      scalar_int_mode);
+extern rtx default_gen_memset_value (rtx, void *, scalar_int_mode);
+
 #endif /* GCC_TARGHOOKS_H */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 02/12] x86: Add TARGET_READ_MEMSET_VALUE/TARGET_GEN_MEMSET_VALUE
  2021-05-18 19:16 [PATCH v4 00/12] Allow TImode/OImode/XImode in op_by_pieces operations H.J. Lu
  2021-05-18 19:16 ` [PATCH v4 01/12] Add TARGET_READ_MEMSET_VALUE/TARGET_GEN_MEMSET_VALUE H.J. Lu
@ 2021-05-18 19:16 ` H.J. Lu
  2021-05-18 19:16 ` [PATCH v4 03/12] x86: Avoid stack realignment when copying data H.J. Lu
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: H.J. Lu @ 2021-05-18 19:16 UTC (permalink / raw)
  To: gcc-patches
  Cc: Richard Biener, Richard Sandiford, Uros Bizjak, Bernd Edlinger

1. Make ix86_expand_vector_init_duplicate global to duplicate QImode
value to TImode/OImode/XImode.
2. Make ix86_minimum_incoming_stack_boundary global and add an argument
to ignore stack_alignment_estimated.
3. Define SCRATCH_SSE_REG as a scratch register for ix86_gen_memset_value.
4. Add TARGET_READ_MEMSET_VALUE and TARGET_GEN_MEMSET_VALUE to support
target instructions to duplicate QImode value to TImode/OImode/XImode
value for memmset.

gcc/

	PR middle-end/90773
	* config/i386/i386-expand.c (ix86_expand_vector_init_duplicate):
	Make it global.
	* config/i386/i386-protos.h (ix86_minimum_incoming_stack_boundary):
	New.
	(ix86_expand_vector_init_duplicate): Likewise.
	* config/i386/i386.c (ix86_minimum_incoming_stack_boundary): Add
	an argument to ignore stack_alignment_estimated.  It is passed
	as false by default.  Make it global.
	(ix86_gen_memset_value_from_prev): New function.
	(ix86_gen_memset_value): Likewise.
	(ix86_read_memset_value): Likewise.
	(TARGET_GEN_MEMSET_VALUE): New.
	(TARGET_READ_MEMSET_VALUE): Likewise.
	* config/i386/i386.h (SCRATCH_SSE_REG): New.

gcc/testsuite/

	PR middle-end/90773
	* gcc.target/i386/pr90773-15.c: New test.
	* gcc.target/i386/pr90773-16.c: Likewise.
	* gcc.target/i386/pr90773-17.c: Likewise.
	* gcc.target/i386/pr90773-18.c: Likewise.
	* gcc.target/i386/pr90773-19.c: Likewise.
---
 gcc/config/i386/i386-expand.c              |   2 +-
 gcc/config/i386/i386-protos.h              |   5 +
 gcc/config/i386/i386.c                     | 268 ++++++++++++++++++++-
 gcc/config/i386/i386.h                     |   4 +
 gcc/testsuite/gcc.target/i386/pr90773-15.c |  14 ++
 gcc/testsuite/gcc.target/i386/pr90773-16.c |  14 ++
 gcc/testsuite/gcc.target/i386/pr90773-17.c |  14 ++
 gcc/testsuite/gcc.target/i386/pr90773-18.c |  15 ++
 gcc/testsuite/gcc.target/i386/pr90773-19.c |  14 ++
 9 files changed, 345 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-15.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-16.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-17.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-18.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-19.c

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index 9f3d41955a2..a9fe31fcb39 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -13652,7 +13652,7 @@ static bool expand_vec_perm_1 (struct expand_vec_perm_d *d);
 /* A subroutine of ix86_expand_vector_init.  Store into TARGET a vector
    with all elements equal to VAR.  Return true if successful.  */
 
-static bool
+bool
 ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
 				   rtx target, rtx val)
 {
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 7782cf1163f..c4896c2da74 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -50,6 +50,9 @@ extern void ix86_reset_previous_fndecl (void);
 
 extern bool ix86_using_red_zone (void);
 
+extern unsigned int ix86_minimum_incoming_stack_boundary (bool,
+							  bool = false);
+
 extern unsigned int ix86_regmode_natural_size (machine_mode);
 #ifdef RTX_CODE
 extern int standard_80387_constant_p (rtx);
@@ -257,6 +260,8 @@ extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool);
 extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx);
 extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx);
 extern void ix86_expand_sse2_abs (rtx, rtx);
+extern bool ix86_expand_vector_init_duplicate (bool, machine_mode, rtx,
+					       rtx);
 
 /* In i386-c.c  */
 extern void ix86_target_macros (void);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 743d8a25fe3..6a981a01668 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -415,7 +415,6 @@ static unsigned int split_stack_prologue_scratch_regno (void);
 static bool i386_asm_output_addr_const_extra (FILE *, rtx);
 
 static bool ix86_can_inline_p (tree, tree);
-static unsigned int ix86_minimum_incoming_stack_boundary (bool);
 
 \f
 /* Whether -mtune= or -march= were specified */
@@ -7231,8 +7230,9 @@ find_drap_reg (void)
 
 /* Return minimum incoming stack alignment.  */
 
-static unsigned int
-ix86_minimum_incoming_stack_boundary (bool sibcall)
+unsigned int
+ix86_minimum_incoming_stack_boundary (bool sibcall,
+				      bool ignore_estimated)
 {
   unsigned int incoming_stack_boundary;
 
@@ -7247,7 +7247,8 @@ ix86_minimum_incoming_stack_boundary (bool sibcall)
      estimated stack alignment is 128bit.  */
   else if (!sibcall
 	   && ix86_force_align_arg_pointer
-	   && crtl->stack_alignment_estimated == 128)
+	   && (ignore_estimated
+	       || crtl->stack_alignment_estimated == 128))
     incoming_stack_boundary = MIN_STACK_BOUNDARY;
   else
     incoming_stack_boundary = ix86_default_incoming_stack_boundary;
@@ -23051,6 +23052,259 @@ ix86_optab_supported_p (int op, machine_mode mode1, machine_mode,
     }
 }
 
+/* Return the RTL for memset in MODE from PREV.  */
+
+static rtx
+ix86_gen_memset_value_from_prev (by_pieces_prev *prevp,
+				 scalar_int_mode mode)
+{
+  rtx prev = prevp->data;
+
+  /* Use the previous data in the same mode.  */
+  if (prevp->mode == mode)
+    return prev;
+
+  machine_mode prev_mode = prevp->mode;
+  size_t size = GET_MODE_SIZE (prev_mode);
+
+  /* NB: Skip if the previous value is 1 byte or less.  CONST_WIDE_INT
+     is in VOIDmode whose size is 0.  */
+  if (size <= 1)
+    return nullptr;
+
+  rtx reg, reg_ti;
+  switch (size)
+    {
+    default:
+      gcc_unreachable ();
+
+    case 2:
+    case 4:
+      return simplify_gen_subreg (mode, prev, prev_mode, 0);
+
+    case 8:
+      /* In 64-bit mode, use SUBREG since word size is 8 bytes.  */
+      if (TARGET_64BIT)
+	return simplify_gen_subreg (mode, prev, prev_mode, 0);
+
+      switch (GET_MODE_SIZE (mode))
+	{
+	default:
+	  gcc_unreachable ();
+	case 2:
+	case 4:
+do_hi_si_mode:
+	  /* In 32-bit mode, Extract the value from an 8-byte
+	     register into an integer register first.  */
+	  reg = gen_reg_rtx (SImode);
+	  emit_move_insn (reg,
+			  simplify_gen_subreg (SImode, prev,
+					       prev_mode, 0));
+	  return simplify_gen_subreg (mode, reg, SImode, 0);
+	}
+      break;
+
+    case 16:
+      switch (GET_MODE_SIZE (mode))
+	{
+	default:
+	  gcc_unreachable ();
+	case 2:
+	case 4:
+	  /* Extract the value from a 16-byte vector register into
+	     an integer register first.  */
+	  goto do_hi_si_mode;
+	case 8:
+	  return simplify_gen_subreg (mode, prev, prev_mode, 0);
+	case 16:
+	  return prev;
+	}
+      break;
+
+    case 32:
+      switch (GET_MODE_SIZE (mode))
+	{
+	default:
+	  gcc_unreachable ();
+	case 2:
+do_himode:
+	  /* Extract the value from a 32-byte vector register into
+	     a 16-byte vector register first.  */
+	  reg_ti = gen_reg_rtx (TImode);
+	  emit_move_insn (reg_ti,
+			  simplify_gen_subreg (TImode, prev,
+					       prev_mode, 0));
+	  /* Then extract the value from a 16-byte vector register
+	     into an integer register.  */
+	  reg = gen_reg_rtx (SImode);
+	  emit_move_insn (reg,
+			  simplify_gen_subreg (SImode, reg_ti,
+					       TImode, 0));
+	  return simplify_gen_subreg (mode, reg, SImode, 0);
+
+	case 4:
+	case 8:
+do_si_di_mode:
+	  /* Extract the value from a 32-byte vector register into
+	     a 16-byte vector register first.  */
+	  reg_ti = gen_reg_rtx (TImode);
+	  emit_move_insn (reg_ti,
+			  simplify_gen_subreg (TImode, prev,
+					       prev_mode, 0));
+	  /* Generate 4/8-byte SSE -> INT move instruction.  */
+	  reg = gen_reg_rtx (mode);
+	  emit_move_insn (reg,
+			  simplify_gen_subreg (mode, reg_ti,
+					       TImode, 0));
+	  return reg;
+	case 16:
+	  return simplify_gen_subreg (mode, prev, prev_mode, 0);
+	case 32:
+	  return prev;
+	}
+
+    case 64:
+      switch (GET_MODE_SIZE (mode))
+	{
+	default:
+	  gcc_unreachable ();
+	case 2:
+	  /* Extract the value from a 64-byte vector register into
+	     a 16-byte vector register first.  */
+	  goto do_himode;
+	case 4:
+	case 8:
+	  /* Extract the value from a 64-byte vector register into
+	     a 16-byte vector register first.  */
+	  goto do_si_di_mode;
+	case 16:
+	case 32:
+	  return simplify_gen_subreg (mode, prev, prev_mode, 0);
+	case 64:
+	  return prev;
+	}
+    }
+
+  return nullptr;
+}
+
+/* Implement the TARGET_GEN_MEMSET_VALUE hook.  */
+
+static rtx
+ix86_gen_memset_value (rtx data, void *prevp, scalar_int_mode mode)
+{
+  /* Don't use the previous value if size is 1.  */
+  if (GET_MODE_SIZE (mode) == 1)
+    return data;
+
+  by_pieces_prev *prev = (by_pieces_prev *) prevp;
+  if (prev != nullptr && prev->data != nullptr)
+    {
+      rtx value = ix86_gen_memset_value_from_prev (prev, mode);
+      if (value)
+	return value;
+    }
+
+  /* Use default_gen_memset_value for vector store won't be used.  */
+  if (GET_MODE_SIZE (mode) <= GET_MODE_SIZE (DImode))
+    return default_gen_memset_value (data, prevp, mode);
+
+  rtx one, target;
+  scalar_mode one_mode;
+
+  unsigned int incoming_stack_boundary
+    = ix86_minimum_incoming_stack_boundary (false, true);
+
+  switch (GET_MODE_SIZE (mode))
+    {
+    default:
+      gcc_unreachable ();
+
+    case 64:
+      if (!TARGET_AVX512BW)
+	{
+	  rtx tmp;
+	  /* NB: Don't increase stack alignment requirement by using a
+	     scratch SSE register.  */
+	  if (GET_MODE_ALIGNMENT (V32QImode) > incoming_stack_boundary)
+	    tmp = gen_rtx_REG (V32QImode, SCRATCH_SSE_REG);
+	  else
+	    tmp = gen_reg_rtx (V32QImode);
+	  if (!ix86_expand_vector_init_duplicate (false, V32QImode,
+						  tmp, data))
+	    gcc_unreachable ();
+	  target = gen_rtx_VEC_CONCAT (V64QImode, tmp, tmp);
+	  if (REGNO (tmp) == SCRATCH_SSE_REG)
+	    {
+	      tmp = gen_rtx_REG (V64QImode, SCRATCH_SSE_REG);
+	      emit_move_insn (tmp, target);
+	      return gen_rtx_REG (mode, SCRATCH_SSE_REG);
+	    }
+	  else
+	    return convert_to_mode (mode, target, 1);
+	}
+      /* FALLTHRU */
+    case 16:
+    case 32:
+      one_mode = QImode;
+      one = data;
+      break;
+    }
+
+  unsigned int nunits = GET_MODE_SIZE (mode) / GET_MODE_SIZE (one_mode);
+  machine_mode vector_mode;
+  if (!mode_for_vector (one_mode, nunits).exists (&vector_mode))
+    gcc_unreachable ();
+
+  /* NB: Don't increase stack alignment requirement by using a scratch
+     SSE register.  */
+  if (GET_MODE_ALIGNMENT (vector_mode) > incoming_stack_boundary)
+    target = gen_rtx_REG (vector_mode, SCRATCH_SSE_REG);
+  else
+    target = gen_reg_rtx (vector_mode);
+  if (!ix86_expand_vector_init_duplicate (false, vector_mode, target,
+					  one))
+    gcc_unreachable ();
+
+  if (REGNO (target) == SCRATCH_SSE_REG)
+    return gen_rtx_REG (mode, SCRATCH_SSE_REG);
+  else
+    return convert_to_mode (mode, target, 1);
+}
+
+/* Implement the TARGET_READ_MEMSET_VALUE hook.  */
+
+static rtx
+ix86_read_memset_value (const char *str, void *prevp,
+			scalar_int_mode mode)
+{
+  rtx value;
+
+  by_pieces_prev *prev = (by_pieces_prev *) prevp;
+  if (prev != nullptr && prev->data != nullptr)
+    {
+      /* Don't use the previous value if size is 1.  */
+      if (GET_MODE_SIZE (mode) == 1)
+	return default_read_memset_value (str, nullptr, mode);
+
+      value = ix86_gen_memset_value_from_prev (prev, mode);
+      if (value)
+	return value;
+
+      return default_read_memset_value (str, nullptr, mode);
+    }
+
+  /* Use default_gen_memset_value if vector store can't be used.
+     NB: Need AVX2 for fast vector duplication and gen_reg_rtx.  */
+  if (GET_MODE_SIZE (mode) <= GET_MODE_SIZE (DImode)
+      || !TARGET_AVX2
+      || !reg_rtx_no)
+   return default_read_memset_value (str, nullptr, mode);
+
+  value = default_read_memset_value (str, nullptr, QImode);
+  return ix86_gen_memset_value (value, nullptr, mode);
+}
+
 /* Address space support.
 
    This is not "far pointers" in the 16-bit sense, but an easy way
@@ -23952,6 +24206,12 @@ static bool ix86_libc_has_fast_function (int fcode ATTRIBUTE_UNUSED)
 #undef TARGET_LIBC_HAS_FAST_FUNCTION
 #define TARGET_LIBC_HAS_FAST_FUNCTION ix86_libc_has_fast_function
 
+#undef TARGET_GEN_MEMSET_VALUE
+#define TARGET_GEN_MEMSET_VALUE ix86_gen_memset_value
+
+#undef TARGET_READ_MEMSET_VALUE
+#define TARGET_READ_MEMSET_VALUE ix86_read_memset_value
+
 #if CHECKING_P
 #undef TARGET_RUN_TARGET_SELFTESTS
 #define TARGET_RUN_TARGET_SELFTESTS selftest::ix86_run_selftests
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 97d6f3863cb..45d86802c51 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -1131,6 +1131,10 @@ extern const char *host_detect_local_cpu (int argc, const char **argv);
 #define FIRST_MASK_REG  MASK0_REG
 #define LAST_MASK_REG   MASK7_REG
 
+/* A scratch vector reg.  */
+#define SCRATCH_SSE_REG \
+  (TARGET_64BIT ? LAST_REX_SSE_REG : LAST_SSE_REG)
+
 /* Override this in other tm.h files to cope with various OS lossage
    requiring a frame pointer.  */
 #ifndef SUBTARGET_FRAME_POINTER_REQUIRED
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-15.c b/gcc/testsuite/gcc.target/i386/pr90773-15.c
new file mode 100644
index 00000000000..c0a96fed892
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-15.c
@@ -0,0 +1,14 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512" } */
+
+extern char *dst;
+
+void
+foo (int c)
+{
+  __builtin_memset (dst, c, 17);
+}
+
+/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%edi, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]+%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movb\[\\t \]+%dil, 16\\(%\[\^,\]+\\)" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-16.c b/gcc/testsuite/gcc.target/i386/pr90773-16.c
new file mode 100644
index 00000000000..d2d1ec6141c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-16.c
@@ -0,0 +1,14 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, -1, 17);
+}
+
+/* { dg-final { scan-assembler-times "vpcmpeqd" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]+%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movb\[\\t \]+\\\$-1, 16\\(%\[\^,\]+\\)" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-17.c b/gcc/testsuite/gcc.target/i386/pr90773-17.c
new file mode 100644
index 00000000000..6c8da7d24ef
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-17.c
@@ -0,0 +1,14 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 12, 19);
+}
+
+/* { dg-final { scan-assembler-times "vpbroadcastb" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]+%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "vmovd\[\\t \]+%xmm\[0-9\]+, 15\\(%\[\^,\]+\\)" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-18.c b/gcc/testsuite/gcc.target/i386/pr90773-18.c
new file mode 100644
index 00000000000..b0687abbe01
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-18.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake-avx512" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 12, 9);
+}
+
+/* { dg-final { scan-assembler-times "movabsq\[\\t \]+\\\$868082074056920076, %r" 1 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+\\\$202116108, \\(%\[\^,\]+\\)" 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+\\\$202116108, 4\\(%\[\^,\]+\\)" 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movb\[\\t \]+\\\$12, 8\\(%\[\^,\]+\\)" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-19.c b/gcc/testsuite/gcc.target/i386/pr90773-19.c
new file mode 100644
index 00000000000..8aa5540bacc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-19.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 12, 9);
+}
+
+/* { dg-final { scan-assembler-times "movabsq\[\\t \]+\\\$868082074056920076, %r" 1 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+\\\$202116108, \\(%\[\^,\]+\\)" 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+\\\$202116108, 4\\(%\[\^,\]+\\)" 1 { target ia32 } } } */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 03/12] x86: Avoid stack realignment when copying data
  2021-05-18 19:16 [PATCH v4 00/12] Allow TImode/OImode/XImode in op_by_pieces operations H.J. Lu
  2021-05-18 19:16 ` [PATCH v4 01/12] Add TARGET_READ_MEMSET_VALUE/TARGET_GEN_MEMSET_VALUE H.J. Lu
  2021-05-18 19:16 ` [PATCH v4 02/12] x86: Add TARGET_READ_MEMSET_VALUE/TARGET_GEN_MEMSET_VALUE H.J. Lu
@ 2021-05-18 19:16 ` H.J. Lu
  2021-05-18 19:16 ` [PATCH v4 04/12] Remove MAX_BITSIZE_MODE_ANY_INT H.J. Lu
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: H.J. Lu @ 2021-05-18 19:16 UTC (permalink / raw)
  To: gcc-patches
  Cc: Richard Biener, Richard Sandiford, Uros Bizjak, Bernd Edlinger

To avoid stack realignment, use SCRATCH_SSE_REG to copy data from one
memory location to another.

gcc/

	* config/i386/i386-expand.c (ix86_expand_vector_move): Use
	SCRATCH_SSE_REG to copy data from one memory location to
	another.

gcc/testsuite/

	* gcc.target/i386/eh_return-1.c: New test.
---
 gcc/config/i386/i386-expand.c               | 16 ++++++++++++-
 gcc/testsuite/gcc.target/i386/eh_return-1.c | 26 +++++++++++++++++++++
 2 files changed, 41 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/eh_return-1.c

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index a9fe31fcb39..28f4f9b0e10 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -435,7 +435,21 @@ ix86_expand_vector_move (machine_mode mode, rtx operands[])
       && !register_operand (op0, mode)
       && !register_operand (op1, mode))
     {
-      emit_move_insn (op0, force_reg (GET_MODE (op0), op1));
+      rtx tmp;
+      mode = GET_MODE (op0);
+      if (TARGET_SSE
+	  && (GET_MODE_ALIGNMENT (mode)
+	      > ix86_minimum_incoming_stack_boundary (false, true)))
+	{
+	  /* NB: Don't increase stack alignment requirement by using
+	     a scratch SSE register to copy data from one memory
+	     location to another since it doesn't require a spill.  */
+	  tmp = gen_rtx_REG (mode, SCRATCH_SSE_REG);
+	  emit_move_insn (tmp, op1);
+	}
+      else
+	tmp = force_reg (mode, op1);
+      emit_move_insn (op0, tmp);
       return;
     }
 
diff --git a/gcc/testsuite/gcc.target/i386/eh_return-1.c b/gcc/testsuite/gcc.target/i386/eh_return-1.c
new file mode 100644
index 00000000000..671ba635e88
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/eh_return-1.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=haswell -mno-avx512f" } */
+
+struct _Unwind_Context
+{
+  void *ra;
+  char array[48];
+};
+
+extern long uw_install_context_1 (struct _Unwind_Context *);
+
+void
+_Unwind_RaiseException (void)
+{
+  struct _Unwind_Context this_context, cur_context;
+  long offset = uw_install_context_1 (&this_context);
+  __builtin_memcpy (&this_context, &cur_context,
+		    sizeof (struct _Unwind_Context));
+  void *handler = __builtin_frob_return_addr ((&cur_context)->ra);
+  uw_install_context_1 (&cur_context);
+  __builtin_eh_return (offset, handler);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 4 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 04/12] Remove MAX_BITSIZE_MODE_ANY_INT
  2021-05-18 19:16 [PATCH v4 00/12] Allow TImode/OImode/XImode in op_by_pieces operations H.J. Lu
                   ` (2 preceding siblings ...)
  2021-05-18 19:16 ` [PATCH v4 03/12] x86: Avoid stack realignment when copying data H.J. Lu
@ 2021-05-18 19:16 ` H.J. Lu
  2021-05-25 14:37   ` Richard Biener
  2021-05-18 19:16 ` [PATCH v4 05/12] x86: Update piecewise move and store H.J. Lu
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 52+ messages in thread
From: H.J. Lu @ 2021-05-18 19:16 UTC (permalink / raw)
  To: gcc-patches
  Cc: Richard Biener, Richard Sandiford, Uros Bizjak, Bernd Edlinger

It is only defined for i386 and everyone uses the default:

 #define MAX_BITSIZE_MODE_ANY_INT (64*BITS_PER_UNIT)

Whatever problems we had before, they have been fixed now.

	* config/i386/i386-modes.def (MAX_BITSIZE_MODE_ANY_INT): Removed.
---
 gcc/config/i386/i386-modes.def | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/gcc/config/i386/i386-modes.def b/gcc/config/i386/i386-modes.def
index dbddfd8e48f..4e7014be034 100644
--- a/gcc/config/i386/i386-modes.def
+++ b/gcc/config/i386/i386-modes.def
@@ -107,19 +107,10 @@ INT_MODE (XI, 64);
 PARTIAL_INT_MODE (HI, 16, P2QI);
 PARTIAL_INT_MODE (SI, 32, P2HI);
 
-/* Mode used for signed overflow checking of TImode.  As
-   MAX_BITSIZE_MODE_ANY_INT is only 160, wide-int.h reserves only that
-   rounded up to multiple of HOST_BITS_PER_WIDE_INT bits in wide_int etc.,
-   so OImode is too large.  For the overflow checking we actually need
-   just 1 or 2 bits beyond TImode precision.  Use 160 bits to have
-   a multiple of 32.  */
+/* Mode used for signed overflow checking of TImode.  For the overflow
+   checking we actually need just 1 or 2 bits beyond TImode precision.
+   Use 160 bits to have a multiple of 32.  */
 PARTIAL_INT_MODE (OI, 160, POI);
 
-/* Keep the OI and XI modes from confusing the compiler into thinking
-   that these modes could actually be used for computation.  They are
-   only holders for vectors during data movement.  Include POImode precision
-   though.  */
-#define MAX_BITSIZE_MODE_ANY_INT (160)
-
 /* The symbol Pmode stands for one of the above machine modes (usually SImode).
    The tm.h file specifies which one.  It is not a distinct mode.  */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 05/12] x86: Update piecewise move and store
  2021-05-18 19:16 [PATCH v4 00/12] Allow TImode/OImode/XImode in op_by_pieces operations H.J. Lu
                   ` (3 preceding siblings ...)
  2021-05-18 19:16 ` [PATCH v4 04/12] Remove MAX_BITSIZE_MODE_ANY_INT H.J. Lu
@ 2021-05-18 19:16 ` H.J. Lu
  2021-05-18 19:16 ` [PATCH v4 06/12] x86: Add AVX2 tests for PR middle-end/90773 H.J. Lu
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: H.J. Lu @ 2021-05-18 19:16 UTC (permalink / raw)
  To: gcc-patches
  Cc: Richard Biener, Richard Sandiford, Uros Bizjak, Bernd Edlinger

We can use TImode/OImode/XImode integers for piecewise move and store.

1. Define MAX_MOVE_MAX to 64, which is the constant maximum number of
bytes that a single instruction can move quickly between memory and
registers or between two memory locations.
2. Define MOVE_MAX to MOVE_MAX_PIECES, which is the maximum number of
bytes we can move from memory to memory in one reasonably fast instruction.
The difference between MAX_MOVE_MAX and MOVE_MAX is that MAX_MOVE_MAX
must be a constant, independent of compiler options, since it is used in
reload.h to define struct target_reload and MOVE_MAX can vary, depending
on compiler options.
3. When vector register is used for piecewise move and store, we don't
increase stack_alignment_needed since vector register spill isn't
required for piecewise move and store.  Since stack_realign_needed is
set to true by checking stack_alignment_estimated set by pseudo vector
register usage, we also need to check stack_realign_needed to eliminate
frame pointer.

gcc/

	* config/i386/i386.c (ix86_finalize_stack_frame_flags): Also
	check stack_realign_needed for stack realignment.
	(ix86_legitimate_constant_p): Always allow CONST_WIDE_INT smaller
	than the largest integer supported by vector register.
	* config/i386/i386.h (MAX_MOVE_MAX): New.  Set to 64.
	(MOVE_MAX_PIECES): Set to bytes of the largest integer supported
	by vector register.
	(MOVE_MAX): Defined to MOVE_MAX_PIECES.
	(STORE_MAX_PIECES): New.

gcc/testsuite/

	* gcc.target/i386/pr90773-1.c: Adjust to expect movq for 32-bit.
	* gcc.target/i386/pr90773-4.c: Also run for 32-bit.
	* gcc.target/i386/pr90773-14.c: Likewise.
	* gcc.target/i386/pr90773-15.c: Likewise.
	* gcc.target/i386/pr90773-16.c: Likewise.
	* gcc.target/i386/pr90773-17.c: Likewise.
---
 gcc/config/i386/i386.c                     | 21 ++++++++++--
 gcc/config/i386/i386.h                     | 40 +++++++++++++++++-----
 gcc/testsuite/gcc.target/i386/pr90773-1.c  | 10 ++----
 gcc/testsuite/gcc.target/i386/pr90773-14.c |  2 +-
 gcc/testsuite/gcc.target/i386/pr90773-15.c |  6 ++--
 gcc/testsuite/gcc.target/i386/pr90773-16.c |  2 +-
 gcc/testsuite/gcc.target/i386/pr90773-17.c |  2 +-
 gcc/testsuite/gcc.target/i386/pr90773-4.c  |  2 +-
 8 files changed, 60 insertions(+), 25 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 6a981a01668..bdf773a312f 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -7942,8 +7942,17 @@ ix86_finalize_stack_frame_flags (void)
      assumed stack realignment might be needed or -fno-omit-frame-pointer
      is used, but in the end nothing that needed the stack alignment had
      been spilled nor stack access, clear frame_pointer_needed and say we
-     don't need stack realignment.  */
-  if ((stack_realign || (!flag_omit_frame_pointer && optimize))
+     don't need stack realignment.
+
+     When vector register is used for piecewise move and store, we don't
+     increase stack_alignment_needed as there is no register spill for
+     piecewise move and store.  Since stack_realign_needed is set to true
+     by checking stack_alignment_estimated which is updated by pseudo
+     vector register usage, we also need to check stack_realign_needed to
+     eliminate frame pointer.  */
+  if ((stack_realign
+       || (!flag_omit_frame_pointer && optimize)
+       || crtl->stack_realign_needed)
       && frame_pointer_needed
       && crtl->is_leaf
       && crtl->sp_is_unchanging
@@ -10402,7 +10411,13 @@ ix86_legitimate_constant_p (machine_mode mode, rtx x)
 	  /* FALLTHRU */
 	case E_OImode:
 	case E_XImode:
-	  if (!standard_sse_constant_p (x, mode))
+	  if (!standard_sse_constant_p (x, mode)
+	      && GET_MODE_SIZE (TARGET_AVX512F
+				? XImode
+				: (TARGET_AVX
+				   ? OImode
+				   : (TARGET_SSE2
+				      ? TImode : DImode))) < GET_MODE_SIZE (mode))
 	    return false;
 	default:
 	  break;
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 45d86802c51..5250b035a85 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -1752,9 +1752,10 @@ typedef struct ix86_args {
 /* Define this as 1 if `char' should by default be signed; else as 0.  */
 #define DEFAULT_SIGNED_CHAR 1
 
-/* Max number of bytes we can move from memory to memory
-   in one reasonably fast instruction.  */
-#define MOVE_MAX 16
+/* The constant maximum number of bytes that a single instruction can
+   move quickly between memory and registers or between two memory
+   locations.  */
+#define MAX_MOVE_MAX 64
 
 /* MOVE_MAX_PIECES is the number of bytes at a time which we can
    move efficiently, as opposed to  MOVE_MAX which is the maximum
@@ -1765,11 +1766,34 @@ typedef struct ix86_args {
    widest mode with MAX_FIXED_MODE_SIZE, we can only use TImode in
    64-bit mode.  */
 #define MOVE_MAX_PIECES \
-  ((TARGET_64BIT \
-    && TARGET_SSE2 \
-    && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
-    && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
-   ? GET_MODE_SIZE (TImode) : UNITS_PER_WORD)
+  ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
+   ? 64 \
+   : ((TARGET_AVX \
+       && !TARGET_PREFER_AVX128 \
+       && !TARGET_AVX256_SPLIT_UNALIGNED_LOAD \
+       && !TARGET_AVX256_SPLIT_UNALIGNED_STORE) \
+      ? 32 \
+      : ((TARGET_SSE2 \
+	  && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
+	  && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
+	 ? 16 : UNITS_PER_WORD)))
+
+/* Max number of bytes we can move from memory to memory in one
+   reasonably fast instruction.  */
+#define MOVE_MAX MOVE_MAX_PIECES
+
+/* STORE_MAX_PIECES is the number of bytes at a time that we can
+   store efficiently.  */
+#define STORE_MAX_PIECES \
+  ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
+   ? 64 \
+   : ((TARGET_AVX \
+       && !TARGET_PREFER_AVX128 \
+       && !TARGET_AVX256_SPLIT_UNALIGNED_STORE) \
+      ? 32 \
+      : ((TARGET_SSE2 \
+	  && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
+	 ? 16 : UNITS_PER_WORD)))
 
 /* If a memory-to-memory move would take MOVE_RATIO or more simple
    move-instruction pairs, we will do a cpymem or libcall instead.
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-1.c b/gcc/testsuite/gcc.target/i386/pr90773-1.c
index 1d9f282dc0d..4fd5a40d99d 100644
--- a/gcc/testsuite/gcc.target/i386/pr90773-1.c
+++ b/gcc/testsuite/gcc.target/i386/pr90773-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -mtune=generic" } */
+/* { dg-options "-O2 -msse2 -mtune=generic" } */
 
 extern char *dst, *src;
 
@@ -9,9 +9,5 @@ foo (void)
   __builtin_memcpy (dst, src, 15);
 }
 
-/* { dg-final { scan-assembler-times "movq\[\\t \]+\\(%\[\^,\]+\\)," 1 { target { ! ia32 } } } } */
-/* { dg-final { scan-assembler-times "movq\[\\t \]+7\\(%\[\^,\]+\\)," 1 { target { ! ia32 } } } } */
-/* { dg-final { scan-assembler-times "movl\[\\t \]+\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
-/* { dg-final { scan-assembler-times "movl\[\\t \]+4\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
-/* { dg-final { scan-assembler-times "movl\[\\t \]+8\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
-/* { dg-final { scan-assembler-times "movl\[\\t \]+11\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movq\[\\t \]+\\(%\[\^,\]+\\)," 1 } } */
+/* { dg-final { scan-assembler-times "movq\[\\t \]+7\\(%\[\^,\]+\\)," 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-14.c b/gcc/testsuite/gcc.target/i386/pr90773-14.c
index 6364916ecac..74ba5055960 100644
--- a/gcc/testsuite/gcc.target/i386/pr90773-14.c
+++ b/gcc/testsuite/gcc.target/i386/pr90773-14.c
@@ -1,4 +1,4 @@
-/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-do compile } */
 /* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
 
 extern char *dst;
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-15.c b/gcc/testsuite/gcc.target/i386/pr90773-15.c
index c0a96fed892..880f71d1567 100644
--- a/gcc/testsuite/gcc.target/i386/pr90773-15.c
+++ b/gcc/testsuite/gcc.target/i386/pr90773-15.c
@@ -1,4 +1,4 @@
-/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-do compile } */
 /* { dg-options "-O2 -march=skylake-avx512" } */
 
 extern char *dst;
@@ -9,6 +9,6 @@ foo (int c)
   __builtin_memset (dst, c, 17);
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%edi, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%.*, %xmm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu\[\\t \]+%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
-/* { dg-final { scan-assembler-times "movb\[\\t \]+%dil, 16\\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movb\[\\t \]+%.*, 16\\(%\[\^,\]+\\)" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-16.c b/gcc/testsuite/gcc.target/i386/pr90773-16.c
index d2d1ec6141c..32a976b10df 100644
--- a/gcc/testsuite/gcc.target/i386/pr90773-16.c
+++ b/gcc/testsuite/gcc.target/i386/pr90773-16.c
@@ -1,4 +1,4 @@
-/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-do compile } */
 /* { dg-options "-O2 -march=skylake-avx512" } */
 
 extern char *dst;
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-17.c b/gcc/testsuite/gcc.target/i386/pr90773-17.c
index 6c8da7d24ef..2d6fbf22a8b 100644
--- a/gcc/testsuite/gcc.target/i386/pr90773-17.c
+++ b/gcc/testsuite/gcc.target/i386/pr90773-17.c
@@ -1,4 +1,4 @@
-/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-do compile } */
 /* { dg-options "-O2 -march=skylake-avx512" } */
 
 extern char *dst;
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-4.c b/gcc/testsuite/gcc.target/i386/pr90773-4.c
index ec0bc0100ae..ee4c04678d1 100644
--- a/gcc/testsuite/gcc.target/i386/pr90773-4.c
+++ b/gcc/testsuite/gcc.target/i386/pr90773-4.c
@@ -1,4 +1,4 @@
-/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-do compile } */
 /* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
 
 extern char *dst;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 06/12] x86: Add AVX2 tests for PR middle-end/90773
  2021-05-18 19:16 [PATCH v4 00/12] Allow TImode/OImode/XImode in op_by_pieces operations H.J. Lu
                   ` (4 preceding siblings ...)
  2021-05-18 19:16 ` [PATCH v4 05/12] x86: Update piecewise move and store H.J. Lu
@ 2021-05-18 19:16 ` H.J. Lu
  2021-05-18 19:16 ` [PATCH v4 07/12] x86: Add tests for piecewise move and store H.J. Lu
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: H.J. Lu @ 2021-05-18 19:16 UTC (permalink / raw)
  To: gcc-patches
  Cc: Richard Biener, Richard Sandiford, Uros Bizjak, Bernd Edlinger

	PR middle-end/90773
	* gcc.target/i386/pr90773-20.c: New test.
	* gcc.target/i386/pr90773-21.c: Likewise.
	* gcc.target/i386/pr90773-22.c: Likewise.
	* gcc.target/i386/pr90773-23.c: Likewise.
---
 gcc/testsuite/gcc.target/i386/pr90773-20.c | 13 +++++++++++++
 gcc/testsuite/gcc.target/i386/pr90773-21.c | 13 +++++++++++++
 gcc/testsuite/gcc.target/i386/pr90773-22.c | 13 +++++++++++++
 gcc/testsuite/gcc.target/i386/pr90773-23.c | 13 +++++++++++++
 4 files changed, 52 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-20.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-21.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-22.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-23.c

diff --git a/gcc/testsuite/gcc.target/i386/pr90773-20.c b/gcc/testsuite/gcc.target/i386/pr90773-20.c
new file mode 100644
index 00000000000..e61e405f2b6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-20.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake" } */
+
+extern char *dst;
+
+void
+foo (int c)
+{
+  __builtin_memset (dst, c, 33);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movb\[\\t \]+.+, 32\\(%\[\^,\]+\\)" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-21.c b/gcc/testsuite/gcc.target/i386/pr90773-21.c
new file mode 100644
index 00000000000..16ad17f3cbb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-21.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake" } */
+
+extern char *dst;
+
+void
+foo (int c)
+{
+  __builtin_memset (dst, c, 34);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movw\[\\t \]%.*, 32\\(%\[\^,\]+\\)" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-22.c b/gcc/testsuite/gcc.target/i386/pr90773-22.c
new file mode 100644
index 00000000000..45a8ff65a84
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-22.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 0, 33);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movb\[\\t \]+.+, 32\\(%\[\^,\]+\\)" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-23.c b/gcc/testsuite/gcc.target/i386/pr90773-23.c
new file mode 100644
index 00000000000..9256ce10ff0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-23.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 0, 34);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movw\[\\t \]+.+, 32\\(%\[\^,\]+\\)" 1 } } */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 07/12] x86: Add tests for piecewise move and store
  2021-05-18 19:16 [PATCH v4 00/12] Allow TImode/OImode/XImode in op_by_pieces operations H.J. Lu
                   ` (5 preceding siblings ...)
  2021-05-18 19:16 ` [PATCH v4 06/12] x86: Add AVX2 tests for PR middle-end/90773 H.J. Lu
@ 2021-05-18 19:16 ` H.J. Lu
  2021-05-18 19:16 ` [PATCH v4 08/12] x86: Also pass -mno-avx to pr72839.c H.J. Lu
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: H.J. Lu @ 2021-05-18 19:16 UTC (permalink / raw)
  To: gcc-patches
  Cc: Richard Biener, Richard Sandiford, Uros Bizjak, Bernd Edlinger

	* gcc.target/i386/pieces-memcpy-10.c: New test.
	* gcc.target/i386/pieces-memcpy-11.c: Likewise.
	* gcc.target/i386/pieces-memcpy-12.c: Likewise.
	* gcc.target/i386/pieces-memcpy-13.c: Likewise.
	* gcc.target/i386/pieces-memcpy-14.c: Likewise.
	* gcc.target/i386/pieces-memcpy-15.c: Likewise.
	* gcc.target/i386/pieces-memcpy-16.c: Likewise.
	* gcc.target/i386/pieces-memcpy-17.c: Likewise.
	* gcc.target/i386/pieces-memcpy-18.c: Likewise.
	* gcc.target/i386/pieces-memcpy-19.c: Likewise.
	* gcc.target/i386/pieces-memset-1.c: Likewise.
	* gcc.target/i386/pieces-memset-2.c: Likewise.
	* gcc.target/i386/pieces-memset-3.c: Likewise.
	* gcc.target/i386/pieces-memset-4.c: Likewise.
	* gcc.target/i386/pieces-memset-5.c: Likewise.
	* gcc.target/i386/pieces-memset-6.c: Likewise.
	* gcc.target/i386/pieces-memset-7.c: Likewise.
	* gcc.target/i386/pieces-memset-8.c: Likewise.
	* gcc.target/i386/pieces-memset-9.c: Likewise.
	* gcc.target/i386/pieces-memset-10.c: Likewise.
	* gcc.target/i386/pieces-memset-11.c: Likewise.
	* gcc.target/i386/pieces-memset-12.c: Likewise.
	* gcc.target/i386/pieces-memset-13.c: Likewise.
	* gcc.target/i386/pieces-memset-14.c: Likewise.
	* gcc.target/i386/pieces-memset-15.c: Likewise.
	* gcc.target/i386/pieces-memset-16.c: Likewise.
	* gcc.target/i386/pieces-memset-17.c: Likewise.
	* gcc.target/i386/pieces-memset-18.c: Likewise.
	* gcc.target/i386/pieces-memset-19.c: Likewise.
	* gcc.target/i386/pieces-memset-20.c: Likewise.
	* gcc.target/i386/pieces-memset-21.c: Likewise.
	* gcc.target/i386/pieces-memset-22.c: Likewise.
	* gcc.target/i386/pieces-memset-23.c: Likewise.
	* gcc.target/i386/pieces-memset-24.c: Likewise.
	* gcc.target/i386/pieces-memset-25.c: Likewise.
	* gcc.target/i386/pieces-memset-26.c: Likewise.
	* gcc.target/i386/pieces-memset-27.c: Likewise.
	* gcc.target/i386/pieces-memset-28.c: Likewise.
	* gcc.target/i386/pieces-memset-29.c: Likewise.
	* gcc.target/i386/pieces-memset-30.c: Likewise.
	* gcc.target/i386/pieces-memset-31.c: Likewise.
	* gcc.target/i386/pieces-memset-32.c: Likewise.
	* gcc.target/i386/pieces-memset-33.c: Likewise.
	* gcc.target/i386/pieces-memset-34.c: Likewise.
	* gcc.target/i386/pieces-memset-35.c: Likewise.
	* gcc.target/i386/pieces-memset-36.c: Likewise.
	* gcc.target/i386/pieces-memset-37.c: Likewise.
	* gcc.target/i386/pieces-memset-38.c: Likewise.
	* gcc.target/i386/pieces-memset-39.c: Likewise.
	* gcc.target/i386/pieces-memset-40.c: Likewise.
	* gcc.target/i386/pieces-memset-41.c: Likewise.
	* gcc.target/i386/pieces-memset-42.c: Likewise.
	* gcc.target/i386/pieces-memset-43.c: Likewise.
	* gcc.target/i386/pieces-memset-44.c: Likewise.
---
 .../gcc.target/i386/pieces-memcpy-10.c         | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memcpy-11.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memcpy-12.c         | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memcpy-13.c         | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memcpy-14.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memcpy-15.c         | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memcpy-16.c         | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memcpy-7.c          | 15 +++++++++++++++
 .../gcc.target/i386/pieces-memcpy-8.c          | 14 ++++++++++++++
 .../gcc.target/i386/pieces-memcpy-9.c          | 14 ++++++++++++++
 .../gcc.target/i386/pieces-memset-1.c          | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memset-10.c         | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memset-11.c         | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memset-12.c         | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memset-13.c         | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memset-14.c         | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memset-15.c         | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memset-16.c         | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memset-17.c         | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memset-18.c         | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memset-19.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-2.c          | 12 ++++++++++++
 .../gcc.target/i386/pieces-memset-20.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-21.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-22.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-23.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-24.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-25.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-26.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-27.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-28.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-29.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-3.c          | 18 ++++++++++++++++++
 .../gcc.target/i386/pieces-memset-30.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-31.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-32.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-33.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-34.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-35.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-36.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-37.c         | 15 +++++++++++++++
 .../gcc.target/i386/pieces-memset-38.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-39.c         | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memset-4.c          | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memset-40.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-41.c         | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memset-42.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-43.c         | 17 +++++++++++++++++
 .../gcc.target/i386/pieces-memset-5.c          | 12 ++++++++++++
 .../gcc.target/i386/pieces-memset-6.c          | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memset-7.c          | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memset-8.c          | 16 ++++++++++++++++
 .../gcc.target/i386/pieces-memset-9.c          | 16 ++++++++++++++++
 53 files changed, 860 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-10.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-11.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-12.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-13.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-14.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-15.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-16.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-7.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-8.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-9.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-10.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-11.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-12.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-13.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-14.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-15.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-16.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-17.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-18.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-19.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-20.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-21.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-22.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-23.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-24.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-25.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-26.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-27.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-28.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-29.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-30.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-31.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-32.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-33.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-34.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-35.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-36.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-37.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-38.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-39.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-40.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-41.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-42.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-43.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-6.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-7.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-8.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-9.c

diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-10.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-10.c
new file mode 100644
index 00000000000..5faee21f9b9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-10.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=sandybridge" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, 33);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 4 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-11.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-11.c
new file mode 100644
index 00000000000..b8917a7f917
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-11.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, 64);
+}
+
+/* { dg-final { scan-assembler-times "movdqu\[ \\t\]+\[^\n\]*%xmm" 4 } } */
+/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 4 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-12.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-12.c
new file mode 100644
index 00000000000..f1432ebe517
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-12.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=haswell" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, 64);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 4 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-13.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-13.c
new file mode 100644
index 00000000000..97e6067fec9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-13.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mtune=generic" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-14.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-14.c
new file mode 100644
index 00000000000..7addc4c0a28
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-14.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, 33);
+}
+
+/* { dg-final { scan-assembler-times "movdqu\[ \\t\]+\[^\n\]*%xmm" 2 } } */
+/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-15.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-15.c
new file mode 100644
index 00000000000..695e8c3fa67
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-15.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=haswell" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, 33);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-16.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-16.c
new file mode 100644
index 00000000000..b0643d05ee7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-16.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mtune=generic" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, 34);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-7.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-7.c
new file mode 100644
index 00000000000..3d248d447ea
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-7.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
+
+void
+foo (int a1, int a2, int a3, int a4, int a5, int a6, char *dst, char *src)
+{
+  __builtin_memcpy (dst, src, 17);
+}
+
+/* { dg-final { scan-assembler-times "movdqu\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-8.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-8.c
new file mode 100644
index 00000000000..c13a2beb2f0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-8.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=generic" } */
+
+void
+foo (int a1, int a2, int a3, int a4, int a5, int a6, char *dst, char *src)
+{
+  __builtin_memcpy (dst, src, 18);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-9.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-9.c
new file mode 100644
index 00000000000..238f88b275e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-9.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mtune=generic" } */
+
+void
+foo (int a1, int a2, int a3, int a4, int a5, int a6, char *dst, char *src)
+{
+  __builtin_memcpy (dst, src, 19);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-1.c b/gcc/testsuite/gcc.target/i386/pieces-memset-1.c
new file mode 100644
index 00000000000..2b8032684b3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-1.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (int x)
+{
+  __builtin_memset (dst, x, 64);
+}
+
+/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 4 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-10.c b/gcc/testsuite/gcc.target/i386/pieces-memset-10.c
new file mode 100644
index 00000000000..a6390d1bd8f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-10.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 3, 64);
+}
+
+/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 4 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-11.c b/gcc/testsuite/gcc.target/i386/pieces-memset-11.c
new file mode 100644
index 00000000000..3fb9038b04f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-11.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=haswell" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 3, 64);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-12.c b/gcc/testsuite/gcc.target/i386/pieces-memset-12.c
new file mode 100644
index 00000000000..fa834566097
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-12.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 3, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-13.c b/gcc/testsuite/gcc.target/i386/pieces-memset-13.c
new file mode 100644
index 00000000000..7f2cd3f58ec
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-13.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 3, 33);
+}
+
+/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-14.c b/gcc/testsuite/gcc.target/i386/pieces-memset-14.c
new file mode 100644
index 00000000000..45ece482464
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-14.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=haswell" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 3, 33);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-15.c b/gcc/testsuite/gcc.target/i386/pieces-memset-15.c
new file mode 100644
index 00000000000..bddf47d728e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-15.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 3, 33);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-16.c b/gcc/testsuite/gcc.target/i386/pieces-memset-16.c
new file mode 100644
index 00000000000..1c5d124cecc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-16.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 3, 17);
+}
+
+/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-17.c b/gcc/testsuite/gcc.target/i386/pieces-memset-17.c
new file mode 100644
index 00000000000..6cdb33557c0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-17.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 3, 17);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-18.c b/gcc/testsuite/gcc.target/i386/pieces-memset-18.c
new file mode 100644
index 00000000000..adbd201b4e7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-18.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 3, 18);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-19.c b/gcc/testsuite/gcc.target/i386/pieces-memset-19.c
new file mode 100644
index 00000000000..7e9cf2e26d8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-19.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 0, 64);
+}
+
+/* { dg-final { scan-assembler-times "pxor\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 4 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-2.c b/gcc/testsuite/gcc.target/i386/pieces-memset-2.c
new file mode 100644
index 00000000000..649f344e8f6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-2.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=haswell" } */
+
+extern char *dst;
+
+void
+foo (int x)
+{
+  __builtin_memset (dst, x, 64);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 2 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-20.c b/gcc/testsuite/gcc.target/i386/pieces-memset-20.c
new file mode 100644
index 00000000000..b8747e669e8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-20.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=haswell" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 0, 64);
+}
+
+/* { dg-final { scan-assembler-times "vpxor\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-21.c b/gcc/testsuite/gcc.target/i386/pieces-memset-21.c
new file mode 100644
index 00000000000..4f001c6d06c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-21.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 0, 66);
+}
+
+/* { dg-final { scan-assembler-times "vpxor\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-22.c b/gcc/testsuite/gcc.target/i386/pieces-memset-22.c
new file mode 100644
index 00000000000..5f3c454ef8f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-22.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 0, 33);
+}
+
+/* { dg-final { scan-assembler-times "pxor\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-23.c b/gcc/testsuite/gcc.target/i386/pieces-memset-23.c
new file mode 100644
index 00000000000..a3b4ffc18e0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-23.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=haswell" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 0, 33);
+}
+
+/* { dg-final { scan-assembler-times "vpxor\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-24.c b/gcc/testsuite/gcc.target/i386/pieces-memset-24.c
new file mode 100644
index 00000000000..e222787b541
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-24.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 0, 33);
+}
+
+/* { dg-final { scan-assembler-times "vpxor\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-25.c b/gcc/testsuite/gcc.target/i386/pieces-memset-25.c
new file mode 100644
index 00000000000..195ddb635eb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-25.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 0, 17);
+}
+
+/* { dg-final { scan-assembler-times "pxor\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-26.c b/gcc/testsuite/gcc.target/i386/pieces-memset-26.c
new file mode 100644
index 00000000000..13606b2da54
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-26.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 0, 17);
+}
+
+/* { dg-final { scan-assembler-times "pxor\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-27.c b/gcc/testsuite/gcc.target/i386/pieces-memset-27.c
new file mode 100644
index 00000000000..54a672b6015
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-27.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 0, 17);
+}
+
+/* { dg-final { scan-assembler-times "pxor\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-28.c b/gcc/testsuite/gcc.target/i386/pieces-memset-28.c
new file mode 100644
index 00000000000..83c2d3f0fde
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-28.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, -1, 64);
+}
+
+/* { dg-final { scan-assembler-times "pcmpeqd\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 4 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-29.c b/gcc/testsuite/gcc.target/i386/pieces-memset-29.c
new file mode 100644
index 00000000000..650e6fe66a5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-29.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=haswell" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, -1, 64);
+}
+
+/* { dg-final { scan-assembler-not "vpcmpeqd\[ \\t\]+\[^\n\]*%ymm" } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-3.c b/gcc/testsuite/gcc.target/i386/pieces-memset-3.c
new file mode 100644
index 00000000000..2aed6dbc68e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-3.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx512bw -mno-avx512vl -mavx512f -mtune=intel" } */
+
+extern char *dst;
+
+void
+foo (int x)
+{
+  __builtin_memset (dst, x, 66);
+}
+
+/* { dg-final { scan-assembler-times "vpbroadcastb\[ \\t\]+\[^\n\]*%ymm" 1 } } */
+/* { dg-final { scan-assembler-times "vinserti64x4\[ \\t\]+\[^\n\]*%zmm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" { target { ! ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-30.c b/gcc/testsuite/gcc.target/i386/pieces-memset-30.c
new file mode 100644
index 00000000000..dcec2c700fc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-30.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx512f -mavx2 -mtune=haswell" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, -1, 64);
+}
+
+/* { dg-final { scan-assembler-times "vpcmpeqd\[ \\t\]+\[^\n\]*%ymm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-31.c b/gcc/testsuite/gcc.target/i386/pieces-memset-31.c
new file mode 100644
index 00000000000..5d20af0938d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-31.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, -1, 66);
+}
+
+/* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\[^\n\]*%zmm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-32.c b/gcc/testsuite/gcc.target/i386/pieces-memset-32.c
new file mode 100644
index 00000000000..c5ca0bd17ba
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-32.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, -1, 33);
+}
+
+/* { dg-final { scan-assembler-times "pcmpeqd\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-33.c b/gcc/testsuite/gcc.target/i386/pieces-memset-33.c
new file mode 100644
index 00000000000..a87d1b80ae6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-33.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=haswell" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, -1, 33);
+}
+
+/* { dg-final { scan-assembler-not "vpcmpeqd\[ \\t\]+\[^\n\]*%ymm" } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-34.c b/gcc/testsuite/gcc.target/i386/pieces-memset-34.c
new file mode 100644
index 00000000000..0c2f1ee6049
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-34.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx512f -mavx2 -mtune=haswell" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, -1, 33);
+}
+
+/* { dg-final { scan-assembler-times "vpcmpeqd\[ \\t\]+\[^\n\]*%ymm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-35.c b/gcc/testsuite/gcc.target/i386/pieces-memset-35.c
new file mode 100644
index 00000000000..b0f4a8b898e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-35.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, -1, 34);
+}
+
+/* { dg-final { scan-assembler-times "vpcmpeqd\[ \\t\]+\[^\n\]*%ymm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-36.c b/gcc/testsuite/gcc.target/i386/pieces-memset-36.c
new file mode 100644
index 00000000000..d1f1263c7b2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-36.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx512f -mavx2 -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (int x)
+{
+  __builtin_memset (dst, x, 17);
+}
+
+/* { dg-final { scan-assembler-times "vpbroadcastb\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-37.c b/gcc/testsuite/gcc.target/i386/pieces-memset-37.c
new file mode 100644
index 00000000000..ec59497b116
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-37.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx512f -mavx2 -mtune=generic" } */
+
+void
+foo (int a1, int a2, int a3, int a4, int a5, int a6, int x, char *dst)
+{
+  __builtin_memset (dst, x, 66);
+}
+
+/* { dg-final { scan-assembler-times "vpbroadcastb\[ \\t\]+\[^\n\]*%ymm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-38.c b/gcc/testsuite/gcc.target/i386/pieces-memset-38.c
new file mode 100644
index 00000000000..ed4a24a54fd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-38.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx512f -mavx2 -mtune=sandybridge" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, -1, 33);
+}
+
+/* { dg-final { scan-assembler-times "vpcmpeqd\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-39.c b/gcc/testsuite/gcc.target/i386/pieces-memset-39.c
new file mode 100644
index 00000000000..a330bff5f3f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-39.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512bw -mtune=generic" } */
+
+void
+foo (int a1, int a2, int a3, int a4, int a5, int a6, int x, char *dst)
+{
+  __builtin_memset (dst, x, 66);
+}
+
+/* { dg-final { scan-assembler-times "vpbroadcastb\[ \\t\]+\[^\n\]*%zmm" 1 } } */
+/* { dg-final { scan-assembler-not "vinserti64x4" } } */
+/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-4.c b/gcc/testsuite/gcc.target/i386/pieces-memset-4.c
new file mode 100644
index 00000000000..9256919bfdf
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-4.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (int x)
+{
+  __builtin_memset (dst, x, 33);
+}
+
+/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-40.c b/gcc/testsuite/gcc.target/i386/pieces-memset-40.c
new file mode 100644
index 00000000000..4eda73ead59
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-40.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx512f -mavx2 -mtune=sandybridge" } */
+
+extern char *dst;
+
+void
+foo (int x)
+{
+  __builtin_memset (dst, x, 66);
+}
+
+/* { dg-final { scan-assembler-times "vpbroadcastb\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 4 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-41.c b/gcc/testsuite/gcc.target/i386/pieces-memset-41.c
new file mode 100644
index 00000000000..f86b6986da9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-41.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=sandybridge" } */
+
+extern char *dst;
+
+void
+foo (int x)
+{
+  __builtin_memset (dst, x, 33);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-42.c b/gcc/testsuite/gcc.target/i386/pieces-memset-42.c
new file mode 100644
index 00000000000..df0c122aae7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-42.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=sandybridge" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 0, 33);
+}
+
+/* { dg-final { scan-assembler-times "vpxor\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-43.c b/gcc/testsuite/gcc.target/i386/pieces-memset-43.c
new file mode 100644
index 00000000000..2f2179c2df9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-43.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=sandybridge" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, -1, 33);
+}
+
+/* { dg-final { scan-assembler-times "vpcmpeqd\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-5.c b/gcc/testsuite/gcc.target/i386/pieces-memset-5.c
new file mode 100644
index 00000000000..3e95db5efef
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-5.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=haswell" } */
+
+extern char *dst;
+
+void
+foo (int x)
+{
+  __builtin_memset (dst, x, 33);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-6.c b/gcc/testsuite/gcc.target/i386/pieces-memset-6.c
new file mode 100644
index 00000000000..571113c3a33
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-6.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mtune=intel" } */
+
+extern char *dst;
+
+void
+foo (int x)
+{
+  __builtin_memset (dst, x, 33);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" { target { ! ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-7.c b/gcc/testsuite/gcc.target/i386/pieces-memset-7.c
new file mode 100644
index 00000000000..fd159869817
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-7.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (int x)
+{
+  __builtin_memset (dst, x, 17);
+}
+
+/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-8.c b/gcc/testsuite/gcc.target/i386/pieces-memset-8.c
new file mode 100644
index 00000000000..7df0019ef63
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-8.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (int x)
+{
+  __builtin_memset (dst, x, 17);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-9.c b/gcc/testsuite/gcc.target/i386/pieces-memset-9.c
new file mode 100644
index 00000000000..ed45d590875
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-9.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mtune=generic" } */
+
+extern char *dst;
+
+void
+foo (int x)
+{
+  __builtin_memset (dst, x, 17);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 08/12] x86: Also pass -mno-avx to pr72839.c
  2021-05-18 19:16 [PATCH v4 00/12] Allow TImode/OImode/XImode in op_by_pieces operations H.J. Lu
                   ` (6 preceding siblings ...)
  2021-05-18 19:16 ` [PATCH v4 07/12] x86: Add tests for piecewise move and store H.J. Lu
@ 2021-05-18 19:16 ` H.J. Lu
  2021-05-18 19:16 ` [PATCH v4 09/12] x86: Also pass -mno-avx to cold-attribute-1.c H.J. Lu
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: H.J. Lu @ 2021-05-18 19:16 UTC (permalink / raw)
  To: gcc-patches
  Cc: Richard Biener, Richard Sandiford, Uros Bizjak, Bernd Edlinger

Also pass -mno-avx to pr72839.c to avoid copying data with YMM or ZMM
registers.

	* gcc.target/i386/pr72839.c: Also pass -mno-avx.
---
 gcc/testsuite/gcc.target/i386/pr72839.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.target/i386/pr72839.c b/gcc/testsuite/gcc.target/i386/pr72839.c
index ea724f70377..6888d9d0a55 100644
--- a/gcc/testsuite/gcc.target/i386/pr72839.c
+++ b/gcc/testsuite/gcc.target/i386/pr72839.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-require-effective-target ia32 } */
-/* { dg-options "-O2 -mtune=lakemont" } */
+/* { dg-options "-O2 -mtune=lakemont -mno-avx" } */
 
 extern char *strcpy (char *, const char *);
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 09/12] x86: Also pass -mno-avx to cold-attribute-1.c
  2021-05-18 19:16 [PATCH v4 00/12] Allow TImode/OImode/XImode in op_by_pieces operations H.J. Lu
                   ` (7 preceding siblings ...)
  2021-05-18 19:16 ` [PATCH v4 08/12] x86: Also pass -mno-avx to pr72839.c H.J. Lu
@ 2021-05-18 19:16 ` H.J. Lu
  2021-05-18 19:16 ` [PATCH v4 10/12] x86: Also pass -mno-avx to sw-1.c for ia32 H.J. Lu
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: H.J. Lu @ 2021-05-18 19:16 UTC (permalink / raw)
  To: gcc-patches
  Cc: Richard Biener, Richard Sandiford, Uros Bizjak, Bernd Edlinger

Also pass -mno-avx to pr72839.c to avoid copying data with YMM or ZMM
registers.

	* gcc.target/i386/cold-attribute-1.c: Also pass -mno-avx.
---
 gcc/testsuite/gcc.target/i386/cold-attribute-1.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.target/i386/cold-attribute-1.c b/gcc/testsuite/gcc.target/i386/cold-attribute-1.c
index 57666ac60b6..658eb3e25bb 100644
--- a/gcc/testsuite/gcc.target/i386/cold-attribute-1.c
+++ b/gcc/testsuite/gcc.target/i386/cold-attribute-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2" } */
+/* { dg-options "-O2 -mno-avx" } */
 #include <string.h>
 static inline
 __attribute__ ((cold)) void
-- 
2.31.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 10/12] x86: Also pass -mno-avx to sw-1.c for ia32
  2021-05-18 19:16 [PATCH v4 00/12] Allow TImode/OImode/XImode in op_by_pieces operations H.J. Lu
                   ` (8 preceding siblings ...)
  2021-05-18 19:16 ` [PATCH v4 09/12] x86: Also pass -mno-avx to cold-attribute-1.c H.J. Lu
@ 2021-05-18 19:16 ` H.J. Lu
  2021-05-18 19:16 ` [PATCH v4 11/12] x86: Update gcc.target/i386/incoming-11.c H.J. Lu
  2021-05-18 19:16 ` [PATCH v4 12/12] constructor: Check if it is faster to load constant from memory H.J. Lu
  11 siblings, 0 replies; 52+ messages in thread
From: H.J. Lu @ 2021-05-18 19:16 UTC (permalink / raw)
  To: gcc-patches
  Cc: Richard Biener, Richard Sandiford, Uros Bizjak, Bernd Edlinger

Also pass -mno-avx to sw-1.c for ia32 since copying data with YMM or ZMM
registers disables shrink-wrapping when the second argument is passed on
stack.

	* gcc.target/i386/sw-1.c: Also pass -mno-avx for ia32.
---
 gcc/testsuite/gcc.target/i386/sw-1.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/testsuite/gcc.target/i386/sw-1.c b/gcc/testsuite/gcc.target/i386/sw-1.c
index aec095eda62..a9c89fca4ec 100644
--- a/gcc/testsuite/gcc.target/i386/sw-1.c
+++ b/gcc/testsuite/gcc.target/i386/sw-1.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -mtune=generic -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */
+/* { dg-additional-options "-mno-avx" { target ia32 } } */
 /* { dg-skip-if "No shrink-wrapping preformed" { x86_64-*-mingw* } } */
 
 #include <string.h>
-- 
2.31.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 11/12] x86: Update gcc.target/i386/incoming-11.c
  2021-05-18 19:16 [PATCH v4 00/12] Allow TImode/OImode/XImode in op_by_pieces operations H.J. Lu
                   ` (9 preceding siblings ...)
  2021-05-18 19:16 ` [PATCH v4 10/12] x86: Also pass -mno-avx to sw-1.c for ia32 H.J. Lu
@ 2021-05-18 19:16 ` H.J. Lu
  2021-05-18 19:16 ` [PATCH v4 12/12] constructor: Check if it is faster to load constant from memory H.J. Lu
  11 siblings, 0 replies; 52+ messages in thread
From: H.J. Lu @ 2021-05-18 19:16 UTC (permalink / raw)
  To: gcc-patches
  Cc: Richard Biener, Richard Sandiford, Uros Bizjak, Bernd Edlinger

Expect no stack realignment since we no longer realign stack when
copying data.

	* gcc.target/i386/incoming-11.c: Expect no stack realignment.
---
 gcc/testsuite/gcc.target/i386/incoming-11.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.target/i386/incoming-11.c b/gcc/testsuite/gcc.target/i386/incoming-11.c
index a830c96f7d1..4b822684b88 100644
--- a/gcc/testsuite/gcc.target/i386/incoming-11.c
+++ b/gcc/testsuite/gcc.target/i386/incoming-11.c
@@ -15,4 +15,4 @@ void f()
 	for (i = 0; i < 100; i++) q[i] = 1;
 }
 
-/* { dg-final { scan-assembler "andl\[\\t \]*\\$-16,\[\\t \]*%esp" } } */
+/* { dg-final { scan-assembler-not "andl\[\\t \]*\\$-16,\[\\t \]*%esp" } } */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 12/12] constructor: Check if it is faster to load constant from memory
  2021-05-18 19:16 [PATCH v4 00/12] Allow TImode/OImode/XImode in op_by_pieces operations H.J. Lu
                   ` (10 preceding siblings ...)
  2021-05-18 19:16 ` [PATCH v4 11/12] x86: Update gcc.target/i386/incoming-11.c H.J. Lu
@ 2021-05-18 19:16 ` H.J. Lu
  2021-05-19  9:33   ` Richard Biener
  11 siblings, 1 reply; 52+ messages in thread
From: H.J. Lu @ 2021-05-18 19:16 UTC (permalink / raw)
  To: gcc-patches
  Cc: Richard Biener, Richard Sandiford, Uros Bizjak, Bernd Edlinger

When expanding a constant constructor, don't call expand_constructor if
it is more efficient to load the data from the memory via move by pieces.

gcc/

	PR middle-end/90773
	* expr.c (expand_expr_real_1): Don't call expand_constructor if
	it is more efficient to load the data from the memory.

gcc/testsuite/

	PR middle-end/90773
	* gcc.target/i386/pr90773-24.c: New test.
	* gcc.target/i386/pr90773-25.c: Likewise.
---
 gcc/expr.c                                 | 10 ++++++++++
 gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 ++++++++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 ++++++++++++++++++++
 3 files changed, 52 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c

diff --git a/gcc/expr.c b/gcc/expr.c
index d09ee42e262..80e01ea1cbe 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
 		unsigned HOST_WIDE_INT ix;
 		tree field, value;
 
+		/* Check if it is more efficient to load the data from
+		   the memory directly.  FIXME: How many stores do we
+		   need here if not moved by pieces?  */
+		unsigned HOST_WIDE_INT bytes
+		  = tree_to_uhwi (TYPE_SIZE_UNIT (type));
+		if ((bytes / UNITS_PER_WORD) > 2
+		    && MOVE_MAX_PIECES > UNITS_PER_WORD
+		    && can_move_by_pieces (bytes, TYPE_ALIGN (type)))
+		  goto normal_inner_ref;
+
 		FOR_EACH_CONSTRUCTOR_ELT (CONSTRUCTOR_ELTS (init), ix,
 					  field, value)
 		  if (tree_int_cst_equal (field, index))
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-24.c b/gcc/testsuite/gcc.target/i386/pr90773-24.c
new file mode 100644
index 00000000000..4a4b62533dc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-24.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=x86-64" } */
+
+struct S
+{
+  long long s1 __attribute__ ((aligned (8)));
+  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
+};
+
+const struct S array[] = {
+  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
+};
+
+void
+foo (struct S *x)
+{
+  x[0] = array[0];
+}
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 16\\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 48\\(%\[\^,\]+\\)" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-25.c b/gcc/testsuite/gcc.target/i386/pr90773-25.c
new file mode 100644
index 00000000000..2520b670989
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-25.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake" } */
+
+struct S
+{
+  long long s1 __attribute__ ((aligned (8)));
+  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
+};
+
+const struct S array[] = {
+  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
+};
+
+void
+foo (struct S *x)
+{
+  x[0] = array[0];
+}
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 01/12] Add TARGET_READ_MEMSET_VALUE/TARGET_GEN_MEMSET_VALUE
  2021-05-18 19:16 ` [PATCH v4 01/12] Add TARGET_READ_MEMSET_VALUE/TARGET_GEN_MEMSET_VALUE H.J. Lu
@ 2021-05-19  9:25   ` Richard Biener
  2021-05-19 12:55     ` H.J. Lu
  0 siblings, 1 reply; 52+ messages in thread
From: Richard Biener @ 2021-05-19  9:25 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> Add TARGET_READ_MEMSET_VALUE and TARGET_GEN_MEMSET_VALUE to support
> target instructions to duplicate QImode value to TImode/OImode/XImode
> value for memmset.
>
>         PR middle-end/90773
>         * builtins.c (builtin_memset_read_str): Call
>         targetm.read_memset_value.
>         (builtin_memset_gen_str): Call targetm.gen_memset_value.
>         * target.def (read_memset_value): New hook.
>         (gen_memset_value): Likewise.
>         * targhooks.c: Inclue "builtins.h".
>         (default_read_memset_value): New function.
>         (default_gen_memset_value): Likewise.
>         * targhooks.h (default_read_memset_value): New prototype.
>         (default_gen_memset_value): Likewise.
>         * doc/tm.texi.in: Add TARGET_READ_MEMSET_VALUE and
>         TARGET_GEN_MEMSET_VALUE hooks.
>         * doc/tm.texi: Regenerated.
> ---
>  gcc/builtins.c     | 47 ++++----------------------------------
>  gcc/doc/tm.texi    | 16 +++++++++++++
>  gcc/doc/tm.texi.in |  4 ++++
>  gcc/target.def     | 20 +++++++++++++++++
>  gcc/targhooks.c    | 56 ++++++++++++++++++++++++++++++++++++++++++++++
>  gcc/targhooks.h    |  4 ++++
>  6 files changed, 104 insertions(+), 43 deletions(-)
>
> diff --git a/gcc/builtins.c b/gcc/builtins.c
> index e1b284846b1..f78a36478ef 100644
> --- a/gcc/builtins.c
> +++ b/gcc/builtins.c
> @@ -6584,24 +6584,11 @@ expand_builtin_strncpy (tree exp, rtx target)
>     previous iteration.  */
>
>  rtx
> -builtin_memset_read_str (void *data, void *prevp,
> +builtin_memset_read_str (void *data, void *prev,
>                          HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
>                          scalar_int_mode mode)
>  {
> -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> -  if (prev != nullptr && prev->data != nullptr)
> -    {
> -      /* Use the previous data in the same mode.  */
> -      if (prev->mode == mode)
> -       return prev->data;
> -    }
> -
> -  const char *c = (const char *) data;
> -  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> -
> -  memset (p, *c, GET_MODE_SIZE (mode));
> -
> -  return c_readstr (p, mode);
> +  return targetm.read_memset_value ((const char *) data, prev, mode);
>  }
>
>  /* Callback routine for store_by_pieces.  Return the RTL of a register
> @@ -6611,37 +6598,11 @@ builtin_memset_read_str (void *data, void *prevp,
>     nullptr, it has the RTL info from the previous iteration.  */
>
>  static rtx
> -builtin_memset_gen_str (void *data, void *prevp,
> +builtin_memset_gen_str (void *data, void *prev,
>                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
>                         scalar_int_mode mode)
>  {
> -  rtx target, coeff;
> -  size_t size;
> -  char *p;
> -
> -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> -  if (prev != nullptr && prev->data != nullptr)
> -    {
> -      /* Use the previous data in the same mode.  */
> -      if (prev->mode == mode)
> -       return prev->data;
> -
> -      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> -      if (target != nullptr)
> -       return target;
> -    }
> -
> -  size = GET_MODE_SIZE (mode);
> -  if (size == 1)
> -    return (rtx) data;
> -
> -  p = XALLOCAVEC (char, size);
> -  memset (p, 1, size);
> -  coeff = c_readstr (p, mode);
> -
> -  target = convert_to_mode (mode, (rtx) data, 1);
> -  target = expand_mult (mode, target, coeff, NULL_RTX, 1);
> -  return force_reg (mode, target);
> +  return targetm.gen_memset_value ((rtx) data, prev, mode);
>  }
>
>  /* Expand expression EXP, which is a call to the memset builtin.  Return
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 85ea9395560..51385044e76 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -11868,6 +11868,22 @@ This function prepares to emit a conditional comparison within a sequence
>   @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.
>  @end deftypefn
>
> +@deftypefn {Target Hook} rtx TARGET_READ_MEMSET_VALUE (const char *@var{c}, void *@var{prev}, scalar_int_mode @var{mode})
> +This function returns the RTL of a constant integer corresponding to
> +target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the stringn
> +constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains
> +the RTL information from the previous interation.
> +@end deftypefn
> +
> +@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_VALUE (rtx @var{data}, void *@var{prev}, scalar_int_mode @var{mode})
> +This function returns the RTL of a register containing
> +@code{GET_MODE_SIZE (@var{mode})} consecutive copies of the unsigned
> +char value given in the RTL register @var{data}.  For example, if
> +@var{mode} is 4 bytes wide, return the RTL for 0x01010101*@var{data}.
> +If @var{PREV} is not @samp{nullptr}, it is the RTL information from
> +the previous iteration.
> +@end deftypefn
> +
>  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})
>  This target hook returns a new value for the number of times @var{loop}
>  should be unrolled. The parameter @var{nunroll} is the number of times
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index d8e3de14af1..8d4c3949fbf 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -7956,6 +7956,10 @@ lists.
>
>  @hook TARGET_GEN_CCMP_NEXT
>
> +@hook TARGET_READ_MEMSET_VALUE
> +
> +@hook TARGET_GEN_MEMSET_VALUE
> +
>  @hook TARGET_LOOP_UNROLL_ADJUST
>
>  @defmac POWI_MAX_MULTS
> diff --git a/gcc/target.def b/gcc/target.def
> index bbaf6b4f3a0..c9aca40fa88 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -2694,6 +2694,26 @@ DEFHOOK
>   rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree op0, tree op1, int bit_code),
>   NULL)
>
> +DEFHOOK
> +(read_memset_value,
> + "This function returns the RTL of a constant integer corresponding to\n\
> +target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the stringn\n\
> +constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains\n\

where is 'str' defined?  I can't really tell what's the difference
from read_memset_value
and gen_memset_value.

Somehow I feel that an optab for the "splat" operation similar
to vec_duplicate might be a better way to expose this - of course
that doesn't handle the "prev" thing.

So how's this the right point of abstraction to the target?

> +the RTL information from the previous interation.",
> + rtx, (const char *c, void *prev, scalar_int_mode mode),
> + default_read_memset_value)
> +
> +DEFHOOK
> +(gen_memset_value,
> + "This function returns the RTL of a register containing\n\
> +@code{GET_MODE_SIZE (@var{mode})} consecutive copies of the unsigned\n\
> +char value given in the RTL register @var{data}.  For example, if\n\
> +@var{mode} is 4 bytes wide, return the RTL for 0x01010101*@var{data}.\n\
> +If @var{PREV} is not @samp{nullptr}, it is the RTL information from\n\
> +the previous iteration.",
> + rtx, (rtx data, void *prev, scalar_int_mode mode),
> + default_gen_memset_value)
> +
>  /* Return a new value for loop unroll size.  */
>  DEFHOOK
>  (loop_unroll_adjust,
> diff --git a/gcc/targhooks.c b/gcc/targhooks.c
> index 1947ef26fd6..b55e6ec6756 100644
> --- a/gcc/targhooks.c
> +++ b/gcc/targhooks.c
> @@ -90,6 +90,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "attribs.h"
>  #include "asan.h"
>  #include "emit-rtl.h"
> +#include "builtins.h"
>
>  bool
>  default_legitimate_address_p (machine_mode mode ATTRIBUTE_UNUSED,
> @@ -2627,4 +2628,59 @@ default_memtag_untagged_pointer (rtx tagged_pointer, rtx target)
>    return untagged_base;
>  }
>
> +/* Default implementation of TARGET_READ_MEMSET_VALUE.  */
> +
> +rtx
> +default_read_memset_value (const char *c, void *prevp,
> +                          scalar_int_mode mode)
> +{
> +  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> +  if (prev != nullptr && prev->data != nullptr)
> +    {
> +      /* Use the previous data in the same mode.  */
> +      if (prev->mode == mode)
> +       return prev->data;
> +    }
> +
> +  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> +
> +  memset (p, *c, GET_MODE_SIZE (mode));
> +
> +  return c_readstr (p, mode);
> +}
> +
> +/* Default implementation of TARGET_GEN_MEMSET_VALUE.  */
> +
> +rtx
> +default_gen_memset_value (rtx data, void *prevp, scalar_int_mode mode)
> +{
> +  rtx target, coeff;
> +  size_t size;
> +  char *p;
> +
> +  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> +  if (prev != nullptr && prev->data != nullptr)
> +    {
> +      /* Use the previous data in the same mode.  */
> +      if (prev->mode == mode)
> +       return prev->data;
> +
> +      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> +      if (target != nullptr)
> +       return target;
> +    }
> +
> +  size = GET_MODE_SIZE (mode);
> +  if (size == 1)
> +    return data;
> +
> +  p = XALLOCAVEC (char, size);
> +  memset (p, 1, size);
> +  coeff = c_readstr (p, mode);
> +
> +  target = convert_to_mode (mode, data, 1);
> +  target = expand_mult (mode, target, coeff, NULL_RTX, 1);
> +  return force_reg (mode, target);
> +}
> +
>  #include "gt-targhooks.h"
> diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> index b537038c0aa..3c00927e196 100644
> --- a/gcc/targhooks.h
> +++ b/gcc/targhooks.h
> @@ -300,4 +300,8 @@ extern rtx default_memtag_set_tag (rtx, rtx, rtx);
>  extern rtx default_memtag_extract_tag (rtx, rtx);
>  extern rtx default_memtag_untagged_pointer (rtx, rtx);
>
> +extern rtx default_read_memset_value (const char *, void *,
> +                                     scalar_int_mode);
> +extern rtx default_gen_memset_value (rtx, void *, scalar_int_mode);
> +
>  #endif /* GCC_TARGHOOKS_H */
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 12/12] constructor: Check if it is faster to load constant from memory
  2021-05-18 19:16 ` [PATCH v4 12/12] constructor: Check if it is faster to load constant from memory H.J. Lu
@ 2021-05-19  9:33   ` Richard Biener
  2021-05-19 13:22     ` H.J. Lu
  0 siblings, 1 reply; 52+ messages in thread
From: Richard Biener @ 2021-05-19  9:33 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> When expanding a constant constructor, don't call expand_constructor if
> it is more efficient to load the data from the memory via move by pieces.
>
> gcc/
>
>         PR middle-end/90773
>         * expr.c (expand_expr_real_1): Don't call expand_constructor if
>         it is more efficient to load the data from the memory.
>
> gcc/testsuite/
>
>         PR middle-end/90773
>         * gcc.target/i386/pr90773-24.c: New test.
>         * gcc.target/i386/pr90773-25.c: Likewise.
> ---
>  gcc/expr.c                                 | 10 ++++++++++
>  gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 ++++++++++++++++++++++
>  gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 ++++++++++++++++++++
>  3 files changed, 52 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c
>
> diff --git a/gcc/expr.c b/gcc/expr.c
> index d09ee42e262..80e01ea1cbe 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
>                 unsigned HOST_WIDE_INT ix;
>                 tree field, value;
>
> +               /* Check if it is more efficient to load the data from
> +                  the memory directly.  FIXME: How many stores do we
> +                  need here if not moved by pieces?  */
> +               unsigned HOST_WIDE_INT bytes
> +                 = tree_to_uhwi (TYPE_SIZE_UNIT (type));

that's prone to fail - it could be a VLA.

> +               if ((bytes / UNITS_PER_WORD) > 2
> +                   && MOVE_MAX_PIECES > UNITS_PER_WORD
> +                   && can_move_by_pieces (bytes, TYPE_ALIGN (type)))
> +                 goto normal_inner_ref;
> +

It looks like you're concerned about aggregate copies but this also handles
non-aggregates (which on GIMPLE might already be optimized of course).

Also you say "if it's cheaper" but I see no cost considerations.  How do
we generally handle immed const vs. load from constant pool costs?

>                 FOR_EACH_CONSTRUCTOR_ELT (CONSTRUCTOR_ELTS (init), ix,
>                                           field, value)
>                   if (tree_int_cst_equal (field, index))
> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-24.c b/gcc/testsuite/gcc.target/i386/pr90773-24.c
> new file mode 100644
> index 00000000000..4a4b62533dc
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr90773-24.c
> @@ -0,0 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=x86-64" } */
> +
> +struct S
> +{
> +  long long s1 __attribute__ ((aligned (8)));
> +  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
> +};
> +
> +const struct S array[] = {
> +  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
> +};
> +
> +void
> +foo (struct S *x)
> +{
> +  x[0] = array[0];
> +}
> +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
> +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 16\\(%\[\^,\]+\\)" 1 } } */
> +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
> +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 48\\(%\[\^,\]+\\)" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-25.c b/gcc/testsuite/gcc.target/i386/pr90773-25.c
> new file mode 100644
> index 00000000000..2520b670989
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr90773-25.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=skylake" } */
> +
> +struct S
> +{
> +  long long s1 __attribute__ ((aligned (8)));
> +  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
> +};
> +
> +const struct S array[] = {
> +  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
> +};
> +
> +void
> +foo (struct S *x)
> +{
> +  x[0] = array[0];
> +}
> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 01/12] Add TARGET_READ_MEMSET_VALUE/TARGET_GEN_MEMSET_VALUE
  2021-05-19  9:25   ` Richard Biener
@ 2021-05-19 12:55     ` H.J. Lu
  2021-05-20 20:49       ` [PATCH] Add 3 target hooks for memset H.J. Lu
  0 siblings, 1 reply; 52+ messages in thread
From: H.J. Lu @ 2021-05-19 12:55 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

On Wed, May 19, 2021 at 2:25 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > Add TARGET_READ_MEMSET_VALUE and TARGET_GEN_MEMSET_VALUE to support
> > target instructions to duplicate QImode value to TImode/OImode/XImode
> > value for memmset.
> >
> >         PR middle-end/90773
> >         * builtins.c (builtin_memset_read_str): Call
> >         targetm.read_memset_value.
> >         (builtin_memset_gen_str): Call targetm.gen_memset_value.
> >         * target.def (read_memset_value): New hook.
> >         (gen_memset_value): Likewise.
> >         * targhooks.c: Inclue "builtins.h".
> >         (default_read_memset_value): New function.
> >         (default_gen_memset_value): Likewise.
> >         * targhooks.h (default_read_memset_value): New prototype.
> >         (default_gen_memset_value): Likewise.
> >         * doc/tm.texi.in: Add TARGET_READ_MEMSET_VALUE and
> >         TARGET_GEN_MEMSET_VALUE hooks.
> >         * doc/tm.texi: Regenerated.
> > ---
> >  gcc/builtins.c     | 47 ++++----------------------------------
> >  gcc/doc/tm.texi    | 16 +++++++++++++
> >  gcc/doc/tm.texi.in |  4 ++++
> >  gcc/target.def     | 20 +++++++++++++++++
> >  gcc/targhooks.c    | 56 ++++++++++++++++++++++++++++++++++++++++++++++
> >  gcc/targhooks.h    |  4 ++++
> >  6 files changed, 104 insertions(+), 43 deletions(-)
> >
> > diff --git a/gcc/builtins.c b/gcc/builtins.c
> > index e1b284846b1..f78a36478ef 100644
> > --- a/gcc/builtins.c
> > +++ b/gcc/builtins.c
> > @@ -6584,24 +6584,11 @@ expand_builtin_strncpy (tree exp, rtx target)
> >     previous iteration.  */
> >
> >  rtx
> > -builtin_memset_read_str (void *data, void *prevp,
> > +builtin_memset_read_str (void *data, void *prev,
> >                          HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> >                          scalar_int_mode mode)
> >  {
> > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > -  if (prev != nullptr && prev->data != nullptr)
> > -    {
> > -      /* Use the previous data in the same mode.  */
> > -      if (prev->mode == mode)
> > -       return prev->data;
> > -    }
> > -
> > -  const char *c = (const char *) data;
> > -  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> > -
> > -  memset (p, *c, GET_MODE_SIZE (mode));
> > -
> > -  return c_readstr (p, mode);
> > +  return targetm.read_memset_value ((const char *) data, prev, mode);
> >  }
> >
> >  /* Callback routine for store_by_pieces.  Return the RTL of a register
> > @@ -6611,37 +6598,11 @@ builtin_memset_read_str (void *data, void *prevp,
> >     nullptr, it has the RTL info from the previous iteration.  */
> >
> >  static rtx
> > -builtin_memset_gen_str (void *data, void *prevp,
> > +builtin_memset_gen_str (void *data, void *prev,
> >                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> >                         scalar_int_mode mode)
> >  {
> > -  rtx target, coeff;
> > -  size_t size;
> > -  char *p;
> > -
> > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > -  if (prev != nullptr && prev->data != nullptr)
> > -    {
> > -      /* Use the previous data in the same mode.  */
> > -      if (prev->mode == mode)
> > -       return prev->data;
> > -
> > -      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> > -      if (target != nullptr)
> > -       return target;
> > -    }
> > -
> > -  size = GET_MODE_SIZE (mode);
> > -  if (size == 1)
> > -    return (rtx) data;
> > -
> > -  p = XALLOCAVEC (char, size);
> > -  memset (p, 1, size);
> > -  coeff = c_readstr (p, mode);
> > -
> > -  target = convert_to_mode (mode, (rtx) data, 1);
> > -  target = expand_mult (mode, target, coeff, NULL_RTX, 1);
> > -  return force_reg (mode, target);
> > +  return targetm.gen_memset_value ((rtx) data, prev, mode);
> >  }
> >
> >  /* Expand expression EXP, which is a call to the memset builtin.  Return
> > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > index 85ea9395560..51385044e76 100644
> > --- a/gcc/doc/tm.texi
> > +++ b/gcc/doc/tm.texi
> > @@ -11868,6 +11868,22 @@ This function prepares to emit a conditional comparison within a sequence
> >   @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.
> >  @end deftypefn
> >
> > +@deftypefn {Target Hook} rtx TARGET_READ_MEMSET_VALUE (const char *@var{c}, void *@var{prev}, scalar_int_mode @var{mode})
> > +This function returns the RTL of a constant integer corresponding to
> > +target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the stringn
> > +constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains
> > +the RTL information from the previous interation.
> > +@end deftypefn
> > +
> > +@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_VALUE (rtx @var{data}, void *@var{prev}, scalar_int_mode @var{mode})
> > +This function returns the RTL of a register containing
> > +@code{GET_MODE_SIZE (@var{mode})} consecutive copies of the unsigned
> > +char value given in the RTL register @var{data}.  For example, if
> > +@var{mode} is 4 bytes wide, return the RTL for 0x01010101*@var{data}.
> > +If @var{PREV} is not @samp{nullptr}, it is the RTL information from
> > +the previous iteration.
> > +@end deftypefn
> > +
> >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})
> >  This target hook returns a new value for the number of times @var{loop}
> >  should be unrolled. The parameter @var{nunroll} is the number of times
> > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > index d8e3de14af1..8d4c3949fbf 100644
> > --- a/gcc/doc/tm.texi.in
> > +++ b/gcc/doc/tm.texi.in
> > @@ -7956,6 +7956,10 @@ lists.
> >
> >  @hook TARGET_GEN_CCMP_NEXT
> >
> > +@hook TARGET_READ_MEMSET_VALUE
> > +
> > +@hook TARGET_GEN_MEMSET_VALUE
> > +
> >  @hook TARGET_LOOP_UNROLL_ADJUST
> >
> >  @defmac POWI_MAX_MULTS
> > diff --git a/gcc/target.def b/gcc/target.def
> > index bbaf6b4f3a0..c9aca40fa88 100644
> > --- a/gcc/target.def
> > +++ b/gcc/target.def
> > @@ -2694,6 +2694,26 @@ DEFHOOK
> >   rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree op0, tree op1, int bit_code),
> >   NULL)
> >
> > +DEFHOOK
> > +(read_memset_value,
> > + "This function returns the RTL of a constant integer corresponding to\n\
> > +target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the stringn\n\
> > +constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains\n\
>
> where is 'str' defined?  I can't really tell what's the difference

Fixed with

diff --git a/gcc/target.def b/gcc/target.def
index c9aca40fa88..4c3a5fcc634 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2699,8 +2699,8 @@ DEFHOOK
  "This function returns the RTL of a constant integer corresponding to\n\
 target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the string\n\
 constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains\n\
-the RTL information from the previous interation.",
- rtx, (const char *c, void *prev, scalar_int_mode mode),
+the RTL information from the previous iteration.",
+ rtx, (const char *str, void *prev, scalar_int_mode mode),
  default_read_memset_value)

 DEFHOOK

> from read_memset_value
> and gen_memset_value.

The difference is that input of read_memset_value is a string constant
like "123" and input of gen_memset_value is an RTL register.

> Somehow I feel that an optab for the "splat" operation similar
> to vec_duplicate might be a better way to expose this - of course
> that doesn't handle the "prev" thing.

The x86 backend has ix86_expand_vector_init_duplicate () to
broadcast QImode to TImode/OImode/XImode:

/* A subroutine of ix86_expand_vector_init.  Store into TARGET a vector
   with all elements equal to VAR.  Return true if successful.  */

bool
ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
                                   rtx target, rtx val)

> So how's this the right point of abstraction to the target?

I can add 2 target hooks, one for scratch register and one for
broadcasting QImode to TImode/OImode/XImode.   Then I can
move x86 codes to the middle-end.

> > +the RTL information from the previous interation.",
> > + rtx, (const char *c, void *prev, scalar_int_mode mode),
> > + default_read_memset_value)
> > +
> > +DEFHOOK
> > +(gen_memset_value,
> > + "This function returns the RTL of a register containing\n\
> > +@code{GET_MODE_SIZE (@var{mode})} consecutive copies of the unsigned\n\
> > +char value given in the RTL register @var{data}.  For example, if\n\
> > +@var{mode} is 4 bytes wide, return the RTL for 0x01010101*@var{data}.\n\
> > +If @var{PREV} is not @samp{nullptr}, it is the RTL information from\n\
> > +the previous iteration.",
> > + rtx, (rtx data, void *prev, scalar_int_mode mode),
> > + default_gen_memset_value)
> > +
> >  /* Return a new value for loop unroll size.  */
> >  DEFHOOK
> >  (loop_unroll_adjust,
> > diff --git a/gcc/targhooks.c b/gcc/targhooks.c
> > index 1947ef26fd6..b55e6ec6756 100644
> > --- a/gcc/targhooks.c
> > +++ b/gcc/targhooks.c
> > @@ -90,6 +90,7 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "attribs.h"
> >  #include "asan.h"
> >  #include "emit-rtl.h"
> > +#include "builtins.h"
> >
> >  bool
> >  default_legitimate_address_p (machine_mode mode ATTRIBUTE_UNUSED,
> > @@ -2627,4 +2628,59 @@ default_memtag_untagged_pointer (rtx tagged_pointer, rtx target)
> >    return untagged_base;
> >  }
> >
> > +/* Default implementation of TARGET_READ_MEMSET_VALUE.  */
> > +
> > +rtx
> > +default_read_memset_value (const char *c, void *prevp,
> > +                          scalar_int_mode mode)
> > +{
> > +  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > +  if (prev != nullptr && prev->data != nullptr)
> > +    {
> > +      /* Use the previous data in the same mode.  */
> > +      if (prev->mode == mode)
> > +       return prev->data;
> > +    }
> > +
> > +  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> > +
> > +  memset (p, *c, GET_MODE_SIZE (mode));
> > +
> > +  return c_readstr (p, mode);
> > +}
> > +
> > +/* Default implementation of TARGET_GEN_MEMSET_VALUE.  */
> > +
> > +rtx
> > +default_gen_memset_value (rtx data, void *prevp, scalar_int_mode mode)
> > +{
> > +  rtx target, coeff;
> > +  size_t size;
> > +  char *p;
> > +
> > +  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > +  if (prev != nullptr && prev->data != nullptr)
> > +    {
> > +      /* Use the previous data in the same mode.  */
> > +      if (prev->mode == mode)
> > +       return prev->data;
> > +
> > +      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> > +      if (target != nullptr)
> > +       return target;
> > +    }
> > +
> > +  size = GET_MODE_SIZE (mode);
> > +  if (size == 1)
> > +    return data;
> > +
> > +  p = XALLOCAVEC (char, size);
> > +  memset (p, 1, size);
> > +  coeff = c_readstr (p, mode);
> > +
> > +  target = convert_to_mode (mode, data, 1);
> > +  target = expand_mult (mode, target, coeff, NULL_RTX, 1);
> > +  return force_reg (mode, target);
> > +}
> > +
> >  #include "gt-targhooks.h"
> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> > index b537038c0aa..3c00927e196 100644
> > --- a/gcc/targhooks.h
> > +++ b/gcc/targhooks.h
> > @@ -300,4 +300,8 @@ extern rtx default_memtag_set_tag (rtx, rtx, rtx);
> >  extern rtx default_memtag_extract_tag (rtx, rtx);
> >  extern rtx default_memtag_untagged_pointer (rtx, rtx);
> >
> > +extern rtx default_read_memset_value (const char *, void *,
> > +                                     scalar_int_mode);
> > +extern rtx default_gen_memset_value (rtx, void *, scalar_int_mode);
> > +
> >  #endif /* GCC_TARGHOOKS_H */
> > --
> > 2.31.1
> >



-- 
H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 12/12] constructor: Check if it is faster to load constant from memory
  2021-05-19  9:33   ` Richard Biener
@ 2021-05-19 13:22     ` H.J. Lu
  2021-05-19 13:27       ` Bernd Edlinger
  2021-05-20  7:51       ` Richard Biener
  0 siblings, 2 replies; 52+ messages in thread
From: H.J. Lu @ 2021-05-19 13:22 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

On Wed, May 19, 2021 at 2:33 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > When expanding a constant constructor, don't call expand_constructor if
> > it is more efficient to load the data from the memory via move by pieces.
> >
> > gcc/
> >
> >         PR middle-end/90773
> >         * expr.c (expand_expr_real_1): Don't call expand_constructor if
> >         it is more efficient to load the data from the memory.
> >
> > gcc/testsuite/
> >
> >         PR middle-end/90773
> >         * gcc.target/i386/pr90773-24.c: New test.
> >         * gcc.target/i386/pr90773-25.c: Likewise.
> > ---
> >  gcc/expr.c                                 | 10 ++++++++++
> >  gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 ++++++++++++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 ++++++++++++++++++++
> >  3 files changed, 52 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c
> >
> > diff --git a/gcc/expr.c b/gcc/expr.c
> > index d09ee42e262..80e01ea1cbe 100644
> > --- a/gcc/expr.c
> > +++ b/gcc/expr.c
> > @@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
> >                 unsigned HOST_WIDE_INT ix;
> >                 tree field, value;
> >
> > +               /* Check if it is more efficient to load the data from
> > +                  the memory directly.  FIXME: How many stores do we
> > +                  need here if not moved by pieces?  */
> > +               unsigned HOST_WIDE_INT bytes
> > +                 = tree_to_uhwi (TYPE_SIZE_UNIT (type));
>
> that's prone to fail - it could be a VLA.

What do you mean by fail?  Is it ICE or missed optimization?
Do you have a testcase?

>
> > +               if ((bytes / UNITS_PER_WORD) > 2
> > +                   && MOVE_MAX_PIECES > UNITS_PER_WORD
> > +                   && can_move_by_pieces (bytes, TYPE_ALIGN (type)))
> > +                 goto normal_inner_ref;
> > +
>
> It looks like you're concerned about aggregate copies but this also handles
> non-aggregates (which on GIMPLE might already be optimized of course).

Here I check if we copy more than 2 words and we can move more than
a word in a single instruction.

> Also you say "if it's cheaper" but I see no cost considerations.  How do
> we generally handle immed const vs. load from constant pool costs?

This trades 2 (update to 8) stores with one load plus one store.  Is there
a way to check which one is faster?

> >                 FOR_EACH_CONSTRUCTOR_ELT (CONSTRUCTOR_ELTS (init), ix,
> >                                           field, value)
> >                   if (tree_int_cst_equal (field, index))
> > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-24.c b/gcc/testsuite/gcc.target/i386/pr90773-24.c
> > new file mode 100644
> > index 00000000000..4a4b62533dc
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr90773-24.c
> > @@ -0,0 +1,22 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=x86-64" } */
> > +
> > +struct S
> > +{
> > +  long long s1 __attribute__ ((aligned (8)));
> > +  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
> > +};
> > +
> > +const struct S array[] = {
> > +  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
> > +};
> > +
> > +void
> > +foo (struct S *x)
> > +{
> > +  x[0] = array[0];
> > +}
> > +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
> > +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 16\\(%\[\^,\]+\\)" 1 } } */
> > +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
> > +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 48\\(%\[\^,\]+\\)" 1 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-25.c b/gcc/testsuite/gcc.target/i386/pr90773-25.c
> > new file mode 100644
> > index 00000000000..2520b670989
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr90773-25.c
> > @@ -0,0 +1,20 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=skylake" } */
> > +
> > +struct S
> > +{
> > +  long long s1 __attribute__ ((aligned (8)));
> > +  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
> > +};
> > +
> > +const struct S array[] = {
> > +  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
> > +};
> > +
> > +void
> > +foo (struct S *x)
> > +{
> > +  x[0] = array[0];
> > +}
> > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
> > --
> > 2.31.1
> >



-- 
H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 12/12] constructor: Check if it is faster to load constant from memory
  2021-05-19 13:22     ` H.J. Lu
@ 2021-05-19 13:27       ` Bernd Edlinger
  2021-05-19 19:04         ` H.J. Lu
  2021-05-20  7:51       ` Richard Biener
  1 sibling, 1 reply; 52+ messages in thread
From: Bernd Edlinger @ 2021-05-19 13:27 UTC (permalink / raw)
  To: H.J. Lu, Richard Biener; +Cc: GCC Patches, Richard Sandiford, Uros Bizjak

On 5/19/21 3:22 PM, H.J. Lu wrote:
> On Wed, May 19, 2021 at 2:33 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
>>
>> On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>
>>> When expanding a constant constructor, don't call expand_constructor if
>>> it is more efficient to load the data from the memory via move by pieces.
>>>
>>> gcc/
>>>
>>>         PR middle-end/90773
>>>         * expr.c (expand_expr_real_1): Don't call expand_constructor if
>>>         it is more efficient to load the data from the memory.
>>>
>>> gcc/testsuite/
>>>
>>>         PR middle-end/90773
>>>         * gcc.target/i386/pr90773-24.c: New test.
>>>         * gcc.target/i386/pr90773-25.c: Likewise.
>>> ---
>>>  gcc/expr.c                                 | 10 ++++++++++
>>>  gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 ++++++++++++++++++++++
>>>  gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 ++++++++++++++++++++
>>>  3 files changed, 52 insertions(+)
>>>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c
>>>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c
>>>
>>> diff --git a/gcc/expr.c b/gcc/expr.c
>>> index d09ee42e262..80e01ea1cbe 100644
>>> --- a/gcc/expr.c
>>> +++ b/gcc/expr.c
>>> @@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
>>>                 unsigned HOST_WIDE_INT ix;
>>>                 tree field, value;
>>>
>>> +               /* Check if it is more efficient to load the data from
>>> +                  the memory directly.  FIXME: How many stores do we
>>> +                  need here if not moved by pieces?  */
>>> +               unsigned HOST_WIDE_INT bytes
>>> +                 = tree_to_uhwi (TYPE_SIZE_UNIT (type));
>>
>> that's prone to fail - it could be a VLA.
> 
> What do you mean by fail?  Is it ICE or missed optimization?
> Do you have a testcase?
> 

I think for a VLA the TYPE_SIZE_UNIT may be unknown (NULL), or something like "x".

for instance something like

int test (int x)
{
  int vla[x];

  vla[x-1] = 0;
  return vla[x-1];
}


Bernd.

>>
>>> +               if ((bytes / UNITS_PER_WORD) > 2
>>> +                   && MOVE_MAX_PIECES > UNITS_PER_WORD
>>> +                   && can_move_by_pieces (bytes, TYPE_ALIGN (type)))
>>> +                 goto normal_inner_ref;
>>> +
>>
>> It looks like you're concerned about aggregate copies but this also handles
>> non-aggregates (which on GIMPLE might already be optimized of course).
> 
> Here I check if we copy more than 2 words and we can move more than
> a word in a single instruction.
> 
>> Also you say "if it's cheaper" but I see no cost considerations.  How do
>> we generally handle immed const vs. load from constant pool costs?
> 
> This trades 2 (update to 8) stores with one load plus one store.  Is there
> a way to check which one is faster?
> 
>>>                 FOR_EACH_CONSTRUCTOR_ELT (CONSTRUCTOR_ELTS (init), ix,
>>>                                           field, value)
>>>                   if (tree_int_cst_equal (field, index))
>>> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-24.c b/gcc/testsuite/gcc.target/i386/pr90773-24.c
>>> new file mode 100644
>>> index 00000000000..4a4b62533dc
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/i386/pr90773-24.c
>>> @@ -0,0 +1,22 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-options "-O2 -march=x86-64" } */
>>> +
>>> +struct S
>>> +{
>>> +  long long s1 __attribute__ ((aligned (8)));
>>> +  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
>>> +};
>>> +
>>> +const struct S array[] = {
>>> +  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
>>> +};
>>> +
>>> +void
>>> +foo (struct S *x)
>>> +{
>>> +  x[0] = array[0];
>>> +}
>>> +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
>>> +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 16\\(%\[\^,\]+\\)" 1 } } */
>>> +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
>>> +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 48\\(%\[\^,\]+\\)" 1 } } */
>>> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-25.c b/gcc/testsuite/gcc.target/i386/pr90773-25.c
>>> new file mode 100644
>>> index 00000000000..2520b670989
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/i386/pr90773-25.c
>>> @@ -0,0 +1,20 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-options "-O2 -march=skylake" } */
>>> +
>>> +struct S
>>> +{
>>> +  long long s1 __attribute__ ((aligned (8)));
>>> +  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
>>> +};
>>> +
>>> +const struct S array[] = {
>>> +  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
>>> +};
>>> +
>>> +void
>>> +foo (struct S *x)
>>> +{
>>> +  x[0] = array[0];
>>> +}
>>> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
>>> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
>>> --
>>> 2.31.1
>>>
> 
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 12/12] constructor: Check if it is faster to load constant from memory
  2021-05-19 13:27       ` Bernd Edlinger
@ 2021-05-19 19:04         ` H.J. Lu
  2021-05-20  6:57           ` Richard Biener
  0 siblings, 1 reply; 52+ messages in thread
From: H.J. Lu @ 2021-05-19 19:04 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Richard Biener, GCC Patches, Richard Sandiford, Uros Bizjak

On Wed, May 19, 2021 at 6:27 AM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
> On 5/19/21 3:22 PM, H.J. Lu wrote:
> > On Wed, May 19, 2021 at 2:33 AM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> >>
> >> On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>>
> >>> When expanding a constant constructor, don't call expand_constructor if
> >>> it is more efficient to load the data from the memory via move by pieces.
> >>>
> >>> gcc/
> >>>
> >>>         PR middle-end/90773
> >>>         * expr.c (expand_expr_real_1): Don't call expand_constructor if
> >>>         it is more efficient to load the data from the memory.
> >>>
> >>> gcc/testsuite/
> >>>
> >>>         PR middle-end/90773
> >>>         * gcc.target/i386/pr90773-24.c: New test.
> >>>         * gcc.target/i386/pr90773-25.c: Likewise.
> >>> ---
> >>>  gcc/expr.c                                 | 10 ++++++++++
> >>>  gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 ++++++++++++++++++++++
> >>>  gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 ++++++++++++++++++++
> >>>  3 files changed, 52 insertions(+)
> >>>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c
> >>>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c
> >>>
> >>> diff --git a/gcc/expr.c b/gcc/expr.c
> >>> index d09ee42e262..80e01ea1cbe 100644
> >>> --- a/gcc/expr.c
> >>> +++ b/gcc/expr.c
> >>> @@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
> >>>                 unsigned HOST_WIDE_INT ix;
> >>>                 tree field, value;
> >>>
> >>> +               /* Check if it is more efficient to load the data from
> >>> +                  the memory directly.  FIXME: How many stores do we
> >>> +                  need here if not moved by pieces?  */
> >>> +               unsigned HOST_WIDE_INT bytes
> >>> +                 = tree_to_uhwi (TYPE_SIZE_UNIT (type));
> >>
> >> that's prone to fail - it could be a VLA.
> >
> > What do you mean by fail?  Is it ICE or missed optimization?
> > Do you have a testcase?
> >
>
> I think for a VLA the TYPE_SIZE_UNIT may be unknown (NULL), or something like "x".
>
> for instance something like
>
> int test (int x)
> {
>   int vla[x];
>
>   vla[x-1] = 0;
>   return vla[x-1];
> }

My patch changes the CONSTRUCTOR code path.   I couldn't find a CONSTRUCTOR
testcase with VLA.

>
> Bernd.
>
> >>
> >>> +               if ((bytes / UNITS_PER_WORD) > 2
> >>> +                   && MOVE_MAX_PIECES > UNITS_PER_WORD
> >>> +                   && can_move_by_pieces (bytes, TYPE_ALIGN (type)))
> >>> +                 goto normal_inner_ref;
> >>> +
> >>
> >> It looks like you're concerned about aggregate copies but this also handles
> >> non-aggregates (which on GIMPLE might already be optimized of course).
> >
> > Here I check if we copy more than 2 words and we can move more than
> > a word in a single instruction.
> >
> >> Also you say "if it's cheaper" but I see no cost considerations.  How do
> >> we generally handle immed const vs. load from constant pool costs?
> >
> > This trades 2 (update to 8) stores with one load plus one store.  Is there
> > a way to check which one is faster?
> >
> >>>                 FOR_EACH_CONSTRUCTOR_ELT (CONSTRUCTOR_ELTS (init), ix,
> >>>                                           field, value)
> >>>                   if (tree_int_cst_equal (field, index))
> >>> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-24.c b/gcc/testsuite/gcc.target/i386/pr90773-24.c
> >>> new file mode 100644
> >>> index 00000000000..4a4b62533dc
> >>> --- /dev/null
> >>> +++ b/gcc/testsuite/gcc.target/i386/pr90773-24.c
> >>> @@ -0,0 +1,22 @@
> >>> +/* { dg-do compile } */
> >>> +/* { dg-options "-O2 -march=x86-64" } */
> >>> +
> >>> +struct S
> >>> +{
> >>> +  long long s1 __attribute__ ((aligned (8)));
> >>> +  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
> >>> +};
> >>> +
> >>> +const struct S array[] = {
> >>> +  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
> >>> +};
> >>> +
> >>> +void
> >>> +foo (struct S *x)
> >>> +{
> >>> +  x[0] = array[0];
> >>> +}
> >>> +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
> >>> +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 16\\(%\[\^,\]+\\)" 1 } } */
> >>> +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
> >>> +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 48\\(%\[\^,\]+\\)" 1 } } */
> >>> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-25.c b/gcc/testsuite/gcc.target/i386/pr90773-25.c
> >>> new file mode 100644
> >>> index 00000000000..2520b670989
> >>> --- /dev/null
> >>> +++ b/gcc/testsuite/gcc.target/i386/pr90773-25.c
> >>> @@ -0,0 +1,20 @@
> >>> +/* { dg-do compile } */
> >>> +/* { dg-options "-O2 -march=skylake" } */
> >>> +
> >>> +struct S
> >>> +{
> >>> +  long long s1 __attribute__ ((aligned (8)));
> >>> +  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
> >>> +};
> >>> +
> >>> +const struct S array[] = {
> >>> +  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
> >>> +};
> >>> +
> >>> +void
> >>> +foo (struct S *x)
> >>> +{
> >>> +  x[0] = array[0];
> >>> +}
> >>> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
> >>> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
> >>> --
> >>> 2.31.1
> >>>
> >
> >
> >



-- 
H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 12/12] constructor: Check if it is faster to load constant from memory
  2021-05-19 19:04         ` H.J. Lu
@ 2021-05-20  6:57           ` Richard Biener
  0 siblings, 0 replies; 52+ messages in thread
From: Richard Biener @ 2021-05-20  6:57 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Bernd Edlinger, GCC Patches, Richard Sandiford, Uros Bizjak

On Wed, May 19, 2021 at 9:05 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, May 19, 2021 at 6:27 AM Bernd Edlinger
> <bernd.edlinger@hotmail.de> wrote:
> >
> > On 5/19/21 3:22 PM, H.J. Lu wrote:
> > > On Wed, May 19, 2021 at 2:33 AM Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > >>
> > >> On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >>>
> > >>> When expanding a constant constructor, don't call expand_constructor if
> > >>> it is more efficient to load the data from the memory via move by pieces.
> > >>>
> > >>> gcc/
> > >>>
> > >>>         PR middle-end/90773
> > >>>         * expr.c (expand_expr_real_1): Don't call expand_constructor if
> > >>>         it is more efficient to load the data from the memory.
> > >>>
> > >>> gcc/testsuite/
> > >>>
> > >>>         PR middle-end/90773
> > >>>         * gcc.target/i386/pr90773-24.c: New test.
> > >>>         * gcc.target/i386/pr90773-25.c: Likewise.
> > >>> ---
> > >>>  gcc/expr.c                                 | 10 ++++++++++
> > >>>  gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 ++++++++++++++++++++++
> > >>>  gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 ++++++++++++++++++++
> > >>>  3 files changed, 52 insertions(+)
> > >>>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c
> > >>>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c
> > >>>
> > >>> diff --git a/gcc/expr.c b/gcc/expr.c
> > >>> index d09ee42e262..80e01ea1cbe 100644
> > >>> --- a/gcc/expr.c
> > >>> +++ b/gcc/expr.c
> > >>> @@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
> > >>>                 unsigned HOST_WIDE_INT ix;
> > >>>                 tree field, value;
> > >>>
> > >>> +               /* Check if it is more efficient to load the data from
> > >>> +                  the memory directly.  FIXME: How many stores do we
> > >>> +                  need here if not moved by pieces?  */
> > >>> +               unsigned HOST_WIDE_INT bytes
> > >>> +                 = tree_to_uhwi (TYPE_SIZE_UNIT (type));
> > >>
> > >> that's prone to fail - it could be a VLA.
> > >
> > > What do you mean by fail?  Is it ICE or missed optimization?
> > > Do you have a testcase?
> > >
> >
> > I think for a VLA the TYPE_SIZE_UNIT may be unknown (NULL), or something like "x".
> >
> > for instance something like
> >
> > int test (int x)
> > {
> >   int vla[x];
> >
> >   vla[x-1] = 0;
> >   return vla[x-1];
> > }
>
> My patch changes the CONSTRUCTOR code path.   I couldn't find a CONSTRUCTOR
> testcase with VLA.

nevertheless it doens't hurt to check tree_fits_uhwi (TYPE_SIZE_UNIT (type)),
there's also int_size_in_bytes () returning a signed HOST_WIDE_INT and -1
on "failure" that would work well in your case.

> >
> > Bernd.
> >
> > >>
> > >>> +               if ((bytes / UNITS_PER_WORD) > 2
> > >>> +                   && MOVE_MAX_PIECES > UNITS_PER_WORD
> > >>> +                   && can_move_by_pieces (bytes, TYPE_ALIGN (type)))
> > >>> +                 goto normal_inner_ref;
> > >>> +
> > >>
> > >> It looks like you're concerned about aggregate copies but this also handles
> > >> non-aggregates (which on GIMPLE might already be optimized of course).
> > >
> > > Here I check if we copy more than 2 words and we can move more than
> > > a word in a single instruction.
> > >
> > >> Also you say "if it's cheaper" but I see no cost considerations.  How do
> > >> we generally handle immed const vs. load from constant pool costs?
> > >
> > > This trades 2 (update to 8) stores with one load plus one store.  Is there
> > > a way to check which one is faster?
> > >
> > >>>                 FOR_EACH_CONSTRUCTOR_ELT (CONSTRUCTOR_ELTS (init), ix,
> > >>>                                           field, value)
> > >>>                   if (tree_int_cst_equal (field, index))
> > >>> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-24.c b/gcc/testsuite/gcc.target/i386/pr90773-24.c
> > >>> new file mode 100644
> > >>> index 00000000000..4a4b62533dc
> > >>> --- /dev/null
> > >>> +++ b/gcc/testsuite/gcc.target/i386/pr90773-24.c
> > >>> @@ -0,0 +1,22 @@
> > >>> +/* { dg-do compile } */
> > >>> +/* { dg-options "-O2 -march=x86-64" } */
> > >>> +
> > >>> +struct S
> > >>> +{
> > >>> +  long long s1 __attribute__ ((aligned (8)));
> > >>> +  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
> > >>> +};
> > >>> +
> > >>> +const struct S array[] = {
> > >>> +  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
> > >>> +};
> > >>> +
> > >>> +void
> > >>> +foo (struct S *x)
> > >>> +{
> > >>> +  x[0] = array[0];
> > >>> +}
> > >>> +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
> > >>> +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 16\\(%\[\^,\]+\\)" 1 } } */
> > >>> +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
> > >>> +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 48\\(%\[\^,\]+\\)" 1 } } */
> > >>> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-25.c b/gcc/testsuite/gcc.target/i386/pr90773-25.c
> > >>> new file mode 100644
> > >>> index 00000000000..2520b670989
> > >>> --- /dev/null
> > >>> +++ b/gcc/testsuite/gcc.target/i386/pr90773-25.c
> > >>> @@ -0,0 +1,20 @@
> > >>> +/* { dg-do compile } */
> > >>> +/* { dg-options "-O2 -march=skylake" } */
> > >>> +
> > >>> +struct S
> > >>> +{
> > >>> +  long long s1 __attribute__ ((aligned (8)));
> > >>> +  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
> > >>> +};
> > >>> +
> > >>> +const struct S array[] = {
> > >>> +  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
> > >>> +};
> > >>> +
> > >>> +void
> > >>> +foo (struct S *x)
> > >>> +{
> > >>> +  x[0] = array[0];
> > >>> +}
> > >>> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
> > >>> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
> > >>> --
> > >>> 2.31.1
> > >>>
> > >
> > >
> > >
>
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 12/12] constructor: Check if it is faster to load constant from memory
  2021-05-19 13:22     ` H.J. Lu
  2021-05-19 13:27       ` Bernd Edlinger
@ 2021-05-20  7:51       ` Richard Biener
  2021-05-20 14:03         ` [PATCH] constructor: Elide expand_constructor when can move by pieces is true H.J. Lu
  1 sibling, 1 reply; 52+ messages in thread
From: Richard Biener @ 2021-05-20  7:51 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

On Wed, May 19, 2021 at 3:22 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, May 19, 2021 at 2:33 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > When expanding a constant constructor, don't call expand_constructor if
> > > it is more efficient to load the data from the memory via move by pieces.
> > >
> > > gcc/
> > >
> > >         PR middle-end/90773
> > >         * expr.c (expand_expr_real_1): Don't call expand_constructor if
> > >         it is more efficient to load the data from the memory.
> > >
> > > gcc/testsuite/
> > >
> > >         PR middle-end/90773
> > >         * gcc.target/i386/pr90773-24.c: New test.
> > >         * gcc.target/i386/pr90773-25.c: Likewise.
> > > ---
> > >  gcc/expr.c                                 | 10 ++++++++++
> > >  gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 ++++++++++++++++++++++
> > >  gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 ++++++++++++++++++++
> > >  3 files changed, 52 insertions(+)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c
> > >
> > > diff --git a/gcc/expr.c b/gcc/expr.c
> > > index d09ee42e262..80e01ea1cbe 100644
> > > --- a/gcc/expr.c
> > > +++ b/gcc/expr.c
> > > @@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
> > >                 unsigned HOST_WIDE_INT ix;
> > >                 tree field, value;
> > >
> > > +               /* Check if it is more efficient to load the data from
> > > +                  the memory directly.  FIXME: How many stores do we
> > > +                  need here if not moved by pieces?  */
> > > +               unsigned HOST_WIDE_INT bytes
> > > +                 = tree_to_uhwi (TYPE_SIZE_UNIT (type));
> >
> > that's prone to fail - it could be a VLA.
>
> What do you mean by fail?  Is it ICE or missed optimization?
> Do you have a testcase?
>
> >
> > > +               if ((bytes / UNITS_PER_WORD) > 2
> > > +                   && MOVE_MAX_PIECES > UNITS_PER_WORD
> > > +                   && can_move_by_pieces (bytes, TYPE_ALIGN (type)))
> > > +                 goto normal_inner_ref;
> > > +
> >
> > It looks like you're concerned about aggregate copies but this also handles
> > non-aggregates (which on GIMPLE might already be optimized of course).
>
> Here I check if we copy more than 2 words and we can move more than
> a word in a single instruction.
>
> > Also you say "if it's cheaper" but I see no cost considerations.  How do
> > we generally handle immed const vs. load from constant pool costs?
>
> This trades 2 (update to 8) stores with one load plus one store.  Is there
> a way to check which one is faster?

I'm not sure - it depends on whether the target can do stores from immediates
at all or what restrictions apply, what the immediate value actually is
(zero or all-ones should be way cheaper than sth arbitrary) and how the
pressure on the load unit is.  can_move_by_pieces (bytes, TYPE_ALIGN (type))
also does not guarantee it will actually move pieces larger than UNITS_PER_WORD,
that might depend on alignment.  There's by_pieces_ninsns that might provide
some hint here.

I'm sure it works well for x86.

I wonder if the existing code is in the appropriate place and we
shouldn't instead
handle this somewhere upthread where we ask to copy 'exp' into some other
memory location.  For your testcase that's expand_assignment but I can
imagine passing array[0] by value to a function resulting in similar copying.
Testing that shows we get

        pushq   array+56(%rip)
        .cfi_def_cfa_offset 24
        pushq   array+48(%rip)
        .cfi_def_cfa_offset 32
        pushq   array+40(%rip)
        .cfi_def_cfa_offset 40
        pushq   array+32(%rip)
        .cfi_def_cfa_offset 48
        pushq   array+24(%rip)
        .cfi_def_cfa_offset 56
        pushq   array+16(%rip)
        .cfi_def_cfa_offset 64
        pushq   array+8(%rip)
        .cfi_def_cfa_offset 72
        pushq   array(%rip)
        .cfi_def_cfa_offset 80
        call    bar

for that.  We do have the by-pieces infrastructure to generally do this kind of
copying but in both of these cases we do not seem to use it.  I also wonder
if the by-pieces infrastructure can pick up constant initializers automagically
(we could native_encode the initializer part and feed the by-pieces
infrastructure with an array of bytes).  There for example might be easy to
immediate-store byte parts and difficult ones where we could decide on a
case-by-case basis whether to load+store or immediate-store them.

For example if I change your testcase to have the array[] initializer
all-zero we currently emit

        pxor    %xmm0, %xmm0
        movups  %xmm0, (%rdi)
        movups  %xmm0, 16(%rdi)
        movups  %xmm0, 32(%rdi)
        movups  %xmm0, 48(%rdi)
        ret

will your patch cause us to emit 4 loads?  OTHO if I do

const struct S array[] = {
  { 0, 0, 0, 7241, 124764, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
};

we get

        movq    $0, (%rdi)
        movl    $0, 8(%rdi)
        movl    $0, 12(%rdi)
        movl    $7241, 16(%rdi)
...

ideally we'd have sth like

    pxor %xmm0, %xmm0
    movups  %xmm0, (%rdi)
    movaps array+16(%rip), %xmm0
    movups %xmm0, 16(%rdi)
...

thus have the zeros written as immediates and the remaining pieces
with load+stores.

The by-pieces infrastructure eventually get's to see

(mem/u/c:BLK (symbol_ref:DI ("array") [flags 0x2] <var_decl
0x7ffff7ff5b40 array>) [1 array+0 S64 A256])

where the MEM_EXPR should provide a way to access the constant initializer.

That said I do agree the current code is a bit premature optimization
- but maybe
it should be fend off in expand_constructor which has the cheap clear_storage
first and which already does check can_move_by_pieces with some heuristics,
but that seems to be guarded by

           || (tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
               && (! can_move_by_pieces
                   (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
                    TYPE_ALIGN (type)))
               && ! mostly_zeros_p (exp))))

which is odd (we _can_ move by pieces, but how does this apply to
TREE_CONSTANT CTORs and avoid_temp_mem?).

That said, I wonder if we want to elide expand_constructor when the
CTOR is TREE_STATIC && TREE_CONSTANT and !mostly_zeros_p
and we can_move_by_pieces.

So sth like

diff --git a/gcc/expr.c b/gcc/expr.c
index 7139545d543..76b3bdf0c01 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -8504,6 +8504,12 @@ expand_constructor (tree exp, rtx target, enum
expand_modifier modifier,
               && (! can_move_by_pieces
                   (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
                    TYPE_ALIGN (type)))
+              && ! mostly_zeros_p (exp))
+          || (TREE_CONSTANT (exp)
+              && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
+              && (can_move_by_pieces
+                  (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
+                   TYPE_ALIGN (type)))
               && ! mostly_zeros_p (exp))))
       || ((modifier == EXPAND_INITIALIZER || modifier == EXPAND_CONST_ADDRESS)
          && TREE_CONSTANT (exp)))

which handles your initializer and the all-zero one optimal?

Richard.

> > >                 FOR_EACH_CONSTRUCTOR_ELT (CONSTRUCTOR_ELTS (init), ix,
> > >                                           field, value)
> > >                   if (tree_int_cst_equal (field, index))
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-24.c b/gcc/testsuite/gcc.target/i386/pr90773-24.c
> > > new file mode 100644
> > > index 00000000000..4a4b62533dc
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr90773-24.c
> > > @@ -0,0 +1,22 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=x86-64" } */
> > > +
> > > +struct S
> > > +{
> > > +  long long s1 __attribute__ ((aligned (8)));
> > > +  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
> > > +};
> > > +
> > > +const struct S array[] = {
> > > +  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
> > > +};
> > > +
> > > +void
> > > +foo (struct S *x)
> > > +{
> > > +  x[0] = array[0];
> > > +}
> > > +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
> > > +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 16\\(%\[\^,\]+\\)" 1 } } */
> > > +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
> > > +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 48\\(%\[\^,\]+\\)" 1 } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-25.c b/gcc/testsuite/gcc.target/i386/pr90773-25.c
> > > new file mode 100644
> > > index 00000000000..2520b670989
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr90773-25.c
> > > @@ -0,0 +1,20 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=skylake" } */
> > > +
> > > +struct S
> > > +{
> > > +  long long s1 __attribute__ ((aligned (8)));
> > > +  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
> > > +};
> > > +
> > > +const struct S array[] = {
> > > +  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
> > > +};
> > > +
> > > +void
> > > +foo (struct S *x)
> > > +{
> > > +  x[0] = array[0];
> > > +}
> > > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
> > > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
> > > --
> > > 2.31.1
> > >
>
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] constructor: Elide expand_constructor when can move by pieces is true
  2021-05-20  7:51       ` Richard Biener
@ 2021-05-20 14:03         ` H.J. Lu
  2021-05-21  5:35           ` Bernd Edlinger
  2021-05-21  6:57           ` Richard Biener
  0 siblings, 2 replies; 52+ messages in thread
From: H.J. Lu @ 2021-05-20 14:03 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

[-- Attachment #1: Type: text/plain, Size: 7848 bytes --]

On Thu, May 20, 2021 at 12:51 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Wed, May 19, 2021 at 3:22 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Wed, May 19, 2021 at 2:33 AM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >
> > > > When expanding a constant constructor, don't call expand_constructor if
> > > > it is more efficient to load the data from the memory via move by pieces.
> > > >
> > > > gcc/
> > > >
> > > >         PR middle-end/90773
> > > >         * expr.c (expand_expr_real_1): Don't call expand_constructor if
> > > >         it is more efficient to load the data from the memory.
> > > >
> > > > gcc/testsuite/
> > > >
> > > >         PR middle-end/90773
> > > >         * gcc.target/i386/pr90773-24.c: New test.
> > > >         * gcc.target/i386/pr90773-25.c: Likewise.
> > > > ---
> > > >  gcc/expr.c                                 | 10 ++++++++++
> > > >  gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 ++++++++++++++++++++++
> > > >  gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 ++++++++++++++++++++
> > > >  3 files changed, 52 insertions(+)
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c
> > > >
> > > > diff --git a/gcc/expr.c b/gcc/expr.c
> > > > index d09ee42e262..80e01ea1cbe 100644
> > > > --- a/gcc/expr.c
> > > > +++ b/gcc/expr.c
> > > > @@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
> > > >                 unsigned HOST_WIDE_INT ix;
> > > >                 tree field, value;
> > > >
> > > > +               /* Check if it is more efficient to load the data from
> > > > +                  the memory directly.  FIXME: How many stores do we
> > > > +                  need here if not moved by pieces?  */
> > > > +               unsigned HOST_WIDE_INT bytes
> > > > +                 = tree_to_uhwi (TYPE_SIZE_UNIT (type));
> > >
> > > that's prone to fail - it could be a VLA.
> >
> > What do you mean by fail?  Is it ICE or missed optimization?
> > Do you have a testcase?
> >
> > >
> > > > +               if ((bytes / UNITS_PER_WORD) > 2
> > > > +                   && MOVE_MAX_PIECES > UNITS_PER_WORD
> > > > +                   && can_move_by_pieces (bytes, TYPE_ALIGN (type)))
> > > > +                 goto normal_inner_ref;
> > > > +
> > >
> > > It looks like you're concerned about aggregate copies but this also handles
> > > non-aggregates (which on GIMPLE might already be optimized of course).
> >
> > Here I check if we copy more than 2 words and we can move more than
> > a word in a single instruction.
> >
> > > Also you say "if it's cheaper" but I see no cost considerations.  How do
> > > we generally handle immed const vs. load from constant pool costs?
> >
> > This trades 2 (update to 8) stores with one load plus one store.  Is there
> > a way to check which one is faster?
>
> I'm not sure - it depends on whether the target can do stores from immediates
> at all or what restrictions apply, what the immediate value actually is
> (zero or all-ones should be way cheaper than sth arbitrary) and how the
> pressure on the load unit is.  can_move_by_pieces (bytes, TYPE_ALIGN (type))
> also does not guarantee it will actually move pieces larger than UNITS_PER_WORD,
> that might depend on alignment.  There's by_pieces_ninsns that might provide
> some hint here.
>
> I'm sure it works well for x86.
>
> I wonder if the existing code is in the appropriate place and we
> shouldn't instead
> handle this somewhere upthread where we ask to copy 'exp' into some other
> memory location.  For your testcase that's expand_assignment but I can
> imagine passing array[0] by value to a function resulting in similar copying.
> Testing that shows we get
>
>         pushq   array+56(%rip)
>         .cfi_def_cfa_offset 24
>         pushq   array+48(%rip)
>         .cfi_def_cfa_offset 32
>         pushq   array+40(%rip)
>         .cfi_def_cfa_offset 40
>         pushq   array+32(%rip)
>         .cfi_def_cfa_offset 48
>         pushq   array+24(%rip)
>         .cfi_def_cfa_offset 56
>         pushq   array+16(%rip)
>         .cfi_def_cfa_offset 64
>         pushq   array+8(%rip)
>         .cfi_def_cfa_offset 72
>         pushq   array(%rip)
>         .cfi_def_cfa_offset 80
>         call    bar
>
> for that.  We do have the by-pieces infrastructure to generally do this kind of
> copying but in both of these cases we do not seem to use it.  I also wonder
> if the by-pieces infrastructure can pick up constant initializers automagically
> (we could native_encode the initializer part and feed the by-pieces
> infrastructure with an array of bytes).  There for example might be easy to
> immediate-store byte parts and difficult ones where we could decide on a
> case-by-case basis whether to load+store or immediate-store them.

I opened:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100704

> For example if I change your testcase to have the array[] initializer
> all-zero we currently emit
>
>         pxor    %xmm0, %xmm0
>         movups  %xmm0, (%rdi)
>         movups  %xmm0, 16(%rdi)
>         movups  %xmm0, 32(%rdi)
>         movups  %xmm0, 48(%rdi)
>         ret
>
> will your patch cause us to emit 4 loads?  OTHO if I do
>
> const struct S array[] = {
>   { 0, 0, 0, 7241, 124764, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
> };
>
> we get
>
>         movq    $0, (%rdi)
>         movl    $0, 8(%rdi)
>         movl    $0, 12(%rdi)
>         movl    $7241, 16(%rdi)
> ...
>
> ideally we'd have sth like
>
>     pxor %xmm0, %xmm0
>     movups  %xmm0, (%rdi)
>     movaps array+16(%rip), %xmm0
>     movups %xmm0, 16(%rdi)
> ...
>
> thus have the zeros written as immediates and the remaining pieces
> with load+stores.
>
> The by-pieces infrastructure eventually get's to see
>
> (mem/u/c:BLK (symbol_ref:DI ("array") [flags 0x2] <var_decl
> 0x7ffff7ff5b40 array>) [1 array+0 S64 A256])
>
> where the MEM_EXPR should provide a way to access the constant initializer.
>
> That said I do agree the current code is a bit premature optimization
> - but maybe
> it should be fend off in expand_constructor which has the cheap clear_storage
> first and which already does check can_move_by_pieces with some heuristics,
> but that seems to be guarded by
>
>            || (tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
>                && (! can_move_by_pieces
>                    (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
>                     TYPE_ALIGN (type)))
>                && ! mostly_zeros_p (exp))))
>
> which is odd (we _can_ move by pieces, but how does this apply to
> TREE_CONSTANT CTORs and avoid_temp_mem?).
>
> That said, I wonder if we want to elide expand_constructor when the
> CTOR is TREE_STATIC && TREE_CONSTANT and !mostly_zeros_p
> and we can_move_by_pieces.
>
> So sth like
>
> diff --git a/gcc/expr.c b/gcc/expr.c
> index 7139545d543..76b3bdf0c01 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -8504,6 +8504,12 @@ expand_constructor (tree exp, rtx target, enum
> expand_modifier modifier,
>                && (! can_move_by_pieces
>                    (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
>                     TYPE_ALIGN (type)))
> +              && ! mostly_zeros_p (exp))
> +          || (TREE_CONSTANT (exp)
> +              && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
> +              && (can_move_by_pieces
> +                  (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
> +                   TYPE_ALIGN (type)))
>                && ! mostly_zeros_p (exp))))
>        || ((modifier == EXPAND_INITIALIZER || modifier == EXPAND_CONST_ADDRESS)
>           && TREE_CONSTANT (exp)))
>
> which handles your initializer and the all-zero one optimal?
>

It works.  Here is the updated patch.

Thanks.

-- 
H.J.

[-- Attachment #2: 0001-constructor-Elide-expand_constructor-when-can-move-b.patch --]
[-- Type: text/x-patch, Size: 4763 bytes --]

From 12989cce4d4c801e505dae96eb8fa36507382aa8 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Sun, 25 Apr 2021 13:56:32 -0700
Subject: [PATCH] constructor: Elide expand_constructor when can move by pieces
 is true

Elide expand_constructor when

1. The constructor is TREE_STATIC && TREE_CONSTANT.  And
2. mostly_zeros_p returns false.  And
3. can_move_by_pieces returns true.

2021-XX-XX  Richard Biener  <rguenther@suse.de>
	    H.J. Lu  <hjl.tools@gmail.com>

gcc/

	PR middle-end/90773
	* expr.c (expand_constructor): Elide expand_constructor when can
	move by pieces is true.

gcc/testsuite/

	PR middle-end/90773
	* gcc.target/i386/pr90773-24.c: New test.
	* gcc.target/i386/pr90773-25.c: Likewise.
	* gcc.target/i386/pr90773-26.c: Likewise.
---
 gcc/expr.c                                 |  6 ++++++
 gcc/testsuite/gcc.target/i386/pr90773-24.c | 23 ++++++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr90773-25.c | 21 ++++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr90773-26.c | 25 ++++++++++++++++++++++
 4 files changed, 75 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-26.c

diff --git a/gcc/expr.c b/gcc/expr.c
index d09ee42e262..068e429a296 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -8504,6 +8504,12 @@ expand_constructor (tree exp, rtx target, enum expand_modifier modifier,
 	       && (! can_move_by_pieces
 		   (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
 		    TYPE_ALIGN (type)))
+	       && ! mostly_zeros_p (exp))
+	   || (TREE_CONSTANT (exp)
+	       && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
+	       && (can_move_by_pieces
+		   (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
+		    TYPE_ALIGN (type)))
 	       && ! mostly_zeros_p (exp))))
       || ((modifier == EXPAND_INITIALIZER || modifier == EXPAND_CONST_ADDRESS)
 	  && TREE_CONSTANT (exp)))
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-24.c b/gcc/testsuite/gcc.target/i386/pr90773-24.c
new file mode 100644
index 00000000000..71f1fd8c4df
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-24.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=x86-64" } */
+
+struct S
+{
+  long long s1 __attribute__ ((aligned (8)));
+  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
+};
+
+const struct S array[] = {
+  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
+};
+
+void
+foo (struct S *x)
+{
+  x[0] = array[0];
+}
+
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 16\\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 48\\(%\[\^,\]+\\)" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-25.c b/gcc/testsuite/gcc.target/i386/pr90773-25.c
new file mode 100644
index 00000000000..b2513c3a9c8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-25.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake" } */
+
+struct S
+{
+  long long s1 __attribute__ ((aligned (8)));
+  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
+};
+
+const struct S array[] = {
+  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
+};
+
+void
+foo (struct S *x)
+{
+  x[0] = array[0];
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-26.c b/gcc/testsuite/gcc.target/i386/pr90773-26.c
new file mode 100644
index 00000000000..ad19a88c883
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-26.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=x86-64" } */
+
+struct S
+{
+  long long s1 __attribute__ ((aligned (8)));
+  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
+};
+
+const struct S array[] = {
+  { 0, }
+};
+
+void
+foo (struct S *x)
+{
+  x[0] = array[0];
+}
+
+/* { dg-final { scan-assembler-not "movdqa" } } */
+/* { dg-final { scan-assembler-times "pxor\[\\t \]%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 16\\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 48\\(%\[\^,\]+\\)" 1 } } */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] Add 3 target hooks for memset
  2021-05-19 12:55     ` H.J. Lu
@ 2021-05-20 20:49       ` H.J. Lu
  2021-05-21  5:42         ` Bernd Edlinger
  2021-05-25 14:34         ` Richard Biener
  0 siblings, 2 replies; 52+ messages in thread
From: H.J. Lu @ 2021-05-20 20:49 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

[-- Attachment #1: Type: text/plain, Size: 11478 bytes --]

On Wed, May 19, 2021 at 5:55 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, May 19, 2021 at 2:25 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > Add TARGET_READ_MEMSET_VALUE and TARGET_GEN_MEMSET_VALUE to support
> > > target instructions to duplicate QImode value to TImode/OImode/XImode
> > > value for memmset.
> > >
> > >         PR middle-end/90773
> > >         * builtins.c (builtin_memset_read_str): Call
> > >         targetm.read_memset_value.
> > >         (builtin_memset_gen_str): Call targetm.gen_memset_value.
> > >         * target.def (read_memset_value): New hook.
> > >         (gen_memset_value): Likewise.
> > >         * targhooks.c: Inclue "builtins.h".
> > >         (default_read_memset_value): New function.
> > >         (default_gen_memset_value): Likewise.
> > >         * targhooks.h (default_read_memset_value): New prototype.
> > >         (default_gen_memset_value): Likewise.
> > >         * doc/tm.texi.in: Add TARGET_READ_MEMSET_VALUE and
> > >         TARGET_GEN_MEMSET_VALUE hooks.
> > >         * doc/tm.texi: Regenerated.
> > > ---
> > >  gcc/builtins.c     | 47 ++++----------------------------------
> > >  gcc/doc/tm.texi    | 16 +++++++++++++
> > >  gcc/doc/tm.texi.in |  4 ++++
> > >  gcc/target.def     | 20 +++++++++++++++++
> > >  gcc/targhooks.c    | 56 ++++++++++++++++++++++++++++++++++++++++++++++
> > >  gcc/targhooks.h    |  4 ++++
> > >  6 files changed, 104 insertions(+), 43 deletions(-)
> > >
> > > diff --git a/gcc/builtins.c b/gcc/builtins.c
> > > index e1b284846b1..f78a36478ef 100644
> > > --- a/gcc/builtins.c
> > > +++ b/gcc/builtins.c
> > > @@ -6584,24 +6584,11 @@ expand_builtin_strncpy (tree exp, rtx target)
> > >     previous iteration.  */
> > >
> > >  rtx
> > > -builtin_memset_read_str (void *data, void *prevp,
> > > +builtin_memset_read_str (void *data, void *prev,
> > >                          HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > >                          scalar_int_mode mode)
> > >  {
> > > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > > -  if (prev != nullptr && prev->data != nullptr)
> > > -    {
> > > -      /* Use the previous data in the same mode.  */
> > > -      if (prev->mode == mode)
> > > -       return prev->data;
> > > -    }
> > > -
> > > -  const char *c = (const char *) data;
> > > -  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> > > -
> > > -  memset (p, *c, GET_MODE_SIZE (mode));
> > > -
> > > -  return c_readstr (p, mode);
> > > +  return targetm.read_memset_value ((const char *) data, prev, mode);
> > >  }
> > >
> > >  /* Callback routine for store_by_pieces.  Return the RTL of a register
> > > @@ -6611,37 +6598,11 @@ builtin_memset_read_str (void *data, void *prevp,
> > >     nullptr, it has the RTL info from the previous iteration.  */
> > >
> > >  static rtx
> > > -builtin_memset_gen_str (void *data, void *prevp,
> > > +builtin_memset_gen_str (void *data, void *prev,
> > >                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > >                         scalar_int_mode mode)
> > >  {
> > > -  rtx target, coeff;
> > > -  size_t size;
> > > -  char *p;
> > > -
> > > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > > -  if (prev != nullptr && prev->data != nullptr)
> > > -    {
> > > -      /* Use the previous data in the same mode.  */
> > > -      if (prev->mode == mode)
> > > -       return prev->data;
> > > -
> > > -      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> > > -      if (target != nullptr)
> > > -       return target;
> > > -    }
> > > -
> > > -  size = GET_MODE_SIZE (mode);
> > > -  if (size == 1)
> > > -    return (rtx) data;
> > > -
> > > -  p = XALLOCAVEC (char, size);
> > > -  memset (p, 1, size);
> > > -  coeff = c_readstr (p, mode);
> > > -
> > > -  target = convert_to_mode (mode, (rtx) data, 1);
> > > -  target = expand_mult (mode, target, coeff, NULL_RTX, 1);
> > > -  return force_reg (mode, target);
> > > +  return targetm.gen_memset_value ((rtx) data, prev, mode);
> > >  }
> > >
> > >  /* Expand expression EXP, which is a call to the memset builtin.  Return
> > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > index 85ea9395560..51385044e76 100644
> > > --- a/gcc/doc/tm.texi
> > > +++ b/gcc/doc/tm.texi
> > > @@ -11868,6 +11868,22 @@ This function prepares to emit a conditional comparison within a sequence
> > >   @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.
> > >  @end deftypefn
> > >
> > > +@deftypefn {Target Hook} rtx TARGET_READ_MEMSET_VALUE (const char *@var{c}, void *@var{prev}, scalar_int_mode @var{mode})
> > > +This function returns the RTL of a constant integer corresponding to
> > > +target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the stringn
> > > +constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains
> > > +the RTL information from the previous interation.
> > > +@end deftypefn
> > > +
> > > +@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_VALUE (rtx @var{data}, void *@var{prev}, scalar_int_mode @var{mode})
> > > +This function returns the RTL of a register containing
> > > +@code{GET_MODE_SIZE (@var{mode})} consecutive copies of the unsigned
> > > +char value given in the RTL register @var{data}.  For example, if
> > > +@var{mode} is 4 bytes wide, return the RTL for 0x01010101*@var{data}.
> > > +If @var{PREV} is not @samp{nullptr}, it is the RTL information from
> > > +the previous iteration.
> > > +@end deftypefn
> > > +
> > >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})
> > >  This target hook returns a new value for the number of times @var{loop}
> > >  should be unrolled. The parameter @var{nunroll} is the number of times
> > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > index d8e3de14af1..8d4c3949fbf 100644
> > > --- a/gcc/doc/tm.texi.in
> > > +++ b/gcc/doc/tm.texi.in
> > > @@ -7956,6 +7956,10 @@ lists.
> > >
> > >  @hook TARGET_GEN_CCMP_NEXT
> > >
> > > +@hook TARGET_READ_MEMSET_VALUE
> > > +
> > > +@hook TARGET_GEN_MEMSET_VALUE
> > > +
> > >  @hook TARGET_LOOP_UNROLL_ADJUST
> > >
> > >  @defmac POWI_MAX_MULTS
> > > diff --git a/gcc/target.def b/gcc/target.def
> > > index bbaf6b4f3a0..c9aca40fa88 100644
> > > --- a/gcc/target.def
> > > +++ b/gcc/target.def
> > > @@ -2694,6 +2694,26 @@ DEFHOOK
> > >   rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree op0, tree op1, int bit_code),
> > >   NULL)
> > >
> > > +DEFHOOK
> > > +(read_memset_value,
> > > + "This function returns the RTL of a constant integer corresponding to\n\
> > > +target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the stringn\n\
> > > +constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains\n\
> >
> > where is 'str' defined?  I can't really tell what's the difference
>
> Fixed with
>
> diff --git a/gcc/target.def b/gcc/target.def
> index c9aca40fa88..4c3a5fcc634 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -2699,8 +2699,8 @@ DEFHOOK
>   "This function returns the RTL of a constant integer corresponding to\n\
>  target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the string\n\
>  constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains\n\
> -the RTL information from the previous interation.",
> - rtx, (const char *c, void *prev, scalar_int_mode mode),
> +the RTL information from the previous iteration.",
> + rtx, (const char *str, void *prev, scalar_int_mode mode),
>   default_read_memset_value)
>
>  DEFHOOK
>
> > from read_memset_value
> > and gen_memset_value.
>
> The difference is that input of read_memset_value is a string constant
> like "123" and input of gen_memset_value is an RTL register.
>
> > Somehow I feel that an optab for the "splat" operation similar
> > to vec_duplicate might be a better way to expose this - of course
> > that doesn't handle the "prev" thing.
>
> The x86 backend has ix86_expand_vector_init_duplicate () to
> broadcast QImode to TImode/OImode/XImode:
>
> /* A subroutine of ix86_expand_vector_init.  Store into TARGET a vector
>    with all elements equal to VAR.  Return true if successful.  */
>
> bool
> ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
>                                    rtx target, rtx val)
>
> > So how's this the right point of abstraction to the target?
>
> I can add 2 target hooks, one for scratch register and one for
> broadcasting QImode to TImode/OImode/XImode.   Then I can
> move x86 codes to the middle-end.
>

Here is the patch to add 3 target hooks:

 -- Target Hook: rtx TARGET_READ_MEMSET_VALUE (const char *C,
          scalar_int_mode MODE)
     This function returns the RTL of a constant integer corresponding
     to target reading 'GET_MODE_SIZE (MODE)' bytes from the string
     constant C.

 -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
          MODE)
     This function returns the RTL of a register containing
     'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
     value given in the RTL register DATA.  For example, if MODE is 4
     bytes wide, return the RTL for 0x01010101*DATA.

 -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE_FROM_PREV (void *PREV,
          scalar_int_mode MODE)
     This function returns the RTL of a register in MODE generated from
     PREV in the previous iteration.

with

/* Return the RTL of a register in MODE generated from PREV in the
   previous iteration.  */

static rtx
gen_memset_value_from_prev (void *prevp, scalar_int_mode mode)
{
  by_pieces_prev *prev = (by_pieces_prev *) prevp;
  rtx value;
  if (prev != nullptr && prev->data != nullptr)
    {
      /* Use the previous data in the same mode.  */
      if (prev->mode == mode)
        return prev->data;

      value = targetm.gen_memset_value_from_prev (prevp, mode);
    }
  else
    value = nullptr;
  return value;
}

/* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)
   bytes from constant string DATA + OFFSET and return it as target
   constant.  If PREV isn't nullptr, it has the RTL info from the
   previous iteration.  */

rtx
builtin_memset_read_str (void *data, void *prev,
                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
                         scalar_int_mode mode)
{
  const char *str = (const char *) data;

  /* Don't use the previous value if size is 1.  */
  if (GET_MODE_SIZE (mode) == 1)
    return default_read_memset_value (str, mode);

  rtx value = gen_memset_value_from_prev (prev, mode);
  if (value)
    return value;

  return targetm.read_memset_value (str, mode);
}

/* Callback routine for store_by_pieces.  Return the RTL of a register
   containing GET_MODE_SIZE (MODE) consecutive copies of the unsigned
   char value given in the RTL register data.  For example, if mode is
   4 bytes wide, return the RTL for 0x01010101*data.  If PREV isn't
   nullptr, it has the RTL info from the previous iteration.  */

static rtx
builtin_memset_gen_str (void *datap, void *prev,
                        HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
                        scalar_int_mode mode)
{
  rtx data = (rtx) datap;

  /* Don't use the previous value if size is 1.  */
  if (GET_MODE_SIZE (mode) == 1)
    return data;

  rtx value = gen_memset_value_from_prev (prev, mode);
  if (value)
    return value;

  return targetm.gen_memset_value (data, mode);
}


-- 
H.J.

[-- Attachment #2: 0001-Add-3-target-hooks-for-memset.patch --]
[-- Type: application/x-patch, Size: 10326 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] constructor: Elide expand_constructor when can move by pieces is true
  2021-05-20 14:03         ` [PATCH] constructor: Elide expand_constructor when can move by pieces is true H.J. Lu
@ 2021-05-21  5:35           ` Bernd Edlinger
  2021-05-21  6:57           ` Richard Biener
  1 sibling, 0 replies; 52+ messages in thread
From: Bernd Edlinger @ 2021-05-21  5:35 UTC (permalink / raw)
  To: H.J. Lu, Richard Biener; +Cc: GCC Patches, Richard Sandiford, Uros Bizjak

On 5/20/21 4:03 PM, H.J. Lu wrote:
> On Thu, May 20, 2021 at 12:51 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
>>
>> On Wed, May 19, 2021 at 3:22 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>
>>> On Wed, May 19, 2021 at 2:33 AM Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>>
>>>> On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>
>>>>> When expanding a constant constructor, don't call expand_constructor if
>>>>> it is more efficient to load the data from the memory via move by pieces.
>>>>>
>>>>> gcc/
>>>>>
>>>>>         PR middle-end/90773
>>>>>         * expr.c (expand_expr_real_1): Don't call expand_constructor if
>>>>>         it is more efficient to load the data from the memory.
>>>>>
>>>>> gcc/testsuite/
>>>>>
>>>>>         PR middle-end/90773
>>>>>         * gcc.target/i386/pr90773-24.c: New test.
>>>>>         * gcc.target/i386/pr90773-25.c: Likewise.
>>>>> ---
>>>>>  gcc/expr.c                                 | 10 ++++++++++
>>>>>  gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 ++++++++++++++++++++++
>>>>>  gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 ++++++++++++++++++++
>>>>>  3 files changed, 52 insertions(+)
>>>>>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c
>>>>>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c
>>>>>
>>>>> diff --git a/gcc/expr.c b/gcc/expr.c
>>>>> index d09ee42e262..80e01ea1cbe 100644
>>>>> --- a/gcc/expr.c
>>>>> +++ b/gcc/expr.c
>>>>> @@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
>>>>>                 unsigned HOST_WIDE_INT ix;
>>>>>                 tree field, value;
>>>>>
>>>>> +               /* Check if it is more efficient to load the data from
>>>>> +                  the memory directly.  FIXME: How many stores do we
>>>>> +                  need here if not moved by pieces?  */
>>>>> +               unsigned HOST_WIDE_INT bytes
>>>>> +                 = tree_to_uhwi (TYPE_SIZE_UNIT (type));
>>>>
>>>> that's prone to fail - it could be a VLA.
>>>
>>> What do you mean by fail?  Is it ICE or missed optimization?
>>> Do you have a testcase?
>>>
>>>>
>>>>> +               if ((bytes / UNITS_PER_WORD) > 2
>>>>> +                   && MOVE_MAX_PIECES > UNITS_PER_WORD
>>>>> +                   && can_move_by_pieces (bytes, TYPE_ALIGN (type)))
>>>>> +                 goto normal_inner_ref;
>>>>> +
>>>>
>>>> It looks like you're concerned about aggregate copies but this also handles
>>>> non-aggregates (which on GIMPLE might already be optimized of course).
>>>
>>> Here I check if we copy more than 2 words and we can move more than
>>> a word in a single instruction.
>>>
>>>> Also you say "if it's cheaper" but I see no cost considerations.  How do
>>>> we generally handle immed const vs. load from constant pool costs?
>>>
>>> This trades 2 (update to 8) stores with one load plus one store.  Is there
>>> a way to check which one is faster?
>>
>> I'm not sure - it depends on whether the target can do stores from immediates
>> at all or what restrictions apply, what the immediate value actually is
>> (zero or all-ones should be way cheaper than sth arbitrary) and how the
>> pressure on the load unit is.  can_move_by_pieces (bytes, TYPE_ALIGN (type))
>> also does not guarantee it will actually move pieces larger than UNITS_PER_WORD,
>> that might depend on alignment.  There's by_pieces_ninsns that might provide
>> some hint here.
>>
>> I'm sure it works well for x86.
>>
>> I wonder if the existing code is in the appropriate place and we
>> shouldn't instead
>> handle this somewhere upthread where we ask to copy 'exp' into some other
>> memory location.  For your testcase that's expand_assignment but I can
>> imagine passing array[0] by value to a function resulting in similar copying.
>> Testing that shows we get
>>
>>         pushq   array+56(%rip)
>>         .cfi_def_cfa_offset 24
>>         pushq   array+48(%rip)
>>         .cfi_def_cfa_offset 32
>>         pushq   array+40(%rip)
>>         .cfi_def_cfa_offset 40
>>         pushq   array+32(%rip)
>>         .cfi_def_cfa_offset 48
>>         pushq   array+24(%rip)
>>         .cfi_def_cfa_offset 56
>>         pushq   array+16(%rip)
>>         .cfi_def_cfa_offset 64
>>         pushq   array+8(%rip)
>>         .cfi_def_cfa_offset 72
>>         pushq   array(%rip)
>>         .cfi_def_cfa_offset 80
>>         call    bar
>>
>> for that.  We do have the by-pieces infrastructure to generally do this kind of
>> copying but in both of these cases we do not seem to use it.  I also wonder
>> if the by-pieces infrastructure can pick up constant initializers automagically
>> (we could native_encode the initializer part and feed the by-pieces
>> infrastructure with an array of bytes).  There for example might be easy to
>> immediate-store byte parts and difficult ones where we could decide on a
>> case-by-case basis whether to load+store or immediate-store them.
> 
> I opened:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100704
> 
>> For example if I change your testcase to have the array[] initializer
>> all-zero we currently emit
>>
>>         pxor    %xmm0, %xmm0
>>         movups  %xmm0, (%rdi)
>>         movups  %xmm0, 16(%rdi)
>>         movups  %xmm0, 32(%rdi)
>>         movups  %xmm0, 48(%rdi)
>>         ret
>>
>> will your patch cause us to emit 4 loads?  OTHO if I do
>>
>> const struct S array[] = {
>>   { 0, 0, 0, 7241, 124764, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
>> };
>>
>> we get
>>
>>         movq    $0, (%rdi)
>>         movl    $0, 8(%rdi)
>>         movl    $0, 12(%rdi)
>>         movl    $7241, 16(%rdi)
>> ...
>>
>> ideally we'd have sth like
>>
>>     pxor %xmm0, %xmm0
>>     movups  %xmm0, (%rdi)
>>     movaps array+16(%rip), %xmm0
>>     movups %xmm0, 16(%rdi)
>> ...
>>
>> thus have the zeros written as immediates and the remaining pieces
>> with load+stores.
>>
>> The by-pieces infrastructure eventually get's to see
>>
>> (mem/u/c:BLK (symbol_ref:DI ("array") [flags 0x2] <var_decl
>> 0x7ffff7ff5b40 array>) [1 array+0 S64 A256])
>>
>> where the MEM_EXPR should provide a way to access the constant initializer.
>>
>> That said I do agree the current code is a bit premature optimization
>> - but maybe
>> it should be fend off in expand_constructor which has the cheap clear_storage
>> first and which already does check can_move_by_pieces with some heuristics,
>> but that seems to be guarded by
>>
>>            || (tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
>>                && (! can_move_by_pieces
>>                    (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
>>                     TYPE_ALIGN (type)))
>>                && ! mostly_zeros_p (exp))))
>>
>> which is odd (we _can_ move by pieces, but how does this apply to
>> TREE_CONSTANT CTORs and avoid_temp_mem?).
>>
>> That said, I wonder if we want to elide expand_constructor when the
>> CTOR is TREE_STATIC && TREE_CONSTANT and !mostly_zeros_p
>> and we can_move_by_pieces.
>>
>> So sth like
>>
>> diff --git a/gcc/expr.c b/gcc/expr.c
>> index 7139545d543..76b3bdf0c01 100644
>> --- a/gcc/expr.c
>> +++ b/gcc/expr.c
>> @@ -8504,6 +8504,12 @@ expand_constructor (tree exp, rtx target, enum
>> expand_modifier modifier,
>>                && (! can_move_by_pieces
>>                    (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
>>                     TYPE_ALIGN (type)))
>> +              && ! mostly_zeros_p (exp))
>> +          || (TREE_CONSTANT (exp)
>> +              && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
>> +              && (can_move_by_pieces
>> +                  (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
>> +                   TYPE_ALIGN (type)))

Just a minor nit: superfluous parentheses around can_move_by_pieces here.


Bernd.

>>                && ! mostly_zeros_p (exp))))
>>        || ((modifier == EXPAND_INITIALIZER || modifier == EXPAND_CONST_ADDRESS)
>>           && TREE_CONSTANT (exp)))
>>
>> which handles your initializer and the all-zero one optimal?
>>
> 
> It works.  Here is the updated patch.
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Add 3 target hooks for memset
  2021-05-20 20:49       ` [PATCH] Add 3 target hooks for memset H.J. Lu
@ 2021-05-21  5:42         ` Bernd Edlinger
  2021-05-21 11:53           ` H.J. Lu
  2021-05-25 14:34         ` Richard Biener
  1 sibling, 1 reply; 52+ messages in thread
From: Bernd Edlinger @ 2021-05-21  5:42 UTC (permalink / raw)
  To: H.J. Lu, Richard Biener; +Cc: GCC Patches, Richard Sandiford, Uros Bizjak

On 5/20/21 10:49 PM, H.J. Lu wrote:
> On Wed, May 19, 2021 at 5:55 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Wed, May 19, 2021 at 2:25 AM Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>>
>>> On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>
>>>> Add TARGET_READ_MEMSET_VALUE and TARGET_GEN_MEMSET_VALUE to support
>>>> target instructions to duplicate QImode value to TImode/OImode/XImode
>>>> value for memmset.
>>>>
>>>>         PR middle-end/90773
>>>>         * builtins.c (builtin_memset_read_str): Call
>>>>         targetm.read_memset_value.
>>>>         (builtin_memset_gen_str): Call targetm.gen_memset_value.
>>>>         * target.def (read_memset_value): New hook.
>>>>         (gen_memset_value): Likewise.
>>>>         * targhooks.c: Inclue "builtins.h".
>>>>         (default_read_memset_value): New function.
>>>>         (default_gen_memset_value): Likewise.
>>>>         * targhooks.h (default_read_memset_value): New prototype.
>>>>         (default_gen_memset_value): Likewise.
>>>>         * doc/tm.texi.in: Add TARGET_READ_MEMSET_VALUE and
>>>>         TARGET_GEN_MEMSET_VALUE hooks.
>>>>         * doc/tm.texi: Regenerated.
>>>> ---
>>>>  gcc/builtins.c     | 47 ++++----------------------------------
>>>>  gcc/doc/tm.texi    | 16 +++++++++++++
>>>>  gcc/doc/tm.texi.in |  4 ++++
>>>>  gcc/target.def     | 20 +++++++++++++++++
>>>>  gcc/targhooks.c    | 56 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>  gcc/targhooks.h    |  4 ++++
>>>>  6 files changed, 104 insertions(+), 43 deletions(-)
>>>>
>>>> diff --git a/gcc/builtins.c b/gcc/builtins.c
>>>> index e1b284846b1..f78a36478ef 100644
>>>> --- a/gcc/builtins.c
>>>> +++ b/gcc/builtins.c
>>>> @@ -6584,24 +6584,11 @@ expand_builtin_strncpy (tree exp, rtx target)
>>>>     previous iteration.  */
>>>>
>>>>  rtx
>>>> -builtin_memset_read_str (void *data, void *prevp,
>>>> +builtin_memset_read_str (void *data, void *prev,
>>>>                          HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
>>>>                          scalar_int_mode mode)
>>>>  {
>>>> -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
>>>> -  if (prev != nullptr && prev->data != nullptr)
>>>> -    {
>>>> -      /* Use the previous data in the same mode.  */
>>>> -      if (prev->mode == mode)
>>>> -       return prev->data;
>>>> -    }
>>>> -
>>>> -  const char *c = (const char *) data;
>>>> -  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
>>>> -
>>>> -  memset (p, *c, GET_MODE_SIZE (mode));
>>>> -
>>>> -  return c_readstr (p, mode);
>>>> +  return targetm.read_memset_value ((const char *) data, prev, mode);
>>>>  }
>>>>
>>>>  /* Callback routine for store_by_pieces.  Return the RTL of a register
>>>> @@ -6611,37 +6598,11 @@ builtin_memset_read_str (void *data, void *prevp,
>>>>     nullptr, it has the RTL info from the previous iteration.  */
>>>>
>>>>  static rtx
>>>> -builtin_memset_gen_str (void *data, void *prevp,
>>>> +builtin_memset_gen_str (void *data, void *prev,
>>>>                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
>>>>                         scalar_int_mode mode)
>>>>  {
>>>> -  rtx target, coeff;
>>>> -  size_t size;
>>>> -  char *p;
>>>> -
>>>> -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
>>>> -  if (prev != nullptr && prev->data != nullptr)
>>>> -    {
>>>> -      /* Use the previous data in the same mode.  */
>>>> -      if (prev->mode == mode)
>>>> -       return prev->data;
>>>> -
>>>> -      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
>>>> -      if (target != nullptr)
>>>> -       return target;
>>>> -    }
>>>> -
>>>> -  size = GET_MODE_SIZE (mode);
>>>> -  if (size == 1)
>>>> -    return (rtx) data;
>>>> -
>>>> -  p = XALLOCAVEC (char, size);
>>>> -  memset (p, 1, size);
>>>> -  coeff = c_readstr (p, mode);
>>>> -
>>>> -  target = convert_to_mode (mode, (rtx) data, 1);
>>>> -  target = expand_mult (mode, target, coeff, NULL_RTX, 1);
>>>> -  return force_reg (mode, target);
>>>> +  return targetm.gen_memset_value ((rtx) data, prev, mode);
>>>>  }
>>>>
>>>>  /* Expand expression EXP, which is a call to the memset builtin.  Return
>>>> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
>>>> index 85ea9395560..51385044e76 100644
>>>> --- a/gcc/doc/tm.texi
>>>> +++ b/gcc/doc/tm.texi
>>>> @@ -11868,6 +11868,22 @@ This function prepares to emit a conditional comparison within a sequence
>>>>   @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.
>>>>  @end deftypefn
>>>>
>>>> +@deftypefn {Target Hook} rtx TARGET_READ_MEMSET_VALUE (const char *@var{c}, void *@var{prev}, scalar_int_mode @var{mode})
>>>> +This function returns the RTL of a constant integer corresponding to
>>>> +target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the stringn
>>>> +constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains
>>>> +the RTL information from the previous interation.
>>>> +@end deftypefn
>>>> +
>>>> +@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_VALUE (rtx @var{data}, void *@var{prev}, scalar_int_mode @var{mode})
>>>> +This function returns the RTL of a register containing
>>>> +@code{GET_MODE_SIZE (@var{mode})} consecutive copies of the unsigned
>>>> +char value given in the RTL register @var{data}.  For example, if
>>>> +@var{mode} is 4 bytes wide, return the RTL for 0x01010101*@var{data}.
>>>> +If @var{PREV} is not @samp{nullptr}, it is the RTL information from
>>>> +the previous iteration.
>>>> +@end deftypefn
>>>> +
>>>>  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})
>>>>  This target hook returns a new value for the number of times @var{loop}
>>>>  should be unrolled. The parameter @var{nunroll} is the number of times
>>>> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
>>>> index d8e3de14af1..8d4c3949fbf 100644
>>>> --- a/gcc/doc/tm.texi.in
>>>> +++ b/gcc/doc/tm.texi.in
>>>> @@ -7956,6 +7956,10 @@ lists.
>>>>
>>>>  @hook TARGET_GEN_CCMP_NEXT
>>>>
>>>> +@hook TARGET_READ_MEMSET_VALUE
>>>> +
>>>> +@hook TARGET_GEN_MEMSET_VALUE
>>>> +
>>>>  @hook TARGET_LOOP_UNROLL_ADJUST
>>>>
>>>>  @defmac POWI_MAX_MULTS
>>>> diff --git a/gcc/target.def b/gcc/target.def
>>>> index bbaf6b4f3a0..c9aca40fa88 100644
>>>> --- a/gcc/target.def
>>>> +++ b/gcc/target.def
>>>> @@ -2694,6 +2694,26 @@ DEFHOOK
>>>>   rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree op0, tree op1, int bit_code),
>>>>   NULL)
>>>>
>>>> +DEFHOOK
>>>> +(read_memset_value,
>>>> + "This function returns the RTL of a constant integer corresponding to\n\
>>>> +target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the stringn\n\
>>>> +constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains\n\
>>>
>>> where is 'str' defined?  I can't really tell what's the difference
>>
>> Fixed with
>>
>> diff --git a/gcc/target.def b/gcc/target.def
>> index c9aca40fa88..4c3a5fcc634 100644
>> --- a/gcc/target.def
>> +++ b/gcc/target.def
>> @@ -2699,8 +2699,8 @@ DEFHOOK
>>   "This function returns the RTL of a constant integer corresponding to\n\
>>  target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the string\n\
>>  constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains\n\
>> -the RTL information from the previous interation.",
>> - rtx, (const char *c, void *prev, scalar_int_mode mode),
>> +the RTL information from the previous iteration.",
>> + rtx, (const char *str, void *prev, scalar_int_mode mode),
>>   default_read_memset_value)
>>
>>  DEFHOOK
>>
>>> from read_memset_value
>>> and gen_memset_value.
>>
>> The difference is that input of read_memset_value is a string constant
>> like "123" and input of gen_memset_value is an RTL register.
>>
>>> Somehow I feel that an optab for the "splat" operation similar
>>> to vec_duplicate might be a better way to expose this - of course
>>> that doesn't handle the "prev" thing.
>>
>> The x86 backend has ix86_expand_vector_init_duplicate () to
>> broadcast QImode to TImode/OImode/XImode:
>>
>> /* A subroutine of ix86_expand_vector_init.  Store into TARGET a vector
>>    with all elements equal to VAR.  Return true if successful.  */
>>
>> bool
>> ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
>>                                    rtx target, rtx val)
>>
>>> So how's this the right point of abstraction to the target?
>>
>> I can add 2 target hooks, one for scratch register and one for
>> broadcasting QImode to TImode/OImode/XImode.   Then I can
>> move x86 codes to the middle-end.
>>
> 
> Here is the patch to add 3 target hooks:
> 
>  -- Target Hook: rtx TARGET_READ_MEMSET_VALUE (const char *C,
>           scalar_int_mode MODE)
>      This function returns the RTL of a constant integer corresponding
>      to target reading 'GET_MODE_SIZE (MODE)' bytes from the string
>      constant C.
> 
>  -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
>           MODE)
>      This function returns the RTL of a register containing
>      'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
>      value given in the RTL register DATA.  For example, if MODE is 4
>      bytes wide, return the RTL for 0x01010101*DATA.
> 
>  -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE_FROM_PREV (void *PREV,
>           scalar_int_mode MODE)
>      This function returns the RTL of a register in MODE generated from
>      PREV in the previous iteration.
> 
> with
> 
> /* Return the RTL of a register in MODE generated from PREV in the
>    previous iteration.  */
> 
> static rtx
> gen_memset_value_from_prev (void *prevp, scalar_int_mode mode)
> {
>   by_pieces_prev *prev = (by_pieces_prev *) prevp;
>   rtx value;
>   if (prev != nullptr && prev->data != nullptr)
>     {
>       /* Use the previous data in the same mode.  */
>       if (prev->mode == mode)
>         return prev->data;
> 
>       value = targetm.gen_memset_value_from_prev (prevp, mode);
>     }
>   else
>     value = nullptr;
>   return value;
> }
> 
> /* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)
>    bytes from constant string DATA + OFFSET and return it as target
>    constant.  If PREV isn't nullptr, it has the RTL info from the
>    previous iteration.  */
> 
> rtx
> builtin_memset_read_str (void *data, void *prev,
>                          HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
>                          scalar_int_mode mode)
> {
>   const char *str = (const char *) data;
> 
>   /* Don't use the previous value if size is 1.  */
>   if (GET_MODE_SIZE (mode) == 1)
>     return default_read_memset_value (str, mode);
> 
>   rtx value = gen_memset_value_from_prev (prev, mode);
>   if (value)
>     return value;
> 
>   return targetm.read_memset_value (str, mode);
> }
> 
> /* Callback routine for store_by_pieces.  Return the RTL of a register
>    containing GET_MODE_SIZE (MODE) consecutive copies of the unsigned
>    char value given in the RTL register data.  For example, if mode is
>    4 bytes wide, return the RTL for 0x01010101*data.  If PREV isn't
>    nullptr, it has the RTL info from the previous iteration.  */
> 
> static rtx
> builtin_memset_gen_str (void *datap, void *prev,
>                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
>                         scalar_int_mode mode)
> {
>   rtx data = (rtx) datap;
> 
>   /* Don't use the previous value if size is 1.  */
>   if (GET_MODE_SIZE (mode) == 1)
>     return data;
> 
>   rtx value = gen_memset_value_from_prev (prev, mode);
>   if (value)
>     return value;
> 
>   return targetm.gen_memset_value (data, mode);
> }> +/* Default implementation of TARGET_GEN_MEMSET_VALUE.  */
> +
> +rtx
> +default_gen_memset_value (rtx data, scalar_int_mode mode)
> +{
> +  rtx target, coeff;
> +  size_t size;
> +  char *p;
> +
> +  size = GET_MODE_SIZE (mode);
> +  if (size == 1)
> +    return data;
> +
> +  p = XALLOCAVEC (char, size);
> +  memset (p, 1, size);
> +  coeff = c_readstr (p, mode);
> +
> +  target = convert_to_mode (mode, data, 1);
> +  target = expand_mult (mode, target, coeff, NULL_RTX, 1);


Note this formula does not work for data = -1 for instance,
since 0x01010101U * -1 = 0xFEFEFEFFU
but memset(str, n, -1) set it to 0xFFFFFFFFU, right?

So are we sure that the value of "data" is always in the range [0..255] ?


Bernd.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] constructor: Elide expand_constructor when can move by pieces is true
  2021-05-20 14:03         ` [PATCH] constructor: Elide expand_constructor when can move by pieces is true H.J. Lu
  2021-05-21  5:35           ` Bernd Edlinger
@ 2021-05-21  6:57           ` Richard Biener
  2021-05-21  7:30             ` Bernd Edlinger
  2021-05-21 13:09             ` [PATCH] Elide expand_constructor if move by pieces is preferred H.J. Lu
  1 sibling, 2 replies; 52+ messages in thread
From: Richard Biener @ 2021-05-21  6:57 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

On Thu, May 20, 2021 at 4:04 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Thu, May 20, 2021 at 12:51 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Wed, May 19, 2021 at 3:22 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Wed, May 19, 2021 at 2:33 AM Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > >
> > > > > When expanding a constant constructor, don't call expand_constructor if
> > > > > it is more efficient to load the data from the memory via move by pieces.
> > > > >
> > > > > gcc/
> > > > >
> > > > >         PR middle-end/90773
> > > > >         * expr.c (expand_expr_real_1): Don't call expand_constructor if
> > > > >         it is more efficient to load the data from the memory.
> > > > >
> > > > > gcc/testsuite/
> > > > >
> > > > >         PR middle-end/90773
> > > > >         * gcc.target/i386/pr90773-24.c: New test.
> > > > >         * gcc.target/i386/pr90773-25.c: Likewise.
> > > > > ---
> > > > >  gcc/expr.c                                 | 10 ++++++++++
> > > > >  gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 ++++++++++++++++++++++
> > > > >  gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 ++++++++++++++++++++
> > > > >  3 files changed, 52 insertions(+)
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c
> > > > >
> > > > > diff --git a/gcc/expr.c b/gcc/expr.c
> > > > > index d09ee42e262..80e01ea1cbe 100644
> > > > > --- a/gcc/expr.c
> > > > > +++ b/gcc/expr.c
> > > > > @@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
> > > > >                 unsigned HOST_WIDE_INT ix;
> > > > >                 tree field, value;
> > > > >
> > > > > +               /* Check if it is more efficient to load the data from
> > > > > +                  the memory directly.  FIXME: How many stores do we
> > > > > +                  need here if not moved by pieces?  */
> > > > > +               unsigned HOST_WIDE_INT bytes
> > > > > +                 = tree_to_uhwi (TYPE_SIZE_UNIT (type));
> > > >
> > > > that's prone to fail - it could be a VLA.
> > >
> > > What do you mean by fail?  Is it ICE or missed optimization?
> > > Do you have a testcase?
> > >
> > > >
> > > > > +               if ((bytes / UNITS_PER_WORD) > 2
> > > > > +                   && MOVE_MAX_PIECES > UNITS_PER_WORD
> > > > > +                   && can_move_by_pieces (bytes, TYPE_ALIGN (type)))
> > > > > +                 goto normal_inner_ref;
> > > > > +
> > > >
> > > > It looks like you're concerned about aggregate copies but this also handles
> > > > non-aggregates (which on GIMPLE might already be optimized of course).
> > >
> > > Here I check if we copy more than 2 words and we can move more than
> > > a word in a single instruction.
> > >
> > > > Also you say "if it's cheaper" but I see no cost considerations.  How do
> > > > we generally handle immed const vs. load from constant pool costs?
> > >
> > > This trades 2 (update to 8) stores with one load plus one store.  Is there
> > > a way to check which one is faster?
> >
> > I'm not sure - it depends on whether the target can do stores from immediates
> > at all or what restrictions apply, what the immediate value actually is
> > (zero or all-ones should be way cheaper than sth arbitrary) and how the
> > pressure on the load unit is.  can_move_by_pieces (bytes, TYPE_ALIGN (type))
> > also does not guarantee it will actually move pieces larger than UNITS_PER_WORD,
> > that might depend on alignment.  There's by_pieces_ninsns that might provide
> > some hint here.
> >
> > I'm sure it works well for x86.
> >
> > I wonder if the existing code is in the appropriate place and we
> > shouldn't instead
> > handle this somewhere upthread where we ask to copy 'exp' into some other
> > memory location.  For your testcase that's expand_assignment but I can
> > imagine passing array[0] by value to a function resulting in similar copying.
> > Testing that shows we get
> >
> >         pushq   array+56(%rip)
> >         .cfi_def_cfa_offset 24
> >         pushq   array+48(%rip)
> >         .cfi_def_cfa_offset 32
> >         pushq   array+40(%rip)
> >         .cfi_def_cfa_offset 40
> >         pushq   array+32(%rip)
> >         .cfi_def_cfa_offset 48
> >         pushq   array+24(%rip)
> >         .cfi_def_cfa_offset 56
> >         pushq   array+16(%rip)
> >         .cfi_def_cfa_offset 64
> >         pushq   array+8(%rip)
> >         .cfi_def_cfa_offset 72
> >         pushq   array(%rip)
> >         .cfi_def_cfa_offset 80
> >         call    bar
> >
> > for that.  We do have the by-pieces infrastructure to generally do this kind of
> > copying but in both of these cases we do not seem to use it.  I also wonder
> > if the by-pieces infrastructure can pick up constant initializers automagically
> > (we could native_encode the initializer part and feed the by-pieces
> > infrastructure with an array of bytes).  There for example might be easy to
> > immediate-store byte parts and difficult ones where we could decide on a
> > case-by-case basis whether to load+store or immediate-store them.
>
> I opened:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100704
>
> > For example if I change your testcase to have the array[] initializer
> > all-zero we currently emit
> >
> >         pxor    %xmm0, %xmm0
> >         movups  %xmm0, (%rdi)
> >         movups  %xmm0, 16(%rdi)
> >         movups  %xmm0, 32(%rdi)
> >         movups  %xmm0, 48(%rdi)
> >         ret
> >
> > will your patch cause us to emit 4 loads?  OTHO if I do
> >
> > const struct S array[] = {
> >   { 0, 0, 0, 7241, 124764, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
> > };
> >
> > we get
> >
> >         movq    $0, (%rdi)
> >         movl    $0, 8(%rdi)
> >         movl    $0, 12(%rdi)
> >         movl    $7241, 16(%rdi)
> > ...
> >
> > ideally we'd have sth like
> >
> >     pxor %xmm0, %xmm0
> >     movups  %xmm0, (%rdi)
> >     movaps array+16(%rip), %xmm0
> >     movups %xmm0, 16(%rdi)
> > ...
> >
> > thus have the zeros written as immediates and the remaining pieces
> > with load+stores.
> >
> > The by-pieces infrastructure eventually get's to see
> >
> > (mem/u/c:BLK (symbol_ref:DI ("array") [flags 0x2] <var_decl
> > 0x7ffff7ff5b40 array>) [1 array+0 S64 A256])
> >
> > where the MEM_EXPR should provide a way to access the constant initializer.
> >
> > That said I do agree the current code is a bit premature optimization
> > - but maybe
> > it should be fend off in expand_constructor which has the cheap clear_storage
> > first and which already does check can_move_by_pieces with some heuristics,
> > but that seems to be guarded by
> >
> >            || (tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
> >                && (! can_move_by_pieces
> >                    (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
> >                     TYPE_ALIGN (type)))
> >                && ! mostly_zeros_p (exp))))
> >
> > which is odd (we _can_ move by pieces, but how does this apply to
> > TREE_CONSTANT CTORs and avoid_temp_mem?).
> >
> > That said, I wonder if we want to elide expand_constructor when the
> > CTOR is TREE_STATIC && TREE_CONSTANT and !mostly_zeros_p
> > and we can_move_by_pieces.
> >
> > So sth like
> >
> > diff --git a/gcc/expr.c b/gcc/expr.c
> > index 7139545d543..76b3bdf0c01 100644
> > --- a/gcc/expr.c
> > +++ b/gcc/expr.c
> > @@ -8504,6 +8504,12 @@ expand_constructor (tree exp, rtx target, enum
> > expand_modifier modifier,
> >                && (! can_move_by_pieces
> >                    (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
> >                     TYPE_ALIGN (type)))
> > +              && ! mostly_zeros_p (exp))
> > +          || (TREE_CONSTANT (exp)
> > +              && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
> > +              && (can_move_by_pieces
> > +                  (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
> > +                   TYPE_ALIGN (type)))
> >                && ! mostly_zeros_p (exp))))
> >        || ((modifier == EXPAND_INITIALIZER || modifier == EXPAND_CONST_ADDRESS)
> >           && TREE_CONSTANT (exp)))
> >
> > which handles your initializer and the all-zero one optimal?
> >
>
> It works.  Here is the updated patch.

So just looking at the code again I think we probably want to add
&& avoid_temp_mem here, at least that's the case we're looking
at.  Not sure if we ever arrive with TREE_CONSTANT CTORs
and !avoid_temp_mem but if so we'd create a temporary here
which of course would be pointless.

So maybe it's then clearer to split the condition out as

diff --git a/gcc/expr.c b/gcc/expr.c
index 7139545d543..ee8f25f9abd 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -8523,6 +8523,19 @@ expand_constructor (tree exp, rtx target, enum
expand_modifier modifier,
       return constructor;
     }

+  /* If the CTOR is available in static storage and not mostly
+     zeros and we can move it by pieces prefer to do so since
+     that's usually more efficient than performing a series of
+     stores from immediates.  */
+  if (avoid_temp_mem
+      && TREE_STATIC (exp)
+      && TREE_CONSTANT (exp)
+      && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
+      && can_move_by_pieces (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
+                            TYPE_ALIGN (type))
+      && ! mostly_zeros_p (exp))
+    return NULL_RTX;
+
   /* Handle calls that pass values in multiple non-contiguous
      locations.  The Irix 6 ABI has examples of this.  */
   if (target == 0 || ! safe_from_p (target, exp, 1)


OK with that change.

Thanks,
Richard.

> Thanks.
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] constructor: Elide expand_constructor when can move by pieces is true
  2021-05-21  6:57           ` Richard Biener
@ 2021-05-21  7:30             ` Bernd Edlinger
  2021-05-21 13:13               ` H.J. Lu
  2021-05-21 13:09             ` [PATCH] Elide expand_constructor if move by pieces is preferred H.J. Lu
  1 sibling, 1 reply; 52+ messages in thread
From: Bernd Edlinger @ 2021-05-21  7:30 UTC (permalink / raw)
  To: Richard Biener, H.J. Lu; +Cc: GCC Patches, Richard Sandiford, Uros Bizjak



On 5/21/21 8:57 AM, Richard Biener wrote:
> On Thu, May 20, 2021 at 4:04 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Thu, May 20, 2021 at 12:51 AM Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>>
>>> On Wed, May 19, 2021 at 3:22 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>
>>>> On Wed, May 19, 2021 at 2:33 AM Richard Biener
>>>> <richard.guenther@gmail.com> wrote:
>>>>>
>>>>> On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>>
>>>>>> When expanding a constant constructor, don't call expand_constructor if
>>>>>> it is more efficient to load the data from the memory via move by pieces.
>>>>>>
>>>>>> gcc/
>>>>>>
>>>>>>         PR middle-end/90773
>>>>>>         * expr.c (expand_expr_real_1): Don't call expand_constructor if
>>>>>>         it is more efficient to load the data from the memory.
>>>>>>
>>>>>> gcc/testsuite/
>>>>>>
>>>>>>         PR middle-end/90773
>>>>>>         * gcc.target/i386/pr90773-24.c: New test.
>>>>>>         * gcc.target/i386/pr90773-25.c: Likewise.
>>>>>> ---
>>>>>>  gcc/expr.c                                 | 10 ++++++++++
>>>>>>  gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 ++++++++++++++++++++++
>>>>>>  gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 ++++++++++++++++++++
>>>>>>  3 files changed, 52 insertions(+)
>>>>>>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c
>>>>>>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c
>>>>>>
>>>>>> diff --git a/gcc/expr.c b/gcc/expr.c
>>>>>> index d09ee42e262..80e01ea1cbe 100644
>>>>>> --- a/gcc/expr.c
>>>>>> +++ b/gcc/expr.c
>>>>>> @@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
>>>>>>                 unsigned HOST_WIDE_INT ix;
>>>>>>                 tree field, value;
>>>>>>
>>>>>> +               /* Check if it is more efficient to load the data from
>>>>>> +                  the memory directly.  FIXME: How many stores do we
>>>>>> +                  need here if not moved by pieces?  */
>>>>>> +               unsigned HOST_WIDE_INT bytes
>>>>>> +                 = tree_to_uhwi (TYPE_SIZE_UNIT (type));
>>>>>
>>>>> that's prone to fail - it could be a VLA.
>>>>
>>>> What do you mean by fail?  Is it ICE or missed optimization?
>>>> Do you have a testcase?
>>>>
>>>>>
>>>>>> +               if ((bytes / UNITS_PER_WORD) > 2
>>>>>> +                   && MOVE_MAX_PIECES > UNITS_PER_WORD
>>>>>> +                   && can_move_by_pieces (bytes, TYPE_ALIGN (type)))
>>>>>> +                 goto normal_inner_ref;
>>>>>> +
>>>>>
>>>>> It looks like you're concerned about aggregate copies but this also handles
>>>>> non-aggregates (which on GIMPLE might already be optimized of course).
>>>>
>>>> Here I check if we copy more than 2 words and we can move more than
>>>> a word in a single instruction.
>>>>
>>>>> Also you say "if it's cheaper" but I see no cost considerations.  How do
>>>>> we generally handle immed const vs. load from constant pool costs?
>>>>
>>>> This trades 2 (update to 8) stores with one load plus one store.  Is there
>>>> a way to check which one is faster?
>>>
>>> I'm not sure - it depends on whether the target can do stores from immediates
>>> at all or what restrictions apply, what the immediate value actually is
>>> (zero or all-ones should be way cheaper than sth arbitrary) and how the
>>> pressure on the load unit is.  can_move_by_pieces (bytes, TYPE_ALIGN (type))
>>> also does not guarantee it will actually move pieces larger than UNITS_PER_WORD,
>>> that might depend on alignment.  There's by_pieces_ninsns that might provide
>>> some hint here.
>>>
>>> I'm sure it works well for x86.
>>>
>>> I wonder if the existing code is in the appropriate place and we
>>> shouldn't instead
>>> handle this somewhere upthread where we ask to copy 'exp' into some other
>>> memory location.  For your testcase that's expand_assignment but I can
>>> imagine passing array[0] by value to a function resulting in similar copying.
>>> Testing that shows we get
>>>
>>>         pushq   array+56(%rip)
>>>         .cfi_def_cfa_offset 24
>>>         pushq   array+48(%rip)
>>>         .cfi_def_cfa_offset 32
>>>         pushq   array+40(%rip)
>>>         .cfi_def_cfa_offset 40
>>>         pushq   array+32(%rip)
>>>         .cfi_def_cfa_offset 48
>>>         pushq   array+24(%rip)
>>>         .cfi_def_cfa_offset 56
>>>         pushq   array+16(%rip)
>>>         .cfi_def_cfa_offset 64
>>>         pushq   array+8(%rip)
>>>         .cfi_def_cfa_offset 72
>>>         pushq   array(%rip)
>>>         .cfi_def_cfa_offset 80
>>>         call    bar
>>>
>>> for that.  We do have the by-pieces infrastructure to generally do this kind of
>>> copying but in both of these cases we do not seem to use it.  I also wonder
>>> if the by-pieces infrastructure can pick up constant initializers automagically
>>> (we could native_encode the initializer part and feed the by-pieces
>>> infrastructure with an array of bytes).  There for example might be easy to
>>> immediate-store byte parts and difficult ones where we could decide on a
>>> case-by-case basis whether to load+store or immediate-store them.
>>
>> I opened:
>>
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100704
>>
>>> For example if I change your testcase to have the array[] initializer
>>> all-zero we currently emit
>>>
>>>         pxor    %xmm0, %xmm0
>>>         movups  %xmm0, (%rdi)
>>>         movups  %xmm0, 16(%rdi)
>>>         movups  %xmm0, 32(%rdi)
>>>         movups  %xmm0, 48(%rdi)
>>>         ret
>>>
>>> will your patch cause us to emit 4 loads?  OTHO if I do
>>>
>>> const struct S array[] = {
>>>   { 0, 0, 0, 7241, 124764, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
>>> };
>>>
>>> we get
>>>
>>>         movq    $0, (%rdi)
>>>         movl    $0, 8(%rdi)
>>>         movl    $0, 12(%rdi)
>>>         movl    $7241, 16(%rdi)
>>> ...
>>>
>>> ideally we'd have sth like
>>>
>>>     pxor %xmm0, %xmm0
>>>     movups  %xmm0, (%rdi)
>>>     movaps array+16(%rip), %xmm0
>>>     movups %xmm0, 16(%rdi)
>>> ...
>>>
>>> thus have the zeros written as immediates and the remaining pieces
>>> with load+stores.
>>>
>>> The by-pieces infrastructure eventually get's to see
>>>
>>> (mem/u/c:BLK (symbol_ref:DI ("array") [flags 0x2] <var_decl
>>> 0x7ffff7ff5b40 array>) [1 array+0 S64 A256])
>>>
>>> where the MEM_EXPR should provide a way to access the constant initializer.
>>>
>>> That said I do agree the current code is a bit premature optimization
>>> - but maybe
>>> it should be fend off in expand_constructor which has the cheap clear_storage
>>> first and which already does check can_move_by_pieces with some heuristics,
>>> but that seems to be guarded by
>>>
>>>            || (tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
>>>                && (! can_move_by_pieces
>>>                    (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
>>>                     TYPE_ALIGN (type)))
>>>                && ! mostly_zeros_p (exp))))
>>>
>>> which is odd (we _can_ move by pieces, but how does this apply to
>>> TREE_CONSTANT CTORs and avoid_temp_mem?).
>>>
>>> That said, I wonder if we want to elide expand_constructor when the
>>> CTOR is TREE_STATIC && TREE_CONSTANT and !mostly_zeros_p
>>> and we can_move_by_pieces.
>>>
>>> So sth like
>>>
>>> diff --git a/gcc/expr.c b/gcc/expr.c
>>> index 7139545d543..76b3bdf0c01 100644
>>> --- a/gcc/expr.c
>>> +++ b/gcc/expr.c
>>> @@ -8504,6 +8504,12 @@ expand_constructor (tree exp, rtx target, enum
>>> expand_modifier modifier,
>>>                && (! can_move_by_pieces
>>>                    (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
>>>                     TYPE_ALIGN (type)))
>>> +              && ! mostly_zeros_p (exp))
>>> +          || (TREE_CONSTANT (exp)
>>> +              && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
>>> +              && (can_move_by_pieces
>>> +                  (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
>>> +                   TYPE_ALIGN (type)))
>>>                && ! mostly_zeros_p (exp))))
>>>        || ((modifier == EXPAND_INITIALIZER || modifier == EXPAND_CONST_ADDRESS)
>>>           && TREE_CONSTANT (exp)))
>>>
>>> which handles your initializer and the all-zero one optimal?
>>>
>>
>> It works.  Here is the updated patch.
> 
> So just looking at the code again I think we probably want to add
> && avoid_temp_mem here, at least that's the case we're looking
> at.  Not sure if we ever arrive with TREE_CONSTANT CTORs
> and !avoid_temp_mem but if so we'd create a temporary here
> which of course would be pointless.
> 
> So maybe it's then clearer to split the condition out as
> 
> diff --git a/gcc/expr.c b/gcc/expr.c
> index 7139545d543..ee8f25f9abd 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -8523,6 +8523,19 @@ expand_constructor (tree exp, rtx target, enum
> expand_modifier modifier,
>        return constructor;
>      }
> 
> +  /* If the CTOR is available in static storage and not mostly
> +     zeros and we can move it by pieces prefer to do so since
> +     that's usually more efficient than performing a series of
> +     stores from immediates.  */
> +  if (avoid_temp_mem
> +      && TREE_STATIC (exp)
> +      && TREE_CONSTANT (exp)
> +      && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
> +      && can_move_by_pieces (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
> +                            TYPE_ALIGN (type))
> +      && ! mostly_zeros_p (exp))
> +    return NULL_RTX;
> +
>    /* Handle calls that pass values in multiple non-contiguous
>       locations.  The Irix 6 ABI has examples of this.  */
>    if (target == 0 || ! safe_from_p (target, exp, 1)
> 
> 
> OK with that change.
> 

Note however (I've been playing with the previous version)
that the test case 

FAIL: gcc.target/i386/pr90773-25.c scan-assembler-times vmovdqu[\\\\t ]%ymm[0-9]+, \\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-25.c scan-assembler-times vmovdqu[\\\\t ]%ymm[0-9]+, 32\\\\(%[^,]+\\\\) 1

fails for --target_board=unix

$ grep movdqu pr90773-25.s
	vmovdqu	%xmm0, (%rdi)
	vmovdqu	%xmm1, 16(%rdi)
	vmovdqu	%xmm2, 32(%rdi)
	vmovdqu	%xmm3, 48(%rdi)

while the test expects %ymm
/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */

and

FAIL: gcc.target/i386/pr90773-24.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, \\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-24.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, 16\\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-24.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, 32\\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-24.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, 48\\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-25.c scan-assembler-times vmovdqu[\\\\t ]%ymm[0-9]+, \\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-25.c scan-assembler-times vmovdqu[\\\\t ]%ymm[0-9]+, 32\\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times pxor[\\\\t ]%xmm[0-9]+, %xmm[0-9]+ 1
FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, \\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, 16\\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, 32\\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, 48\\\\(%[^,]+\\\\) 1

fails for --target_board=unix/-m32


Bernd.

> Thanks,
> Richard.
> 
>> Thanks.
>>
>> --
>> H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Add 3 target hooks for memset
  2021-05-21  5:42         ` Bernd Edlinger
@ 2021-05-21 11:53           ` H.J. Lu
  0 siblings, 0 replies; 52+ messages in thread
From: H.J. Lu @ 2021-05-21 11:53 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Richard Biener, GCC Patches, Richard Sandiford, Uros Bizjak

On Thu, May 20, 2021 at 10:42 PM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
> On 5/20/21 10:49 PM, H.J. Lu wrote:
> > On Wed, May 19, 2021 at 5:55 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>
> >> On Wed, May 19, 2021 at 2:25 AM Richard Biener
> >> <richard.guenther@gmail.com> wrote:
> >>>
> >>> On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>>>
> >>>> Add TARGET_READ_MEMSET_VALUE and TARGET_GEN_MEMSET_VALUE to support
> >>>> target instructions to duplicate QImode value to TImode/OImode/XImode
> >>>> value for memmset.
> >>>>
> >>>>         PR middle-end/90773
> >>>>         * builtins.c (builtin_memset_read_str): Call
> >>>>         targetm.read_memset_value.
> >>>>         (builtin_memset_gen_str): Call targetm.gen_memset_value.
> >>>>         * target.def (read_memset_value): New hook.
> >>>>         (gen_memset_value): Likewise.
> >>>>         * targhooks.c: Inclue "builtins.h".
> >>>>         (default_read_memset_value): New function.
> >>>>         (default_gen_memset_value): Likewise.
> >>>>         * targhooks.h (default_read_memset_value): New prototype.
> >>>>         (default_gen_memset_value): Likewise.
> >>>>         * doc/tm.texi.in: Add TARGET_READ_MEMSET_VALUE and
> >>>>         TARGET_GEN_MEMSET_VALUE hooks.
> >>>>         * doc/tm.texi: Regenerated.
> >>>> ---
> >>>>  gcc/builtins.c     | 47 ++++----------------------------------
> >>>>  gcc/doc/tm.texi    | 16 +++++++++++++
> >>>>  gcc/doc/tm.texi.in |  4 ++++
> >>>>  gcc/target.def     | 20 +++++++++++++++++
> >>>>  gcc/targhooks.c    | 56 ++++++++++++++++++++++++++++++++++++++++++++++
> >>>>  gcc/targhooks.h    |  4 ++++
> >>>>  6 files changed, 104 insertions(+), 43 deletions(-)
> >>>>
> >>>> diff --git a/gcc/builtins.c b/gcc/builtins.c
> >>>> index e1b284846b1..f78a36478ef 100644
> >>>> --- a/gcc/builtins.c
> >>>> +++ b/gcc/builtins.c
> >>>> @@ -6584,24 +6584,11 @@ expand_builtin_strncpy (tree exp, rtx target)
> >>>>     previous iteration.  */
> >>>>
> >>>>  rtx
> >>>> -builtin_memset_read_str (void *data, void *prevp,
> >>>> +builtin_memset_read_str (void *data, void *prev,
> >>>>                          HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> >>>>                          scalar_int_mode mode)
> >>>>  {
> >>>> -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> >>>> -  if (prev != nullptr && prev->data != nullptr)
> >>>> -    {
> >>>> -      /* Use the previous data in the same mode.  */
> >>>> -      if (prev->mode == mode)
> >>>> -       return prev->data;
> >>>> -    }
> >>>> -
> >>>> -  const char *c = (const char *) data;
> >>>> -  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> >>>> -
> >>>> -  memset (p, *c, GET_MODE_SIZE (mode));
> >>>> -
> >>>> -  return c_readstr (p, mode);
> >>>> +  return targetm.read_memset_value ((const char *) data, prev, mode);
> >>>>  }
> >>>>
> >>>>  /* Callback routine for store_by_pieces.  Return the RTL of a register
> >>>> @@ -6611,37 +6598,11 @@ builtin_memset_read_str (void *data, void *prevp,
> >>>>     nullptr, it has the RTL info from the previous iteration.  */
> >>>>
> >>>>  static rtx
> >>>> -builtin_memset_gen_str (void *data, void *prevp,
> >>>> +builtin_memset_gen_str (void *data, void *prev,
> >>>>                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> >>>>                         scalar_int_mode mode)
> >>>>  {
> >>>> -  rtx target, coeff;
> >>>> -  size_t size;
> >>>> -  char *p;
> >>>> -
> >>>> -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> >>>> -  if (prev != nullptr && prev->data != nullptr)
> >>>> -    {
> >>>> -      /* Use the previous data in the same mode.  */
> >>>> -      if (prev->mode == mode)
> >>>> -       return prev->data;
> >>>> -
> >>>> -      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> >>>> -      if (target != nullptr)
> >>>> -       return target;
> >>>> -    }
> >>>> -
> >>>> -  size = GET_MODE_SIZE (mode);
> >>>> -  if (size == 1)
> >>>> -    return (rtx) data;
> >>>> -
> >>>> -  p = XALLOCAVEC (char, size);
> >>>> -  memset (p, 1, size);
> >>>> -  coeff = c_readstr (p, mode);
> >>>> -
> >>>> -  target = convert_to_mode (mode, (rtx) data, 1);
> >>>> -  target = expand_mult (mode, target, coeff, NULL_RTX, 1);
> >>>> -  return force_reg (mode, target);
> >>>> +  return targetm.gen_memset_value ((rtx) data, prev, mode);
> >>>>  }
> >>>>
> >>>>  /* Expand expression EXP, which is a call to the memset builtin.  Return
> >>>> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> >>>> index 85ea9395560..51385044e76 100644
> >>>> --- a/gcc/doc/tm.texi
> >>>> +++ b/gcc/doc/tm.texi
> >>>> @@ -11868,6 +11868,22 @@ This function prepares to emit a conditional comparison within a sequence
> >>>>   @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.
> >>>>  @end deftypefn
> >>>>
> >>>> +@deftypefn {Target Hook} rtx TARGET_READ_MEMSET_VALUE (const char *@var{c}, void *@var{prev}, scalar_int_mode @var{mode})
> >>>> +This function returns the RTL of a constant integer corresponding to
> >>>> +target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the stringn
> >>>> +constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains
> >>>> +the RTL information from the previous interation.
> >>>> +@end deftypefn
> >>>> +
> >>>> +@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_VALUE (rtx @var{data}, void *@var{prev}, scalar_int_mode @var{mode})
> >>>> +This function returns the RTL of a register containing
> >>>> +@code{GET_MODE_SIZE (@var{mode})} consecutive copies of the unsigned
> >>>> +char value given in the RTL register @var{data}.  For example, if
> >>>> +@var{mode} is 4 bytes wide, return the RTL for 0x01010101*@var{data}.
> >>>> +If @var{PREV} is not @samp{nullptr}, it is the RTL information from
> >>>> +the previous iteration.
> >>>> +@end deftypefn
> >>>> +
> >>>>  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})
> >>>>  This target hook returns a new value for the number of times @var{loop}
> >>>>  should be unrolled. The parameter @var{nunroll} is the number of times
> >>>> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> >>>> index d8e3de14af1..8d4c3949fbf 100644
> >>>> --- a/gcc/doc/tm.texi.in
> >>>> +++ b/gcc/doc/tm.texi.in
> >>>> @@ -7956,6 +7956,10 @@ lists.
> >>>>
> >>>>  @hook TARGET_GEN_CCMP_NEXT
> >>>>
> >>>> +@hook TARGET_READ_MEMSET_VALUE
> >>>> +
> >>>> +@hook TARGET_GEN_MEMSET_VALUE
> >>>> +
> >>>>  @hook TARGET_LOOP_UNROLL_ADJUST
> >>>>
> >>>>  @defmac POWI_MAX_MULTS
> >>>> diff --git a/gcc/target.def b/gcc/target.def
> >>>> index bbaf6b4f3a0..c9aca40fa88 100644
> >>>> --- a/gcc/target.def
> >>>> +++ b/gcc/target.def
> >>>> @@ -2694,6 +2694,26 @@ DEFHOOK
> >>>>   rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree op0, tree op1, int bit_code),
> >>>>   NULL)
> >>>>
> >>>> +DEFHOOK
> >>>> +(read_memset_value,
> >>>> + "This function returns the RTL of a constant integer corresponding to\n\
> >>>> +target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the stringn\n\
> >>>> +constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains\n\
> >>>
> >>> where is 'str' defined?  I can't really tell what's the difference
> >>
> >> Fixed with
> >>
> >> diff --git a/gcc/target.def b/gcc/target.def
> >> index c9aca40fa88..4c3a5fcc634 100644
> >> --- a/gcc/target.def
> >> +++ b/gcc/target.def
> >> @@ -2699,8 +2699,8 @@ DEFHOOK
> >>   "This function returns the RTL of a constant integer corresponding to\n\
> >>  target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the string\n\
> >>  constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains\n\
> >> -the RTL information from the previous interation.",
> >> - rtx, (const char *c, void *prev, scalar_int_mode mode),
> >> +the RTL information from the previous iteration.",
> >> + rtx, (const char *str, void *prev, scalar_int_mode mode),
> >>   default_read_memset_value)
> >>
> >>  DEFHOOK
> >>
> >>> from read_memset_value
> >>> and gen_memset_value.
> >>
> >> The difference is that input of read_memset_value is a string constant
> >> like "123" and input of gen_memset_value is an RTL register.
> >>
> >>> Somehow I feel that an optab for the "splat" operation similar
> >>> to vec_duplicate might be a better way to expose this - of course
> >>> that doesn't handle the "prev" thing.
> >>
> >> The x86 backend has ix86_expand_vector_init_duplicate () to
> >> broadcast QImode to TImode/OImode/XImode:
> >>
> >> /* A subroutine of ix86_expand_vector_init.  Store into TARGET a vector
> >>    with all elements equal to VAR.  Return true if successful.  */
> >>
> >> bool
> >> ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
> >>                                    rtx target, rtx val)
> >>
> >>> So how's this the right point of abstraction to the target?
> >>
> >> I can add 2 target hooks, one for scratch register and one for
> >> broadcasting QImode to TImode/OImode/XImode.   Then I can
> >> move x86 codes to the middle-end.
> >>
> >
> > Here is the patch to add 3 target hooks:
> >
> >  -- Target Hook: rtx TARGET_READ_MEMSET_VALUE (const char *C,
> >           scalar_int_mode MODE)
> >      This function returns the RTL of a constant integer corresponding
> >      to target reading 'GET_MODE_SIZE (MODE)' bytes from the string
> >      constant C.
> >
> >  -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> >           MODE)
> >      This function returns the RTL of a register containing
> >      'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> >      value given in the RTL register DATA.  For example, if MODE is 4
> >      bytes wide, return the RTL for 0x01010101*DATA.
> >
> >  -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE_FROM_PREV (void *PREV,
> >           scalar_int_mode MODE)
> >      This function returns the RTL of a register in MODE generated from
> >      PREV in the previous iteration.
> >
> > with
> >
> > /* Return the RTL of a register in MODE generated from PREV in the
> >    previous iteration.  */
> >
> > static rtx
> > gen_memset_value_from_prev (void *prevp, scalar_int_mode mode)
> > {
> >   by_pieces_prev *prev = (by_pieces_prev *) prevp;
> >   rtx value;
> >   if (prev != nullptr && prev->data != nullptr)
> >     {
> >       /* Use the previous data in the same mode.  */
> >       if (prev->mode == mode)
> >         return prev->data;
> >
> >       value = targetm.gen_memset_value_from_prev (prevp, mode);
> >     }
> >   else
> >     value = nullptr;
> >   return value;
> > }
> >
> > /* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)
> >    bytes from constant string DATA + OFFSET and return it as target
> >    constant.  If PREV isn't nullptr, it has the RTL info from the
> >    previous iteration.  */
> >
> > rtx
> > builtin_memset_read_str (void *data, void *prev,
> >                          HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> >                          scalar_int_mode mode)
> > {
> >   const char *str = (const char *) data;
> >
> >   /* Don't use the previous value if size is 1.  */
> >   if (GET_MODE_SIZE (mode) == 1)
> >     return default_read_memset_value (str, mode);
> >
> >   rtx value = gen_memset_value_from_prev (prev, mode);
> >   if (value)
> >     return value;
> >
> >   return targetm.read_memset_value (str, mode);
> > }
> >
> > /* Callback routine for store_by_pieces.  Return the RTL of a register
> >    containing GET_MODE_SIZE (MODE) consecutive copies of the unsigned
> >    char value given in the RTL register data.  For example, if mode is
> >    4 bytes wide, return the RTL for 0x01010101*data.  If PREV isn't
> >    nullptr, it has the RTL info from the previous iteration.  */
> >
> > static rtx
> > builtin_memset_gen_str (void *datap, void *prev,
> >                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> >                         scalar_int_mode mode)
> > {
> >   rtx data = (rtx) datap;
> >
> >   /* Don't use the previous value if size is 1.  */
> >   if (GET_MODE_SIZE (mode) == 1)
> >     return data;
> >
> >   rtx value = gen_memset_value_from_prev (prev, mode);
> >   if (value)
> >     return value;
> >
> >   return targetm.gen_memset_value (data, mode);
> > }> +/* Default implementation of TARGET_GEN_MEMSET_VALUE.  */
> > +
> > +rtx
> > +default_gen_memset_value (rtx data, scalar_int_mode mode)
> > +{
> > +  rtx target, coeff;
> > +  size_t size;
> > +  char *p;
> > +
> > +  size = GET_MODE_SIZE (mode);
> > +  if (size == 1)
> > +    return data;
> > +
> > +  p = XALLOCAVEC (char, size);
> > +  memset (p, 1, size);
> > +  coeff = c_readstr (p, mode);
> > +
> > +  target = convert_to_mode (mode, data, 1);
> > +  target = expand_mult (mode, target, coeff, NULL_RTX, 1);
>
>
> Note this formula does not work for data = -1 for instance,
> since 0x01010101U * -1 = 0xFEFEFEFFU
> but memset(str, n, -1) set it to 0xFFFFFFFFU, right?
>
> So are we sure that the value of "data" is always in the range [0..255] ?

I didn't change the formula.  I just turned the current formula into
a target hook.  For memset,

DESCRIPTION
       The  memset()  function  fills  the  first  n  bytes of the memory area
       pointed to by s with the constant byte c.

"data", aka the constant byte c, is a byte in the range [0..255].

-- 
H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] Elide expand_constructor if move by pieces is preferred
  2021-05-21  6:57           ` Richard Biener
  2021-05-21  7:30             ` Bernd Edlinger
@ 2021-05-21 13:09             ` H.J. Lu
  1 sibling, 0 replies; 52+ messages in thread
From: H.J. Lu @ 2021-05-21 13:09 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

[-- Attachment #1: Type: text/plain, Size: 10247 bytes --]

On Thu, May 20, 2021 at 11:57 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Thu, May 20, 2021 at 4:04 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Thu, May 20, 2021 at 12:51 AM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Wed, May 19, 2021 at 3:22 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >
> > > > On Wed, May 19, 2021 at 2:33 AM Richard Biener
> > > > <richard.guenther@gmail.com> wrote:
> > > > >
> > > > > On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > >
> > > > > > When expanding a constant constructor, don't call expand_constructor if
> > > > > > it is more efficient to load the data from the memory via move by pieces.
> > > > > >
> > > > > > gcc/
> > > > > >
> > > > > >         PR middle-end/90773
> > > > > >         * expr.c (expand_expr_real_1): Don't call expand_constructor if
> > > > > >         it is more efficient to load the data from the memory.
> > > > > >
> > > > > > gcc/testsuite/
> > > > > >
> > > > > >         PR middle-end/90773
> > > > > >         * gcc.target/i386/pr90773-24.c: New test.
> > > > > >         * gcc.target/i386/pr90773-25.c: Likewise.
> > > > > > ---
> > > > > >  gcc/expr.c                                 | 10 ++++++++++
> > > > > >  gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 ++++++++++++++++++++++
> > > > > >  gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 ++++++++++++++++++++
> > > > > >  3 files changed, 52 insertions(+)
> > > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c
> > > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c
> > > > > >
> > > > > > diff --git a/gcc/expr.c b/gcc/expr.c
> > > > > > index d09ee42e262..80e01ea1cbe 100644
> > > > > > --- a/gcc/expr.c
> > > > > > +++ b/gcc/expr.c
> > > > > > @@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
> > > > > >                 unsigned HOST_WIDE_INT ix;
> > > > > >                 tree field, value;
> > > > > >
> > > > > > +               /* Check if it is more efficient to load the data from
> > > > > > +                  the memory directly.  FIXME: How many stores do we
> > > > > > +                  need here if not moved by pieces?  */
> > > > > > +               unsigned HOST_WIDE_INT bytes
> > > > > > +                 = tree_to_uhwi (TYPE_SIZE_UNIT (type));
> > > > >
> > > > > that's prone to fail - it could be a VLA.
> > > >
> > > > What do you mean by fail?  Is it ICE or missed optimization?
> > > > Do you have a testcase?
> > > >
> > > > >
> > > > > > +               if ((bytes / UNITS_PER_WORD) > 2
> > > > > > +                   && MOVE_MAX_PIECES > UNITS_PER_WORD
> > > > > > +                   && can_move_by_pieces (bytes, TYPE_ALIGN (type)))
> > > > > > +                 goto normal_inner_ref;
> > > > > > +
> > > > >
> > > > > It looks like you're concerned about aggregate copies but this also handles
> > > > > non-aggregates (which on GIMPLE might already be optimized of course).
> > > >
> > > > Here I check if we copy more than 2 words and we can move more than
> > > > a word in a single instruction.
> > > >
> > > > > Also you say "if it's cheaper" but I see no cost considerations.  How do
> > > > > we generally handle immed const vs. load from constant pool costs?
> > > >
> > > > This trades 2 (update to 8) stores with one load plus one store.  Is there
> > > > a way to check which one is faster?
> > >
> > > I'm not sure - it depends on whether the target can do stores from immediates
> > > at all or what restrictions apply, what the immediate value actually is
> > > (zero or all-ones should be way cheaper than sth arbitrary) and how the
> > > pressure on the load unit is.  can_move_by_pieces (bytes, TYPE_ALIGN (type))
> > > also does not guarantee it will actually move pieces larger than UNITS_PER_WORD,
> > > that might depend on alignment.  There's by_pieces_ninsns that might provide
> > > some hint here.
> > >
> > > I'm sure it works well for x86.
> > >
> > > I wonder if the existing code is in the appropriate place and we
> > > shouldn't instead
> > > handle this somewhere upthread where we ask to copy 'exp' into some other
> > > memory location.  For your testcase that's expand_assignment but I can
> > > imagine passing array[0] by value to a function resulting in similar copying.
> > > Testing that shows we get
> > >
> > >         pushq   array+56(%rip)
> > >         .cfi_def_cfa_offset 24
> > >         pushq   array+48(%rip)
> > >         .cfi_def_cfa_offset 32
> > >         pushq   array+40(%rip)
> > >         .cfi_def_cfa_offset 40
> > >         pushq   array+32(%rip)
> > >         .cfi_def_cfa_offset 48
> > >         pushq   array+24(%rip)
> > >         .cfi_def_cfa_offset 56
> > >         pushq   array+16(%rip)
> > >         .cfi_def_cfa_offset 64
> > >         pushq   array+8(%rip)
> > >         .cfi_def_cfa_offset 72
> > >         pushq   array(%rip)
> > >         .cfi_def_cfa_offset 80
> > >         call    bar
> > >
> > > for that.  We do have the by-pieces infrastructure to generally do this kind of
> > > copying but in both of these cases we do not seem to use it.  I also wonder
> > > if the by-pieces infrastructure can pick up constant initializers automagically
> > > (we could native_encode the initializer part and feed the by-pieces
> > > infrastructure with an array of bytes).  There for example might be easy to
> > > immediate-store byte parts and difficult ones where we could decide on a
> > > case-by-case basis whether to load+store or immediate-store them.
> >
> > I opened:
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100704
> >
> > > For example if I change your testcase to have the array[] initializer
> > > all-zero we currently emit
> > >
> > >         pxor    %xmm0, %xmm0
> > >         movups  %xmm0, (%rdi)
> > >         movups  %xmm0, 16(%rdi)
> > >         movups  %xmm0, 32(%rdi)
> > >         movups  %xmm0, 48(%rdi)
> > >         ret
> > >
> > > will your patch cause us to emit 4 loads?  OTHO if I do
> > >
> > > const struct S array[] = {
> > >   { 0, 0, 0, 7241, 124764, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
> > > };
> > >
> > > we get
> > >
> > >         movq    $0, (%rdi)
> > >         movl    $0, 8(%rdi)
> > >         movl    $0, 12(%rdi)
> > >         movl    $7241, 16(%rdi)
> > > ...
> > >
> > > ideally we'd have sth like
> > >
> > >     pxor %xmm0, %xmm0
> > >     movups  %xmm0, (%rdi)
> > >     movaps array+16(%rip), %xmm0
> > >     movups %xmm0, 16(%rdi)
> > > ...
> > >
> > > thus have the zeros written as immediates and the remaining pieces
> > > with load+stores.
> > >
> > > The by-pieces infrastructure eventually get's to see
> > >
> > > (mem/u/c:BLK (symbol_ref:DI ("array") [flags 0x2] <var_decl
> > > 0x7ffff7ff5b40 array>) [1 array+0 S64 A256])
> > >
> > > where the MEM_EXPR should provide a way to access the constant initializer.
> > >
> > > That said I do agree the current code is a bit premature optimization
> > > - but maybe
> > > it should be fend off in expand_constructor which has the cheap clear_storage
> > > first and which already does check can_move_by_pieces with some heuristics,
> > > but that seems to be guarded by
> > >
> > >            || (tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
> > >                && (! can_move_by_pieces
> > >                    (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
> > >                     TYPE_ALIGN (type)))
> > >                && ! mostly_zeros_p (exp))))
> > >
> > > which is odd (we _can_ move by pieces, but how does this apply to
> > > TREE_CONSTANT CTORs and avoid_temp_mem?).
> > >
> > > That said, I wonder if we want to elide expand_constructor when the
> > > CTOR is TREE_STATIC && TREE_CONSTANT and !mostly_zeros_p
> > > and we can_move_by_pieces.
> > >
> > > So sth like
> > >
> > > diff --git a/gcc/expr.c b/gcc/expr.c
> > > index 7139545d543..76b3bdf0c01 100644
> > > --- a/gcc/expr.c
> > > +++ b/gcc/expr.c
> > > @@ -8504,6 +8504,12 @@ expand_constructor (tree exp, rtx target, enum
> > > expand_modifier modifier,
> > >                && (! can_move_by_pieces
> > >                    (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
> > >                     TYPE_ALIGN (type)))
> > > +              && ! mostly_zeros_p (exp))
> > > +          || (TREE_CONSTANT (exp)
> > > +              && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
> > > +              && (can_move_by_pieces
> > > +                  (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
> > > +                   TYPE_ALIGN (type)))
> > >                && ! mostly_zeros_p (exp))))
> > >        || ((modifier == EXPAND_INITIALIZER || modifier == EXPAND_CONST_ADDRESS)
> > >           && TREE_CONSTANT (exp)))
> > >
> > > which handles your initializer and the all-zero one optimal?
> > >
> >
> > It works.  Here is the updated patch.
>
> So just looking at the code again I think we probably want to add
> && avoid_temp_mem here, at least that's the case we're looking
> at.  Not sure if we ever arrive with TREE_CONSTANT CTORs
> and !avoid_temp_mem but if so we'd create a temporary here
> which of course would be pointless.
>
> So maybe it's then clearer to split the condition out as
>
> diff --git a/gcc/expr.c b/gcc/expr.c
> index 7139545d543..ee8f25f9abd 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -8523,6 +8523,19 @@ expand_constructor (tree exp, rtx target, enum
> expand_modifier modifier,
>        return constructor;
>      }
>
> +  /* If the CTOR is available in static storage and not mostly
> +     zeros and we can move it by pieces prefer to do so since
> +     that's usually more efficient than performing a series of
> +     stores from immediates.  */
> +  if (avoid_temp_mem
> +      && TREE_STATIC (exp)
> +      && TREE_CONSTANT (exp)
> +      && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
> +      && can_move_by_pieces (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
> +                            TYPE_ALIGN (type))
> +      && ! mostly_zeros_p (exp))
> +    return NULL_RTX;
> +
>    /* Handle calls that pass values in multiple non-contiguous
>       locations.  The Irix 6 ABI has examples of this.  */
>    if (target == 0 || ! safe_from_p (target, exp, 1)
>
>
> OK with that change.
>

This is the patch I am checking in after testing.

Thanks.

-- 
H.J.

[-- Attachment #2: 0001-Elide-expand_constructor-if-move-by-pieces-is-prefer.patch --]
[-- Type: text/x-patch, Size: 4119 bytes --]

From ca3e15ff233230f2450478459a41e6a93ec42df6 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Fri, 21 May 2021 05:16:20 -0700
Subject: [PATCH] Elide expand_constructor if move by pieces is preferred

Elide expand_constructor when the constructor is static storage and not
mostly zeros and we can move it by pieces prefer to do so since that's
usually more efficient than performing a series of stores from immediates.

2021-05-21  Richard Biener  <rguenther@suse.de>
	    H.J. Lu  <hjl.tools@gmail.com>

gcc/

	PR middle-end/90773
	* expr.c (expand_constructor): Elide expand_constructor if
	move by pieces is preferred.

gcc/testsuite/

	* gcc.target/i386/pr90773-24.c: New test.
	* gcc.target/i386/pr90773-25.c: Likewise.
---
 gcc/expr.c                                 | 13 +++++++++++
 gcc/testsuite/gcc.target/i386/pr90773-24.c | 23 ++++++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr90773-25.c | 25 ++++++++++++++++++++++
 3 files changed, 61 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c

diff --git a/gcc/expr.c b/gcc/expr.c
index d09ee42e262..ba61eb98b3b 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -8523,6 +8523,19 @@ expand_constructor (tree exp, rtx target, enum expand_modifier modifier,
       return constructor;
     }
 
+  /* If the CTOR is available in static storage and not mostly
+     zeros and we can move it by pieces prefer to do so since
+     that's usually more efficient than performing a series of
+     stores from immediates.  */
+  if (avoid_temp_mem
+      && TREE_STATIC (exp)
+      && TREE_CONSTANT (exp)
+      && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
+      && can_move_by_pieces (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
+			     TYPE_ALIGN (type))
+      && ! mostly_zeros_p (exp))
+    return NULL_RTX;
+
   /* Handle calls that pass values in multiple non-contiguous
      locations.  The Irix 6 ABI has examples of this.  */
   if (target == 0 || ! safe_from_p (target, exp, 1)
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-24.c b/gcc/testsuite/gcc.target/i386/pr90773-24.c
new file mode 100644
index 00000000000..7b2ea66dcfc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-24.c
@@ -0,0 +1,23 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=x86-64" } */
+
+struct S
+{
+  long long s1 __attribute__ ((aligned (8)));
+  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
+};
+
+const struct S array[] = {
+  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
+};
+
+void
+foo (struct S *x)
+{
+  x[0] = array[0];
+}
+
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 16\\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 48\\(%\[\^,\]+\\)" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-25.c b/gcc/testsuite/gcc.target/i386/pr90773-25.c
new file mode 100644
index 00000000000..57642ea8d2d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-25.c
@@ -0,0 +1,25 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=x86-64" } */
+
+struct S
+{
+  long long s1 __attribute__ ((aligned (8)));
+  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
+};
+
+const struct S array[] = {
+  { 0, }
+};
+
+void
+foo (struct S *x)
+{
+  x[0] = array[0];
+}
+
+/* { dg-final { scan-assembler-not "movdqa" } } */
+/* { dg-final { scan-assembler-times "pxor\[\\t \]%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 16\\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 48\\(%\[\^,\]+\\)" 1 } } */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] constructor: Elide expand_constructor when can move by pieces is true
  2021-05-21  7:30             ` Bernd Edlinger
@ 2021-05-21 13:13               ` H.J. Lu
  0 siblings, 0 replies; 52+ messages in thread
From: H.J. Lu @ 2021-05-21 13:13 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Richard Biener, GCC Patches, Richard Sandiford, Uros Bizjak

On Fri, May 21, 2021 at 12:30 AM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
>
>
> On 5/21/21 8:57 AM, Richard Biener wrote:
> > On Thu, May 20, 2021 at 4:04 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>
> >> On Thu, May 20, 2021 at 12:51 AM Richard Biener
> >> <richard.guenther@gmail.com> wrote:
> >>>
> >>> On Wed, May 19, 2021 at 3:22 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>>>
> >>>> On Wed, May 19, 2021 at 2:33 AM Richard Biener
> >>>> <richard.guenther@gmail.com> wrote:
> >>>>>
> >>>>> On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>>>>>
> >>>>>> When expanding a constant constructor, don't call expand_constructor if
> >>>>>> it is more efficient to load the data from the memory via move by pieces.
> >>>>>>
> >>>>>> gcc/
> >>>>>>
> >>>>>>         PR middle-end/90773
> >>>>>>         * expr.c (expand_expr_real_1): Don't call expand_constructor if
> >>>>>>         it is more efficient to load the data from the memory.
> >>>>>>
> >>>>>> gcc/testsuite/
> >>>>>>
> >>>>>>         PR middle-end/90773
> >>>>>>         * gcc.target/i386/pr90773-24.c: New test.
> >>>>>>         * gcc.target/i386/pr90773-25.c: Likewise.
> >>>>>> ---
> >>>>>>  gcc/expr.c                                 | 10 ++++++++++
> >>>>>>  gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 ++++++++++++++++++++++
> >>>>>>  gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 ++++++++++++++++++++
> >>>>>>  3 files changed, 52 insertions(+)
> >>>>>>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c
> >>>>>>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c
> >>>>>>
> >>>>>> diff --git a/gcc/expr.c b/gcc/expr.c
> >>>>>> index d09ee42e262..80e01ea1cbe 100644
> >>>>>> --- a/gcc/expr.c
> >>>>>> +++ b/gcc/expr.c
> >>>>>> @@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
> >>>>>>                 unsigned HOST_WIDE_INT ix;
> >>>>>>                 tree field, value;
> >>>>>>
> >>>>>> +               /* Check if it is more efficient to load the data from
> >>>>>> +                  the memory directly.  FIXME: How many stores do we
> >>>>>> +                  need here if not moved by pieces?  */
> >>>>>> +               unsigned HOST_WIDE_INT bytes
> >>>>>> +                 = tree_to_uhwi (TYPE_SIZE_UNIT (type));
> >>>>>
> >>>>> that's prone to fail - it could be a VLA.
> >>>>
> >>>> What do you mean by fail?  Is it ICE or missed optimization?
> >>>> Do you have a testcase?
> >>>>
> >>>>>
> >>>>>> +               if ((bytes / UNITS_PER_WORD) > 2
> >>>>>> +                   && MOVE_MAX_PIECES > UNITS_PER_WORD
> >>>>>> +                   && can_move_by_pieces (bytes, TYPE_ALIGN (type)))
> >>>>>> +                 goto normal_inner_ref;
> >>>>>> +
> >>>>>
> >>>>> It looks like you're concerned about aggregate copies but this also handles
> >>>>> non-aggregates (which on GIMPLE might already be optimized of course).
> >>>>
> >>>> Here I check if we copy more than 2 words and we can move more than
> >>>> a word in a single instruction.
> >>>>
> >>>>> Also you say "if it's cheaper" but I see no cost considerations.  How do
> >>>>> we generally handle immed const vs. load from constant pool costs?
> >>>>
> >>>> This trades 2 (update to 8) stores with one load plus one store.  Is there
> >>>> a way to check which one is faster?
> >>>
> >>> I'm not sure - it depends on whether the target can do stores from immediates
> >>> at all or what restrictions apply, what the immediate value actually is
> >>> (zero or all-ones should be way cheaper than sth arbitrary) and how the
> >>> pressure on the load unit is.  can_move_by_pieces (bytes, TYPE_ALIGN (type))
> >>> also does not guarantee it will actually move pieces larger than UNITS_PER_WORD,
> >>> that might depend on alignment.  There's by_pieces_ninsns that might provide
> >>> some hint here.
> >>>
> >>> I'm sure it works well for x86.
> >>>
> >>> I wonder if the existing code is in the appropriate place and we
> >>> shouldn't instead
> >>> handle this somewhere upthread where we ask to copy 'exp' into some other
> >>> memory location.  For your testcase that's expand_assignment but I can
> >>> imagine passing array[0] by value to a function resulting in similar copying.
> >>> Testing that shows we get
> >>>
> >>>         pushq   array+56(%rip)
> >>>         .cfi_def_cfa_offset 24
> >>>         pushq   array+48(%rip)
> >>>         .cfi_def_cfa_offset 32
> >>>         pushq   array+40(%rip)
> >>>         .cfi_def_cfa_offset 40
> >>>         pushq   array+32(%rip)
> >>>         .cfi_def_cfa_offset 48
> >>>         pushq   array+24(%rip)
> >>>         .cfi_def_cfa_offset 56
> >>>         pushq   array+16(%rip)
> >>>         .cfi_def_cfa_offset 64
> >>>         pushq   array+8(%rip)
> >>>         .cfi_def_cfa_offset 72
> >>>         pushq   array(%rip)
> >>>         .cfi_def_cfa_offset 80
> >>>         call    bar
> >>>
> >>> for that.  We do have the by-pieces infrastructure to generally do this kind of
> >>> copying but in both of these cases we do not seem to use it.  I also wonder
> >>> if the by-pieces infrastructure can pick up constant initializers automagically
> >>> (we could native_encode the initializer part and feed the by-pieces
> >>> infrastructure with an array of bytes).  There for example might be easy to
> >>> immediate-store byte parts and difficult ones where we could decide on a
> >>> case-by-case basis whether to load+store or immediate-store them.
> >>
> >> I opened:
> >>
> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100704
> >>
> >>> For example if I change your testcase to have the array[] initializer
> >>> all-zero we currently emit
> >>>
> >>>         pxor    %xmm0, %xmm0
> >>>         movups  %xmm0, (%rdi)
> >>>         movups  %xmm0, 16(%rdi)
> >>>         movups  %xmm0, 32(%rdi)
> >>>         movups  %xmm0, 48(%rdi)
> >>>         ret
> >>>
> >>> will your patch cause us to emit 4 loads?  OTHO if I do
> >>>
> >>> const struct S array[] = {
> >>>   { 0, 0, 0, 7241, 124764, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
> >>> };
> >>>
> >>> we get
> >>>
> >>>         movq    $0, (%rdi)
> >>>         movl    $0, 8(%rdi)
> >>>         movl    $0, 12(%rdi)
> >>>         movl    $7241, 16(%rdi)
> >>> ...
> >>>
> >>> ideally we'd have sth like
> >>>
> >>>     pxor %xmm0, %xmm0
> >>>     movups  %xmm0, (%rdi)
> >>>     movaps array+16(%rip), %xmm0
> >>>     movups %xmm0, 16(%rdi)
> >>> ...
> >>>
> >>> thus have the zeros written as immediates and the remaining pieces
> >>> with load+stores.
> >>>
> >>> The by-pieces infrastructure eventually get's to see
> >>>
> >>> (mem/u/c:BLK (symbol_ref:DI ("array") [flags 0x2] <var_decl
> >>> 0x7ffff7ff5b40 array>) [1 array+0 S64 A256])
> >>>
> >>> where the MEM_EXPR should provide a way to access the constant initializer.
> >>>
> >>> That said I do agree the current code is a bit premature optimization
> >>> - but maybe
> >>> it should be fend off in expand_constructor which has the cheap clear_storage
> >>> first and which already does check can_move_by_pieces with some heuristics,
> >>> but that seems to be guarded by
> >>>
> >>>            || (tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
> >>>                && (! can_move_by_pieces
> >>>                    (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
> >>>                     TYPE_ALIGN (type)))
> >>>                && ! mostly_zeros_p (exp))))
> >>>
> >>> which is odd (we _can_ move by pieces, but how does this apply to
> >>> TREE_CONSTANT CTORs and avoid_temp_mem?).
> >>>
> >>> That said, I wonder if we want to elide expand_constructor when the
> >>> CTOR is TREE_STATIC && TREE_CONSTANT and !mostly_zeros_p
> >>> and we can_move_by_pieces.
> >>>
> >>> So sth like
> >>>
> >>> diff --git a/gcc/expr.c b/gcc/expr.c
> >>> index 7139545d543..76b3bdf0c01 100644
> >>> --- a/gcc/expr.c
> >>> +++ b/gcc/expr.c
> >>> @@ -8504,6 +8504,12 @@ expand_constructor (tree exp, rtx target, enum
> >>> expand_modifier modifier,
> >>>                && (! can_move_by_pieces
> >>>                    (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
> >>>                     TYPE_ALIGN (type)))
> >>> +              && ! mostly_zeros_p (exp))
> >>> +          || (TREE_CONSTANT (exp)
> >>> +              && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
> >>> +              && (can_move_by_pieces
> >>> +                  (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
> >>> +                   TYPE_ALIGN (type)))
> >>>                && ! mostly_zeros_p (exp))))
> >>>        || ((modifier == EXPAND_INITIALIZER || modifier == EXPAND_CONST_ADDRESS)
> >>>           && TREE_CONSTANT (exp)))
> >>>
> >>> which handles your initializer and the all-zero one optimal?
> >>>
> >>
> >> It works.  Here is the updated patch.
> >
> > So just looking at the code again I think we probably want to add
> > && avoid_temp_mem here, at least that's the case we're looking
> > at.  Not sure if we ever arrive with TREE_CONSTANT CTORs
> > and !avoid_temp_mem but if so we'd create a temporary here
> > which of course would be pointless.
> >
> > So maybe it's then clearer to split the condition out as
> >
> > diff --git a/gcc/expr.c b/gcc/expr.c
> > index 7139545d543..ee8f25f9abd 100644
> > --- a/gcc/expr.c
> > +++ b/gcc/expr.c
> > @@ -8523,6 +8523,19 @@ expand_constructor (tree exp, rtx target, enum
> > expand_modifier modifier,
> >        return constructor;
> >      }
> >
> > +  /* If the CTOR is available in static storage and not mostly
> > +     zeros and we can move it by pieces prefer to do so since
> > +     that's usually more efficient than performing a series of
> > +     stores from immediates.  */
> > +  if (avoid_temp_mem
> > +      && TREE_STATIC (exp)
> > +      && TREE_CONSTANT (exp)
> > +      && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
> > +      && can_move_by_pieces (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
> > +                            TYPE_ALIGN (type))
> > +      && ! mostly_zeros_p (exp))
> > +    return NULL_RTX;
> > +
> >    /* Handle calls that pass values in multiple non-contiguous
> >       locations.  The Irix 6 ABI has examples of this.  */
> >    if (target == 0 || ! safe_from_p (target, exp, 1)
> >
> >
> > OK with that change.
> >
>
> Note however (I've been playing with the previous version)
> that the test case
>
> FAIL: gcc.target/i386/pr90773-25.c scan-assembler-times vmovdqu[\\\\t ]%ymm[0-9]+, \\\\(%[^,]+\\\\) 1
> FAIL: gcc.target/i386/pr90773-25.c scan-assembler-times vmovdqu[\\\\t ]%ymm[0-9]+, 32\\\\(%[^,]+\\\\) 1
>
> fails for --target_board=unix
>
> $ grep movdqu pr90773-25.s
>         vmovdqu %xmm0, (%rdi)
>         vmovdqu %xmm1, 16(%rdi)
>         vmovdqu %xmm2, 32(%rdi)
>         vmovdqu %xmm3, 48(%rdi)
>
> while the test expects %ymm
> /* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
> /* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */
>
> and
>
> FAIL: gcc.target/i386/pr90773-24.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, \\\\(%[^,]+\\\\) 1
> FAIL: gcc.target/i386/pr90773-24.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, 16\\\\(%[^,]+\\\\) 1
> FAIL: gcc.target/i386/pr90773-24.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, 32\\\\(%[^,]+\\\\) 1
> FAIL: gcc.target/i386/pr90773-24.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, 48\\\\(%[^,]+\\\\) 1
> FAIL: gcc.target/i386/pr90773-25.c scan-assembler-times vmovdqu[\\\\t ]%ymm[0-9]+, \\\\(%[^,]+\\\\) 1
> FAIL: gcc.target/i386/pr90773-25.c scan-assembler-times vmovdqu[\\\\t ]%ymm[0-9]+, 32\\\\(%[^,]+\\\\) 1
> FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times pxor[\\\\t ]%xmm[0-9]+, %xmm[0-9]+ 1
> FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, \\\\(%[^,]+\\\\) 1
> FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, 16\\\\(%[^,]+\\\\) 1
> FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, 32\\\\(%[^,]+\\\\) 1
> FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, 48\\\\(%[^,]+\\\\) 1
>
> fails for --target_board=unix/-m32
>

The whole patch set is needed.   My users/hjl/pieces/hook branch is at

https://gitlab.com/x86-gcc/gcc/-/tree/users/hjl/pieces/hook

I got

[hjl@gnu-cfl-2 testsuite]$
/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
-B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-25.c
-fdiagnostics-plain-output -O2 -march=skylake -ffat-lto-objects
-fno-ident -S -o pr90773-25.s
[hjl@gnu-cfl-2 testsuite]$ cat pr90773-25.s
.file "pr90773-25.c"
.text
.p2align 4
.globl foo
.type foo, @function
foo:
.LFB0:
.cfi_startproc
vpxor %xmm0, %xmm0, %xmm0
vmovdqu %xmm0, (%rdi)
vmovdqu %xmm0, 16(%rdi)
vmovdqu %xmm0, 32(%rdi)
vmovdqu %xmm0, 48(%rdi)
ret
.cfi_endproc
.LFE0:
.size foo, .-foo
.globl array
.section .rodata
.align 32
.type array, @object
.size array, 64
array:
.zero 64
.section .note.GNU-stack,"",@progbits
[hjl@gnu-cfl-2 testsuite]$

-- 
H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Add 3 target hooks for memset
  2021-05-20 20:49       ` [PATCH] Add 3 target hooks for memset H.J. Lu
  2021-05-21  5:42         ` Bernd Edlinger
@ 2021-05-25 14:34         ` Richard Biener
  2021-05-25 15:11           ` H.J. Lu
  1 sibling, 1 reply; 52+ messages in thread
From: Richard Biener @ 2021-05-25 14:34 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

On Thu, May 20, 2021 at 10:50 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, May 19, 2021 at 5:55 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Wed, May 19, 2021 at 2:25 AM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >
> > > > Add TARGET_READ_MEMSET_VALUE and TARGET_GEN_MEMSET_VALUE to support
> > > > target instructions to duplicate QImode value to TImode/OImode/XImode
> > > > value for memmset.
> > > >
> > > >         PR middle-end/90773
> > > >         * builtins.c (builtin_memset_read_str): Call
> > > >         targetm.read_memset_value.
> > > >         (builtin_memset_gen_str): Call targetm.gen_memset_value.
> > > >         * target.def (read_memset_value): New hook.
> > > >         (gen_memset_value): Likewise.
> > > >         * targhooks.c: Inclue "builtins.h".
> > > >         (default_read_memset_value): New function.
> > > >         (default_gen_memset_value): Likewise.
> > > >         * targhooks.h (default_read_memset_value): New prototype.
> > > >         (default_gen_memset_value): Likewise.
> > > >         * doc/tm.texi.in: Add TARGET_READ_MEMSET_VALUE and
> > > >         TARGET_GEN_MEMSET_VALUE hooks.
> > > >         * doc/tm.texi: Regenerated.
> > > > ---
> > > >  gcc/builtins.c     | 47 ++++----------------------------------
> > > >  gcc/doc/tm.texi    | 16 +++++++++++++
> > > >  gcc/doc/tm.texi.in |  4 ++++
> > > >  gcc/target.def     | 20 +++++++++++++++++
> > > >  gcc/targhooks.c    | 56 ++++++++++++++++++++++++++++++++++++++++++++++
> > > >  gcc/targhooks.h    |  4 ++++
> > > >  6 files changed, 104 insertions(+), 43 deletions(-)
> > > >
> > > > diff --git a/gcc/builtins.c b/gcc/builtins.c
> > > > index e1b284846b1..f78a36478ef 100644
> > > > --- a/gcc/builtins.c
> > > > +++ b/gcc/builtins.c
> > > > @@ -6584,24 +6584,11 @@ expand_builtin_strncpy (tree exp, rtx target)
> > > >     previous iteration.  */
> > > >
> > > >  rtx
> > > > -builtin_memset_read_str (void *data, void *prevp,
> > > > +builtin_memset_read_str (void *data, void *prev,
> > > >                          HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > > >                          scalar_int_mode mode)
> > > >  {
> > > > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > > > -  if (prev != nullptr && prev->data != nullptr)
> > > > -    {
> > > > -      /* Use the previous data in the same mode.  */
> > > > -      if (prev->mode == mode)
> > > > -       return prev->data;
> > > > -    }
> > > > -
> > > > -  const char *c = (const char *) data;
> > > > -  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> > > > -
> > > > -  memset (p, *c, GET_MODE_SIZE (mode));
> > > > -
> > > > -  return c_readstr (p, mode);
> > > > +  return targetm.read_memset_value ((const char *) data, prev, mode);
> > > >  }
> > > >
> > > >  /* Callback routine for store_by_pieces.  Return the RTL of a register
> > > > @@ -6611,37 +6598,11 @@ builtin_memset_read_str (void *data, void *prevp,
> > > >     nullptr, it has the RTL info from the previous iteration.  */
> > > >
> > > >  static rtx
> > > > -builtin_memset_gen_str (void *data, void *prevp,
> > > > +builtin_memset_gen_str (void *data, void *prev,
> > > >                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > > >                         scalar_int_mode mode)
> > > >  {
> > > > -  rtx target, coeff;
> > > > -  size_t size;
> > > > -  char *p;
> > > > -
> > > > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > > > -  if (prev != nullptr && prev->data != nullptr)
> > > > -    {
> > > > -      /* Use the previous data in the same mode.  */
> > > > -      if (prev->mode == mode)
> > > > -       return prev->data;
> > > > -
> > > > -      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> > > > -      if (target != nullptr)
> > > > -       return target;
> > > > -    }
> > > > -
> > > > -  size = GET_MODE_SIZE (mode);
> > > > -  if (size == 1)
> > > > -    return (rtx) data;
> > > > -
> > > > -  p = XALLOCAVEC (char, size);
> > > > -  memset (p, 1, size);
> > > > -  coeff = c_readstr (p, mode);
> > > > -
> > > > -  target = convert_to_mode (mode, (rtx) data, 1);
> > > > -  target = expand_mult (mode, target, coeff, NULL_RTX, 1);
> > > > -  return force_reg (mode, target);
> > > > +  return targetm.gen_memset_value ((rtx) data, prev, mode);
> > > >  }
> > > >
> > > >  /* Expand expression EXP, which is a call to the memset builtin.  Return
> > > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > > index 85ea9395560..51385044e76 100644
> > > > --- a/gcc/doc/tm.texi
> > > > +++ b/gcc/doc/tm.texi
> > > > @@ -11868,6 +11868,22 @@ This function prepares to emit a conditional comparison within a sequence
> > > >   @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.
> > > >  @end deftypefn
> > > >
> > > > +@deftypefn {Target Hook} rtx TARGET_READ_MEMSET_VALUE (const char *@var{c}, void *@var{prev}, scalar_int_mode @var{mode})
> > > > +This function returns the RTL of a constant integer corresponding to
> > > > +target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the stringn
> > > > +constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains
> > > > +the RTL information from the previous interation.
> > > > +@end deftypefn
> > > > +
> > > > +@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_VALUE (rtx @var{data}, void *@var{prev}, scalar_int_mode @var{mode})
> > > > +This function returns the RTL of a register containing
> > > > +@code{GET_MODE_SIZE (@var{mode})} consecutive copies of the unsigned
> > > > +char value given in the RTL register @var{data}.  For example, if
> > > > +@var{mode} is 4 bytes wide, return the RTL for 0x01010101*@var{data}.
> > > > +If @var{PREV} is not @samp{nullptr}, it is the RTL information from
> > > > +the previous iteration.
> > > > +@end deftypefn
> > > > +
> > > >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})
> > > >  This target hook returns a new value for the number of times @var{loop}
> > > >  should be unrolled. The parameter @var{nunroll} is the number of times
> > > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > > index d8e3de14af1..8d4c3949fbf 100644
> > > > --- a/gcc/doc/tm.texi.in
> > > > +++ b/gcc/doc/tm.texi.in
> > > > @@ -7956,6 +7956,10 @@ lists.
> > > >
> > > >  @hook TARGET_GEN_CCMP_NEXT
> > > >
> > > > +@hook TARGET_READ_MEMSET_VALUE
> > > > +
> > > > +@hook TARGET_GEN_MEMSET_VALUE
> > > > +
> > > >  @hook TARGET_LOOP_UNROLL_ADJUST
> > > >
> > > >  @defmac POWI_MAX_MULTS
> > > > diff --git a/gcc/target.def b/gcc/target.def
> > > > index bbaf6b4f3a0..c9aca40fa88 100644
> > > > --- a/gcc/target.def
> > > > +++ b/gcc/target.def
> > > > @@ -2694,6 +2694,26 @@ DEFHOOK
> > > >   rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree op0, tree op1, int bit_code),
> > > >   NULL)
> > > >
> > > > +DEFHOOK
> > > > +(read_memset_value,
> > > > + "This function returns the RTL of a constant integer corresponding to\n\
> > > > +target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the stringn\n\
> > > > +constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains\n\
> > >
> > > where is 'str' defined?  I can't really tell what's the difference
> >
> > Fixed with
> >
> > diff --git a/gcc/target.def b/gcc/target.def
> > index c9aca40fa88..4c3a5fcc634 100644
> > --- a/gcc/target.def
> > +++ b/gcc/target.def
> > @@ -2699,8 +2699,8 @@ DEFHOOK
> >   "This function returns the RTL of a constant integer corresponding to\n\
> >  target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the string\n\
> >  constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains\n\
> > -the RTL information from the previous interation.",
> > - rtx, (const char *c, void *prev, scalar_int_mode mode),
> > +the RTL information from the previous iteration.",
> > + rtx, (const char *str, void *prev, scalar_int_mode mode),
> >   default_read_memset_value)
> >
> >  DEFHOOK
> >
> > > from read_memset_value
> > > and gen_memset_value.
> >
> > The difference is that input of read_memset_value is a string constant
> > like "123" and input of gen_memset_value is an RTL register.
> >
> > > Somehow I feel that an optab for the "splat" operation similar
> > > to vec_duplicate might be a better way to expose this - of course
> > > that doesn't handle the "prev" thing.
> >
> > The x86 backend has ix86_expand_vector_init_duplicate () to
> > broadcast QImode to TImode/OImode/XImode:
> >
> > /* A subroutine of ix86_expand_vector_init.  Store into TARGET a vector
> >    with all elements equal to VAR.  Return true if successful.  */
> >
> > bool
> > ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
> >                                    rtx target, rtx val)
> >
> > > So how's this the right point of abstraction to the target?
> >
> > I can add 2 target hooks, one for scratch register and one for
> > broadcasting QImode to TImode/OImode/XImode.   Then I can
> > move x86 codes to the middle-end.
> >
>
> Here is the patch to add 3 target hooks:
>
>  -- Target Hook: rtx TARGET_READ_MEMSET_VALUE (const char *C,
>           scalar_int_mode MODE)
>      This function returns the RTL of a constant integer corresponding
>      to target reading 'GET_MODE_SIZE (MODE)' bytes from the string
>      constant C.
>
>  -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
>           MODE)
>      This function returns the RTL of a register containing
>      'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
>      value given in the RTL register DATA.  For example, if MODE is 4
>      bytes wide, return the RTL for 0x01010101*DATA.

For this one I wonder if it should be an optab instead.  Couldn't you
use the existing vec_duplicate for this by using (paradoxical) subregs
like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?

Currently vec_duplicate is only used for SVE IIRC.

>  -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE_FROM_PREV (void *PREV,
>           scalar_int_mode MODE)
>      This function returns the RTL of a register in MODE generated from
>      PREV in the previous iteration.
>
> with
>
> /* Return the RTL of a register in MODE generated from PREV in the
>    previous iteration.  */
>
> static rtx
> gen_memset_value_from_prev (void *prevp, scalar_int_mode mode)
> {
>   by_pieces_prev *prev = (by_pieces_prev *) prevp;
>   rtx value;
>   if (prev != nullptr && prev->data != nullptr)
>     {
>       /* Use the previous data in the same mode.  */
>       if (prev->mode == mode)
>         return prev->data;
>
>       value = targetm.gen_memset_value_from_prev (prevp, mode);

But why do we need a target hook here?  Doesn't this just duplicate/subreg
this further?

>     }
>   else
>     value = nullptr;
>   return value;
> }
>
> /* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)
>    bytes from constant string DATA + OFFSET and return it as target
>    constant.  If PREV isn't nullptr, it has the RTL info from the
>    previous iteration.  */
>
> rtx
> builtin_memset_read_str (void *data, void *prev,
>                          HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
>                          scalar_int_mode mode)
> {
>   const char *str = (const char *) data;
>
>   /* Don't use the previous value if size is 1.  */
>   if (GET_MODE_SIZE (mode) == 1)
>     return default_read_memset_value (str, mode);
>
>   rtx value = gen_memset_value_from_prev (prev, mode);
>   if (value)
>     return value;
>
>   return targetm.read_memset_value (str, mode);
> }
>
> /* Callback routine for store_by_pieces.  Return the RTL of a register
>    containing GET_MODE_SIZE (MODE) consecutive copies of the unsigned
>    char value given in the RTL register data.  For example, if mode is
>    4 bytes wide, return the RTL for 0x01010101*data.  If PREV isn't
>    nullptr, it has the RTL info from the previous iteration.  */
>
> static rtx
> builtin_memset_gen_str (void *datap, void *prev,
>                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
>                         scalar_int_mode mode)
> {
>   rtx data = (rtx) datap;
>
>   /* Don't use the previous value if size is 1.  */
>   if (GET_MODE_SIZE (mode) == 1)
>     return data;
>
>   rtx value = gen_memset_value_from_prev (prev, mode);
>   if (value)
>     return value;
>
>   return targetm.gen_memset_value (data, mode);
> }
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 04/12] Remove MAX_BITSIZE_MODE_ANY_INT
  2021-05-18 19:16 ` [PATCH v4 04/12] Remove MAX_BITSIZE_MODE_ANY_INT H.J. Lu
@ 2021-05-25 14:37   ` Richard Biener
  0 siblings, 0 replies; 52+ messages in thread
From: Richard Biener @ 2021-05-25 14:37 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> It is only defined for i386 and everyone uses the default:
>
>  #define MAX_BITSIZE_MODE_ANY_INT (64*BITS_PER_UNIT)
>
> Whatever problems we had before, they have been fixed now.

So I don't have a strong recollection here apart from memory usage
considerations with wide-int (possibly fixed by all the trailing-wide-int stuff
we now have).  So I'm fine if the target maintainer is - but then we probably
should remove all vestiges of non-default MAX_BITSIZE_MODE_ANY_INT,
or do we want to keep it just in case?

Thanks,
Richard.

>         * config/i386/i386-modes.def (MAX_BITSIZE_MODE_ANY_INT): Removed.
> ---
>  gcc/config/i386/i386-modes.def | 15 +++------------
>  1 file changed, 3 insertions(+), 12 deletions(-)
>
> diff --git a/gcc/config/i386/i386-modes.def b/gcc/config/i386/i386-modes.def
> index dbddfd8e48f..4e7014be034 100644
> --- a/gcc/config/i386/i386-modes.def
> +++ b/gcc/config/i386/i386-modes.def
> @@ -107,19 +107,10 @@ INT_MODE (XI, 64);
>  PARTIAL_INT_MODE (HI, 16, P2QI);
>  PARTIAL_INT_MODE (SI, 32, P2HI);
>
> -/* Mode used for signed overflow checking of TImode.  As
> -   MAX_BITSIZE_MODE_ANY_INT is only 160, wide-int.h reserves only that
> -   rounded up to multiple of HOST_BITS_PER_WIDE_INT bits in wide_int etc.,
> -   so OImode is too large.  For the overflow checking we actually need
> -   just 1 or 2 bits beyond TImode precision.  Use 160 bits to have
> -   a multiple of 32.  */
> +/* Mode used for signed overflow checking of TImode.  For the overflow
> +   checking we actually need just 1 or 2 bits beyond TImode precision.
> +   Use 160 bits to have a multiple of 32.  */
>  PARTIAL_INT_MODE (OI, 160, POI);
>
> -/* Keep the OI and XI modes from confusing the compiler into thinking
> -   that these modes could actually be used for computation.  They are
> -   only holders for vectors during data movement.  Include POImode precision
> -   though.  */
> -#define MAX_BITSIZE_MODE_ANY_INT (160)
> -
>  /* The symbol Pmode stands for one of the above machine modes (usually SImode).
>     The tm.h file specifies which one.  It is not a distinct mode.  */
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Add 3 target hooks for memset
  2021-05-25 14:34         ` Richard Biener
@ 2021-05-25 15:11           ` H.J. Lu
  2021-05-26  8:28             ` Richard Biener
  0 siblings, 1 reply; 52+ messages in thread
From: H.J. Lu @ 2021-05-25 15:11 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

On Tue, May 25, 2021 at 7:34 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Thu, May 20, 2021 at 10:50 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Wed, May 19, 2021 at 5:55 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Wed, May 19, 2021 at 2:25 AM Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > >
> > > > > Add TARGET_READ_MEMSET_VALUE and TARGET_GEN_MEMSET_VALUE to support
> > > > > target instructions to duplicate QImode value to TImode/OImode/XImode
> > > > > value for memmset.
> > > > >
> > > > >         PR middle-end/90773
> > > > >         * builtins.c (builtin_memset_read_str): Call
> > > > >         targetm.read_memset_value.
> > > > >         (builtin_memset_gen_str): Call targetm.gen_memset_value.
> > > > >         * target.def (read_memset_value): New hook.
> > > > >         (gen_memset_value): Likewise.
> > > > >         * targhooks.c: Inclue "builtins.h".
> > > > >         (default_read_memset_value): New function.
> > > > >         (default_gen_memset_value): Likewise.
> > > > >         * targhooks.h (default_read_memset_value): New prototype.
> > > > >         (default_gen_memset_value): Likewise.
> > > > >         * doc/tm.texi.in: Add TARGET_READ_MEMSET_VALUE and
> > > > >         TARGET_GEN_MEMSET_VALUE hooks.
> > > > >         * doc/tm.texi: Regenerated.
> > > > > ---
> > > > >  gcc/builtins.c     | 47 ++++----------------------------------
> > > > >  gcc/doc/tm.texi    | 16 +++++++++++++
> > > > >  gcc/doc/tm.texi.in |  4 ++++
> > > > >  gcc/target.def     | 20 +++++++++++++++++
> > > > >  gcc/targhooks.c    | 56 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > >  gcc/targhooks.h    |  4 ++++
> > > > >  6 files changed, 104 insertions(+), 43 deletions(-)
> > > > >
> > > > > diff --git a/gcc/builtins.c b/gcc/builtins.c
> > > > > index e1b284846b1..f78a36478ef 100644
> > > > > --- a/gcc/builtins.c
> > > > > +++ b/gcc/builtins.c
> > > > > @@ -6584,24 +6584,11 @@ expand_builtin_strncpy (tree exp, rtx target)
> > > > >     previous iteration.  */
> > > > >
> > > > >  rtx
> > > > > -builtin_memset_read_str (void *data, void *prevp,
> > > > > +builtin_memset_read_str (void *data, void *prev,
> > > > >                          HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > > > >                          scalar_int_mode mode)
> > > > >  {
> > > > > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > > > > -  if (prev != nullptr && prev->data != nullptr)
> > > > > -    {
> > > > > -      /* Use the previous data in the same mode.  */
> > > > > -      if (prev->mode == mode)
> > > > > -       return prev->data;
> > > > > -    }
> > > > > -
> > > > > -  const char *c = (const char *) data;
> > > > > -  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> > > > > -
> > > > > -  memset (p, *c, GET_MODE_SIZE (mode));
> > > > > -
> > > > > -  return c_readstr (p, mode);
> > > > > +  return targetm.read_memset_value ((const char *) data, prev, mode);
> > > > >  }
> > > > >
> > > > >  /* Callback routine for store_by_pieces.  Return the RTL of a register
> > > > > @@ -6611,37 +6598,11 @@ builtin_memset_read_str (void *data, void *prevp,
> > > > >     nullptr, it has the RTL info from the previous iteration.  */
> > > > >
> > > > >  static rtx
> > > > > -builtin_memset_gen_str (void *data, void *prevp,
> > > > > +builtin_memset_gen_str (void *data, void *prev,
> > > > >                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > > > >                         scalar_int_mode mode)
> > > > >  {
> > > > > -  rtx target, coeff;
> > > > > -  size_t size;
> > > > > -  char *p;
> > > > > -
> > > > > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > > > > -  if (prev != nullptr && prev->data != nullptr)
> > > > > -    {
> > > > > -      /* Use the previous data in the same mode.  */
> > > > > -      if (prev->mode == mode)
> > > > > -       return prev->data;
> > > > > -
> > > > > -      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> > > > > -      if (target != nullptr)
> > > > > -       return target;
> > > > > -    }
> > > > > -
> > > > > -  size = GET_MODE_SIZE (mode);
> > > > > -  if (size == 1)
> > > > > -    return (rtx) data;
> > > > > -
> > > > > -  p = XALLOCAVEC (char, size);
> > > > > -  memset (p, 1, size);
> > > > > -  coeff = c_readstr (p, mode);
> > > > > -
> > > > > -  target = convert_to_mode (mode, (rtx) data, 1);
> > > > > -  target = expand_mult (mode, target, coeff, NULL_RTX, 1);
> > > > > -  return force_reg (mode, target);
> > > > > +  return targetm.gen_memset_value ((rtx) data, prev, mode);
> > > > >  }
> > > > >
> > > > >  /* Expand expression EXP, which is a call to the memset builtin.  Return
> > > > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > > > index 85ea9395560..51385044e76 100644
> > > > > --- a/gcc/doc/tm.texi
> > > > > +++ b/gcc/doc/tm.texi
> > > > > @@ -11868,6 +11868,22 @@ This function prepares to emit a conditional comparison within a sequence
> > > > >   @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.
> > > > >  @end deftypefn
> > > > >
> > > > > +@deftypefn {Target Hook} rtx TARGET_READ_MEMSET_VALUE (const char *@var{c}, void *@var{prev}, scalar_int_mode @var{mode})
> > > > > +This function returns the RTL of a constant integer corresponding to
> > > > > +target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the stringn
> > > > > +constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains
> > > > > +the RTL information from the previous interation.
> > > > > +@end deftypefn
> > > > > +
> > > > > +@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_VALUE (rtx @var{data}, void *@var{prev}, scalar_int_mode @var{mode})
> > > > > +This function returns the RTL of a register containing
> > > > > +@code{GET_MODE_SIZE (@var{mode})} consecutive copies of the unsigned
> > > > > +char value given in the RTL register @var{data}.  For example, if
> > > > > +@var{mode} is 4 bytes wide, return the RTL for 0x01010101*@var{data}.
> > > > > +If @var{PREV} is not @samp{nullptr}, it is the RTL information from
> > > > > +the previous iteration.
> > > > > +@end deftypefn
> > > > > +
> > > > >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})
> > > > >  This target hook returns a new value for the number of times @var{loop}
> > > > >  should be unrolled. The parameter @var{nunroll} is the number of times
> > > > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > > > index d8e3de14af1..8d4c3949fbf 100644
> > > > > --- a/gcc/doc/tm.texi.in
> > > > > +++ b/gcc/doc/tm.texi.in
> > > > > @@ -7956,6 +7956,10 @@ lists.
> > > > >
> > > > >  @hook TARGET_GEN_CCMP_NEXT
> > > > >
> > > > > +@hook TARGET_READ_MEMSET_VALUE
> > > > > +
> > > > > +@hook TARGET_GEN_MEMSET_VALUE
> > > > > +
> > > > >  @hook TARGET_LOOP_UNROLL_ADJUST
> > > > >
> > > > >  @defmac POWI_MAX_MULTS
> > > > > diff --git a/gcc/target.def b/gcc/target.def
> > > > > index bbaf6b4f3a0..c9aca40fa88 100644
> > > > > --- a/gcc/target.def
> > > > > +++ b/gcc/target.def
> > > > > @@ -2694,6 +2694,26 @@ DEFHOOK
> > > > >   rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree op0, tree op1, int bit_code),
> > > > >   NULL)
> > > > >
> > > > > +DEFHOOK
> > > > > +(read_memset_value,
> > > > > + "This function returns the RTL of a constant integer corresponding to\n\
> > > > > +target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the stringn\n\
> > > > > +constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains\n\
> > > >
> > > > where is 'str' defined?  I can't really tell what's the difference
> > >
> > > Fixed with
> > >
> > > diff --git a/gcc/target.def b/gcc/target.def
> > > index c9aca40fa88..4c3a5fcc634 100644
> > > --- a/gcc/target.def
> > > +++ b/gcc/target.def
> > > @@ -2699,8 +2699,8 @@ DEFHOOK
> > >   "This function returns the RTL of a constant integer corresponding to\n\
> > >  target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the string\n\
> > >  constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains\n\
> > > -the RTL information from the previous interation.",
> > > - rtx, (const char *c, void *prev, scalar_int_mode mode),
> > > +the RTL information from the previous iteration.",
> > > + rtx, (const char *str, void *prev, scalar_int_mode mode),
> > >   default_read_memset_value)
> > >
> > >  DEFHOOK
> > >
> > > > from read_memset_value
> > > > and gen_memset_value.
> > >
> > > The difference is that input of read_memset_value is a string constant
> > > like "123" and input of gen_memset_value is an RTL register.
> > >
> > > > Somehow I feel that an optab for the "splat" operation similar
> > > > to vec_duplicate might be a better way to expose this - of course
> > > > that doesn't handle the "prev" thing.
> > >
> > > The x86 backend has ix86_expand_vector_init_duplicate () to
> > > broadcast QImode to TImode/OImode/XImode:
> > >
> > > /* A subroutine of ix86_expand_vector_init.  Store into TARGET a vector
> > >    with all elements equal to VAR.  Return true if successful.  */
> > >
> > > bool
> > > ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
> > >                                    rtx target, rtx val)
> > >
> > > > So how's this the right point of abstraction to the target?
> > >
> > > I can add 2 target hooks, one for scratch register and one for
> > > broadcasting QImode to TImode/OImode/XImode.   Then I can
> > > move x86 codes to the middle-end.
> > >
> >
> > Here is the patch to add 3 target hooks:
> >
> >  -- Target Hook: rtx TARGET_READ_MEMSET_VALUE (const char *C,
> >           scalar_int_mode MODE)
> >      This function returns the RTL of a constant integer corresponding
> >      to target reading 'GET_MODE_SIZE (MODE)' bytes from the string
> >      constant C.
> >
> >  -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> >           MODE)
> >      This function returns the RTL of a register containing
> >      'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> >      value given in the RTL register DATA.  For example, if MODE is 4
> >      bytes wide, return the RTL for 0x01010101*DATA.
>
> For this one I wonder if it should be an optab instead.  Couldn't you
> use the existing vec_duplicate for this by using (paradoxical) subregs
> like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?

I tried.   It doesn't even work on x86.  See:

https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html

There are special cases to subreg HI, SI and DI modes of TI mode in
ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
work here.   Each backend may need its own special handling.

> Currently vec_duplicate is only used for SVE IIRC.
>
> >  -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE_FROM_PREV (void *PREV,
> >           scalar_int_mode MODE)
> >      This function returns the RTL of a register in MODE generated from
> >      PREV in the previous iteration.
> >
> > with
> >
> > /* Return the RTL of a register in MODE generated from PREV in the
> >    previous iteration.  */
> >
> > static rtx
> > gen_memset_value_from_prev (void *prevp, scalar_int_mode mode)
> > {
> >   by_pieces_prev *prev = (by_pieces_prev *) prevp;
> >   rtx value;
> >   if (prev != nullptr && prev->data != nullptr)
> >     {
> >       /* Use the previous data in the same mode.  */
> >       if (prev->mode == mode)
> >         return prev->data;
> >
> >       value = targetm.gen_memset_value_from_prev (prevp, mode);
>
> But why do we need a target hook here?  Doesn't this just duplicate/subreg
> this further?

See above.

> >     }
> >   else
> >     value = nullptr;
> >   return value;
> > }
> >
> > /* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)
> >    bytes from constant string DATA + OFFSET and return it as target
> >    constant.  If PREV isn't nullptr, it has the RTL info from the
> >    previous iteration.  */
> >
> > rtx
> > builtin_memset_read_str (void *data, void *prev,
> >                          HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> >                          scalar_int_mode mode)
> > {
> >   const char *str = (const char *) data;
> >
> >   /* Don't use the previous value if size is 1.  */
> >   if (GET_MODE_SIZE (mode) == 1)
> >     return default_read_memset_value (str, mode);
> >
> >   rtx value = gen_memset_value_from_prev (prev, mode);
> >   if (value)
> >     return value;
> >
> >   return targetm.read_memset_value (str, mode);
> > }
> >
> > /* Callback routine for store_by_pieces.  Return the RTL of a register
> >    containing GET_MODE_SIZE (MODE) consecutive copies of the unsigned
> >    char value given in the RTL register data.  For example, if mode is
> >    4 bytes wide, return the RTL for 0x01010101*data.  If PREV isn't
> >    nullptr, it has the RTL info from the previous iteration.  */
> >
> > static rtx
> > builtin_memset_gen_str (void *datap, void *prev,
> >                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> >                         scalar_int_mode mode)
> > {
> >   rtx data = (rtx) datap;
> >
> >   /* Don't use the previous value if size is 1.  */
> >   if (GET_MODE_SIZE (mode) == 1)
> >     return data;
> >
> >   rtx value = gen_memset_value_from_prev (prev, mode);
> >   if (value)
> >     return value;
> >
> >   return targetm.gen_memset_value (data, mode);
> > }
> >
> >
> > --
> > H.J.



-- 
H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Add 3 target hooks for memset
  2021-05-25 15:11           ` H.J. Lu
@ 2021-05-26  8:28             ` Richard Biener
  2021-05-31 12:09               ` [PATCH] Add integer_extract and vec_const_duplicate optabs H.J. Lu
  0 siblings, 1 reply; 52+ messages in thread
From: Richard Biener @ 2021-05-26  8:28 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

On Tue, May 25, 2021 at 5:12 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, May 25, 2021 at 7:34 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Thu, May 20, 2021 at 10:50 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Wed, May 19, 2021 at 5:55 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >
> > > > On Wed, May 19, 2021 at 2:25 AM Richard Biener
> > > > <richard.guenther@gmail.com> wrote:
> > > > >
> > > > > On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > >
> > > > > > Add TARGET_READ_MEMSET_VALUE and TARGET_GEN_MEMSET_VALUE to support
> > > > > > target instructions to duplicate QImode value to TImode/OImode/XImode
> > > > > > value for memmset.
> > > > > >
> > > > > >         PR middle-end/90773
> > > > > >         * builtins.c (builtin_memset_read_str): Call
> > > > > >         targetm.read_memset_value.
> > > > > >         (builtin_memset_gen_str): Call targetm.gen_memset_value.
> > > > > >         * target.def (read_memset_value): New hook.
> > > > > >         (gen_memset_value): Likewise.
> > > > > >         * targhooks.c: Inclue "builtins.h".
> > > > > >         (default_read_memset_value): New function.
> > > > > >         (default_gen_memset_value): Likewise.
> > > > > >         * targhooks.h (default_read_memset_value): New prototype.
> > > > > >         (default_gen_memset_value): Likewise.
> > > > > >         * doc/tm.texi.in: Add TARGET_READ_MEMSET_VALUE and
> > > > > >         TARGET_GEN_MEMSET_VALUE hooks.
> > > > > >         * doc/tm.texi: Regenerated.
> > > > > > ---
> > > > > >  gcc/builtins.c     | 47 ++++----------------------------------
> > > > > >  gcc/doc/tm.texi    | 16 +++++++++++++
> > > > > >  gcc/doc/tm.texi.in |  4 ++++
> > > > > >  gcc/target.def     | 20 +++++++++++++++++
> > > > > >  gcc/targhooks.c    | 56 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > > >  gcc/targhooks.h    |  4 ++++
> > > > > >  6 files changed, 104 insertions(+), 43 deletions(-)
> > > > > >
> > > > > > diff --git a/gcc/builtins.c b/gcc/builtins.c
> > > > > > index e1b284846b1..f78a36478ef 100644
> > > > > > --- a/gcc/builtins.c
> > > > > > +++ b/gcc/builtins.c
> > > > > > @@ -6584,24 +6584,11 @@ expand_builtin_strncpy (tree exp, rtx target)
> > > > > >     previous iteration.  */
> > > > > >
> > > > > >  rtx
> > > > > > -builtin_memset_read_str (void *data, void *prevp,
> > > > > > +builtin_memset_read_str (void *data, void *prev,
> > > > > >                          HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > > > > >                          scalar_int_mode mode)
> > > > > >  {
> > > > > > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > > > > > -  if (prev != nullptr && prev->data != nullptr)
> > > > > > -    {
> > > > > > -      /* Use the previous data in the same mode.  */
> > > > > > -      if (prev->mode == mode)
> > > > > > -       return prev->data;
> > > > > > -    }
> > > > > > -
> > > > > > -  const char *c = (const char *) data;
> > > > > > -  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> > > > > > -
> > > > > > -  memset (p, *c, GET_MODE_SIZE (mode));
> > > > > > -
> > > > > > -  return c_readstr (p, mode);
> > > > > > +  return targetm.read_memset_value ((const char *) data, prev, mode);
> > > > > >  }
> > > > > >
> > > > > >  /* Callback routine for store_by_pieces.  Return the RTL of a register
> > > > > > @@ -6611,37 +6598,11 @@ builtin_memset_read_str (void *data, void *prevp,
> > > > > >     nullptr, it has the RTL info from the previous iteration.  */
> > > > > >
> > > > > >  static rtx
> > > > > > -builtin_memset_gen_str (void *data, void *prevp,
> > > > > > +builtin_memset_gen_str (void *data, void *prev,
> > > > > >                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > > > > >                         scalar_int_mode mode)
> > > > > >  {
> > > > > > -  rtx target, coeff;
> > > > > > -  size_t size;
> > > > > > -  char *p;
> > > > > > -
> > > > > > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > > > > > -  if (prev != nullptr && prev->data != nullptr)
> > > > > > -    {
> > > > > > -      /* Use the previous data in the same mode.  */
> > > > > > -      if (prev->mode == mode)
> > > > > > -       return prev->data;
> > > > > > -
> > > > > > -      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> > > > > > -      if (target != nullptr)
> > > > > > -       return target;
> > > > > > -    }
> > > > > > -
> > > > > > -  size = GET_MODE_SIZE (mode);
> > > > > > -  if (size == 1)
> > > > > > -    return (rtx) data;
> > > > > > -
> > > > > > -  p = XALLOCAVEC (char, size);
> > > > > > -  memset (p, 1, size);
> > > > > > -  coeff = c_readstr (p, mode);
> > > > > > -
> > > > > > -  target = convert_to_mode (mode, (rtx) data, 1);
> > > > > > -  target = expand_mult (mode, target, coeff, NULL_RTX, 1);
> > > > > > -  return force_reg (mode, target);
> > > > > > +  return targetm.gen_memset_value ((rtx) data, prev, mode);
> > > > > >  }
> > > > > >
> > > > > >  /* Expand expression EXP, which is a call to the memset builtin.  Return
> > > > > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > > > > index 85ea9395560..51385044e76 100644
> > > > > > --- a/gcc/doc/tm.texi
> > > > > > +++ b/gcc/doc/tm.texi
> > > > > > @@ -11868,6 +11868,22 @@ This function prepares to emit a conditional comparison within a sequence
> > > > > >   @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.
> > > > > >  @end deftypefn
> > > > > >
> > > > > > +@deftypefn {Target Hook} rtx TARGET_READ_MEMSET_VALUE (const char *@var{c}, void *@var{prev}, scalar_int_mode @var{mode})
> > > > > > +This function returns the RTL of a constant integer corresponding to
> > > > > > +target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the stringn
> > > > > > +constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains
> > > > > > +the RTL information from the previous interation.
> > > > > > +@end deftypefn
> > > > > > +
> > > > > > +@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_VALUE (rtx @var{data}, void *@var{prev}, scalar_int_mode @var{mode})
> > > > > > +This function returns the RTL of a register containing
> > > > > > +@code{GET_MODE_SIZE (@var{mode})} consecutive copies of the unsigned
> > > > > > +char value given in the RTL register @var{data}.  For example, if
> > > > > > +@var{mode} is 4 bytes wide, return the RTL for 0x01010101*@var{data}.
> > > > > > +If @var{PREV} is not @samp{nullptr}, it is the RTL information from
> > > > > > +the previous iteration.
> > > > > > +@end deftypefn
> > > > > > +
> > > > > >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})
> > > > > >  This target hook returns a new value for the number of times @var{loop}
> > > > > >  should be unrolled. The parameter @var{nunroll} is the number of times
> > > > > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > > > > index d8e3de14af1..8d4c3949fbf 100644
> > > > > > --- a/gcc/doc/tm.texi.in
> > > > > > +++ b/gcc/doc/tm.texi.in
> > > > > > @@ -7956,6 +7956,10 @@ lists.
> > > > > >
> > > > > >  @hook TARGET_GEN_CCMP_NEXT
> > > > > >
> > > > > > +@hook TARGET_READ_MEMSET_VALUE
> > > > > > +
> > > > > > +@hook TARGET_GEN_MEMSET_VALUE
> > > > > > +
> > > > > >  @hook TARGET_LOOP_UNROLL_ADJUST
> > > > > >
> > > > > >  @defmac POWI_MAX_MULTS
> > > > > > diff --git a/gcc/target.def b/gcc/target.def
> > > > > > index bbaf6b4f3a0..c9aca40fa88 100644
> > > > > > --- a/gcc/target.def
> > > > > > +++ b/gcc/target.def
> > > > > > @@ -2694,6 +2694,26 @@ DEFHOOK
> > > > > >   rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree op0, tree op1, int bit_code),
> > > > > >   NULL)
> > > > > >
> > > > > > +DEFHOOK
> > > > > > +(read_memset_value,
> > > > > > + "This function returns the RTL of a constant integer corresponding to\n\
> > > > > > +target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the stringn\n\
> > > > > > +constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains\n\
> > > > >
> > > > > where is 'str' defined?  I can't really tell what's the difference
> > > >
> > > > Fixed with
> > > >
> > > > diff --git a/gcc/target.def b/gcc/target.def
> > > > index c9aca40fa88..4c3a5fcc634 100644
> > > > --- a/gcc/target.def
> > > > +++ b/gcc/target.def
> > > > @@ -2699,8 +2699,8 @@ DEFHOOK
> > > >   "This function returns the RTL of a constant integer corresponding to\n\
> > > >  target reading @code{GET_MODE_SIZE (@var{mode})} bytes from the string\n\
> > > >  constant @var{str}.  If @var{prev} is not @samp{nullptr}, it contains\n\
> > > > -the RTL information from the previous interation.",
> > > > - rtx, (const char *c, void *prev, scalar_int_mode mode),
> > > > +the RTL information from the previous iteration.",
> > > > + rtx, (const char *str, void *prev, scalar_int_mode mode),
> > > >   default_read_memset_value)
> > > >
> > > >  DEFHOOK
> > > >
> > > > > from read_memset_value
> > > > > and gen_memset_value.
> > > >
> > > > The difference is that input of read_memset_value is a string constant
> > > > like "123" and input of gen_memset_value is an RTL register.
> > > >
> > > > > Somehow I feel that an optab for the "splat" operation similar
> > > > > to vec_duplicate might be a better way to expose this - of course
> > > > > that doesn't handle the "prev" thing.
> > > >
> > > > The x86 backend has ix86_expand_vector_init_duplicate () to
> > > > broadcast QImode to TImode/OImode/XImode:
> > > >
> > > > /* A subroutine of ix86_expand_vector_init.  Store into TARGET a vector
> > > >    with all elements equal to VAR.  Return true if successful.  */
> > > >
> > > > bool
> > > > ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
> > > >                                    rtx target, rtx val)
> > > >
> > > > > So how's this the right point of abstraction to the target?
> > > >
> > > > I can add 2 target hooks, one for scratch register and one for
> > > > broadcasting QImode to TImode/OImode/XImode.   Then I can
> > > > move x86 codes to the middle-end.
> > > >
> > >
> > > Here is the patch to add 3 target hooks:
> > >
> > >  -- Target Hook: rtx TARGET_READ_MEMSET_VALUE (const char *C,
> > >           scalar_int_mode MODE)
> > >      This function returns the RTL of a constant integer corresponding
> > >      to target reading 'GET_MODE_SIZE (MODE)' bytes from the string
> > >      constant C.
> > >
> > >  -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> > >           MODE)
> > >      This function returns the RTL of a register containing
> > >      'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> > >      value given in the RTL register DATA.  For example, if MODE is 4
> > >      bytes wide, return the RTL for 0x01010101*DATA.
> >
> > For this one I wonder if it should be an optab instead.  Couldn't you
> > use the existing vec_duplicate for this by using (paradoxical) subregs
> > like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
>
> I tried.   It doesn't even work on x86.  See:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html

Not sure what I should read from there...

> There are special cases to subreg HI, SI and DI modes of TI mode in
> ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
> work here.   Each backend may need its own special handling.

OK, I guess I'm not (RTL) qualified enough to further review these parts,
sorry.  Since we're doing code generation the canonical way to communicate
with backends should be optabs, not some set of disconnected target hooks.
But as said, I probably don't know enough of RTL to see why it's the only way.

Richard.

> > Currently vec_duplicate is only used for SVE IIRC.
> >
> > >  -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE_FROM_PREV (void *PREV,
> > >           scalar_int_mode MODE)
> > >      This function returns the RTL of a register in MODE generated from
> > >      PREV in the previous iteration.
> > >
> > > with
> > >
> > > /* Return the RTL of a register in MODE generated from PREV in the
> > >    previous iteration.  */
> > >
> > > static rtx
> > > gen_memset_value_from_prev (void *prevp, scalar_int_mode mode)
> > > {
> > >   by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > >   rtx value;
> > >   if (prev != nullptr && prev->data != nullptr)
> > >     {
> > >       /* Use the previous data in the same mode.  */
> > >       if (prev->mode == mode)
> > >         return prev->data;
> > >
> > >       value = targetm.gen_memset_value_from_prev (prevp, mode);
> >
> > But why do we need a target hook here?  Doesn't this just duplicate/subreg
> > this further?
>
> See above.
>
> > >     }
> > >   else
> > >     value = nullptr;
> > >   return value;
> > > }
> > >
> > > /* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)
> > >    bytes from constant string DATA + OFFSET and return it as target
> > >    constant.  If PREV isn't nullptr, it has the RTL info from the
> > >    previous iteration.  */
> > >
> > > rtx
> > > builtin_memset_read_str (void *data, void *prev,
> > >                          HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > >                          scalar_int_mode mode)
> > > {
> > >   const char *str = (const char *) data;
> > >
> > >   /* Don't use the previous value if size is 1.  */
> > >   if (GET_MODE_SIZE (mode) == 1)
> > >     return default_read_memset_value (str, mode);
> > >
> > >   rtx value = gen_memset_value_from_prev (prev, mode);
> > >   if (value)
> > >     return value;
> > >
> > >   return targetm.read_memset_value (str, mode);
> > > }
> > >
> > > /* Callback routine for store_by_pieces.  Return the RTL of a register
> > >    containing GET_MODE_SIZE (MODE) consecutive copies of the unsigned
> > >    char value given in the RTL register data.  For example, if mode is
> > >    4 bytes wide, return the RTL for 0x01010101*data.  If PREV isn't
> > >    nullptr, it has the RTL info from the previous iteration.  */
> > >
> > > static rtx
> > > builtin_memset_gen_str (void *datap, void *prev,
> > >                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > >                         scalar_int_mode mode)
> > > {
> > >   rtx data = (rtx) datap;
> > >
> > >   /* Don't use the previous value if size is 1.  */
> > >   if (GET_MODE_SIZE (mode) == 1)
> > >     return data;
> > >
> > >   rtx value = gen_memset_value_from_prev (prev, mode);
> > >   if (value)
> > >     return value;
> > >
> > >   return targetm.gen_memset_value (data, mode);
> > > }
> > >
> > >
> > > --
> > > H.J.
>
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] Add integer_extract and vec_const_duplicate optabs
  2021-05-26  8:28             ` Richard Biener
@ 2021-05-31 12:09               ` H.J. Lu
  2021-05-31 12:46                 ` Richard Biener
  0 siblings, 1 reply; 52+ messages in thread
From: H.J. Lu @ 2021-05-31 12:09 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
> > > >
> > > >  -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> > > >           MODE)
> > > >      This function returns the RTL of a register containing
> > > >      'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> > > >      value given in the RTL register DATA.  For example, if MODE is 4
> > > >      bytes wide, return the RTL for 0x01010101*DATA.
> > >
> > > For this one I wonder if it should be an optab instead.  Couldn't you
> > > use the existing vec_duplicate for this by using (paradoxical) subregs
> > > like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
> >
> > I tried.   It doesn't even work on x86.  See:
> >
> > https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
> 
> Not sure what I should read from there...
> 
> > There are special cases to subreg HI, SI and DI modes of TI mode in
> > ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
> > work here.   Each backend may need its own special handling.
> 
> OK, I guess I'm not (RTL) qualified enough to further review these parts,
> sorry.  Since we're doing code generation the canonical way to communicate
> with backends should be optabs, not some set of disconnected target hooks.
> But as said, I probably don't know enough of RTL to see why it's the only way.
> 
> Richard.

Here is the patch to add optabs instead.  Does it look OK?

Thanks.

H.J.
---
Add 2 optabs:

1. integer_extract: Extract lower bit value from the integer value in
TImode, OImode or XImode.
2. vec_const_duplicate: Broadcast a QImode constant to a vector.  It is
similar to vec_duplicate.  Since the resulting vector is computable at
compile-time, vec_duplicate may not be faster and backend can opt out
broadcasting from a constant while opting in broadcasting from a
variable.

and rewrite builtin_memset_read_str/builtin_memset_gen_str to support
target instructions to duplicate QImode value to TImode, OImode or XImode
value for memmset.

Add TARGET_GEN_MEMSET_SCRATCH_RTX to allow the backend to use a hard
scratch register to avoid stack realignment when expanding memset.

	PR middle-end/90773
	* builtins.c (gen_memset_value_from_prev): New function.
	(gen_memset_broadcast): Likewise.
	(builtin_memset_read_str): Use gen_memset_value_from_prev
	and gen_memset_broadcast.
	(builtin_memset_gen_str): Likewise.
	* optabs.def: Add integer_extract and vec_const_duplicate.
	* target.def (gen_memset_scratch_rtx): New hook.
	* doc/md.texi: Document vec_const_duplicate and integer_extract.
	* doc/tm.texi.in: Add TARGET_GEN_MEMSET_SCRATCH_RTX.
	* doc/tm.texi: Regenerated.
---
 gcc/builtins.c     | 117 +++++++++++++++++++++++++++++++++++++--------
 gcc/doc/md.texi    |  16 +++++++
 gcc/doc/tm.texi    |   5 ++
 gcc/doc/tm.texi.in |   2 +
 gcc/optabs.def     |   4 ++
 gcc/target.def     |   7 +++
 6 files changed, 131 insertions(+), 20 deletions(-)

diff --git a/gcc/builtins.c b/gcc/builtins.c
index af1fe49bb48..7683169eb96 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -6598,26 +6598,106 @@ expand_builtin_strncpy (tree exp, rtx target)
   return NULL_RTX;
 }
 
+/* Return the RTL of a register in MODE generated from PREV in the
+   previous iteration.  */
+
+static rtx
+gen_memset_value_from_prev (void *prevp, scalar_int_mode mode)
+{
+  rtx target = nullptr;
+  by_pieces_prev *prev = (by_pieces_prev *) prevp;
+  if (prev != nullptr && prev->data != nullptr)
+    {
+      /* Use the previous data in the same mode.  */
+      if (prev->mode == mode)
+	return prev->data;
+
+      /* Extract the RTL in MODE from PREV.  */
+      enum insn_code icode
+	= convert_optab_handler (integer_extract_optab, mode,
+				 prev->mode);
+      if (icode != CODE_FOR_nothing)
+	{
+	  target = gen_reg_rtx (mode);
+	  class expand_operand ops[2];
+	  create_output_operand (&ops[0], target, mode);
+	  create_input_operand (&ops[1], prev->data, prev->mode);
+	  expand_insn (icode, 2, ops);
+	  if (!rtx_equal_p (target, ops[0].value))
+	    emit_move_insn (target, ops[0].value);
+	}
+      else
+	target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
+    }
+  return target;
+}
+
+/* Return the RTL of a register in MODE broadcasted from DATA.  */
+
+static rtx
+gen_memset_broadcast (enum optab_tag broadcast_optab, rtx data,
+		      scalar_int_mode mode)
+{
+  /* Skip if regno_reg_rtx isn't initialized.  */
+  if (!regno_reg_rtx)
+    return nullptr;
+
+  rtx target = nullptr;
+
+  unsigned int nunits = GET_MODE_SIZE (mode) / GET_MODE_SIZE (QImode);
+  machine_mode vector_mode;
+  if (!mode_for_vector (QImode, nunits).exists (&vector_mode))
+    gcc_unreachable ();
+
+  enum insn_code icode = optab_handler (broadcast_optab, vector_mode);
+  if (icode != CODE_FOR_nothing)
+    {
+      target = targetm.gen_memset_scratch_rtx (vector_mode);
+      class expand_operand ops[2];
+      create_output_operand (&ops[0], target, vector_mode);
+      create_input_operand (&ops[1], (rtx) data, QImode);
+      expand_insn (icode, 2, ops);
+      if (!rtx_equal_p (target, ops[0].value))
+	emit_move_insn (target, ops[0].value);
+      if (REGNO (target) < FIRST_PSEUDO_REGISTER)
+	target = gen_rtx_REG (mode, REGNO (target));
+      else
+	target = convert_to_mode (mode, target, 1);
+    }
+
+  return target;
+}
+
 /* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)
    bytes from constant string DATA + OFFSET and return it as target
    constant.  If PREV isn't nullptr, it has the RTL info from the
    previous iteration.  */
 
 rtx
-builtin_memset_read_str (void *data, void *prevp,
+builtin_memset_read_str (void *data, void *prev,
 			 HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
 			 scalar_int_mode mode)
 {
-  by_pieces_prev *prev = (by_pieces_prev *) prevp;
-  if (prev != nullptr && prev->data != nullptr)
+  rtx target;
+
+  /* Don't use the previous value if size is 1.  */
+  if (GET_MODE_SIZE (mode) != 1)
     {
-      /* Use the previous data in the same mode.  */
-      if (prev->mode == mode)
-	return prev->data;
+      target = gen_memset_value_from_prev (prev, mode);
+      if (target != nullptr)
+	return target;
     }
 
   const char *c = (const char *) data;
-  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
+  char *p = XALLOCAVEC (char, GET_MODE_SIZE (QImode));
+  memset (p, *c, GET_MODE_SIZE (QImode));
+  rtx src = c_readstr (p, QImode);
+  target = gen_memset_broadcast (vec_const_duplicate_optab, src,
+				 mode);
+  if (target != nullptr)
+    return target;
+
+  p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
 
   memset (p, *c, GET_MODE_SIZE (mode));
 
@@ -6631,7 +6711,7 @@ builtin_memset_read_str (void *data, void *prevp,
    nullptr, it has the RTL info from the previous iteration.  */
 
 static rtx
-builtin_memset_gen_str (void *data, void *prevp,
+builtin_memset_gen_str (void *data, void *prev,
 			HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
 			scalar_int_mode mode)
 {
@@ -6639,22 +6719,19 @@ builtin_memset_gen_str (void *data, void *prevp,
   size_t size;
   char *p;
 
-  by_pieces_prev *prev = (by_pieces_prev *) prevp;
-  if (prev != nullptr && prev->data != nullptr)
-    {
-      /* Use the previous data in the same mode.  */
-      if (prev->mode == mode)
-	return prev->data;
-
-      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
-      if (target != nullptr)
-	return target;
-    }
-
   size = GET_MODE_SIZE (mode);
   if (size == 1)
     return (rtx) data;
 
+  target = gen_memset_value_from_prev (prev, mode);
+  if (target != nullptr)
+    return target;
+
+  target = gen_memset_broadcast (vec_duplicate_optab, (rtx) data,
+				 mode);
+  if (target != nullptr)
+    return target;
+
   p = XALLOCAVEC (char, size);
   memset (p, 1, size);
   coeff = c_readstr (p, mode);
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 00caf3844cc..a798fb1a97e 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5079,6 +5079,16 @@ vectors go through the @code{mov@var{m}} pattern instead.
 
 This pattern is not allowed to @code{FAIL}.
 
+@cindex @code{vec_const_duplicate@var{m}} instruction pattern
+@item @samp{vec_const_duplicate@var{m}}
+Initialize vector output operand 0 in mode @var{m} so that each element
+has the value given by constant input operand 1.
+
+This pattern only handles duplicates of constant inputs.  Non-constant
+vectors go through the @code{vec_duplicate@var{m}} pattern instead.
+
+This pattern is not allowed to @code{FAIL}.
+
 @cindex @code{vec_series@var{m}} instruction pattern
 @item @samp{vec_series@var{m}}
 Initialize vector output operand 0 so that element @var{i} is equal to
@@ -7904,6 +7914,12 @@ inclusive and operand 1 exclusive.
 If this pattern is not defined, a call to the library function
 @code{__clear_cache} is used.
 
+@cindex @code{integer_extract@var{m}@var{n}} instruction pattern
+@item @samp{integer_extract@var{m}@var{n}}
+Extract lower bit value from the integer value in @code{TImode},
+@code{OImode} or @code{XImode}.  Operand 1 is the integer in mode
+@var{n} and operand 0 stores value to be extracted in mode @var{m}.
+
 @end table
 
 @end ifset
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index e3a080e4a7c..8ccc262b1fc 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11894,6 +11894,11 @@ This function prepares to emit a conditional comparison within a sequence
  @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.
 @end deftypefn
 
+@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_SCRATCH_RTX (machine_mode @var{mode})
+This hook should return an rtx for scratch register in @var{mode} to
+be used by memset broadcast.  The default is @code{gen_reg_rtx}.
+@end deftypefn
+
 @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})
 This target hook returns a new value for the number of times @var{loop}
 should be unrolled. The parameter @var{nunroll} is the number of times
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index d9fbbe20e6f..99bf01fe25d 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7960,6 +7960,8 @@ lists.
 
 @hook TARGET_GEN_CCMP_NEXT
 
+@hook TARGET_GEN_MEMSET_SCRATCH_RTX
+
 @hook TARGET_LOOP_UNROLL_ADJUST
 
 @defmac POWI_MAX_MULTS
diff --git a/gcc/optabs.def b/gcc/optabs.def
index b192a9d070b..fd8ab8b4a26 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -100,6 +100,8 @@ OPTAB_CD(vec_init_optab, "vec_init$a$b")
 
 OPTAB_CD (while_ult_optab, "while_ult$a$b")
 
+OPTAB_CD (integer_extract_optab, "integer_extract$a$b")
+
 OPTAB_NL(add_optab, "add$P$a3", PLUS, "add", '3', gen_int_fp_fixed_libfunc)
 OPTAB_NX(add_optab, "add$F$a3")
 OPTAB_NX(add_optab, "add$Q$a3")
@@ -453,3 +455,5 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
 OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
 OPTAB_D (len_load_optab, "len_load_$a")
 OPTAB_D (len_store_optab, "len_store_$a")
+
+OPTAB_DC (vec_const_duplicate_optab, "vec_const_duplicate$a", VEC_DUPLICATE)
diff --git a/gcc/target.def b/gcc/target.def
index 1dffedc81e4..b89e7c24471 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2724,6 +2724,13 @@ DEFHOOK
  rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree op0, tree op1, int bit_code),
  NULL)
 
+DEFHOOK
+(gen_memset_scratch_rtx,
+ "This hook should return an rtx for scratch register in @var{mode} to\n\
+be used by memset broadcast.  The default is @code{gen_reg_rtx}.",
+ rtx, (machine_mode mode),
+ gen_reg_rtx)
+
 /* Return a new value for loop unroll size.  */
 DEFHOOK
 (loop_unroll_adjust,
-- 
2.31.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Add integer_extract and vec_const_duplicate optabs
  2021-05-31 12:09               ` [PATCH] Add integer_extract and vec_const_duplicate optabs H.J. Lu
@ 2021-05-31 12:46                 ` Richard Biener
  2021-05-31 13:12                   ` H.J. Lu
  0 siblings, 1 reply; 52+ messages in thread
From: Richard Biener @ 2021-05-31 12:46 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
> > > > >
> > > > >  -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> > > > >           MODE)
> > > > >      This function returns the RTL of a register containing
> > > > >      'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> > > > >      value given in the RTL register DATA.  For example, if MODE is 4
> > > > >      bytes wide, return the RTL for 0x01010101*DATA.
> > > >
> > > > For this one I wonder if it should be an optab instead.  Couldn't you
> > > > use the existing vec_duplicate for this by using (paradoxical) subregs
> > > > like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
> > >
> > > I tried.   It doesn't even work on x86.  See:
> > >
> > > https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
> >
> > Not sure what I should read from there...
> >
> > > There are special cases to subreg HI, SI and DI modes of TI mode in
> > > ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
> > > work here.   Each backend may need its own special handling.
> >
> > OK, I guess I'm not (RTL) qualified enough to further review these parts,
> > sorry.  Since we're doing code generation the canonical way to communicate
> > with backends should be optabs, not some set of disconnected target hooks.
> > But as said, I probably don't know enough of RTL to see why it's the only way.
> >
> > Richard.
>
> Here is the patch to add optabs instead.  Does it look OK?
>
> Thanks.
>
> H.J.
> ---
> Add 2 optabs:
>
> 1. integer_extract: Extract lower bit value from the integer value in
> TImode, OImode or XImode.

That sounds very specific, esp. the restriction to {TI,OI,XI}mode.
It also sounds like it matches (subreg:{TI,OI,XI} (...) 0).  There are
existing target hooks verifying subreg validity - why's that not a good
fit here?  ISTR you say gen_lowpart () doesn't work (or was it
simplify_gen_subreg?), why's that so?

> 2. vec_const_duplicate: Broadcast a QImode constant to a vector.  It is
> similar to vec_duplicate.  Since the resulting vector is computable at
> compile-time, vec_duplicate may not be faster and backend can opt out
> broadcasting from a constant while opting in broadcasting from a
> variable.

Is it for the latter that you choose to use a new optab since
vec_duplicate is not allowed to FAIL?  You should probably document
that the constant value duplicated should be of the component mode of
m.

> and rewrite builtin_memset_read_str/builtin_memset_gen_str to support
> target instructions to duplicate QImode value to TImode, OImode or XImode
> value for memmset.
>
> Add TARGET_GEN_MEMSET_SCRATCH_RTX to allow the backend to use a hard
> scratch register to avoid stack realignment when expanding memset.
>
>         PR middle-end/90773
>         * builtins.c (gen_memset_value_from_prev): New function.
>         (gen_memset_broadcast): Likewise.
>         (builtin_memset_read_str): Use gen_memset_value_from_prev
>         and gen_memset_broadcast.
>         (builtin_memset_gen_str): Likewise.
>         * optabs.def: Add integer_extract and vec_const_duplicate.
>         * target.def (gen_memset_scratch_rtx): New hook.
>         * doc/md.texi: Document vec_const_duplicate and integer_extract.
>         * doc/tm.texi.in: Add TARGET_GEN_MEMSET_SCRATCH_RTX.
>         * doc/tm.texi: Regenerated.
> ---
>  gcc/builtins.c     | 117 +++++++++++++++++++++++++++++++++++++--------
>  gcc/doc/md.texi    |  16 +++++++
>  gcc/doc/tm.texi    |   5 ++
>  gcc/doc/tm.texi.in |   2 +
>  gcc/optabs.def     |   4 ++
>  gcc/target.def     |   7 +++
>  6 files changed, 131 insertions(+), 20 deletions(-)
>
> diff --git a/gcc/builtins.c b/gcc/builtins.c
> index af1fe49bb48..7683169eb96 100644
> --- a/gcc/builtins.c
> +++ b/gcc/builtins.c
> @@ -6598,26 +6598,106 @@ expand_builtin_strncpy (tree exp, rtx target)
>    return NULL_RTX;
>  }
>
> +/* Return the RTL of a register in MODE generated from PREV in the
> +   previous iteration.  */
> +
> +static rtx
> +gen_memset_value_from_prev (void *prevp, scalar_int_mode mode)
> +{
> +  rtx target = nullptr;
> +  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> +  if (prev != nullptr && prev->data != nullptr)
> +    {
> +      /* Use the previous data in the same mode.  */
> +      if (prev->mode == mode)
> +       return prev->data;
> +
> +      /* Extract the RTL in MODE from PREV.  */
> +      enum insn_code icode
> +       = convert_optab_handler (integer_extract_optab, mode,
> +                                prev->mode);
> +      if (icode != CODE_FOR_nothing)
> +       {
> +         target = gen_reg_rtx (mode);
> +         class expand_operand ops[2];
> +         create_output_operand (&ops[0], target, mode);
> +         create_input_operand (&ops[1], prev->data, prev->mode);
> +         expand_insn (icode, 2, ops);
> +         if (!rtx_equal_p (target, ops[0].value))
> +           emit_move_insn (target, ops[0].value);
> +       }
> +      else
> +       target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> +    }
> +  return target;
> +}
> +
> +/* Return the RTL of a register in MODE broadcasted from DATA.  */
> +
> +static rtx
> +gen_memset_broadcast (enum optab_tag broadcast_optab, rtx data,
> +                     scalar_int_mode mode)
> +{
> +  /* Skip if regno_reg_rtx isn't initialized.  */
> +  if (!regno_reg_rtx)
> +    return nullptr;
> +
> +  rtx target = nullptr;
> +
> +  unsigned int nunits = GET_MODE_SIZE (mode) / GET_MODE_SIZE (QImode);
> +  machine_mode vector_mode;
> +  if (!mode_for_vector (QImode, nunits).exists (&vector_mode))
> +    gcc_unreachable ();
> +
> +  enum insn_code icode = optab_handler (broadcast_optab, vector_mode);
> +  if (icode != CODE_FOR_nothing)
> +    {
> +      target = targetm.gen_memset_scratch_rtx (vector_mode);
> +      class expand_operand ops[2];
> +      create_output_operand (&ops[0], target, vector_mode);
> +      create_input_operand (&ops[1], (rtx) data, QImode);
> +      expand_insn (icode, 2, ops);
> +      if (!rtx_equal_p (target, ops[0].value))
> +       emit_move_insn (target, ops[0].value);
> +      if (REGNO (target) < FIRST_PSEUDO_REGISTER)
> +       target = gen_rtx_REG (mode, REGNO (target));
> +      else
> +       target = convert_to_mode (mode, target, 1);
> +    }
> +
> +  return target;
> +}
> +
>  /* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)
>     bytes from constant string DATA + OFFSET and return it as target
>     constant.  If PREV isn't nullptr, it has the RTL info from the
>     previous iteration.  */
>
>  rtx
> -builtin_memset_read_str (void *data, void *prevp,
> +builtin_memset_read_str (void *data, void *prev,
>                          HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
>                          scalar_int_mode mode)
>  {
> -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> -  if (prev != nullptr && prev->data != nullptr)
> +  rtx target;
> +
> +  /* Don't use the previous value if size is 1.  */
> +  if (GET_MODE_SIZE (mode) != 1)
>      {
> -      /* Use the previous data in the same mode.  */
> -      if (prev->mode == mode)
> -       return prev->data;
> +      target = gen_memset_value_from_prev (prev, mode);
> +      if (target != nullptr)
> +       return target;
>      }
>
>    const char *c = (const char *) data;
> -  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> +  char *p = XALLOCAVEC (char, GET_MODE_SIZE (QImode));
> +  memset (p, *c, GET_MODE_SIZE (QImode));
> +  rtx src = c_readstr (p, QImode);
> +  target = gen_memset_broadcast (vec_const_duplicate_optab, src,
> +                                mode);
> +  if (target != nullptr)
> +    return target;
> +
> +  p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
>
>    memset (p, *c, GET_MODE_SIZE (mode));
>
> @@ -6631,7 +6711,7 @@ builtin_memset_read_str (void *data, void *prevp,
>     nullptr, it has the RTL info from the previous iteration.  */
>
>  static rtx
> -builtin_memset_gen_str (void *data, void *prevp,
> +builtin_memset_gen_str (void *data, void *prev,
>                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
>                         scalar_int_mode mode)
>  {
> @@ -6639,22 +6719,19 @@ builtin_memset_gen_str (void *data, void *prevp,
>    size_t size;
>    char *p;
>
> -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> -  if (prev != nullptr && prev->data != nullptr)
> -    {
> -      /* Use the previous data in the same mode.  */
> -      if (prev->mode == mode)
> -       return prev->data;
> -
> -      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> -      if (target != nullptr)
> -       return target;
> -    }
> -
>    size = GET_MODE_SIZE (mode);
>    if (size == 1)
>      return (rtx) data;
>
> +  target = gen_memset_value_from_prev (prev, mode);
> +  if (target != nullptr)
> +    return target;
> +
> +  target = gen_memset_broadcast (vec_duplicate_optab, (rtx) data,
> +                                mode);
> +  if (target != nullptr)
> +    return target;
> +
>    p = XALLOCAVEC (char, size);
>    memset (p, 1, size);
>    coeff = c_readstr (p, mode);
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 00caf3844cc..a798fb1a97e 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5079,6 +5079,16 @@ vectors go through the @code{mov@var{m}} pattern instead.
>
>  This pattern is not allowed to @code{FAIL}.
>
> +@cindex @code{vec_const_duplicate@var{m}} instruction pattern
> +@item @samp{vec_const_duplicate@var{m}}
> +Initialize vector output operand 0 in mode @var{m} so that each element
> +has the value given by constant input operand 1.
> +
> +This pattern only handles duplicates of constant inputs.  Non-constant
> +vectors go through the @code{vec_duplicate@var{m}} pattern instead.
> +
> +This pattern is not allowed to @code{FAIL}.
> +
>  @cindex @code{vec_series@var{m}} instruction pattern
>  @item @samp{vec_series@var{m}}
>  Initialize vector output operand 0 so that element @var{i} is equal to
> @@ -7904,6 +7914,12 @@ inclusive and operand 1 exclusive.
>  If this pattern is not defined, a call to the library function
>  @code{__clear_cache} is used.
>
> +@cindex @code{integer_extract@var{m}@var{n}} instruction pattern
> +@item @samp{integer_extract@var{m}@var{n}}
> +Extract lower bit value from the integer value in @code{TImode},
> +@code{OImode} or @code{XImode}.  Operand 1 is the integer in mode
> +@var{n} and operand 0 stores value to be extracted in mode @var{m}.
> +
>  @end table
>
>  @end ifset
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index e3a080e4a7c..8ccc262b1fc 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -11894,6 +11894,11 @@ This function prepares to emit a conditional comparison within a sequence
>   @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.
>  @end deftypefn
>
> +@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_SCRATCH_RTX (machine_mode @var{mode})
> +This hook should return an rtx for scratch register in @var{mode} to
> +be used by memset broadcast.  The default is @code{gen_reg_rtx}.
> +@end deftypefn
> +
>  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})
>  This target hook returns a new value for the number of times @var{loop}
>  should be unrolled. The parameter @var{nunroll} is the number of times
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index d9fbbe20e6f..99bf01fe25d 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -7960,6 +7960,8 @@ lists.
>
>  @hook TARGET_GEN_CCMP_NEXT
>
> +@hook TARGET_GEN_MEMSET_SCRATCH_RTX
> +
>  @hook TARGET_LOOP_UNROLL_ADJUST
>
>  @defmac POWI_MAX_MULTS
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index b192a9d070b..fd8ab8b4a26 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -100,6 +100,8 @@ OPTAB_CD(vec_init_optab, "vec_init$a$b")
>
>  OPTAB_CD (while_ult_optab, "while_ult$a$b")
>
> +OPTAB_CD (integer_extract_optab, "integer_extract$a$b")
> +
>  OPTAB_NL(add_optab, "add$P$a3", PLUS, "add", '3', gen_int_fp_fixed_libfunc)
>  OPTAB_NX(add_optab, "add$F$a3")
>  OPTAB_NX(add_optab, "add$Q$a3")
> @@ -453,3 +455,5 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
>  OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
>  OPTAB_D (len_load_optab, "len_load_$a")
>  OPTAB_D (len_store_optab, "len_store_$a")
> +
> +OPTAB_DC (vec_const_duplicate_optab, "vec_const_duplicate$a", VEC_DUPLICATE)
> diff --git a/gcc/target.def b/gcc/target.def
> index 1dffedc81e4..b89e7c24471 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -2724,6 +2724,13 @@ DEFHOOK
>   rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree op0, tree op1, int bit_code),
>   NULL)
>
> +DEFHOOK
> +(gen_memset_scratch_rtx,
> + "This hook should return an rtx for scratch register in @var{mode} to\n\
> +be used by memset broadcast.  The default is @code{gen_reg_rtx}.",
> + rtx, (machine_mode mode),
> + gen_reg_rtx)
> +
>  /* Return a new value for loop unroll size.  */
>  DEFHOOK
>  (loop_unroll_adjust,
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Add integer_extract and vec_const_duplicate optabs
  2021-05-31 12:46                 ` Richard Biener
@ 2021-05-31 13:12                   ` H.J. Lu
  2021-05-31 13:25                     ` Richard Biener
  0 siblings, 1 reply; 52+ messages in thread
From: H.J. Lu @ 2021-05-31 13:12 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

On Mon, May 31, 2021 at 5:46 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
> > > > > >
> > > > > >  -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> > > > > >           MODE)
> > > > > >      This function returns the RTL of a register containing
> > > > > >      'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> > > > > >      value given in the RTL register DATA.  For example, if MODE is 4
> > > > > >      bytes wide, return the RTL for 0x01010101*DATA.
> > > > >
> > > > > For this one I wonder if it should be an optab instead.  Couldn't you
> > > > > use the existing vec_duplicate for this by using (paradoxical) subregs
> > > > > like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
> > > >
> > > > I tried.   It doesn't even work on x86.  See:
> > > >
> > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
> > >
> > > Not sure what I should read from there...
> > >
> > > > There are special cases to subreg HI, SI and DI modes of TI mode in
> > > > ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
> > > > work here.   Each backend may need its own special handling.
> > >
> > > OK, I guess I'm not (RTL) qualified enough to further review these parts,
> > > sorry.  Since we're doing code generation the canonical way to communicate
> > > with backends should be optabs, not some set of disconnected target hooks.
> > > But as said, I probably don't know enough of RTL to see why it's the only way.
> > >
> > > Richard.
> >
> > Here is the patch to add optabs instead.  Does it look OK?
> >
> > Thanks.
> >
> > H.J.
> > ---
> > Add 2 optabs:
> >
> > 1. integer_extract: Extract lower bit value from the integer value in
> > TImode, OImode or XImode.
>
> That sounds very specific, esp. the restriction to {TI,OI,XI}mode.
> It also sounds like it matches (subreg:{TI,OI,XI} (...) 0).  There are
> existing target hooks verifying subreg validity - why's that not a good
> fit here?  ISTR you say gen_lowpart () doesn't work (or was it
> simplify_gen_subreg?), why's that so?

{TI,OI,XI}mode are storage only integer types.   subreg doesn't work
well on them.  I got

[hjl@gnu-cfl-2 pieces]$ cat s2.i
extern void *ops;

void
foo (int c)
{
  __builtin_memset (ops, c, 34);
}
[hjl@gnu-cfl-2 pieces]$ make s2.s
/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
-B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
-O2 -march=haswell -S s2.i
during RTL pass: reload
s2.i: In function ‘foo’:
s2.i:7:1: internal compiler error: maximum number of generated reload
insns per insn achieved (90)
    7 | }
      | ^
0x1050734 lra_constraints(bool)
/export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c:5091
0x1039536 lra(_IO_FILE*)
/export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336
0xfe1140 do_reload
/export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822
0xfe162e execute
/export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.
make: *** [Makefile:32: s2.s] Error 1
[hjl@gnu-cfl-2 pieces]$

due to

(insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84)
                (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
*)ops.0_1]+32 S2 A8])
        (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1
     (nil))

The new optab gives us

(insn 12 11 13 2 (set (reg:TI 88)
        (reg:TI 51 xmm15)) "s2.i":6:3 -1
     (nil))
(insn 13 12 14 2 (set (reg:SI 89)
        (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1
     (nil))
(insn 14 13 15 2 (set (reg:HI 87)
        (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1
     (nil))
(insn 15 14 0 2 (set (mem:HI (plus:DI (reg/f:DI 84)
                (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
*)ops.0_1]+32 S2 A8])
        (reg:HI 87)) "s2.i":6:3 -1
     (nil))

> > 2. vec_const_duplicate: Broadcast a QImode constant to a vector.  It is
> > similar to vec_duplicate.  Since the resulting vector is computable at
> > compile-time, vec_duplicate may not be faster and backend can opt out
> > broadcasting from a constant while opting in broadcasting from a
> > variable.
>
> Is it for the latter that you choose to use a new optab since
> vec_duplicate is not allowed to FAIL?  You should probably document

Yes.

> that the constant value duplicated should be of the component mode of
> m.

I added:

@cindex @code{vec_const_duplicate@var{m}} instruction pattern
@item @samp{vec_const_duplicate@var{m}}
Initialize vector output operand 0 in mode @var{m} so that each element
has the value given by constant input operand 1.

This pattern only handles duplicates of constant inputs.  Non-constant
vectors go through the @code{vec_duplicate@var{m}} pattern instead.

This pattern is not allowed to @code{FAIL}.

which is in my patch.

> > and rewrite builtin_memset_read_str/builtin_memset_gen_str to support
> > target instructions to duplicate QImode value to TImode, OImode or XImode
> > value for memmset.
> >
> > Add TARGET_GEN_MEMSET_SCRATCH_RTX to allow the backend to use a hard
> > scratch register to avoid stack realignment when expanding memset.
> >
> >         PR middle-end/90773
> >         * builtins.c (gen_memset_value_from_prev): New function.
> >         (gen_memset_broadcast): Likewise.
> >         (builtin_memset_read_str): Use gen_memset_value_from_prev
> >         and gen_memset_broadcast.
> >         (builtin_memset_gen_str): Likewise.
> >         * optabs.def: Add integer_extract and vec_const_duplicate.
> >         * target.def (gen_memset_scratch_rtx): New hook.
> >         * doc/md.texi: Document vec_const_duplicate and integer_extract.
> >         * doc/tm.texi.in: Add TARGET_GEN_MEMSET_SCRATCH_RTX.
> >         * doc/tm.texi: Regenerated.
> > ---
> >  gcc/builtins.c     | 117 +++++++++++++++++++++++++++++++++++++--------
> >  gcc/doc/md.texi    |  16 +++++++
> >  gcc/doc/tm.texi    |   5 ++
> >  gcc/doc/tm.texi.in |   2 +
> >  gcc/optabs.def     |   4 ++
> >  gcc/target.def     |   7 +++
> >  6 files changed, 131 insertions(+), 20 deletions(-)
> >
> > diff --git a/gcc/builtins.c b/gcc/builtins.c
> > index af1fe49bb48..7683169eb96 100644
> > --- a/gcc/builtins.c
> > +++ b/gcc/builtins.c
> > @@ -6598,26 +6598,106 @@ expand_builtin_strncpy (tree exp, rtx target)
> >    return NULL_RTX;
> >  }
> >
> > +/* Return the RTL of a register in MODE generated from PREV in the
> > +   previous iteration.  */
> > +
> > +static rtx
> > +gen_memset_value_from_prev (void *prevp, scalar_int_mode mode)
> > +{
> > +  rtx target = nullptr;
> > +  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > +  if (prev != nullptr && prev->data != nullptr)
> > +    {
> > +      /* Use the previous data in the same mode.  */
> > +      if (prev->mode == mode)
> > +       return prev->data;
> > +
> > +      /* Extract the RTL in MODE from PREV.  */
> > +      enum insn_code icode
> > +       = convert_optab_handler (integer_extract_optab, mode,
> > +                                prev->mode);
> > +      if (icode != CODE_FOR_nothing)
> > +       {
> > +         target = gen_reg_rtx (mode);
> > +         class expand_operand ops[2];
> > +         create_output_operand (&ops[0], target, mode);
> > +         create_input_operand (&ops[1], prev->data, prev->mode);
> > +         expand_insn (icode, 2, ops);
> > +         if (!rtx_equal_p (target, ops[0].value))
> > +           emit_move_insn (target, ops[0].value);
> > +       }
> > +      else
> > +       target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> > +    }
> > +  return target;
> > +}
> > +
> > +/* Return the RTL of a register in MODE broadcasted from DATA.  */
> > +
> > +static rtx
> > +gen_memset_broadcast (enum optab_tag broadcast_optab, rtx data,
> > +                     scalar_int_mode mode)
> > +{
> > +  /* Skip if regno_reg_rtx isn't initialized.  */
> > +  if (!regno_reg_rtx)
> > +    return nullptr;
> > +
> > +  rtx target = nullptr;
> > +
> > +  unsigned int nunits = GET_MODE_SIZE (mode) / GET_MODE_SIZE (QImode);
> > +  machine_mode vector_mode;
> > +  if (!mode_for_vector (QImode, nunits).exists (&vector_mode))
> > +    gcc_unreachable ();
> > +
> > +  enum insn_code icode = optab_handler (broadcast_optab, vector_mode);
> > +  if (icode != CODE_FOR_nothing)
> > +    {
> > +      target = targetm.gen_memset_scratch_rtx (vector_mode);
> > +      class expand_operand ops[2];
> > +      create_output_operand (&ops[0], target, vector_mode);
> > +      create_input_operand (&ops[1], (rtx) data, QImode);
> > +      expand_insn (icode, 2, ops);
> > +      if (!rtx_equal_p (target, ops[0].value))
> > +       emit_move_insn (target, ops[0].value);
> > +      if (REGNO (target) < FIRST_PSEUDO_REGISTER)
> > +       target = gen_rtx_REG (mode, REGNO (target));
> > +      else
> > +       target = convert_to_mode (mode, target, 1);
> > +    }
> > +
> > +  return target;
> > +}
> > +
> >  /* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)
> >     bytes from constant string DATA + OFFSET and return it as target
> >     constant.  If PREV isn't nullptr, it has the RTL info from the
> >     previous iteration.  */
> >
> >  rtx
> > -builtin_memset_read_str (void *data, void *prevp,
> > +builtin_memset_read_str (void *data, void *prev,
> >                          HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> >                          scalar_int_mode mode)
> >  {
> > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > -  if (prev != nullptr && prev->data != nullptr)
> > +  rtx target;
> > +
> > +  /* Don't use the previous value if size is 1.  */
> > +  if (GET_MODE_SIZE (mode) != 1)
> >      {
> > -      /* Use the previous data in the same mode.  */
> > -      if (prev->mode == mode)
> > -       return prev->data;
> > +      target = gen_memset_value_from_prev (prev, mode);
> > +      if (target != nullptr)
> > +       return target;
> >      }
> >
> >    const char *c = (const char *) data;
> > -  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> > +  char *p = XALLOCAVEC (char, GET_MODE_SIZE (QImode));
> > +  memset (p, *c, GET_MODE_SIZE (QImode));
> > +  rtx src = c_readstr (p, QImode);
> > +  target = gen_memset_broadcast (vec_const_duplicate_optab, src,
> > +                                mode);
> > +  if (target != nullptr)
> > +    return target;
> > +
> > +  p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> >
> >    memset (p, *c, GET_MODE_SIZE (mode));
> >
> > @@ -6631,7 +6711,7 @@ builtin_memset_read_str (void *data, void *prevp,
> >     nullptr, it has the RTL info from the previous iteration.  */
> >
> >  static rtx
> > -builtin_memset_gen_str (void *data, void *prevp,
> > +builtin_memset_gen_str (void *data, void *prev,
> >                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> >                         scalar_int_mode mode)
> >  {
> > @@ -6639,22 +6719,19 @@ builtin_memset_gen_str (void *data, void *prevp,
> >    size_t size;
> >    char *p;
> >
> > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > -  if (prev != nullptr && prev->data != nullptr)
> > -    {
> > -      /* Use the previous data in the same mode.  */
> > -      if (prev->mode == mode)
> > -       return prev->data;
> > -
> > -      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> > -      if (target != nullptr)
> > -       return target;
> > -    }
> > -
> >    size = GET_MODE_SIZE (mode);
> >    if (size == 1)
> >      return (rtx) data;
> >
> > +  target = gen_memset_value_from_prev (prev, mode);
> > +  if (target != nullptr)
> > +    return target;
> > +
> > +  target = gen_memset_broadcast (vec_duplicate_optab, (rtx) data,
> > +                                mode);
> > +  if (target != nullptr)
> > +    return target;
> > +
> >    p = XALLOCAVEC (char, size);
> >    memset (p, 1, size);
> >    coeff = c_readstr (p, mode);
> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > index 00caf3844cc..a798fb1a97e 100644
> > --- a/gcc/doc/md.texi
> > +++ b/gcc/doc/md.texi
> > @@ -5079,6 +5079,16 @@ vectors go through the @code{mov@var{m}} pattern instead.
> >
> >  This pattern is not allowed to @code{FAIL}.
> >
> > +@cindex @code{vec_const_duplicate@var{m}} instruction pattern
> > +@item @samp{vec_const_duplicate@var{m}}
> > +Initialize vector output operand 0 in mode @var{m} so that each element
> > +has the value given by constant input operand 1.
> > +
> > +This pattern only handles duplicates of constant inputs.  Non-constant
> > +vectors go through the @code{vec_duplicate@var{m}} pattern instead.
> > +
> > +This pattern is not allowed to @code{FAIL}.
> > +
> >  @cindex @code{vec_series@var{m}} instruction pattern
> >  @item @samp{vec_series@var{m}}
> >  Initialize vector output operand 0 so that element @var{i} is equal to
> > @@ -7904,6 +7914,12 @@ inclusive and operand 1 exclusive.
> >  If this pattern is not defined, a call to the library function
> >  @code{__clear_cache} is used.
> >
> > +@cindex @code{integer_extract@var{m}@var{n}} instruction pattern
> > +@item @samp{integer_extract@var{m}@var{n}}
> > +Extract lower bit value from the integer value in @code{TImode},
> > +@code{OImode} or @code{XImode}.  Operand 1 is the integer in mode
> > +@var{n} and operand 0 stores value to be extracted in mode @var{m}.
> > +
> >  @end table
> >
> >  @end ifset
> > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > index e3a080e4a7c..8ccc262b1fc 100644
> > --- a/gcc/doc/tm.texi
> > +++ b/gcc/doc/tm.texi
> > @@ -11894,6 +11894,11 @@ This function prepares to emit a conditional comparison within a sequence
> >   @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.
> >  @end deftypefn
> >
> > +@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_SCRATCH_RTX (machine_mode @var{mode})
> > +This hook should return an rtx for scratch register in @var{mode} to
> > +be used by memset broadcast.  The default is @code{gen_reg_rtx}.
> > +@end deftypefn
> > +
> >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})
> >  This target hook returns a new value for the number of times @var{loop}
> >  should be unrolled. The parameter @var{nunroll} is the number of times
> > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > index d9fbbe20e6f..99bf01fe25d 100644
> > --- a/gcc/doc/tm.texi.in
> > +++ b/gcc/doc/tm.texi.in
> > @@ -7960,6 +7960,8 @@ lists.
> >
> >  @hook TARGET_GEN_CCMP_NEXT
> >
> > +@hook TARGET_GEN_MEMSET_SCRATCH_RTX
> > +
> >  @hook TARGET_LOOP_UNROLL_ADJUST
> >
> >  @defmac POWI_MAX_MULTS
> > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > index b192a9d070b..fd8ab8b4a26 100644
> > --- a/gcc/optabs.def
> > +++ b/gcc/optabs.def
> > @@ -100,6 +100,8 @@ OPTAB_CD(vec_init_optab, "vec_init$a$b")
> >
> >  OPTAB_CD (while_ult_optab, "while_ult$a$b")
> >
> > +OPTAB_CD (integer_extract_optab, "integer_extract$a$b")
> > +
> >  OPTAB_NL(add_optab, "add$P$a3", PLUS, "add", '3', gen_int_fp_fixed_libfunc)
> >  OPTAB_NX(add_optab, "add$F$a3")
> >  OPTAB_NX(add_optab, "add$Q$a3")
> > @@ -453,3 +455,5 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
> >  OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
> >  OPTAB_D (len_load_optab, "len_load_$a")
> >  OPTAB_D (len_store_optab, "len_store_$a")
> > +
> > +OPTAB_DC (vec_const_duplicate_optab, "vec_const_duplicate$a", VEC_DUPLICATE)
> > diff --git a/gcc/target.def b/gcc/target.def
> > index 1dffedc81e4..b89e7c24471 100644
> > --- a/gcc/target.def
> > +++ b/gcc/target.def
> > @@ -2724,6 +2724,13 @@ DEFHOOK
> >   rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree op0, tree op1, int bit_code),
> >   NULL)
> >
> > +DEFHOOK
> > +(gen_memset_scratch_rtx,
> > + "This hook should return an rtx for scratch register in @var{mode} to\n\
> > +be used by memset broadcast.  The default is @code{gen_reg_rtx}.",
> > + rtx, (machine_mode mode),
> > + gen_reg_rtx)
> > +
> >  /* Return a new value for loop unroll size.  */
> >  DEFHOOK
> >  (loop_unroll_adjust,
> > --
> > 2.31.1
> >



-- 
H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Add integer_extract and vec_const_duplicate optabs
  2021-05-31 13:12                   ` H.J. Lu
@ 2021-05-31 13:25                     ` Richard Biener
  2021-05-31 13:32                       ` H.J. Lu
  0 siblings, 1 reply; 52+ messages in thread
From: Richard Biener @ 2021-05-31 13:25 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

On Mon, May 31, 2021 at 3:12 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, May 31, 2021 at 5:46 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
> > > > > > >
> > > > > > >  -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> > > > > > >           MODE)
> > > > > > >      This function returns the RTL of a register containing
> > > > > > >      'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> > > > > > >      value given in the RTL register DATA.  For example, if MODE is 4
> > > > > > >      bytes wide, return the RTL for 0x01010101*DATA.
> > > > > >
> > > > > > For this one I wonder if it should be an optab instead.  Couldn't you
> > > > > > use the existing vec_duplicate for this by using (paradoxical) subregs
> > > > > > like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
> > > > >
> > > > > I tried.   It doesn't even work on x86.  See:
> > > > >
> > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
> > > >
> > > > Not sure what I should read from there...
> > > >
> > > > > There are special cases to subreg HI, SI and DI modes of TI mode in
> > > > > ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
> > > > > work here.   Each backend may need its own special handling.
> > > >
> > > > OK, I guess I'm not (RTL) qualified enough to further review these parts,
> > > > sorry.  Since we're doing code generation the canonical way to communicate
> > > > with backends should be optabs, not some set of disconnected target hooks.
> > > > But as said, I probably don't know enough of RTL to see why it's the only way.
> > > >
> > > > Richard.
> > >
> > > Here is the patch to add optabs instead.  Does it look OK?
> > >
> > > Thanks.
> > >
> > > H.J.
> > > ---
> > > Add 2 optabs:
> > >
> > > 1. integer_extract: Extract lower bit value from the integer value in
> > > TImode, OImode or XImode.
> >
> > That sounds very specific, esp. the restriction to {TI,OI,XI}mode.
> > It also sounds like it matches (subreg:{TI,OI,XI} (...) 0).  There are
> > existing target hooks verifying subreg validity - why's that not a good
> > fit here?  ISTR you say gen_lowpart () doesn't work (or was it
> > simplify_gen_subreg?), why's that so?
>
> {TI,OI,XI}mode are storage only integer types.   subreg doesn't work
> well on them.  I got
>
> [hjl@gnu-cfl-2 pieces]$ cat s2.i
> extern void *ops;
>
> void
> foo (int c)
> {
>   __builtin_memset (ops, c, 34);
> }
> [hjl@gnu-cfl-2 pieces]$ make s2.s
> /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
> -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
> -O2 -march=haswell -S s2.i
> during RTL pass: reload
> s2.i: In function ‘foo’:
> s2.i:7:1: internal compiler error: maximum number of generated reload
> insns per insn achieved (90)
>     7 | }
>       | ^
> 0x1050734 lra_constraints(bool)
> /export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c:5091
> 0x1039536 lra(_IO_FILE*)
> /export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336
> 0xfe1140 do_reload
> /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822
> 0xfe162e execute
> /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008
> Please submit a full bug report,
> with preprocessed source if appropriate.
> Please include the complete backtrace with any bug report.
> See <https://gcc.gnu.org/bugs/> for instructions.
> make: *** [Makefile:32: s2.s] Error 1
> [hjl@gnu-cfl-2 pieces]$
>
> due to
>
> (insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84)
>                 (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
> *)ops.0_1]+32 S2 A8])
>         (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1
>      (nil))
>
> The new optab gives us
>
> (insn 12 11 13 2 (set (reg:TI 88)
>         (reg:TI 51 xmm15)) "s2.i":6:3 -1
>      (nil))
> (insn 13 12 14 2 (set (reg:SI 89)
>         (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1
>      (nil))
> (insn 14 13 15 2 (set (reg:HI 87)
>         (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1
>      (nil))

that looks odd to me - what's the final result after LRA?  I think
we should see to make lowpart_subreg work on {XI,OI,TI}mode.
Only two steps should be necessary at most:
xmm -> gpr, grp -> subreg, or gpr -> subreg.  So the expander
code in memset should try to generate the subreg directly
and if that fails, try a word_mode subreg followed by the subreg.

> (insn 15 14 0 2 (set (mem:HI (plus:DI (reg/f:DI 84)
>                 (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
> *)ops.0_1]+32 S2 A8])
>         (reg:HI 87)) "s2.i":6:3 -1
>      (nil))
>
> > > 2. vec_const_duplicate: Broadcast a QImode constant to a vector.  It is
> > > similar to vec_duplicate.  Since the resulting vector is computable at
> > > compile-time, vec_duplicate may not be faster and backend can opt out
> > > broadcasting from a constant while opting in broadcasting from a
> > > variable.
> >
> > Is it for the latter that you choose to use a new optab since
> > vec_duplicate is not allowed to FAIL?  You should probably document
>
> Yes.
>
> > that the constant value duplicated should be of the component mode of
> > m.
>
> I added:
>
> @cindex @code{vec_const_duplicate@var{m}} instruction pattern
> @item @samp{vec_const_duplicate@var{m}}
> Initialize vector output operand 0 in mode @var{m} so that each element
> has the value given by constant input operand 1.
>
> This pattern only handles duplicates of constant inputs.  Non-constant
> vectors go through the @code{vec_duplicate@var{m}} pattern instead.
>
> This pattern is not allowed to @code{FAIL}.
>
> which is in my patch.
>
> > > and rewrite builtin_memset_read_str/builtin_memset_gen_str to support
> > > target instructions to duplicate QImode value to TImode, OImode or XImode
> > > value for memmset.
> > >
> > > Add TARGET_GEN_MEMSET_SCRATCH_RTX to allow the backend to use a hard
> > > scratch register to avoid stack realignment when expanding memset.
> > >
> > >         PR middle-end/90773
> > >         * builtins.c (gen_memset_value_from_prev): New function.
> > >         (gen_memset_broadcast): Likewise.
> > >         (builtin_memset_read_str): Use gen_memset_value_from_prev
> > >         and gen_memset_broadcast.
> > >         (builtin_memset_gen_str): Likewise.
> > >         * optabs.def: Add integer_extract and vec_const_duplicate.
> > >         * target.def (gen_memset_scratch_rtx): New hook.
> > >         * doc/md.texi: Document vec_const_duplicate and integer_extract.
> > >         * doc/tm.texi.in: Add TARGET_GEN_MEMSET_SCRATCH_RTX.
> > >         * doc/tm.texi: Regenerated.
> > > ---
> > >  gcc/builtins.c     | 117 +++++++++++++++++++++++++++++++++++++--------
> > >  gcc/doc/md.texi    |  16 +++++++
> > >  gcc/doc/tm.texi    |   5 ++
> > >  gcc/doc/tm.texi.in |   2 +
> > >  gcc/optabs.def     |   4 ++
> > >  gcc/target.def     |   7 +++
> > >  6 files changed, 131 insertions(+), 20 deletions(-)
> > >
> > > diff --git a/gcc/builtins.c b/gcc/builtins.c
> > > index af1fe49bb48..7683169eb96 100644
> > > --- a/gcc/builtins.c
> > > +++ b/gcc/builtins.c
> > > @@ -6598,26 +6598,106 @@ expand_builtin_strncpy (tree exp, rtx target)
> > >    return NULL_RTX;
> > >  }
> > >
> > > +/* Return the RTL of a register in MODE generated from PREV in the
> > > +   previous iteration.  */
> > > +
> > > +static rtx
> > > +gen_memset_value_from_prev (void *prevp, scalar_int_mode mode)
> > > +{
> > > +  rtx target = nullptr;
> > > +  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > > +  if (prev != nullptr && prev->data != nullptr)
> > > +    {
> > > +      /* Use the previous data in the same mode.  */
> > > +      if (prev->mode == mode)
> > > +       return prev->data;
> > > +
> > > +      /* Extract the RTL in MODE from PREV.  */
> > > +      enum insn_code icode
> > > +       = convert_optab_handler (integer_extract_optab, mode,
> > > +                                prev->mode);
> > > +      if (icode != CODE_FOR_nothing)
> > > +       {
> > > +         target = gen_reg_rtx (mode);
> > > +         class expand_operand ops[2];
> > > +         create_output_operand (&ops[0], target, mode);
> > > +         create_input_operand (&ops[1], prev->data, prev->mode);
> > > +         expand_insn (icode, 2, ops);
> > > +         if (!rtx_equal_p (target, ops[0].value))
> > > +           emit_move_insn (target, ops[0].value);
> > > +       }
> > > +      else
> > > +       target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> > > +    }
> > > +  return target;
> > > +}
> > > +
> > > +/* Return the RTL of a register in MODE broadcasted from DATA.  */
> > > +
> > > +static rtx
> > > +gen_memset_broadcast (enum optab_tag broadcast_optab, rtx data,
> > > +                     scalar_int_mode mode)
> > > +{
> > > +  /* Skip if regno_reg_rtx isn't initialized.  */
> > > +  if (!regno_reg_rtx)
> > > +    return nullptr;
> > > +
> > > +  rtx target = nullptr;
> > > +
> > > +  unsigned int nunits = GET_MODE_SIZE (mode) / GET_MODE_SIZE (QImode);
> > > +  machine_mode vector_mode;
> > > +  if (!mode_for_vector (QImode, nunits).exists (&vector_mode))
> > > +    gcc_unreachable ();
> > > +
> > > +  enum insn_code icode = optab_handler (broadcast_optab, vector_mode);
> > > +  if (icode != CODE_FOR_nothing)
> > > +    {
> > > +      target = targetm.gen_memset_scratch_rtx (vector_mode);
> > > +      class expand_operand ops[2];
> > > +      create_output_operand (&ops[0], target, vector_mode);
> > > +      create_input_operand (&ops[1], (rtx) data, QImode);
> > > +      expand_insn (icode, 2, ops);
> > > +      if (!rtx_equal_p (target, ops[0].value))
> > > +       emit_move_insn (target, ops[0].value);
> > > +      if (REGNO (target) < FIRST_PSEUDO_REGISTER)
> > > +       target = gen_rtx_REG (mode, REGNO (target));
> > > +      else
> > > +       target = convert_to_mode (mode, target, 1);
> > > +    }
> > > +
> > > +  return target;
> > > +}
> > > +
> > >  /* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)
> > >     bytes from constant string DATA + OFFSET and return it as target
> > >     constant.  If PREV isn't nullptr, it has the RTL info from the
> > >     previous iteration.  */
> > >
> > >  rtx
> > > -builtin_memset_read_str (void *data, void *prevp,
> > > +builtin_memset_read_str (void *data, void *prev,
> > >                          HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > >                          scalar_int_mode mode)
> > >  {
> > > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > > -  if (prev != nullptr && prev->data != nullptr)
> > > +  rtx target;
> > > +
> > > +  /* Don't use the previous value if size is 1.  */
> > > +  if (GET_MODE_SIZE (mode) != 1)
> > >      {
> > > -      /* Use the previous data in the same mode.  */
> > > -      if (prev->mode == mode)
> > > -       return prev->data;
> > > +      target = gen_memset_value_from_prev (prev, mode);
> > > +      if (target != nullptr)
> > > +       return target;
> > >      }
> > >
> > >    const char *c = (const char *) data;
> > > -  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> > > +  char *p = XALLOCAVEC (char, GET_MODE_SIZE (QImode));
> > > +  memset (p, *c, GET_MODE_SIZE (QImode));
> > > +  rtx src = c_readstr (p, QImode);
> > > +  target = gen_memset_broadcast (vec_const_duplicate_optab, src,
> > > +                                mode);
> > > +  if (target != nullptr)
> > > +    return target;
> > > +
> > > +  p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> > >
> > >    memset (p, *c, GET_MODE_SIZE (mode));
> > >
> > > @@ -6631,7 +6711,7 @@ builtin_memset_read_str (void *data, void *prevp,
> > >     nullptr, it has the RTL info from the previous iteration.  */
> > >
> > >  static rtx
> > > -builtin_memset_gen_str (void *data, void *prevp,
> > > +builtin_memset_gen_str (void *data, void *prev,
> > >                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > >                         scalar_int_mode mode)
> > >  {
> > > @@ -6639,22 +6719,19 @@ builtin_memset_gen_str (void *data, void *prevp,
> > >    size_t size;
> > >    char *p;
> > >
> > > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > > -  if (prev != nullptr && prev->data != nullptr)
> > > -    {
> > > -      /* Use the previous data in the same mode.  */
> > > -      if (prev->mode == mode)
> > > -       return prev->data;
> > > -
> > > -      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> > > -      if (target != nullptr)
> > > -       return target;
> > > -    }
> > > -
> > >    size = GET_MODE_SIZE (mode);
> > >    if (size == 1)
> > >      return (rtx) data;
> > >
> > > +  target = gen_memset_value_from_prev (prev, mode);
> > > +  if (target != nullptr)
> > > +    return target;
> > > +
> > > +  target = gen_memset_broadcast (vec_duplicate_optab, (rtx) data,
> > > +                                mode);
> > > +  if (target != nullptr)
> > > +    return target;
> > > +
> > >    p = XALLOCAVEC (char, size);
> > >    memset (p, 1, size);
> > >    coeff = c_readstr (p, mode);
> > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > > index 00caf3844cc..a798fb1a97e 100644
> > > --- a/gcc/doc/md.texi
> > > +++ b/gcc/doc/md.texi
> > > @@ -5079,6 +5079,16 @@ vectors go through the @code{mov@var{m}} pattern instead.
> > >
> > >  This pattern is not allowed to @code{FAIL}.
> > >
> > > +@cindex @code{vec_const_duplicate@var{m}} instruction pattern
> > > +@item @samp{vec_const_duplicate@var{m}}
> > > +Initialize vector output operand 0 in mode @var{m} so that each element
> > > +has the value given by constant input operand 1.
> > > +
> > > +This pattern only handles duplicates of constant inputs.  Non-constant
> > > +vectors go through the @code{vec_duplicate@var{m}} pattern instead.
> > > +
> > > +This pattern is not allowed to @code{FAIL}.
> > > +
> > >  @cindex @code{vec_series@var{m}} instruction pattern
> > >  @item @samp{vec_series@var{m}}
> > >  Initialize vector output operand 0 so that element @var{i} is equal to
> > > @@ -7904,6 +7914,12 @@ inclusive and operand 1 exclusive.
> > >  If this pattern is not defined, a call to the library function
> > >  @code{__clear_cache} is used.
> > >
> > > +@cindex @code{integer_extract@var{m}@var{n}} instruction pattern
> > > +@item @samp{integer_extract@var{m}@var{n}}
> > > +Extract lower bit value from the integer value in @code{TImode},
> > > +@code{OImode} or @code{XImode}.  Operand 1 is the integer in mode
> > > +@var{n} and operand 0 stores value to be extracted in mode @var{m}.
> > > +
> > >  @end table
> > >
> > >  @end ifset
> > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > index e3a080e4a7c..8ccc262b1fc 100644
> > > --- a/gcc/doc/tm.texi
> > > +++ b/gcc/doc/tm.texi
> > > @@ -11894,6 +11894,11 @@ This function prepares to emit a conditional comparison within a sequence
> > >   @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.
> > >  @end deftypefn
> > >
> > > +@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_SCRATCH_RTX (machine_mode @var{mode})
> > > +This hook should return an rtx for scratch register in @var{mode} to
> > > +be used by memset broadcast.  The default is @code{gen_reg_rtx}.
> > > +@end deftypefn
> > > +
> > >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})
> > >  This target hook returns a new value for the number of times @var{loop}
> > >  should be unrolled. The parameter @var{nunroll} is the number of times
> > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > index d9fbbe20e6f..99bf01fe25d 100644
> > > --- a/gcc/doc/tm.texi.in
> > > +++ b/gcc/doc/tm.texi.in
> > > @@ -7960,6 +7960,8 @@ lists.
> > >
> > >  @hook TARGET_GEN_CCMP_NEXT
> > >
> > > +@hook TARGET_GEN_MEMSET_SCRATCH_RTX
> > > +
> > >  @hook TARGET_LOOP_UNROLL_ADJUST
> > >
> > >  @defmac POWI_MAX_MULTS
> > > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > > index b192a9d070b..fd8ab8b4a26 100644
> > > --- a/gcc/optabs.def
> > > +++ b/gcc/optabs.def
> > > @@ -100,6 +100,8 @@ OPTAB_CD(vec_init_optab, "vec_init$a$b")
> > >
> > >  OPTAB_CD (while_ult_optab, "while_ult$a$b")
> > >
> > > +OPTAB_CD (integer_extract_optab, "integer_extract$a$b")
> > > +
> > >  OPTAB_NL(add_optab, "add$P$a3", PLUS, "add", '3', gen_int_fp_fixed_libfunc)
> > >  OPTAB_NX(add_optab, "add$F$a3")
> > >  OPTAB_NX(add_optab, "add$Q$a3")
> > > @@ -453,3 +455,5 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
> > >  OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
> > >  OPTAB_D (len_load_optab, "len_load_$a")
> > >  OPTAB_D (len_store_optab, "len_store_$a")
> > > +
> > > +OPTAB_DC (vec_const_duplicate_optab, "vec_const_duplicate$a", VEC_DUPLICATE)
> > > diff --git a/gcc/target.def b/gcc/target.def
> > > index 1dffedc81e4..b89e7c24471 100644
> > > --- a/gcc/target.def
> > > +++ b/gcc/target.def
> > > @@ -2724,6 +2724,13 @@ DEFHOOK
> > >   rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree op0, tree op1, int bit_code),
> > >   NULL)
> > >
> > > +DEFHOOK
> > > +(gen_memset_scratch_rtx,
> > > + "This hook should return an rtx for scratch register in @var{mode} to\n\
> > > +be used by memset broadcast.  The default is @code{gen_reg_rtx}.",
> > > + rtx, (machine_mode mode),
> > > + gen_reg_rtx)
> > > +
> > >  /* Return a new value for loop unroll size.  */
> > >  DEFHOOK
> > >  (loop_unroll_adjust,
> > > --
> > > 2.31.1
> > >
>
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Add integer_extract and vec_const_duplicate optabs
  2021-05-31 13:25                     ` Richard Biener
@ 2021-05-31 13:32                       ` H.J. Lu
  2021-05-31 13:36                         ` H.J. Lu
  2021-05-31 20:22                         ` [PATCH v2] Add vec_const_duplicate optab and TARGET_GEN_MEMSET_SCRATCH_RTX H.J. Lu
  0 siblings, 2 replies; 52+ messages in thread
From: H.J. Lu @ 2021-05-31 13:32 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

On Mon, May 31, 2021 at 6:26 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Mon, May 31, 2021 at 3:12 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Mon, May 31, 2021 at 5:46 AM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >
> > > > On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
> > > > > > > >
> > > > > > > >  -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> > > > > > > >           MODE)
> > > > > > > >      This function returns the RTL of a register containing
> > > > > > > >      'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> > > > > > > >      value given in the RTL register DATA.  For example, if MODE is 4
> > > > > > > >      bytes wide, return the RTL for 0x01010101*DATA.
> > > > > > >
> > > > > > > For this one I wonder if it should be an optab instead.  Couldn't you
> > > > > > > use the existing vec_duplicate for this by using (paradoxical) subregs
> > > > > > > like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
> > > > > >
> > > > > > I tried.   It doesn't even work on x86.  See:
> > > > > >
> > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
> > > > >
> > > > > Not sure what I should read from there...
> > > > >
> > > > > > There are special cases to subreg HI, SI and DI modes of TI mode in
> > > > > > ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
> > > > > > work here.   Each backend may need its own special handling.
> > > > >
> > > > > OK, I guess I'm not (RTL) qualified enough to further review these parts,
> > > > > sorry.  Since we're doing code generation the canonical way to communicate
> > > > > with backends should be optabs, not some set of disconnected target hooks.
> > > > > But as said, I probably don't know enough of RTL to see why it's the only way.
> > > > >
> > > > > Richard.
> > > >
> > > > Here is the patch to add optabs instead.  Does it look OK?
> > > >
> > > > Thanks.
> > > >
> > > > H.J.
> > > > ---
> > > > Add 2 optabs:
> > > >
> > > > 1. integer_extract: Extract lower bit value from the integer value in
> > > > TImode, OImode or XImode.
> > >
> > > That sounds very specific, esp. the restriction to {TI,OI,XI}mode.
> > > It also sounds like it matches (subreg:{TI,OI,XI} (...) 0).  There are
> > > existing target hooks verifying subreg validity - why's that not a good
> > > fit here?  ISTR you say gen_lowpart () doesn't work (or was it
> > > simplify_gen_subreg?), why's that so?
> >
> > {TI,OI,XI}mode are storage only integer types.   subreg doesn't work
> > well on them.  I got
> >
> > [hjl@gnu-cfl-2 pieces]$ cat s2.i
> > extern void *ops;
> >
> > void
> > foo (int c)
> > {
> >   __builtin_memset (ops, c, 34);
> > }
> > [hjl@gnu-cfl-2 pieces]$ make s2.s
> > /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
> > -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
> > -O2 -march=haswell -S s2.i
> > during RTL pass: reload
> > s2.i: In function ‘foo’:
> > s2.i:7:1: internal compiler error: maximum number of generated reload
> > insns per insn achieved (90)
> >     7 | }
> >       | ^
> > 0x1050734 lra_constraints(bool)
> > /export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c:5091
> > 0x1039536 lra(_IO_FILE*)
> > /export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336
> > 0xfe1140 do_reload
> > /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822
> > 0xfe162e execute
> > /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008
> > Please submit a full bug report,
> > with preprocessed source if appropriate.
> > Please include the complete backtrace with any bug report.
> > See <https://gcc.gnu.org/bugs/> for instructions.
> > make: *** [Makefile:32: s2.s] Error 1
> > [hjl@gnu-cfl-2 pieces]$
> >
> > due to
> >
> > (insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84)
> >                 (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
> > *)ops.0_1]+32 S2 A8])
> >         (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1
> >      (nil))
> >
> > The new optab gives us
> >
> > (insn 12 11 13 2 (set (reg:TI 88)
> >         (reg:TI 51 xmm15)) "s2.i":6:3 -1
> >      (nil))
> > (insn 13 12 14 2 (set (reg:SI 89)
> >         (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1
> >      (nil))
> > (insn 14 13 15 2 (set (reg:HI 87)
> >         (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1
> >      (nil))
>
> that looks odd to me - what's the final result after LRA?  I think

I got:

vmovd %edi, %xmm15
movq ops(%rip), %rdx
vpbroadcastb %xmm15, %ymm15
vmovq %xmm15, %rax    <<<< move to GPR
vmovdqu %ymm15, (%rdx)
movw %ax, 32(%rdx)   <<<< subreg of GPR
vzeroupper
ret

> we should see to make lowpart_subreg work on {XI,OI,TI}mode.
> Only two steps should be necessary at most:
> xmm -> gpr, grp -> subreg, or gpr -> subreg.  So the expander
> code in memset should try to generate the subreg directly

subreg didn't fail on x86 when I tried.

> and if that fails, try a word_mode subreg followed by the subreg.

I will try word_mode subreg.

>
> > (insn 15 14 0 2 (set (mem:HI (plus:DI (reg/f:DI 84)
> >                 (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
> > *)ops.0_1]+32 S2 A8])
> >         (reg:HI 87)) "s2.i":6:3 -1
> >      (nil))
> >
> > > > 2. vec_const_duplicate: Broadcast a QImode constant to a vector.  It is
> > > > similar to vec_duplicate.  Since the resulting vector is computable at
> > > > compile-time, vec_duplicate may not be faster and backend can opt out
> > > > broadcasting from a constant while opting in broadcasting from a
> > > > variable.
> > >
> > > Is it for the latter that you choose to use a new optab since
> > > vec_duplicate is not allowed to FAIL?  You should probably document
> >
> > Yes.
> >
> > > that the constant value duplicated should be of the component mode of
> > > m.
> >
> > I added:
> >
> > @cindex @code{vec_const_duplicate@var{m}} instruction pattern
> > @item @samp{vec_const_duplicate@var{m}}
> > Initialize vector output operand 0 in mode @var{m} so that each element
> > has the value given by constant input operand 1.
> >
> > This pattern only handles duplicates of constant inputs.  Non-constant
> > vectors go through the @code{vec_duplicate@var{m}} pattern instead.
> >
> > This pattern is not allowed to @code{FAIL}.
> >
> > which is in my patch.
> >
> > > > and rewrite builtin_memset_read_str/builtin_memset_gen_str to support
> > > > target instructions to duplicate QImode value to TImode, OImode or XImode
> > > > value for memmset.
> > > >
> > > > Add TARGET_GEN_MEMSET_SCRATCH_RTX to allow the backend to use a hard
> > > > scratch register to avoid stack realignment when expanding memset.
> > > >
> > > >         PR middle-end/90773
> > > >         * builtins.c (gen_memset_value_from_prev): New function.
> > > >         (gen_memset_broadcast): Likewise.
> > > >         (builtin_memset_read_str): Use gen_memset_value_from_prev
> > > >         and gen_memset_broadcast.
> > > >         (builtin_memset_gen_str): Likewise.
> > > >         * optabs.def: Add integer_extract and vec_const_duplicate.
> > > >         * target.def (gen_memset_scratch_rtx): New hook.
> > > >         * doc/md.texi: Document vec_const_duplicate and integer_extract.
> > > >         * doc/tm.texi.in: Add TARGET_GEN_MEMSET_SCRATCH_RTX.
> > > >         * doc/tm.texi: Regenerated.
> > > > ---
> > > >  gcc/builtins.c     | 117 +++++++++++++++++++++++++++++++++++++--------
> > > >  gcc/doc/md.texi    |  16 +++++++
> > > >  gcc/doc/tm.texi    |   5 ++
> > > >  gcc/doc/tm.texi.in |   2 +
> > > >  gcc/optabs.def     |   4 ++
> > > >  gcc/target.def     |   7 +++
> > > >  6 files changed, 131 insertions(+), 20 deletions(-)
> > > >
> > > > diff --git a/gcc/builtins.c b/gcc/builtins.c
> > > > index af1fe49bb48..7683169eb96 100644
> > > > --- a/gcc/builtins.c
> > > > +++ b/gcc/builtins.c
> > > > @@ -6598,26 +6598,106 @@ expand_builtin_strncpy (tree exp, rtx target)
> > > >    return NULL_RTX;
> > > >  }
> > > >
> > > > +/* Return the RTL of a register in MODE generated from PREV in the
> > > > +   previous iteration.  */
> > > > +
> > > > +static rtx
> > > > +gen_memset_value_from_prev (void *prevp, scalar_int_mode mode)
> > > > +{
> > > > +  rtx target = nullptr;
> > > > +  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > > > +  if (prev != nullptr && prev->data != nullptr)
> > > > +    {
> > > > +      /* Use the previous data in the same mode.  */
> > > > +      if (prev->mode == mode)
> > > > +       return prev->data;
> > > > +
> > > > +      /* Extract the RTL in MODE from PREV.  */
> > > > +      enum insn_code icode
> > > > +       = convert_optab_handler (integer_extract_optab, mode,
> > > > +                                prev->mode);
> > > > +      if (icode != CODE_FOR_nothing)
> > > > +       {
> > > > +         target = gen_reg_rtx (mode);
> > > > +         class expand_operand ops[2];
> > > > +         create_output_operand (&ops[0], target, mode);
> > > > +         create_input_operand (&ops[1], prev->data, prev->mode);
> > > > +         expand_insn (icode, 2, ops);
> > > > +         if (!rtx_equal_p (target, ops[0].value))
> > > > +           emit_move_insn (target, ops[0].value);
> > > > +       }
> > > > +      else
> > > > +       target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> > > > +    }
> > > > +  return target;
> > > > +}
> > > > +
> > > > +/* Return the RTL of a register in MODE broadcasted from DATA.  */
> > > > +
> > > > +static rtx
> > > > +gen_memset_broadcast (enum optab_tag broadcast_optab, rtx data,
> > > > +                     scalar_int_mode mode)
> > > > +{
> > > > +  /* Skip if regno_reg_rtx isn't initialized.  */
> > > > +  if (!regno_reg_rtx)
> > > > +    return nullptr;
> > > > +
> > > > +  rtx target = nullptr;
> > > > +
> > > > +  unsigned int nunits = GET_MODE_SIZE (mode) / GET_MODE_SIZE (QImode);
> > > > +  machine_mode vector_mode;
> > > > +  if (!mode_for_vector (QImode, nunits).exists (&vector_mode))
> > > > +    gcc_unreachable ();
> > > > +
> > > > +  enum insn_code icode = optab_handler (broadcast_optab, vector_mode);
> > > > +  if (icode != CODE_FOR_nothing)
> > > > +    {
> > > > +      target = targetm.gen_memset_scratch_rtx (vector_mode);
> > > > +      class expand_operand ops[2];
> > > > +      create_output_operand (&ops[0], target, vector_mode);
> > > > +      create_input_operand (&ops[1], (rtx) data, QImode);
> > > > +      expand_insn (icode, 2, ops);
> > > > +      if (!rtx_equal_p (target, ops[0].value))
> > > > +       emit_move_insn (target, ops[0].value);
> > > > +      if (REGNO (target) < FIRST_PSEUDO_REGISTER)
> > > > +       target = gen_rtx_REG (mode, REGNO (target));
> > > > +      else
> > > > +       target = convert_to_mode (mode, target, 1);
> > > > +    }
> > > > +
> > > > +  return target;
> > > > +}
> > > > +
> > > >  /* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)
> > > >     bytes from constant string DATA + OFFSET and return it as target
> > > >     constant.  If PREV isn't nullptr, it has the RTL info from the
> > > >     previous iteration.  */
> > > >
> > > >  rtx
> > > > -builtin_memset_read_str (void *data, void *prevp,
> > > > +builtin_memset_read_str (void *data, void *prev,
> > > >                          HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > > >                          scalar_int_mode mode)
> > > >  {
> > > > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > > > -  if (prev != nullptr && prev->data != nullptr)
> > > > +  rtx target;
> > > > +
> > > > +  /* Don't use the previous value if size is 1.  */
> > > > +  if (GET_MODE_SIZE (mode) != 1)
> > > >      {
> > > > -      /* Use the previous data in the same mode.  */
> > > > -      if (prev->mode == mode)
> > > > -       return prev->data;
> > > > +      target = gen_memset_value_from_prev (prev, mode);
> > > > +      if (target != nullptr)
> > > > +       return target;
> > > >      }
> > > >
> > > >    const char *c = (const char *) data;
> > > > -  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> > > > +  char *p = XALLOCAVEC (char, GET_MODE_SIZE (QImode));
> > > > +  memset (p, *c, GET_MODE_SIZE (QImode));
> > > > +  rtx src = c_readstr (p, QImode);
> > > > +  target = gen_memset_broadcast (vec_const_duplicate_optab, src,
> > > > +                                mode);
> > > > +  if (target != nullptr)
> > > > +    return target;
> > > > +
> > > > +  p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> > > >
> > > >    memset (p, *c, GET_MODE_SIZE (mode));
> > > >
> > > > @@ -6631,7 +6711,7 @@ builtin_memset_read_str (void *data, void *prevp,
> > > >     nullptr, it has the RTL info from the previous iteration.  */
> > > >
> > > >  static rtx
> > > > -builtin_memset_gen_str (void *data, void *prevp,
> > > > +builtin_memset_gen_str (void *data, void *prev,
> > > >                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > > >                         scalar_int_mode mode)
> > > >  {
> > > > @@ -6639,22 +6719,19 @@ builtin_memset_gen_str (void *data, void *prevp,
> > > >    size_t size;
> > > >    char *p;
> > > >
> > > > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > > > -  if (prev != nullptr && prev->data != nullptr)
> > > > -    {
> > > > -      /* Use the previous data in the same mode.  */
> > > > -      if (prev->mode == mode)
> > > > -       return prev->data;
> > > > -
> > > > -      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> > > > -      if (target != nullptr)
> > > > -       return target;
> > > > -    }
> > > > -
> > > >    size = GET_MODE_SIZE (mode);
> > > >    if (size == 1)
> > > >      return (rtx) data;
> > > >
> > > > +  target = gen_memset_value_from_prev (prev, mode);
> > > > +  if (target != nullptr)
> > > > +    return target;
> > > > +
> > > > +  target = gen_memset_broadcast (vec_duplicate_optab, (rtx) data,
> > > > +                                mode);
> > > > +  if (target != nullptr)
> > > > +    return target;
> > > > +
> > > >    p = XALLOCAVEC (char, size);
> > > >    memset (p, 1, size);
> > > >    coeff = c_readstr (p, mode);
> > > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > > > index 00caf3844cc..a798fb1a97e 100644
> > > > --- a/gcc/doc/md.texi
> > > > +++ b/gcc/doc/md.texi
> > > > @@ -5079,6 +5079,16 @@ vectors go through the @code{mov@var{m}} pattern instead.
> > > >
> > > >  This pattern is not allowed to @code{FAIL}.
> > > >
> > > > +@cindex @code{vec_const_duplicate@var{m}} instruction pattern
> > > > +@item @samp{vec_const_duplicate@var{m}}
> > > > +Initialize vector output operand 0 in mode @var{m} so that each element
> > > > +has the value given by constant input operand 1.
> > > > +
> > > > +This pattern only handles duplicates of constant inputs.  Non-constant
> > > > +vectors go through the @code{vec_duplicate@var{m}} pattern instead.
> > > > +
> > > > +This pattern is not allowed to @code{FAIL}.
> > > > +
> > > >  @cindex @code{vec_series@var{m}} instruction pattern
> > > >  @item @samp{vec_series@var{m}}
> > > >  Initialize vector output operand 0 so that element @var{i} is equal to
> > > > @@ -7904,6 +7914,12 @@ inclusive and operand 1 exclusive.
> > > >  If this pattern is not defined, a call to the library function
> > > >  @code{__clear_cache} is used.
> > > >
> > > > +@cindex @code{integer_extract@var{m}@var{n}} instruction pattern
> > > > +@item @samp{integer_extract@var{m}@var{n}}
> > > > +Extract lower bit value from the integer value in @code{TImode},
> > > > +@code{OImode} or @code{XImode}.  Operand 1 is the integer in mode
> > > > +@var{n} and operand 0 stores value to be extracted in mode @var{m}.
> > > > +
> > > >  @end table
> > > >
> > > >  @end ifset
> > > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > > index e3a080e4a7c..8ccc262b1fc 100644
> > > > --- a/gcc/doc/tm.texi
> > > > +++ b/gcc/doc/tm.texi
> > > > @@ -11894,6 +11894,11 @@ This function prepares to emit a conditional comparison within a sequence
> > > >   @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.
> > > >  @end deftypefn
> > > >
> > > > +@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_SCRATCH_RTX (machine_mode @var{mode})
> > > > +This hook should return an rtx for scratch register in @var{mode} to
> > > > +be used by memset broadcast.  The default is @code{gen_reg_rtx}.
> > > > +@end deftypefn
> > > > +
> > > >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})
> > > >  This target hook returns a new value for the number of times @var{loop}
> > > >  should be unrolled. The parameter @var{nunroll} is the number of times
> > > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > > index d9fbbe20e6f..99bf01fe25d 100644
> > > > --- a/gcc/doc/tm.texi.in
> > > > +++ b/gcc/doc/tm.texi.in
> > > > @@ -7960,6 +7960,8 @@ lists.
> > > >
> > > >  @hook TARGET_GEN_CCMP_NEXT
> > > >
> > > > +@hook TARGET_GEN_MEMSET_SCRATCH_RTX
> > > > +
> > > >  @hook TARGET_LOOP_UNROLL_ADJUST
> > > >
> > > >  @defmac POWI_MAX_MULTS
> > > > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > > > index b192a9d070b..fd8ab8b4a26 100644
> > > > --- a/gcc/optabs.def
> > > > +++ b/gcc/optabs.def
> > > > @@ -100,6 +100,8 @@ OPTAB_CD(vec_init_optab, "vec_init$a$b")
> > > >
> > > >  OPTAB_CD (while_ult_optab, "while_ult$a$b")
> > > >
> > > > +OPTAB_CD (integer_extract_optab, "integer_extract$a$b")
> > > > +
> > > >  OPTAB_NL(add_optab, "add$P$a3", PLUS, "add", '3', gen_int_fp_fixed_libfunc)
> > > >  OPTAB_NX(add_optab, "add$F$a3")
> > > >  OPTAB_NX(add_optab, "add$Q$a3")
> > > > @@ -453,3 +455,5 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
> > > >  OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
> > > >  OPTAB_D (len_load_optab, "len_load_$a")
> > > >  OPTAB_D (len_store_optab, "len_store_$a")
> > > > +
> > > > +OPTAB_DC (vec_const_duplicate_optab, "vec_const_duplicate$a", VEC_DUPLICATE)
> > > > diff --git a/gcc/target.def b/gcc/target.def
> > > > index 1dffedc81e4..b89e7c24471 100644
> > > > --- a/gcc/target.def
> > > > +++ b/gcc/target.def
> > > > @@ -2724,6 +2724,13 @@ DEFHOOK
> > > >   rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree op0, tree op1, int bit_code),
> > > >   NULL)
> > > >
> > > > +DEFHOOK
> > > > +(gen_memset_scratch_rtx,
> > > > + "This hook should return an rtx for scratch register in @var{mode} to\n\
> > > > +be used by memset broadcast.  The default is @code{gen_reg_rtx}.",
> > > > + rtx, (machine_mode mode),
> > > > + gen_reg_rtx)
> > > > +
> > > >  /* Return a new value for loop unroll size.  */
> > > >  DEFHOOK
> > > >  (loop_unroll_adjust,
> > > > --
> > > > 2.31.1
> > > >
> >
> >
> >
> > --
> > H.J.



-- 
H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Add integer_extract and vec_const_duplicate optabs
  2021-05-31 13:32                       ` H.J. Lu
@ 2021-05-31 13:36                         ` H.J. Lu
  2021-05-31 20:22                         ` [PATCH v2] Add vec_const_duplicate optab and TARGET_GEN_MEMSET_SCRATCH_RTX H.J. Lu
  1 sibling, 0 replies; 52+ messages in thread
From: H.J. Lu @ 2021-05-31 13:36 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Richard Sandiford, Uros Bizjak, Bernd Edlinger

On Mon, May 31, 2021 at 6:32 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, May 31, 2021 at 6:26 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Mon, May 31, 2021 at 3:12 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Mon, May 31, 2021 at 5:46 AM Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > >
> > > > > On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
> > > > > > > > >
> > > > > > > > >  -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> > > > > > > > >           MODE)
> > > > > > > > >      This function returns the RTL of a register containing
> > > > > > > > >      'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> > > > > > > > >      value given in the RTL register DATA.  For example, if MODE is 4
> > > > > > > > >      bytes wide, return the RTL for 0x01010101*DATA.
> > > > > > > >
> > > > > > > > For this one I wonder if it should be an optab instead.  Couldn't you
> > > > > > > > use the existing vec_duplicate for this by using (paradoxical) subregs
> > > > > > > > like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
> > > > > > >
> > > > > > > I tried.   It doesn't even work on x86.  See:
> > > > > > >
> > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
> > > > > >
> > > > > > Not sure what I should read from there...
> > > > > >
> > > > > > > There are special cases to subreg HI, SI and DI modes of TI mode in
> > > > > > > ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
> > > > > > > work here.   Each backend may need its own special handling.
> > > > > >
> > > > > > OK, I guess I'm not (RTL) qualified enough to further review these parts,
> > > > > > sorry.  Since we're doing code generation the canonical way to communicate
> > > > > > with backends should be optabs, not some set of disconnected target hooks.
> > > > > > But as said, I probably don't know enough of RTL to see why it's the only way.
> > > > > >
> > > > > > Richard.
> > > > >
> > > > > Here is the patch to add optabs instead.  Does it look OK?
> > > > >
> > > > > Thanks.
> > > > >
> > > > > H.J.
> > > > > ---
> > > > > Add 2 optabs:
> > > > >
> > > > > 1. integer_extract: Extract lower bit value from the integer value in
> > > > > TImode, OImode or XImode.
> > > >
> > > > That sounds very specific, esp. the restriction to {TI,OI,XI}mode.
> > > > It also sounds like it matches (subreg:{TI,OI,XI} (...) 0).  There are
> > > > existing target hooks verifying subreg validity - why's that not a good
> > > > fit here?  ISTR you say gen_lowpart () doesn't work (or was it
> > > > simplify_gen_subreg?), why's that so?
> > >
> > > {TI,OI,XI}mode are storage only integer types.   subreg doesn't work
> > > well on them.  I got
> > >
> > > [hjl@gnu-cfl-2 pieces]$ cat s2.i
> > > extern void *ops;
> > >
> > > void
> > > foo (int c)
> > > {
> > >   __builtin_memset (ops, c, 34);
> > > }
> > > [hjl@gnu-cfl-2 pieces]$ make s2.s
> > > /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
> > > -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
> > > -O2 -march=haswell -S s2.i
> > > during RTL pass: reload
> > > s2.i: In function ‘foo’:
> > > s2.i:7:1: internal compiler error: maximum number of generated reload
> > > insns per insn achieved (90)
> > >     7 | }
> > >       | ^
> > > 0x1050734 lra_constraints(bool)
> > > /export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c:5091
> > > 0x1039536 lra(_IO_FILE*)
> > > /export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336
> > > 0xfe1140 do_reload
> > > /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822
> > > 0xfe162e execute
> > > /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008
> > > Please submit a full bug report,
> > > with preprocessed source if appropriate.
> > > Please include the complete backtrace with any bug report.
> > > See <https://gcc.gnu.org/bugs/> for instructions.
> > > make: *** [Makefile:32: s2.s] Error 1
> > > [hjl@gnu-cfl-2 pieces]$
> > >
> > > due to
> > >
> > > (insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84)
> > >                 (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
> > > *)ops.0_1]+32 S2 A8])
> > >         (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1
> > >      (nil))
> > >
> > > The new optab gives us
> > >
> > > (insn 12 11 13 2 (set (reg:TI 88)
> > >         (reg:TI 51 xmm15)) "s2.i":6:3 -1
> > >      (nil))
> > > (insn 13 12 14 2 (set (reg:SI 89)
> > >         (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1
> > >      (nil))
> > > (insn 14 13 15 2 (set (reg:HI 87)
> > >         (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1
> > >      (nil))
> >
> > that looks odd to me - what's the final result after LRA?  I think
>
> I got:
>
> vmovd %edi, %xmm15
> movq ops(%rip), %rdx
> vpbroadcastb %xmm15, %ymm15
> vmovq %xmm15, %rax    <<<< move to GPR
> vmovdqu %ymm15, (%rdx)
> movw %ax, 32(%rdx)   <<<< subreg of GPR
> vzeroupper
> ret
>
> > we should see to make lowpart_subreg work on {XI,OI,TI}mode.
> > Only two steps should be necessary at most:
> > xmm -> gpr, grp -> subreg, or gpr -> subreg.  So the expander
> > code in memset should try to generate the subreg directly
>
> subreg didn't fail on x86 when I tried.
>
> > and if that fails, try a word_mode subreg followed by the subreg.
>
> I will try word_mode subreg.
>
> >
> > > (insn 15 14 0 2 (set (mem:HI (plus:DI (reg/f:DI 84)
> > >                 (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
> > > *)ops.0_1]+32 S2 A8])
> > >         (reg:HI 87)) "s2.i":6:3 -1
> > >      (nil))
> > >
> > > > > 2. vec_const_duplicate: Broadcast a QImode constant to a vector.  It is
> > > > > similar to vec_duplicate.  Since the resulting vector is computable at
> > > > > compile-time, vec_duplicate may not be faster and backend can opt out
> > > > > broadcasting from a constant while opting in broadcasting from a
> > > > > variable.
> > > >
> > > > Is it for the latter that you choose to use a new optab since
> > > > vec_duplicate is not allowed to FAIL?  You should probably document
> > >
> > > Yes.

I checked all backeds.   There are no V{16,32,64}QI vec_duplicate patterns.
If we allow these expanders to fail, I don't need vec_const_duplicate.

> > > > that the constant value duplicated should be of the component mode of
> > > > m.
> > >
> > > I added:
> > >
> > > @cindex @code{vec_const_duplicate@var{m}} instruction pattern
> > > @item @samp{vec_const_duplicate@var{m}}
> > > Initialize vector output operand 0 in mode @var{m} so that each element
> > > has the value given by constant input operand 1.
> > >
> > > This pattern only handles duplicates of constant inputs.  Non-constant
> > > vectors go through the @code{vec_duplicate@var{m}} pattern instead.
> > >
> > > This pattern is not allowed to @code{FAIL}.
> > >
> > > which is in my patch.
> > >
> > > > > and rewrite builtin_memset_read_str/builtin_memset_gen_str to support
> > > > > target instructions to duplicate QImode value to TImode, OImode or XImode
> > > > > value for memmset.
> > > > >
> > > > > Add TARGET_GEN_MEMSET_SCRATCH_RTX to allow the backend to use a hard
> > > > > scratch register to avoid stack realignment when expanding memset.
> > > > >
> > > > >         PR middle-end/90773
> > > > >         * builtins.c (gen_memset_value_from_prev): New function.
> > > > >         (gen_memset_broadcast): Likewise.
> > > > >         (builtin_memset_read_str): Use gen_memset_value_from_prev
> > > > >         and gen_memset_broadcast.
> > > > >         (builtin_memset_gen_str): Likewise.
> > > > >         * optabs.def: Add integer_extract and vec_const_duplicate.
> > > > >         * target.def (gen_memset_scratch_rtx): New hook.
> > > > >         * doc/md.texi: Document vec_const_duplicate and integer_extract.
> > > > >         * doc/tm.texi.in: Add TARGET_GEN_MEMSET_SCRATCH_RTX.
> > > > >         * doc/tm.texi: Regenerated.
> > > > > ---
> > > > >  gcc/builtins.c     | 117 +++++++++++++++++++++++++++++++++++++--------
> > > > >  gcc/doc/md.texi    |  16 +++++++
> > > > >  gcc/doc/tm.texi    |   5 ++
> > > > >  gcc/doc/tm.texi.in |   2 +
> > > > >  gcc/optabs.def     |   4 ++
> > > > >  gcc/target.def     |   7 +++
> > > > >  6 files changed, 131 insertions(+), 20 deletions(-)
> > > > >
> > > > > diff --git a/gcc/builtins.c b/gcc/builtins.c
> > > > > index af1fe49bb48..7683169eb96 100644
> > > > > --- a/gcc/builtins.c
> > > > > +++ b/gcc/builtins.c
> > > > > @@ -6598,26 +6598,106 @@ expand_builtin_strncpy (tree exp, rtx target)
> > > > >    return NULL_RTX;
> > > > >  }
> > > > >
> > > > > +/* Return the RTL of a register in MODE generated from PREV in the
> > > > > +   previous iteration.  */
> > > > > +
> > > > > +static rtx
> > > > > +gen_memset_value_from_prev (void *prevp, scalar_int_mode mode)
> > > > > +{
> > > > > +  rtx target = nullptr;
> > > > > +  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > > > > +  if (prev != nullptr && prev->data != nullptr)
> > > > > +    {
> > > > > +      /* Use the previous data in the same mode.  */
> > > > > +      if (prev->mode == mode)
> > > > > +       return prev->data;
> > > > > +
> > > > > +      /* Extract the RTL in MODE from PREV.  */
> > > > > +      enum insn_code icode
> > > > > +       = convert_optab_handler (integer_extract_optab, mode,
> > > > > +                                prev->mode);
> > > > > +      if (icode != CODE_FOR_nothing)
> > > > > +       {
> > > > > +         target = gen_reg_rtx (mode);
> > > > > +         class expand_operand ops[2];
> > > > > +         create_output_operand (&ops[0], target, mode);
> > > > > +         create_input_operand (&ops[1], prev->data, prev->mode);
> > > > > +         expand_insn (icode, 2, ops);
> > > > > +         if (!rtx_equal_p (target, ops[0].value))
> > > > > +           emit_move_insn (target, ops[0].value);
> > > > > +       }
> > > > > +      else
> > > > > +       target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> > > > > +    }
> > > > > +  return target;
> > > > > +}
> > > > > +
> > > > > +/* Return the RTL of a register in MODE broadcasted from DATA.  */
> > > > > +
> > > > > +static rtx
> > > > > +gen_memset_broadcast (enum optab_tag broadcast_optab, rtx data,
> > > > > +                     scalar_int_mode mode)
> > > > > +{
> > > > > +  /* Skip if regno_reg_rtx isn't initialized.  */
> > > > > +  if (!regno_reg_rtx)
> > > > > +    return nullptr;
> > > > > +
> > > > > +  rtx target = nullptr;
> > > > > +
> > > > > +  unsigned int nunits = GET_MODE_SIZE (mode) / GET_MODE_SIZE (QImode);
> > > > > +  machine_mode vector_mode;
> > > > > +  if (!mode_for_vector (QImode, nunits).exists (&vector_mode))
> > > > > +    gcc_unreachable ();
> > > > > +
> > > > > +  enum insn_code icode = optab_handler (broadcast_optab, vector_mode);
> > > > > +  if (icode != CODE_FOR_nothing)
> > > > > +    {
> > > > > +      target = targetm.gen_memset_scratch_rtx (vector_mode);
> > > > > +      class expand_operand ops[2];
> > > > > +      create_output_operand (&ops[0], target, vector_mode);
> > > > > +      create_input_operand (&ops[1], (rtx) data, QImode);
> > > > > +      expand_insn (icode, 2, ops);
> > > > > +      if (!rtx_equal_p (target, ops[0].value))
> > > > > +       emit_move_insn (target, ops[0].value);
> > > > > +      if (REGNO (target) < FIRST_PSEUDO_REGISTER)
> > > > > +       target = gen_rtx_REG (mode, REGNO (target));
> > > > > +      else
> > > > > +       target = convert_to_mode (mode, target, 1);
> > > > > +    }
> > > > > +
> > > > > +  return target;
> > > > > +}
> > > > > +
> > > > >  /* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)
> > > > >     bytes from constant string DATA + OFFSET and return it as target
> > > > >     constant.  If PREV isn't nullptr, it has the RTL info from the
> > > > >     previous iteration.  */
> > > > >
> > > > >  rtx
> > > > > -builtin_memset_read_str (void *data, void *prevp,
> > > > > +builtin_memset_read_str (void *data, void *prev,
> > > > >                          HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > > > >                          scalar_int_mode mode)
> > > > >  {
> > > > > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > > > > -  if (prev != nullptr && prev->data != nullptr)
> > > > > +  rtx target;
> > > > > +
> > > > > +  /* Don't use the previous value if size is 1.  */
> > > > > +  if (GET_MODE_SIZE (mode) != 1)
> > > > >      {
> > > > > -      /* Use the previous data in the same mode.  */
> > > > > -      if (prev->mode == mode)
> > > > > -       return prev->data;
> > > > > +      target = gen_memset_value_from_prev (prev, mode);
> > > > > +      if (target != nullptr)
> > > > > +       return target;
> > > > >      }
> > > > >
> > > > >    const char *c = (const char *) data;
> > > > > -  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> > > > > +  char *p = XALLOCAVEC (char, GET_MODE_SIZE (QImode));
> > > > > +  memset (p, *c, GET_MODE_SIZE (QImode));
> > > > > +  rtx src = c_readstr (p, QImode);
> > > > > +  target = gen_memset_broadcast (vec_const_duplicate_optab, src,
> > > > > +                                mode);
> > > > > +  if (target != nullptr)
> > > > > +    return target;
> > > > > +
> > > > > +  p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> > > > >
> > > > >    memset (p, *c, GET_MODE_SIZE (mode));
> > > > >
> > > > > @@ -6631,7 +6711,7 @@ builtin_memset_read_str (void *data, void *prevp,
> > > > >     nullptr, it has the RTL info from the previous iteration.  */
> > > > >
> > > > >  static rtx
> > > > > -builtin_memset_gen_str (void *data, void *prevp,
> > > > > +builtin_memset_gen_str (void *data, void *prev,
> > > > >                         HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > > > >                         scalar_int_mode mode)
> > > > >  {
> > > > > @@ -6639,22 +6719,19 @@ builtin_memset_gen_str (void *data, void *prevp,
> > > > >    size_t size;
> > > > >    char *p;
> > > > >
> > > > > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > > > > -  if (prev != nullptr && prev->data != nullptr)
> > > > > -    {
> > > > > -      /* Use the previous data in the same mode.  */
> > > > > -      if (prev->mode == mode)
> > > > > -       return prev->data;
> > > > > -
> > > > > -      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> > > > > -      if (target != nullptr)
> > > > > -       return target;
> > > > > -    }
> > > > > -
> > > > >    size = GET_MODE_SIZE (mode);
> > > > >    if (size == 1)
> > > > >      return (rtx) data;
> > > > >
> > > > > +  target = gen_memset_value_from_prev (prev, mode);
> > > > > +  if (target != nullptr)
> > > > > +    return target;
> > > > > +
> > > > > +  target = gen_memset_broadcast (vec_duplicate_optab, (rtx) data,
> > > > > +                                mode);
> > > > > +  if (target != nullptr)
> > > > > +    return target;
> > > > > +
> > > > >    p = XALLOCAVEC (char, size);
> > > > >    memset (p, 1, size);
> > > > >    coeff = c_readstr (p, mode);
> > > > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > > > > index 00caf3844cc..a798fb1a97e 100644
> > > > > --- a/gcc/doc/md.texi
> > > > > +++ b/gcc/doc/md.texi
> > > > > @@ -5079,6 +5079,16 @@ vectors go through the @code{mov@var{m}} pattern instead.
> > > > >
> > > > >  This pattern is not allowed to @code{FAIL}.
> > > > >
> > > > > +@cindex @code{vec_const_duplicate@var{m}} instruction pattern
> > > > > +@item @samp{vec_const_duplicate@var{m}}
> > > > > +Initialize vector output operand 0 in mode @var{m} so that each element
> > > > > +has the value given by constant input operand 1.
> > > > > +
> > > > > +This pattern only handles duplicates of constant inputs.  Non-constant
> > > > > +vectors go through the @code{vec_duplicate@var{m}} pattern instead.
> > > > > +
> > > > > +This pattern is not allowed to @code{FAIL}.
> > > > > +
> > > > >  @cindex @code{vec_series@var{m}} instruction pattern
> > > > >  @item @samp{vec_series@var{m}}
> > > > >  Initialize vector output operand 0 so that element @var{i} is equal to
> > > > > @@ -7904,6 +7914,12 @@ inclusive and operand 1 exclusive.
> > > > >  If this pattern is not defined, a call to the library function
> > > > >  @code{__clear_cache} is used.
> > > > >
> > > > > +@cindex @code{integer_extract@var{m}@var{n}} instruction pattern
> > > > > +@item @samp{integer_extract@var{m}@var{n}}
> > > > > +Extract lower bit value from the integer value in @code{TImode},
> > > > > +@code{OImode} or @code{XImode}.  Operand 1 is the integer in mode
> > > > > +@var{n} and operand 0 stores value to be extracted in mode @var{m}.
> > > > > +
> > > > >  @end table
> > > > >
> > > > >  @end ifset
> > > > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > > > index e3a080e4a7c..8ccc262b1fc 100644
> > > > > --- a/gcc/doc/tm.texi
> > > > > +++ b/gcc/doc/tm.texi
> > > > > @@ -11894,6 +11894,11 @@ This function prepares to emit a conditional comparison within a sequence
> > > > >   @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.
> > > > >  @end deftypefn
> > > > >
> > > > > +@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_SCRATCH_RTX (machine_mode @var{mode})
> > > > > +This hook should return an rtx for scratch register in @var{mode} to
> > > > > +be used by memset broadcast.  The default is @code{gen_reg_rtx}.
> > > > > +@end deftypefn
> > > > > +
> > > > >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})
> > > > >  This target hook returns a new value for the number of times @var{loop}
> > > > >  should be unrolled. The parameter @var{nunroll} is the number of times
> > > > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > > > index d9fbbe20e6f..99bf01fe25d 100644
> > > > > --- a/gcc/doc/tm.texi.in
> > > > > +++ b/gcc/doc/tm.texi.in
> > > > > @@ -7960,6 +7960,8 @@ lists.
> > > > >
> > > > >  @hook TARGET_GEN_CCMP_NEXT
> > > > >
> > > > > +@hook TARGET_GEN_MEMSET_SCRATCH_RTX
> > > > > +
> > > > >  @hook TARGET_LOOP_UNROLL_ADJUST
> > > > >
> > > > >  @defmac POWI_MAX_MULTS
> > > > > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > > > > index b192a9d070b..fd8ab8b4a26 100644
> > > > > --- a/gcc/optabs.def
> > > > > +++ b/gcc/optabs.def
> > > > > @@ -100,6 +100,8 @@ OPTAB_CD(vec_init_optab, "vec_init$a$b")
> > > > >
> > > > >  OPTAB_CD (while_ult_optab, "while_ult$a$b")
> > > > >
> > > > > +OPTAB_CD (integer_extract_optab, "integer_extract$a$b")
> > > > > +
> > > > >  OPTAB_NL(add_optab, "add$P$a3", PLUS, "add", '3', gen_int_fp_fixed_libfunc)
> > > > >  OPTAB_NX(add_optab, "add$F$a3")
> > > > >  OPTAB_NX(add_optab, "add$Q$a3")
> > > > > @@ -453,3 +455,5 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
> > > > >  OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
> > > > >  OPTAB_D (len_load_optab, "len_load_$a")
> > > > >  OPTAB_D (len_store_optab, "len_store_$a")
> > > > > +
> > > > > +OPTAB_DC (vec_const_duplicate_optab, "vec_const_duplicate$a", VEC_DUPLICATE)
> > > > > diff --git a/gcc/target.def b/gcc/target.def
> > > > > index 1dffedc81e4..b89e7c24471 100644
> > > > > --- a/gcc/target.def
> > > > > +++ b/gcc/target.def
> > > > > @@ -2724,6 +2724,13 @@ DEFHOOK
> > > > >   rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree op0, tree op1, int bit_code),
> > > > >   NULL)
> > > > >
> > > > > +DEFHOOK
> > > > > +(gen_memset_scratch_rtx,
> > > > > + "This hook should return an rtx for scratch register in @var{mode} to\n\
> > > > > +be used by memset broadcast.  The default is @code{gen_reg_rtx}.",
> > > > > + rtx, (machine_mode mode),
> > > > > + gen_reg_rtx)
> > > > > +
> > > > >  /* Return a new value for loop unroll size.  */
> > > > >  DEFHOOK
> > > > >  (loop_unroll_adjust,
> > > > > --
> > > > > 2.31.1
> > > > >
> > >
> > >
> > >
> > > --
> > > H.J.
>
>
>
> --
> H.J.



-- 
H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2] Add vec_const_duplicate optab and TARGET_GEN_MEMSET_SCRATCH_RTX
  2021-05-31 13:32                       ` H.J. Lu
  2021-05-31 13:36                         ` H.J. Lu
@ 2021-05-31 20:22                         ` H.J. Lu
  2021-06-01  5:50                           ` Richard Sandiford
  1 sibling, 1 reply; 52+ messages in thread
From: H.J. Lu @ 2021-05-31 20:22 UTC (permalink / raw)
  To: Richard Biener, GCC Patches, Richard Sandiford, Uros Bizjak,
	Bernd Edlinger
  Cc: Jeff Law

On Mon, May 31, 2021 at 06:32:04AM -0700, H.J. Lu wrote:
> On Mon, May 31, 2021 at 6:26 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Mon, May 31, 2021 at 3:12 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Mon, May 31, 2021 at 5:46 AM Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > >
> > > > > On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
> > > > > > > > >
> > > > > > > > >  -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> > > > > > > > >           MODE)
> > > > > > > > >      This function returns the RTL of a register containing
> > > > > > > > >      'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> > > > > > > > >      value given in the RTL register DATA.  For example, if MODE is 4
> > > > > > > > >      bytes wide, return the RTL for 0x01010101*DATA.
> > > > > > > >
> > > > > > > > For this one I wonder if it should be an optab instead.  Couldn't you
> > > > > > > > use the existing vec_duplicate for this by using (paradoxical) subregs
> > > > > > > > like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
> > > > > > >
> > > > > > > I tried.   It doesn't even work on x86.  See:
> > > > > > >
> > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
> > > > > >
> > > > > > Not sure what I should read from there...
> > > > > >
> > > > > > > There are special cases to subreg HI, SI and DI modes of TI mode in
> > > > > > > ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
> > > > > > > work here.   Each backend may need its own special handling.
> > > > > >
> > > > > > OK, I guess I'm not (RTL) qualified enough to further review these parts,
> > > > > > sorry.  Since we're doing code generation the canonical way to communicate
> > > > > > with backends should be optabs, not some set of disconnected target hooks.
> > > > > > But as said, I probably don't know enough of RTL to see why it's the only way.
> > > > > >
> > > > > > Richard.
> > > > >
> > > > > Here is the patch to add optabs instead.  Does it look OK?
> > > > >
> > > > > Thanks.
> > > > >
> > > > > H.J.
> > > > > ---
> > > > > Add 2 optabs:
> > > > >
> > > > > 1. integer_extract: Extract lower bit value from the integer value in
> > > > > TImode, OImode or XImode.
> > > >
> > > > That sounds very specific, esp. the restriction to {TI,OI,XI}mode.
> > > > It also sounds like it matches (subreg:{TI,OI,XI} (...) 0).  There are
> > > > existing target hooks verifying subreg validity - why's that not a good
> > > > fit here?  ISTR you say gen_lowpart () doesn't work (or was it
> > > > simplify_gen_subreg?), why's that so?
> > >
> > > {TI,OI,XI}mode are storage only integer types.   subreg doesn't work
> > > well on them.  I got
> > >
> > > [hjl@gnu-cfl-2 pieces]$ cat s2.i
> > > extern void *ops;
> > >
> > > void
> > > foo (int c)
> > > {
> > >   __builtin_memset (ops, c, 34);
> > > }
> > > [hjl@gnu-cfl-2 pieces]$ make s2.s
> > > /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
> > > -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
> > > -O2 -march=haswell -S s2.i
> > > during RTL pass: reload
> > > s2.i: In function ‘foo’:
> > > s2.i:7:1: internal compiler error: maximum number of generated reload
> > > insns per insn achieved (90)
> > >     7 | }
> > >       | ^
> > > 0x1050734 lra_constraints(bool)
> > > /export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c:5091
> > > 0x1039536 lra(_IO_FILE*)
> > > /export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336
> > > 0xfe1140 do_reload
> > > /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822
> > > 0xfe162e execute
> > > /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008
> > > Please submit a full bug report,
> > > with preprocessed source if appropriate.
> > > Please include the complete backtrace with any bug report.
> > > See <https://gcc.gnu.org/bugs/> for instructions.
> > > make: *** [Makefile:32: s2.s] Error 1
> > > [hjl@gnu-cfl-2 pieces]$
> > >
> > > due to
> > >
> > > (insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84)
> > >                 (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
> > > *)ops.0_1]+32 S2 A8])
> > >         (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1
> > >      (nil))
> > >
> > > The new optab gives us
> > >
> > > (insn 12 11 13 2 (set (reg:TI 88)
> > >         (reg:TI 51 xmm15)) "s2.i":6:3 -1
> > >      (nil))
> > > (insn 13 12 14 2 (set (reg:SI 89)
> > >         (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1
> > >      (nil))
> > > (insn 14 13 15 2 (set (reg:HI 87)
> > >         (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1
> > >      (nil))
> >
> > that looks odd to me - what's the final result after LRA?  I think
> 
> I got:
> 
> vmovd %edi, %xmm15
> movq ops(%rip), %rdx
> vpbroadcastb %xmm15, %ymm15
> vmovq %xmm15, %rax    <<<< move to GPR
> vmovdqu %ymm15, (%rdx)
> movw %ax, 32(%rdx)   <<<< subreg of GPR
> vzeroupper
> ret
> 
> > we should see to make lowpart_subreg work on {XI,OI,TI}mode.
> > Only two steps should be necessary at most:
> > xmm -> gpr, grp -> subreg, or gpr -> subreg.  So the expander
> > code in memset should try to generate the subreg directly
> 
> subreg didn't fail on x86 when I tried.
> 
> > and if that fails, try a word_mode subreg followed by the subreg.
> 
> I will try word_mode subreg.
> 

Here is the v2 patch to use word_mode subreg.  For

---
extern void *ops;

void
foo (int c)
{
  __builtin_memset (ops, 4, 32);
}
---

without vec_const_duplicate, I got

	movl	$4, %eax
	movq	ops(%rip), %rdx
	movd	%eax, %xmm0
	punpcklbw	%xmm0, %xmm0
	punpcklwd	%xmm0, %xmm0
	pshufd	$0, %xmm0, %xmm0
	movups	%xmm0, (%rdx)
	movups	%xmm0, 16(%rdx)
	ret

With vec_const_duplicate, I got

	movq	ops(%rip), %rax
	movdqa	.LC0(%rip), %xmm0
	movups	%xmm0, (%rax)
	movups	%xmm0, 16(%rax)
	ret

If vec_duplicate is allowed to fail, I don't need vec_const_duplicate.


H.J.
---
1. Add vec_const_duplicate to broadcast a QImode constant to a vector.
It is similar to vec_duplicate which is not allowed to fail.  Since the
resulting vector is computable at compile-time, vec_duplicate may not
be faster and backend can opt out broadcasting from a constant while
opting in broadcasting from a variable.
2. Rewrite builtin_memset_read_str/builtin_memset_gen_str to support
target instructions to duplicate QImode value to TImode, OImode or XImode
value for memmset.
3. Add TARGET_GEN_MEMSET_SCRATCH_RTX to allow the backend to use a hard
scratch register to avoid stack realignment when expanding memset.

	PR middle-end/90773
	* builtins.c (gen_memset_value_from_prev): New function.
	(gen_memset_broadcast): Likewise.
	(builtin_memset_read_str): Use gen_memset_value_from_prev
	and gen_memset_broadcast.
	(builtin_memset_gen_str): Likewise.
	* optabs.def: Add vec_const_duplicate.
	* target.def (gen_memset_scratch_rtx): New hook.
	* doc/md.texi: Document vec_const_duplicate.
	* doc/tm.texi.in: Add TARGET_GEN_MEMSET_SCRATCH_RTX.
	* doc/tm.texi: Regenerated.
---
 gcc/builtins.c     | 115 +++++++++++++++++++++++++++++++++++++--------
 gcc/doc/md.texi    |  10 ++++
 gcc/doc/tm.texi    |   5 ++
 gcc/doc/tm.texi.in |   2 +
 gcc/optabs.def     |   2 +
 gcc/target.def     |   7 +++
 6 files changed, 121 insertions(+), 20 deletions(-)

diff --git a/gcc/builtins.c b/gcc/builtins.c
index af1fe49bb48..4573450d2c0 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -6598,26 +6598,104 @@ expand_builtin_strncpy (tree exp, rtx target)
   return NULL_RTX;
 }
 
+/* Return the RTL of a register in MODE generated from PREV in the
+   previous iteration.  */
+
+static rtx
+gen_memset_value_from_prev (void *prevp, scalar_int_mode mode)
+{
+  rtx target = nullptr;
+  by_pieces_prev *prev = (by_pieces_prev *) prevp;
+  if (prev != nullptr && prev->data != nullptr)
+    {
+      /* Use the previous data in the same mode.  */
+      if (prev->mode == mode)
+	return prev->data;
+
+      rtx prev_rtx = prev->data;
+      machine_mode prev_mode = prev->mode;
+      unsigned int word_size = GET_MODE_SIZE (word_mode);
+      if (word_size < GET_MODE_SIZE (prev->mode)
+	  && word_size > GET_MODE_SIZE (mode))
+	{
+	  /* First generate subreg of word mode if the previous mode is
+	     wider than word mode and word mode is wider than MODE.  */
+	  prev_rtx = simplify_gen_subreg (word_mode, prev_rtx,
+					  prev_mode, 0);
+	  prev_mode = word_mode;
+	}
+      if (prev_rtx != nullptr)
+	target = simplify_gen_subreg (mode, prev_rtx, prev_mode, 0);
+    }
+  return target;
+}
+
+/* Return the RTL of a register in MODE broadcasted from DATA.  */
+
+static rtx
+gen_memset_broadcast (enum optab_tag broadcast_optab, rtx data,
+		      scalar_int_mode mode)
+{
+  /* Skip if regno_reg_rtx isn't initialized.  */
+  if (!regno_reg_rtx)
+    return nullptr;
+
+  rtx target = nullptr;
+
+  unsigned int nunits = GET_MODE_SIZE (mode) / GET_MODE_SIZE (QImode);
+  machine_mode vector_mode;
+  if (!mode_for_vector (QImode, nunits).exists (&vector_mode))
+    gcc_unreachable ();
+
+  enum insn_code icode = optab_handler (broadcast_optab, vector_mode);
+  if (icode != CODE_FOR_nothing)
+    {
+      target = targetm.gen_memset_scratch_rtx (vector_mode);
+      class expand_operand ops[2];
+      create_output_operand (&ops[0], target, vector_mode);
+      create_input_operand (&ops[1], (rtx) data, QImode);
+      expand_insn (icode, 2, ops);
+      if (!rtx_equal_p (target, ops[0].value))
+	emit_move_insn (target, ops[0].value);
+      if (REGNO (target) < FIRST_PSEUDO_REGISTER)
+	target = gen_rtx_REG (mode, REGNO (target));
+      else
+	target = convert_to_mode (mode, target, 1);
+    }
+
+  return target;
+}
+
 /* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)
    bytes from constant string DATA + OFFSET and return it as target
    constant.  If PREV isn't nullptr, it has the RTL info from the
    previous iteration.  */
 
 rtx
-builtin_memset_read_str (void *data, void *prevp,
+builtin_memset_read_str (void *data, void *prev,
 			 HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
 			 scalar_int_mode mode)
 {
-  by_pieces_prev *prev = (by_pieces_prev *) prevp;
-  if (prev != nullptr && prev->data != nullptr)
+  rtx target;
+
+  /* Don't use the previous value if size is 1.  */
+  if (GET_MODE_SIZE (mode) != 1)
     {
-      /* Use the previous data in the same mode.  */
-      if (prev->mode == mode)
-	return prev->data;
+      target = gen_memset_value_from_prev (prev, mode);
+      if (target != nullptr)
+	return target;
     }
 
   const char *c = (const char *) data;
-  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
+  char *p = XALLOCAVEC (char, GET_MODE_SIZE (QImode));
+  memset (p, *c, GET_MODE_SIZE (QImode));
+  rtx src = c_readstr (p, QImode);
+  target = gen_memset_broadcast (vec_const_duplicate_optab, src,
+				 mode);
+  if (target != nullptr)
+    return target;
+
+  p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
 
   memset (p, *c, GET_MODE_SIZE (mode));
 
@@ -6631,7 +6709,7 @@ builtin_memset_read_str (void *data, void *prevp,
    nullptr, it has the RTL info from the previous iteration.  */
 
 static rtx
-builtin_memset_gen_str (void *data, void *prevp,
+builtin_memset_gen_str (void *data, void *prev,
 			HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
 			scalar_int_mode mode)
 {
@@ -6639,22 +6717,19 @@ builtin_memset_gen_str (void *data, void *prevp,
   size_t size;
   char *p;
 
-  by_pieces_prev *prev = (by_pieces_prev *) prevp;
-  if (prev != nullptr && prev->data != nullptr)
-    {
-      /* Use the previous data in the same mode.  */
-      if (prev->mode == mode)
-	return prev->data;
-
-      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
-      if (target != nullptr)
-	return target;
-    }
-
   size = GET_MODE_SIZE (mode);
   if (size == 1)
     return (rtx) data;
 
+  target = gen_memset_value_from_prev (prev, mode);
+  if (target != nullptr)
+    return target;
+
+  target = gen_memset_broadcast (vec_duplicate_optab, (rtx) data,
+				 mode);
+  if (target != nullptr)
+    return target;
+
   p = XALLOCAVEC (char, size);
   memset (p, 1, size);
   coeff = c_readstr (p, mode);
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 00caf3844cc..cb9cf420c0b 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5079,6 +5079,16 @@ vectors go through the @code{mov@var{m}} pattern instead.
 
 This pattern is not allowed to @code{FAIL}.
 
+@cindex @code{vec_const_duplicate@var{m}} instruction pattern
+@item @samp{vec_const_duplicate@var{m}}
+Initialize vector output operand 0 in mode @var{m} so that each element
+has the value given by constant input operand 1.
+
+This pattern only handles duplicates of constant inputs.  Non-constant
+vectors go through the @code{vec_duplicate@var{m}} pattern instead.
+
+This pattern is not allowed to @code{FAIL}.
+
 @cindex @code{vec_series@var{m}} instruction pattern
 @item @samp{vec_series@var{m}}
 Initialize vector output operand 0 so that element @var{i} is equal to
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index e3a080e4a7c..8ccc262b1fc 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11894,6 +11894,11 @@ This function prepares to emit a conditional comparison within a sequence
  @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.
 @end deftypefn
 
+@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_SCRATCH_RTX (machine_mode @var{mode})
+This hook should return an rtx for scratch register in @var{mode} to
+be used by memset broadcast.  The default is @code{gen_reg_rtx}.
+@end deftypefn
+
 @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})
 This target hook returns a new value for the number of times @var{loop}
 should be unrolled. The parameter @var{nunroll} is the number of times
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index d9fbbe20e6f..99bf01fe25d 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7960,6 +7960,8 @@ lists.
 
 @hook TARGET_GEN_CCMP_NEXT
 
+@hook TARGET_GEN_MEMSET_SCRATCH_RTX
+
 @hook TARGET_LOOP_UNROLL_ADJUST
 
 @defmac POWI_MAX_MULTS
diff --git a/gcc/optabs.def b/gcc/optabs.def
index b192a9d070b..643d2d17a3b 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -453,3 +453,5 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
 OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
 OPTAB_D (len_load_optab, "len_load_$a")
 OPTAB_D (len_store_optab, "len_store_$a")
+
+OPTAB_DC (vec_const_duplicate_optab, "vec_const_duplicate$a", VEC_DUPLICATE)
diff --git a/gcc/target.def b/gcc/target.def
index 1dffedc81e4..b89e7c24471 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2724,6 +2724,13 @@ DEFHOOK
  rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree op0, tree op1, int bit_code),
  NULL)
 
+DEFHOOK
+(gen_memset_scratch_rtx,
+ "This hook should return an rtx for scratch register in @var{mode} to\n\
+be used by memset broadcast.  The default is @code{gen_reg_rtx}.",
+ rtx, (machine_mode mode),
+ gen_reg_rtx)
+
 /* Return a new value for loop unroll size.  */
 DEFHOOK
 (loop_unroll_adjust,
-- 
2.31.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2] Add vec_const_duplicate optab and TARGET_GEN_MEMSET_SCRATCH_RTX
  2021-05-31 20:22                         ` [PATCH v2] Add vec_const_duplicate optab and TARGET_GEN_MEMSET_SCRATCH_RTX H.J. Lu
@ 2021-06-01  5:50                           ` Richard Sandiford
  2021-06-01  5:54                             ` Jeff Law
  0 siblings, 1 reply; 52+ messages in thread
From: Richard Sandiford @ 2021-06-01  5:50 UTC (permalink / raw)
  To: H.J. Lu via Gcc-patches
  Cc: Richard Biener, Uros Bizjak, Bernd Edlinger, H.J. Lu, Jeff Law

"H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> On Mon, May 31, 2021 at 06:32:04AM -0700, H.J. Lu wrote:
>> On Mon, May 31, 2021 at 6:26 AM Richard Biener
>> <richard.guenther@gmail.com> wrote:
>> >
>> > On Mon, May 31, 2021 at 3:12 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>> > >
>> > > On Mon, May 31, 2021 at 5:46 AM Richard Biener
>> > > <richard.guenther@gmail.com> wrote:
>> > > >
>> > > > On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>> > > > >
>> > > > > On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
>> > > > > > > > >
>> > > > > > > > >  -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
>> > > > > > > > >           MODE)
>> > > > > > > > >      This function returns the RTL of a register containing
>> > > > > > > > >      'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
>> > > > > > > > >      value given in the RTL register DATA.  For example, if MODE is 4
>> > > > > > > > >      bytes wide, return the RTL for 0x01010101*DATA.
>> > > > > > > >
>> > > > > > > > For this one I wonder if it should be an optab instead.  Couldn't you
>> > > > > > > > use the existing vec_duplicate for this by using (paradoxical) subregs
>> > > > > > > > like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
>> > > > > > >
>> > > > > > > I tried.   It doesn't even work on x86.  See:
>> > > > > > >
>> > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
>> > > > > >
>> > > > > > Not sure what I should read from there...
>> > > > > >
>> > > > > > > There are special cases to subreg HI, SI and DI modes of TI mode in
>> > > > > > > ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
>> > > > > > > work here.   Each backend may need its own special handling.
>> > > > > >
>> > > > > > OK, I guess I'm not (RTL) qualified enough to further review these parts,
>> > > > > > sorry.  Since we're doing code generation the canonical way to communicate
>> > > > > > with backends should be optabs, not some set of disconnected target hooks.
>> > > > > > But as said, I probably don't know enough of RTL to see why it's the only way.
>> > > > > >
>> > > > > > Richard.
>> > > > >
>> > > > > Here is the patch to add optabs instead.  Does it look OK?
>> > > > >
>> > > > > Thanks.
>> > > > >
>> > > > > H.J.
>> > > > > ---
>> > > > > Add 2 optabs:
>> > > > >
>> > > > > 1. integer_extract: Extract lower bit value from the integer value in
>> > > > > TImode, OImode or XImode.
>> > > >
>> > > > That sounds very specific, esp. the restriction to {TI,OI,XI}mode.
>> > > > It also sounds like it matches (subreg:{TI,OI,XI} (...) 0).  There are
>> > > > existing target hooks verifying subreg validity - why's that not a good
>> > > > fit here?  ISTR you say gen_lowpart () doesn't work (or was it
>> > > > simplify_gen_subreg?), why's that so?
>> > >
>> > > {TI,OI,XI}mode are storage only integer types.   subreg doesn't work
>> > > well on them.  I got
>> > >
>> > > [hjl@gnu-cfl-2 pieces]$ cat s2.i
>> > > extern void *ops;
>> > >
>> > > void
>> > > foo (int c)
>> > > {
>> > >   __builtin_memset (ops, c, 34);
>> > > }
>> > > [hjl@gnu-cfl-2 pieces]$ make s2.s
>> > > /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
>> > > -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
>> > > -O2 -march=haswell -S s2.i
>> > > during RTL pass: reload
>> > > s2.i: In function ‘foo’:
>> > > s2.i:7:1: internal compiler error: maximum number of generated reload
>> > > insns per insn achieved (90)
>> > >     7 | }
>> > >       | ^
>> > > 0x1050734 lra_constraints(bool)
>> > > /export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c:5091
>> > > 0x1039536 lra(_IO_FILE*)
>> > > /export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336
>> > > 0xfe1140 do_reload
>> > > /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822
>> > > 0xfe162e execute
>> > > /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008
>> > > Please submit a full bug report,
>> > > with preprocessed source if appropriate.
>> > > Please include the complete backtrace with any bug report.
>> > > See <https://gcc.gnu.org/bugs/> for instructions.
>> > > make: *** [Makefile:32: s2.s] Error 1
>> > > [hjl@gnu-cfl-2 pieces]$
>> > >
>> > > due to
>> > >
>> > > (insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84)
>> > >                 (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
>> > > *)ops.0_1]+32 S2 A8])
>> > >         (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1
>> > >      (nil))
>> > >
>> > > The new optab gives us
>> > >
>> > > (insn 12 11 13 2 (set (reg:TI 88)
>> > >         (reg:TI 51 xmm15)) "s2.i":6:3 -1
>> > >      (nil))
>> > > (insn 13 12 14 2 (set (reg:SI 89)
>> > >         (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1
>> > >      (nil))
>> > > (insn 14 13 15 2 (set (reg:HI 87)
>> > >         (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1
>> > >      (nil))
>> >
>> > that looks odd to me - what's the final result after LRA?  I think
>> 
>> I got:
>> 
>> vmovd %edi, %xmm15
>> movq ops(%rip), %rdx
>> vpbroadcastb %xmm15, %ymm15
>> vmovq %xmm15, %rax    <<<< move to GPR
>> vmovdqu %ymm15, (%rdx)
>> movw %ax, 32(%rdx)   <<<< subreg of GPR
>> vzeroupper
>> ret
>> 
>> > we should see to make lowpart_subreg work on {XI,OI,TI}mode.
>> > Only two steps should be necessary at most:
>> > xmm -> gpr, grp -> subreg, or gpr -> subreg.  So the expander
>> > code in memset should try to generate the subreg directly
>> 
>> subreg didn't fail on x86 when I tried.
>> 
>> > and if that fails, try a word_mode subreg followed by the subreg.
>> 
>> I will try word_mode subreg.
>> 
>
> Here is the v2 patch to use word_mode subreg.  For
>
> ---
> extern void *ops;
>
> void
> foo (int c)
> {
>   __builtin_memset (ops, 4, 32);
> }
> ---
>
> without vec_const_duplicate, I got
>
> 	movl	$4, %eax
> 	movq	ops(%rip), %rdx
> 	movd	%eax, %xmm0
> 	punpcklbw	%xmm0, %xmm0
> 	punpcklwd	%xmm0, %xmm0
> 	pshufd	$0, %xmm0, %xmm0
> 	movups	%xmm0, (%rdx)
> 	movups	%xmm0, 16(%rdx)
> 	ret
>
> With vec_const_duplicate, I got
>
> 	movq	ops(%rip), %rax
> 	movdqa	.LC0(%rip), %xmm0
> 	movups	%xmm0, (%rax)
> 	movups	%xmm0, 16(%rax)
> 	ret
>
> If vec_duplicate is allowed to fail, I don't need vec_const_duplicate.

I don't understand why we need an optab for this though.  If the operand
is constant then we should just be doing an ordinary move in which the
source is a CONST_VECTOR.  It's then up to the move patterns to handle
duplicated constants as efficiently as possible.  (Sorry if this was
discussed upthread and I missed it.)

Thanks,
Richard

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2] Add vec_const_duplicate optab and TARGET_GEN_MEMSET_SCRATCH_RTX
  2021-06-01  5:50                           ` Richard Sandiford
@ 2021-06-01  5:54                             ` Jeff Law
  2021-06-01 13:05                               ` H.J. Lu
  0 siblings, 1 reply; 52+ messages in thread
From: Jeff Law @ 2021-06-01  5:54 UTC (permalink / raw)
  To: H.J. Lu via Gcc-patches, Richard Biener, Uros Bizjak,
	Bernd Edlinger, H.J. Lu, richard.sandiford



On 5/31/2021 11:50 PM, Richard Sandiford wrote:
> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
>> On Mon, May 31, 2021 at 06:32:04AM -0700, H.J. Lu wrote:
>>> On Mon, May 31, 2021 at 6:26 AM Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Mon, May 31, 2021 at 3:12 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>> On Mon, May 31, 2021 at 5:46 AM Richard Biener
>>>>> <richard.guenther@gmail.com> wrote:
>>>>>> On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>>> On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
>>>>>>>>>>>   -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
>>>>>>>>>>>            MODE)
>>>>>>>>>>>       This function returns the RTL of a register containing
>>>>>>>>>>>       'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
>>>>>>>>>>>       value given in the RTL register DATA.  For example, if MODE is 4
>>>>>>>>>>>       bytes wide, return the RTL for 0x01010101*DATA.
>>>>>>>>>> For this one I wonder if it should be an optab instead.  Couldn't you
>>>>>>>>>> use the existing vec_duplicate for this by using (paradoxical) subregs
>>>>>>>>>> like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
>>>>>>>>> I tried.   It doesn't even work on x86.  See:
>>>>>>>>>
>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
>>>>>>>> Not sure what I should read from there...
>>>>>>>>
>>>>>>>>> There are special cases to subreg HI, SI and DI modes of TI mode in
>>>>>>>>> ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
>>>>>>>>> work here.   Each backend may need its own special handling.
>>>>>>>> OK, I guess I'm not (RTL) qualified enough to further review these parts,
>>>>>>>> sorry.  Since we're doing code generation the canonical way to communicate
>>>>>>>> with backends should be optabs, not some set of disconnected target hooks.
>>>>>>>> But as said, I probably don't know enough of RTL to see why it's the only way.
>>>>>>>>
>>>>>>>> Richard.
>>>>>>> Here is the patch to add optabs instead.  Does it look OK?
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> H.J.
>>>>>>> ---
>>>>>>> Add 2 optabs:
>>>>>>>
>>>>>>> 1. integer_extract: Extract lower bit value from the integer value in
>>>>>>> TImode, OImode or XImode.
>>>>>> That sounds very specific, esp. the restriction to {TI,OI,XI}mode.
>>>>>> It also sounds like it matches (subreg:{TI,OI,XI} (...) 0).  There are
>>>>>> existing target hooks verifying subreg validity - why's that not a good
>>>>>> fit here?  ISTR you say gen_lowpart () doesn't work (or was it
>>>>>> simplify_gen_subreg?), why's that so?
>>>>> {TI,OI,XI}mode are storage only integer types.   subreg doesn't work
>>>>> well on them.  I got
>>>>>
>>>>> [hjl@gnu-cfl-2 pieces]$ cat s2.i
>>>>> extern void *ops;
>>>>>
>>>>> void
>>>>> foo (int c)
>>>>> {
>>>>>    __builtin_memset (ops, c, 34);
>>>>> }
>>>>> [hjl@gnu-cfl-2 pieces]$ make s2.s
>>>>> /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
>>>>> -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
>>>>> -O2 -march=haswell -S s2.i
>>>>> during RTL pass: reload
>>>>> s2.i: In function ‘foo’:
>>>>> s2.i:7:1: internal compiler error: maximum number of generated reload
>>>>> insns per insn achieved (90)
>>>>>      7 | }
>>>>>        | ^
>>>>> 0x1050734 lra_constraints(bool)
>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c:5091
>>>>> 0x1039536 lra(_IO_FILE*)
>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336
>>>>> 0xfe1140 do_reload
>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822
>>>>> 0xfe162e execute
>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008
>>>>> Please submit a full bug report,
>>>>> with preprocessed source if appropriate.
>>>>> Please include the complete backtrace with any bug report.
>>>>> See <https://gcc.gnu.org/bugs/> for instructions.
>>>>> make: *** [Makefile:32: s2.s] Error 1
>>>>> [hjl@gnu-cfl-2 pieces]$
>>>>>
>>>>> due to
>>>>>
>>>>> (insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84)
>>>>>                  (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
>>>>> *)ops.0_1]+32 S2 A8])
>>>>>          (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1
>>>>>       (nil))
>>>>>
>>>>> The new optab gives us
>>>>>
>>>>> (insn 12 11 13 2 (set (reg:TI 88)
>>>>>          (reg:TI 51 xmm15)) "s2.i":6:3 -1
>>>>>       (nil))
>>>>> (insn 13 12 14 2 (set (reg:SI 89)
>>>>>          (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1
>>>>>       (nil))
>>>>> (insn 14 13 15 2 (set (reg:HI 87)
>>>>>          (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1
>>>>>       (nil))
>>>> that looks odd to me - what's the final result after LRA?  I think
>>> I got:
>>>
>>> vmovd %edi, %xmm15
>>> movq ops(%rip), %rdx
>>> vpbroadcastb %xmm15, %ymm15
>>> vmovq %xmm15, %rax    <<<< move to GPR
>>> vmovdqu %ymm15, (%rdx)
>>> movw %ax, 32(%rdx)   <<<< subreg of GPR
>>> vzeroupper
>>> ret
>>>
>>>> we should see to make lowpart_subreg work on {XI,OI,TI}mode.
>>>> Only two steps should be necessary at most:
>>>> xmm -> gpr, grp -> subreg, or gpr -> subreg.  So the expander
>>>> code in memset should try to generate the subreg directly
>>> subreg didn't fail on x86 when I tried.
>>>
>>>> and if that fails, try a word_mode subreg followed by the subreg.
>>> I will try word_mode subreg.
>>>
>> Here is the v2 patch to use word_mode subreg.  For
>>
>> ---
>> extern void *ops;
>>
>> void
>> foo (int c)
>> {
>>    __builtin_memset (ops, 4, 32);
>> }
>> ---
>>
>> without vec_const_duplicate, I got
>>
>> 	movl	$4, %eax
>> 	movq	ops(%rip), %rdx
>> 	movd	%eax, %xmm0
>> 	punpcklbw	%xmm0, %xmm0
>> 	punpcklwd	%xmm0, %xmm0
>> 	pshufd	$0, %xmm0, %xmm0
>> 	movups	%xmm0, (%rdx)
>> 	movups	%xmm0, 16(%rdx)
>> 	ret
>>
>> With vec_const_duplicate, I got
>>
>> 	movq	ops(%rip), %rax
>> 	movdqa	.LC0(%rip), %xmm0
>> 	movups	%xmm0, (%rax)
>> 	movups	%xmm0, 16(%rax)
>> 	ret
>>
>> If vec_duplicate is allowed to fail, I don't need vec_const_duplicate.
> I don't understand why we need an optab for this though.  If the operand
> is constant then we should just be doing an ordinary move in which the
> source is a CONST_VECTOR.  It's then up to the move patterns to handle
> duplicated constants as efficiently as possible.  (Sorry if this was
> discussed upthread and I missed it.)
That's exactly the point I'm trying to get across as well.

jeff


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2] Add vec_const_duplicate optab and TARGET_GEN_MEMSET_SCRATCH_RTX
  2021-06-01  5:54                             ` Jeff Law
@ 2021-06-01 13:05                               ` H.J. Lu
  2021-06-01 13:25                                 ` Richard Biener
  0 siblings, 1 reply; 52+ messages in thread
From: H.J. Lu @ 2021-06-01 13:05 UTC (permalink / raw)
  To: Jeff Law
  Cc: H.J. Lu via Gcc-patches, Richard Biener, Uros Bizjak,
	Bernd Edlinger, richard.sandiford

On Mon, May 31, 2021 at 11:54:53PM -0600, Jeff Law wrote:
> 
> 
> On 5/31/2021 11:50 PM, Richard Sandiford wrote:
> > "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> > > On Mon, May 31, 2021 at 06:32:04AM -0700, H.J. Lu wrote:
> > > > On Mon, May 31, 2021 at 6:26 AM Richard Biener
> > > > <richard.guenther@gmail.com> wrote:
> > > > > On Mon, May 31, 2021 at 3:12 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > > On Mon, May 31, 2021 at 5:46 AM Richard Biener
> > > > > > <richard.guenther@gmail.com> wrote:
> > > > > > > On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > > > > On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
> > > > > > > > > > > >   -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> > > > > > > > > > > >            MODE)
> > > > > > > > > > > >       This function returns the RTL of a register containing
> > > > > > > > > > > >       'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> > > > > > > > > > > >       value given in the RTL register DATA.  For example, if MODE is 4
> > > > > > > > > > > >       bytes wide, return the RTL for 0x01010101*DATA.
> > > > > > > > > > > For this one I wonder if it should be an optab instead.  Couldn't you
> > > > > > > > > > > use the existing vec_duplicate for this by using (paradoxical) subregs
> > > > > > > > > > > like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
> > > > > > > > > > I tried.   It doesn't even work on x86.  See:
> > > > > > > > > > 
> > > > > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
> > > > > > > > > Not sure what I should read from there...
> > > > > > > > > 
> > > > > > > > > > There are special cases to subreg HI, SI and DI modes of TI mode in
> > > > > > > > > > ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
> > > > > > > > > > work here.   Each backend may need its own special handling.
> > > > > > > > > OK, I guess I'm not (RTL) qualified enough to further review these parts,
> > > > > > > > > sorry.  Since we're doing code generation the canonical way to communicate
> > > > > > > > > with backends should be optabs, not some set of disconnected target hooks.
> > > > > > > > > But as said, I probably don't know enough of RTL to see why it's the only way.
> > > > > > > > > 
> > > > > > > > > Richard.
> > > > > > > > Here is the patch to add optabs instead.  Does it look OK?
> > > > > > > > 
> > > > > > > > Thanks.
> > > > > > > > 
> > > > > > > > H.J.
> > > > > > > > ---
> > > > > > > > Add 2 optabs:
> > > > > > > > 
> > > > > > > > 1. integer_extract: Extract lower bit value from the integer value in
> > > > > > > > TImode, OImode or XImode.
> > > > > > > That sounds very specific, esp. the restriction to {TI,OI,XI}mode.
> > > > > > > It also sounds like it matches (subreg:{TI,OI,XI} (...) 0).  There are
> > > > > > > existing target hooks verifying subreg validity - why's that not a good
> > > > > > > fit here?  ISTR you say gen_lowpart () doesn't work (or was it
> > > > > > > simplify_gen_subreg?), why's that so?
> > > > > > {TI,OI,XI}mode are storage only integer types.   subreg doesn't work
> > > > > > well on them.  I got
> > > > > > 
> > > > > > [hjl@gnu-cfl-2 pieces]$ cat s2.i
> > > > > > extern void *ops;
> > > > > > 
> > > > > > void
> > > > > > foo (int c)
> > > > > > {
> > > > > >    __builtin_memset (ops, c, 34);
> > > > > > }
> > > > > > [hjl@gnu-cfl-2 pieces]$ make s2.s
> > > > > > /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
> > > > > > -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
> > > > > > -O2 -march=haswell -S s2.i
> > > > > > during RTL pass: reload
> > > > > > s2.i: In function ‘foo’:
> > > > > > s2.i:7:1: internal compiler error: maximum number of generated reload
> > > > > > insns per insn achieved (90)
> > > > > >      7 | }
> > > > > >        | ^
> > > > > > 0x1050734 lra_constraints(bool)
> > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c:5091
> > > > > > 0x1039536 lra(_IO_FILE*)
> > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336
> > > > > > 0xfe1140 do_reload
> > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822
> > > > > > 0xfe162e execute
> > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008
> > > > > > Please submit a full bug report,
> > > > > > with preprocessed source if appropriate.
> > > > > > Please include the complete backtrace with any bug report.
> > > > > > See <https://gcc.gnu.org/bugs/> for instructions.
> > > > > > make: *** [Makefile:32: s2.s] Error 1
> > > > > > [hjl@gnu-cfl-2 pieces]$
> > > > > > 
> > > > > > due to
> > > > > > 
> > > > > > (insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84)
> > > > > >                  (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
> > > > > > *)ops.0_1]+32 S2 A8])
> > > > > >          (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1
> > > > > >       (nil))
> > > > > > 
> > > > > > The new optab gives us
> > > > > > 
> > > > > > (insn 12 11 13 2 (set (reg:TI 88)
> > > > > >          (reg:TI 51 xmm15)) "s2.i":6:3 -1
> > > > > >       (nil))
> > > > > > (insn 13 12 14 2 (set (reg:SI 89)
> > > > > >          (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1
> > > > > >       (nil))
> > > > > > (insn 14 13 15 2 (set (reg:HI 87)
> > > > > >          (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1
> > > > > >       (nil))
> > > > > that looks odd to me - what's the final result after LRA?  I think
> > > > I got:
> > > > 
> > > > vmovd %edi, %xmm15
> > > > movq ops(%rip), %rdx
> > > > vpbroadcastb %xmm15, %ymm15
> > > > vmovq %xmm15, %rax    <<<< move to GPR
> > > > vmovdqu %ymm15, (%rdx)
> > > > movw %ax, 32(%rdx)   <<<< subreg of GPR
> > > > vzeroupper
> > > > ret
> > > > 
> > > > > we should see to make lowpart_subreg work on {XI,OI,TI}mode.
> > > > > Only two steps should be necessary at most:
> > > > > xmm -> gpr, grp -> subreg, or gpr -> subreg.  So the expander
> > > > > code in memset should try to generate the subreg directly
> > > > subreg didn't fail on x86 when I tried.
> > > > 
> > > > > and if that fails, try a word_mode subreg followed by the subreg.
> > > > I will try word_mode subreg.
> > > > 
> > > Here is the v2 patch to use word_mode subreg.  For
> > > 
> > > ---
> > > extern void *ops;
> > > 
> > > void
> > > foo (int c)
> > > {
> > >    __builtin_memset (ops, 4, 32);
> > > }
> > > ---
> > > 
> > > without vec_const_duplicate, I got
> > > 
> > > 	movl	$4, %eax
> > > 	movq	ops(%rip), %rdx
> > > 	movd	%eax, %xmm0
> > > 	punpcklbw	%xmm0, %xmm0
> > > 	punpcklwd	%xmm0, %xmm0
> > > 	pshufd	$0, %xmm0, %xmm0
> > > 	movups	%xmm0, (%rdx)
> > > 	movups	%xmm0, 16(%rdx)
> > > 	ret
> > > 
> > > With vec_const_duplicate, I got
> > > 
> > > 	movq	ops(%rip), %rax
> > > 	movdqa	.LC0(%rip), %xmm0
> > > 	movups	%xmm0, (%rax)
> > > 	movups	%xmm0, 16(%rax)
> > > 	ret
> > > 
> > > If vec_duplicate is allowed to fail, I don't need vec_const_duplicate.
> > I don't understand why we need an optab for this though.  If the operand
> > is constant then we should just be doing an ordinary move in which the
> > source is a CONST_VECTOR.  It's then up to the move patterns to handle
> > duplicated constants as efficiently as possible.  (Sorry if this was
> > discussed upthread and I missed it.)
> That's exactly the point I'm trying to get across as well.
> 

This is what we do today.  But I'd like to generate

	movl	$4, %eax
	vpbroadcastb	%eax, %ymm15
	movq	ops(%rip), %rax
	vmovdqu	%ymm15, (%rax)
	vzeroupper
	ret

instead of

	vmovdqa	.LC0(%rip), %ymm15
	movq	ops(%rip), %rax
	vmovdqu	%ymm15, (%rax)
	vzeroupper
	ret

Do I need a vec_dup pattern for it?

H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2] Add vec_const_duplicate optab and TARGET_GEN_MEMSET_SCRATCH_RTX
  2021-06-01 13:05                               ` H.J. Lu
@ 2021-06-01 13:25                                 ` Richard Biener
  2021-06-01 13:29                                   ` H.J. Lu
  0 siblings, 1 reply; 52+ messages in thread
From: Richard Biener @ 2021-06-01 13:25 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jeff Law, H.J. Lu via Gcc-patches, Uros Bizjak, Bernd Edlinger,
	Richard Sandiford

On Tue, Jun 1, 2021 at 3:05 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, May 31, 2021 at 11:54:53PM -0600, Jeff Law wrote:
> >
> >
> > On 5/31/2021 11:50 PM, Richard Sandiford wrote:
> > > "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> > > > On Mon, May 31, 2021 at 06:32:04AM -0700, H.J. Lu wrote:
> > > > > On Mon, May 31, 2021 at 6:26 AM Richard Biener
> > > > > <richard.guenther@gmail.com> wrote:
> > > > > > On Mon, May 31, 2021 at 3:12 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > > > On Mon, May 31, 2021 at 5:46 AM Richard Biener
> > > > > > > <richard.guenther@gmail.com> wrote:
> > > > > > > > On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > > > > > On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
> > > > > > > > > > > > >   -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> > > > > > > > > > > > >            MODE)
> > > > > > > > > > > > >       This function returns the RTL of a register containing
> > > > > > > > > > > > >       'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> > > > > > > > > > > > >       value given in the RTL register DATA.  For example, if MODE is 4
> > > > > > > > > > > > >       bytes wide, return the RTL for 0x01010101*DATA.
> > > > > > > > > > > > For this one I wonder if it should be an optab instead.  Couldn't you
> > > > > > > > > > > > use the existing vec_duplicate for this by using (paradoxical) subregs
> > > > > > > > > > > > like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
> > > > > > > > > > > I tried.   It doesn't even work on x86.  See:
> > > > > > > > > > >
> > > > > > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
> > > > > > > > > > Not sure what I should read from there...
> > > > > > > > > >
> > > > > > > > > > > There are special cases to subreg HI, SI and DI modes of TI mode in
> > > > > > > > > > > ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
> > > > > > > > > > > work here.   Each backend may need its own special handling.
> > > > > > > > > > OK, I guess I'm not (RTL) qualified enough to further review these parts,
> > > > > > > > > > sorry.  Since we're doing code generation the canonical way to communicate
> > > > > > > > > > with backends should be optabs, not some set of disconnected target hooks.
> > > > > > > > > > But as said, I probably don't know enough of RTL to see why it's the only way.
> > > > > > > > > >
> > > > > > > > > > Richard.
> > > > > > > > > Here is the patch to add optabs instead.  Does it look OK?
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > >
> > > > > > > > > H.J.
> > > > > > > > > ---
> > > > > > > > > Add 2 optabs:
> > > > > > > > >
> > > > > > > > > 1. integer_extract: Extract lower bit value from the integer value in
> > > > > > > > > TImode, OImode or XImode.
> > > > > > > > That sounds very specific, esp. the restriction to {TI,OI,XI}mode.
> > > > > > > > It also sounds like it matches (subreg:{TI,OI,XI} (...) 0).  There are
> > > > > > > > existing target hooks verifying subreg validity - why's that not a good
> > > > > > > > fit here?  ISTR you say gen_lowpart () doesn't work (or was it
> > > > > > > > simplify_gen_subreg?), why's that so?
> > > > > > > {TI,OI,XI}mode are storage only integer types.   subreg doesn't work
> > > > > > > well on them.  I got
> > > > > > >
> > > > > > > [hjl@gnu-cfl-2 pieces]$ cat s2.i
> > > > > > > extern void *ops;
> > > > > > >
> > > > > > > void
> > > > > > > foo (int c)
> > > > > > > {
> > > > > > >    __builtin_memset (ops, c, 34);
> > > > > > > }
> > > > > > > [hjl@gnu-cfl-2 pieces]$ make s2.s
> > > > > > > /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
> > > > > > > -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
> > > > > > > -O2 -march=haswell -S s2.i
> > > > > > > during RTL pass: reload
> > > > > > > s2.i: In function ‘foo’:
> > > > > > > s2.i:7:1: internal compiler error: maximum number of generated reload
> > > > > > > insns per insn achieved (90)
> > > > > > >      7 | }
> > > > > > >        | ^
> > > > > > > 0x1050734 lra_constraints(bool)
> > > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c:5091
> > > > > > > 0x1039536 lra(_IO_FILE*)
> > > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336
> > > > > > > 0xfe1140 do_reload
> > > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822
> > > > > > > 0xfe162e execute
> > > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008
> > > > > > > Please submit a full bug report,
> > > > > > > with preprocessed source if appropriate.
> > > > > > > Please include the complete backtrace with any bug report.
> > > > > > > See <https://gcc.gnu.org/bugs/> for instructions.
> > > > > > > make: *** [Makefile:32: s2.s] Error 1
> > > > > > > [hjl@gnu-cfl-2 pieces]$
> > > > > > >
> > > > > > > due to
> > > > > > >
> > > > > > > (insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84)
> > > > > > >                  (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
> > > > > > > *)ops.0_1]+32 S2 A8])
> > > > > > >          (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1
> > > > > > >       (nil))
> > > > > > >
> > > > > > > The new optab gives us
> > > > > > >
> > > > > > > (insn 12 11 13 2 (set (reg:TI 88)
> > > > > > >          (reg:TI 51 xmm15)) "s2.i":6:3 -1
> > > > > > >       (nil))
> > > > > > > (insn 13 12 14 2 (set (reg:SI 89)
> > > > > > >          (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1
> > > > > > >       (nil))
> > > > > > > (insn 14 13 15 2 (set (reg:HI 87)
> > > > > > >          (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1
> > > > > > >       (nil))
> > > > > > that looks odd to me - what's the final result after LRA?  I think
> > > > > I got:
> > > > >
> > > > > vmovd %edi, %xmm15
> > > > > movq ops(%rip), %rdx
> > > > > vpbroadcastb %xmm15, %ymm15
> > > > > vmovq %xmm15, %rax    <<<< move to GPR
> > > > > vmovdqu %ymm15, (%rdx)
> > > > > movw %ax, 32(%rdx)   <<<< subreg of GPR
> > > > > vzeroupper
> > > > > ret
> > > > >
> > > > > > we should see to make lowpart_subreg work on {XI,OI,TI}mode.
> > > > > > Only two steps should be necessary at most:
> > > > > > xmm -> gpr, grp -> subreg, or gpr -> subreg.  So the expander
> > > > > > code in memset should try to generate the subreg directly
> > > > > subreg didn't fail on x86 when I tried.
> > > > >
> > > > > > and if that fails, try a word_mode subreg followed by the subreg.
> > > > > I will try word_mode subreg.
> > > > >
> > > > Here is the v2 patch to use word_mode subreg.  For
> > > >
> > > > ---
> > > > extern void *ops;
> > > >
> > > > void
> > > > foo (int c)
> > > > {
> > > >    __builtin_memset (ops, 4, 32);
> > > > }
> > > > ---
> > > >
> > > > without vec_const_duplicate, I got
> > > >
> > > >   movl    $4, %eax
> > > >   movq    ops(%rip), %rdx
> > > >   movd    %eax, %xmm0
> > > >   punpcklbw       %xmm0, %xmm0
> > > >   punpcklwd       %xmm0, %xmm0
> > > >   pshufd  $0, %xmm0, %xmm0
> > > >   movups  %xmm0, (%rdx)
> > > >   movups  %xmm0, 16(%rdx)
> > > >   ret
> > > >
> > > > With vec_const_duplicate, I got
> > > >
> > > >   movq    ops(%rip), %rax
> > > >   movdqa  .LC0(%rip), %xmm0
> > > >   movups  %xmm0, (%rax)
> > > >   movups  %xmm0, 16(%rax)
> > > >   ret
> > > >
> > > > If vec_duplicate is allowed to fail, I don't need vec_const_duplicate.
> > > I don't understand why we need an optab for this though.  If the operand
> > > is constant then we should just be doing an ordinary move in which the
> > > source is a CONST_VECTOR.  It's then up to the move patterns to handle
> > > duplicated constants as efficiently as possible.  (Sorry if this was
> > > discussed upthread and I missed it.)
> > That's exactly the point I'm trying to get across as well.
> >
>
> This is what we do today.  But I'd like to generate
>
>         movl    $4, %eax
>         vpbroadcastb    %eax, %ymm15
>         movq    ops(%rip), %rax
>         vmovdqu %ymm15, (%rax)
>         vzeroupper
>         ret
>
> instead of
>
>         vmovdqa .LC0(%rip), %ymm15
>         movq    ops(%rip), %rax
>         vmovdqu %ymm15, (%rax)
>         vzeroupper
>         ret
>
> Do I need a vec_dup pattern for it?

I think we have special code sequences to materialize some
constant vectors already, we should be able to add to that, no?

Richard.

>
> H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2] Add vec_const_duplicate optab and TARGET_GEN_MEMSET_SCRATCH_RTX
  2021-06-01 13:25                                 ` Richard Biener
@ 2021-06-01 13:29                                   ` H.J. Lu
  2021-06-01 14:21                                     ` Jeff Law
  0 siblings, 1 reply; 52+ messages in thread
From: H.J. Lu @ 2021-06-01 13:29 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jeff Law, H.J. Lu via Gcc-patches, Uros Bizjak, Bernd Edlinger,
	Richard Sandiford

On Tue, Jun 1, 2021 at 6:25 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Tue, Jun 1, 2021 at 3:05 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Mon, May 31, 2021 at 11:54:53PM -0600, Jeff Law wrote:
> > >
> > >
> > > On 5/31/2021 11:50 PM, Richard Sandiford wrote:
> > > > "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> > > > > On Mon, May 31, 2021 at 06:32:04AM -0700, H.J. Lu wrote:
> > > > > > On Mon, May 31, 2021 at 6:26 AM Richard Biener
> > > > > > <richard.guenther@gmail.com> wrote:
> > > > > > > On Mon, May 31, 2021 at 3:12 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > > > > On Mon, May 31, 2021 at 5:46 AM Richard Biener
> > > > > > > > <richard.guenther@gmail.com> wrote:
> > > > > > > > > On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > > > > > > On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
> > > > > > > > > > > > > >   -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> > > > > > > > > > > > > >            MODE)
> > > > > > > > > > > > > >       This function returns the RTL of a register containing
> > > > > > > > > > > > > >       'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> > > > > > > > > > > > > >       value given in the RTL register DATA.  For example, if MODE is 4
> > > > > > > > > > > > > >       bytes wide, return the RTL for 0x01010101*DATA.
> > > > > > > > > > > > > For this one I wonder if it should be an optab instead.  Couldn't you
> > > > > > > > > > > > > use the existing vec_duplicate for this by using (paradoxical) subregs
> > > > > > > > > > > > > like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
> > > > > > > > > > > > I tried.   It doesn't even work on x86.  See:
> > > > > > > > > > > >
> > > > > > > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
> > > > > > > > > > > Not sure what I should read from there...
> > > > > > > > > > >
> > > > > > > > > > > > There are special cases to subreg HI, SI and DI modes of TI mode in
> > > > > > > > > > > > ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
> > > > > > > > > > > > work here.   Each backend may need its own special handling.
> > > > > > > > > > > OK, I guess I'm not (RTL) qualified enough to further review these parts,
> > > > > > > > > > > sorry.  Since we're doing code generation the canonical way to communicate
> > > > > > > > > > > with backends should be optabs, not some set of disconnected target hooks.
> > > > > > > > > > > But as said, I probably don't know enough of RTL to see why it's the only way.
> > > > > > > > > > >
> > > > > > > > > > > Richard.
> > > > > > > > > > Here is the patch to add optabs instead.  Does it look OK?
> > > > > > > > > >
> > > > > > > > > > Thanks.
> > > > > > > > > >
> > > > > > > > > > H.J.
> > > > > > > > > > ---
> > > > > > > > > > Add 2 optabs:
> > > > > > > > > >
> > > > > > > > > > 1. integer_extract: Extract lower bit value from the integer value in
> > > > > > > > > > TImode, OImode or XImode.
> > > > > > > > > That sounds very specific, esp. the restriction to {TI,OI,XI}mode.
> > > > > > > > > It also sounds like it matches (subreg:{TI,OI,XI} (...) 0).  There are
> > > > > > > > > existing target hooks verifying subreg validity - why's that not a good
> > > > > > > > > fit here?  ISTR you say gen_lowpart () doesn't work (or was it
> > > > > > > > > simplify_gen_subreg?), why's that so?
> > > > > > > > {TI,OI,XI}mode are storage only integer types.   subreg doesn't work
> > > > > > > > well on them.  I got
> > > > > > > >
> > > > > > > > [hjl@gnu-cfl-2 pieces]$ cat s2.i
> > > > > > > > extern void *ops;
> > > > > > > >
> > > > > > > > void
> > > > > > > > foo (int c)
> > > > > > > > {
> > > > > > > >    __builtin_memset (ops, c, 34);
> > > > > > > > }
> > > > > > > > [hjl@gnu-cfl-2 pieces]$ make s2.s
> > > > > > > > /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
> > > > > > > > -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
> > > > > > > > -O2 -march=haswell -S s2.i
> > > > > > > > during RTL pass: reload
> > > > > > > > s2.i: In function ‘foo’:
> > > > > > > > s2.i:7:1: internal compiler error: maximum number of generated reload
> > > > > > > > insns per insn achieved (90)
> > > > > > > >      7 | }
> > > > > > > >        | ^
> > > > > > > > 0x1050734 lra_constraints(bool)
> > > > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c:5091
> > > > > > > > 0x1039536 lra(_IO_FILE*)
> > > > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336
> > > > > > > > 0xfe1140 do_reload
> > > > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822
> > > > > > > > 0xfe162e execute
> > > > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008
> > > > > > > > Please submit a full bug report,
> > > > > > > > with preprocessed source if appropriate.
> > > > > > > > Please include the complete backtrace with any bug report.
> > > > > > > > See <https://gcc.gnu.org/bugs/> for instructions.
> > > > > > > > make: *** [Makefile:32: s2.s] Error 1
> > > > > > > > [hjl@gnu-cfl-2 pieces]$
> > > > > > > >
> > > > > > > > due to
> > > > > > > >
> > > > > > > > (insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84)
> > > > > > > >                  (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
> > > > > > > > *)ops.0_1]+32 S2 A8])
> > > > > > > >          (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1
> > > > > > > >       (nil))
> > > > > > > >
> > > > > > > > The new optab gives us
> > > > > > > >
> > > > > > > > (insn 12 11 13 2 (set (reg:TI 88)
> > > > > > > >          (reg:TI 51 xmm15)) "s2.i":6:3 -1
> > > > > > > >       (nil))
> > > > > > > > (insn 13 12 14 2 (set (reg:SI 89)
> > > > > > > >          (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1
> > > > > > > >       (nil))
> > > > > > > > (insn 14 13 15 2 (set (reg:HI 87)
> > > > > > > >          (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1
> > > > > > > >       (nil))
> > > > > > > that looks odd to me - what's the final result after LRA?  I think
> > > > > > I got:
> > > > > >
> > > > > > vmovd %edi, %xmm15
> > > > > > movq ops(%rip), %rdx
> > > > > > vpbroadcastb %xmm15, %ymm15
> > > > > > vmovq %xmm15, %rax    <<<< move to GPR
> > > > > > vmovdqu %ymm15, (%rdx)
> > > > > > movw %ax, 32(%rdx)   <<<< subreg of GPR
> > > > > > vzeroupper
> > > > > > ret
> > > > > >
> > > > > > > we should see to make lowpart_subreg work on {XI,OI,TI}mode.
> > > > > > > Only two steps should be necessary at most:
> > > > > > > xmm -> gpr, grp -> subreg, or gpr -> subreg.  So the expander
> > > > > > > code in memset should try to generate the subreg directly
> > > > > > subreg didn't fail on x86 when I tried.
> > > > > >
> > > > > > > and if that fails, try a word_mode subreg followed by the subreg.
> > > > > > I will try word_mode subreg.
> > > > > >
> > > > > Here is the v2 patch to use word_mode subreg.  For
> > > > >
> > > > > ---
> > > > > extern void *ops;
> > > > >
> > > > > void
> > > > > foo (int c)
> > > > > {
> > > > >    __builtin_memset (ops, 4, 32);
> > > > > }
> > > > > ---
> > > > >
> > > > > without vec_const_duplicate, I got
> > > > >
> > > > >   movl    $4, %eax
> > > > >   movq    ops(%rip), %rdx
> > > > >   movd    %eax, %xmm0
> > > > >   punpcklbw       %xmm0, %xmm0
> > > > >   punpcklwd       %xmm0, %xmm0
> > > > >   pshufd  $0, %xmm0, %xmm0
> > > > >   movups  %xmm0, (%rdx)
> > > > >   movups  %xmm0, 16(%rdx)
> > > > >   ret
> > > > >
> > > > > With vec_const_duplicate, I got
> > > > >
> > > > >   movq    ops(%rip), %rax
> > > > >   movdqa  .LC0(%rip), %xmm0
> > > > >   movups  %xmm0, (%rax)
> > > > >   movups  %xmm0, 16(%rax)
> > > > >   ret
> > > > >
> > > > > If vec_duplicate is allowed to fail, I don't need vec_const_duplicate.
> > > > I don't understand why we need an optab for this though.  If the operand
> > > > is constant then we should just be doing an ordinary move in which the
> > > > source is a CONST_VECTOR.  It's then up to the move patterns to handle
> > > > duplicated constants as efficiently as possible.  (Sorry if this was
> > > > discussed upthread and I missed it.)
> > > That's exactly the point I'm trying to get across as well.
> > >
> >
> > This is what we do today.  But I'd like to generate
> >
> >         movl    $4, %eax
> >         vpbroadcastb    %eax, %ymm15
> >         movq    ops(%rip), %rax
> >         vmovdqu %ymm15, (%rax)
> >         vzeroupper
> >         ret
> >
> > instead of
> >
> >         vmovdqa .LC0(%rip), %ymm15
> >         movq    ops(%rip), %rax
> >         vmovdqu %ymm15, (%rax)
> >         vzeroupper
> >         ret
> >
> > Do I need a vec_dup pattern for it?
>
> I think we have special code sequences to materialize some
> constant vectors already, we should be able to add to that, no?

We can do that for all 0s and all 1s at the final codegen.   For
other values, since we need a GPR, we can't do that.

-- 
H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2] Add vec_const_duplicate optab and TARGET_GEN_MEMSET_SCRATCH_RTX
  2021-06-01 13:29                                   ` H.J. Lu
@ 2021-06-01 14:21                                     ` Jeff Law
  2021-06-01 23:07                                       ` H.J. Lu
  0 siblings, 1 reply; 52+ messages in thread
From: Jeff Law @ 2021-06-01 14:21 UTC (permalink / raw)
  To: H.J. Lu, Richard Biener
  Cc: H.J. Lu via Gcc-patches, Jeff Law, Bernd Edlinger, Richard Sandiford



On 6/1/2021 7:29 AM, H.J. Lu via Gcc-patches wrote:
> On Tue, Jun 1, 2021 at 6:25 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Tue, Jun 1, 2021 at 3:05 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>> On Mon, May 31, 2021 at 11:54:53PM -0600, Jeff Law wrote:
>>>>
>>>> On 5/31/2021 11:50 PM, Richard Sandiford wrote:
>>>>> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
>>>>>> On Mon, May 31, 2021 at 06:32:04AM -0700, H.J. Lu wrote:
>>>>>>> On Mon, May 31, 2021 at 6:26 AM Richard Biener
>>>>>>> <richard.guenther@gmail.com> wrote:
>>>>>>>> On Mon, May 31, 2021 at 3:12 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>>>>> On Mon, May 31, 2021 at 5:46 AM Richard Biener
>>>>>>>>> <richard.guenther@gmail.com> wrote:
>>>>>>>>>> On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>>>>>>> On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
>>>>>>>>>>>>>>>    -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
>>>>>>>>>>>>>>>             MODE)
>>>>>>>>>>>>>>>        This function returns the RTL of a register containing
>>>>>>>>>>>>>>>        'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
>>>>>>>>>>>>>>>        value given in the RTL register DATA.  For example, if MODE is 4
>>>>>>>>>>>>>>>        bytes wide, return the RTL for 0x01010101*DATA.
>>>>>>>>>>>>>> For this one I wonder if it should be an optab instead.  Couldn't you
>>>>>>>>>>>>>> use the existing vec_duplicate for this by using (paradoxical) subregs
>>>>>>>>>>>>>> like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
>>>>>>>>>>>>> I tried.   It doesn't even work on x86.  See:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
>>>>>>>>>>>> Not sure what I should read from there...
>>>>>>>>>>>>
>>>>>>>>>>>>> There are special cases to subreg HI, SI and DI modes of TI mode in
>>>>>>>>>>>>> ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
>>>>>>>>>>>>> work here.   Each backend may need its own special handling.
>>>>>>>>>>>> OK, I guess I'm not (RTL) qualified enough to further review these parts,
>>>>>>>>>>>> sorry.  Since we're doing code generation the canonical way to communicate
>>>>>>>>>>>> with backends should be optabs, not some set of disconnected target hooks.
>>>>>>>>>>>> But as said, I probably don't know enough of RTL to see why it's the only way.
>>>>>>>>>>>>
>>>>>>>>>>>> Richard.
>>>>>>>>>>> Here is the patch to add optabs instead.  Does it look OK?
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>>
>>>>>>>>>>> H.J.
>>>>>>>>>>> ---
>>>>>>>>>>> Add 2 optabs:
>>>>>>>>>>>
>>>>>>>>>>> 1. integer_extract: Extract lower bit value from the integer value in
>>>>>>>>>>> TImode, OImode or XImode.
>>>>>>>>>> That sounds very specific, esp. the restriction to {TI,OI,XI}mode.
>>>>>>>>>> It also sounds like it matches (subreg:{TI,OI,XI} (...) 0).  There are
>>>>>>>>>> existing target hooks verifying subreg validity - why's that not a good
>>>>>>>>>> fit here?  ISTR you say gen_lowpart () doesn't work (or was it
>>>>>>>>>> simplify_gen_subreg?), why's that so?
>>>>>>>>> {TI,OI,XI}mode are storage only integer types.   subreg doesn't work
>>>>>>>>> well on them.  I got
>>>>>>>>>
>>>>>>>>> [hjl@gnu-cfl-2 pieces]$ cat s2.i
>>>>>>>>> extern void *ops;
>>>>>>>>>
>>>>>>>>> void
>>>>>>>>> foo (int c)
>>>>>>>>> {
>>>>>>>>>     __builtin_memset (ops, c, 34);
>>>>>>>>> }
>>>>>>>>> [hjl@gnu-cfl-2 pieces]$ make s2.s
>>>>>>>>> /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
>>>>>>>>> -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
>>>>>>>>> -O2 -march=haswell -S s2.i
>>>>>>>>> during RTL pass: reload
>>>>>>>>> s2.i: In function ‘foo’:
>>>>>>>>> s2.i:7:1: internal compiler error: maximum number of generated reload
>>>>>>>>> insns per insn achieved (90)
>>>>>>>>>       7 | }
>>>>>>>>>         | ^
>>>>>>>>> 0x1050734 lra_constraints(bool)
>>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c:5091
>>>>>>>>> 0x1039536 lra(_IO_FILE*)
>>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336
>>>>>>>>> 0xfe1140 do_reload
>>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822
>>>>>>>>> 0xfe162e execute
>>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008
>>>>>>>>> Please submit a full bug report,
>>>>>>>>> with preprocessed source if appropriate.
>>>>>>>>> Please include the complete backtrace with any bug report.
>>>>>>>>> See <https://gcc.gnu.org/bugs/> for instructions.
>>>>>>>>> make: *** [Makefile:32: s2.s] Error 1
>>>>>>>>> [hjl@gnu-cfl-2 pieces]$
>>>>>>>>>
>>>>>>>>> due to
>>>>>>>>>
>>>>>>>>> (insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84)
>>>>>>>>>                   (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
>>>>>>>>> *)ops.0_1]+32 S2 A8])
>>>>>>>>>           (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1
>>>>>>>>>        (nil))
>>>>>>>>>
>>>>>>>>> The new optab gives us
>>>>>>>>>
>>>>>>>>> (insn 12 11 13 2 (set (reg:TI 88)
>>>>>>>>>           (reg:TI 51 xmm15)) "s2.i":6:3 -1
>>>>>>>>>        (nil))
>>>>>>>>> (insn 13 12 14 2 (set (reg:SI 89)
>>>>>>>>>           (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1
>>>>>>>>>        (nil))
>>>>>>>>> (insn 14 13 15 2 (set (reg:HI 87)
>>>>>>>>>           (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1
>>>>>>>>>        (nil))
>>>>>>>> that looks odd to me - what's the final result after LRA?  I think
>>>>>>> I got:
>>>>>>>
>>>>>>> vmovd %edi, %xmm15
>>>>>>> movq ops(%rip), %rdx
>>>>>>> vpbroadcastb %xmm15, %ymm15
>>>>>>> vmovq %xmm15, %rax    <<<< move to GPR
>>>>>>> vmovdqu %ymm15, (%rdx)
>>>>>>> movw %ax, 32(%rdx)   <<<< subreg of GPR
>>>>>>> vzeroupper
>>>>>>> ret
>>>>>>>
>>>>>>>> we should see to make lowpart_subreg work on {XI,OI,TI}mode.
>>>>>>>> Only two steps should be necessary at most:
>>>>>>>> xmm -> gpr, grp -> subreg, or gpr -> subreg.  So the expander
>>>>>>>> code in memset should try to generate the subreg directly
>>>>>>> subreg didn't fail on x86 when I tried.
>>>>>>>
>>>>>>>> and if that fails, try a word_mode subreg followed by the subreg.
>>>>>>> I will try word_mode subreg.
>>>>>>>
>>>>>> Here is the v2 patch to use word_mode subreg.  For
>>>>>>
>>>>>> ---
>>>>>> extern void *ops;
>>>>>>
>>>>>> void
>>>>>> foo (int c)
>>>>>> {
>>>>>>     __builtin_memset (ops, 4, 32);
>>>>>> }
>>>>>> ---
>>>>>>
>>>>>> without vec_const_duplicate, I got
>>>>>>
>>>>>>    movl    $4, %eax
>>>>>>    movq    ops(%rip), %rdx
>>>>>>    movd    %eax, %xmm0
>>>>>>    punpcklbw       %xmm0, %xmm0
>>>>>>    punpcklwd       %xmm0, %xmm0
>>>>>>    pshufd  $0, %xmm0, %xmm0
>>>>>>    movups  %xmm0, (%rdx)
>>>>>>    movups  %xmm0, 16(%rdx)
>>>>>>    ret
>>>>>>
>>>>>> With vec_const_duplicate, I got
>>>>>>
>>>>>>    movq    ops(%rip), %rax
>>>>>>    movdqa  .LC0(%rip), %xmm0
>>>>>>    movups  %xmm0, (%rax)
>>>>>>    movups  %xmm0, 16(%rax)
>>>>>>    ret
>>>>>>
>>>>>> If vec_duplicate is allowed to fail, I don't need vec_const_duplicate.
>>>>> I don't understand why we need an optab for this though.  If the operand
>>>>> is constant then we should just be doing an ordinary move in which the
>>>>> source is a CONST_VECTOR.  It's then up to the move patterns to handle
>>>>> duplicated constants as efficiently as possible.  (Sorry if this was
>>>>> discussed upthread and I missed it.)
>>>> That's exactly the point I'm trying to get across as well.
>>>>
>>> This is what we do today.  But I'd like to generate
>>>
>>>          movl    $4, %eax
>>>          vpbroadcastb    %eax, %ymm15
>>>          movq    ops(%rip), %rax
>>>          vmovdqu %ymm15, (%rax)
>>>          vzeroupper
>>>          ret
>>>
>>> instead of
>>>
>>>          vmovdqa .LC0(%rip), %ymm15
>>>          movq    ops(%rip), %rax
>>>          vmovdqu %ymm15, (%rax)
>>>          vzeroupper
>>>          ret
>>>
>>> Do I need a vec_dup pattern for it?
>> I think we have special code sequences to materialize some
>> constant vectors already, we should be able to add to that, no?
> We can do that for all 0s and all 1s at the final codegen.   For
> other values, since we need a GPR, we can't do that.
You can catch them in your movxx expanders, you can create peep2 
patterns that use available GPRs, etc.  I don't see a fundamental need 
to to introduce new target macros or hooks to handle this stuff.  In 
fact I've done both to handle a closely related issue on our port.

jeff

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2] Add vec_const_duplicate optab and TARGET_GEN_MEMSET_SCRATCH_RTX
  2021-06-01 14:21                                     ` Jeff Law
@ 2021-06-01 23:07                                       ` H.J. Lu
  2021-06-02  1:21                                         ` Hongtao Liu
  0 siblings, 1 reply; 52+ messages in thread
From: H.J. Lu @ 2021-06-01 23:07 UTC (permalink / raw)
  To: Jeff Law
  Cc: Richard Biener, H.J. Lu via Gcc-patches, Jeff Law,
	Bernd Edlinger, Richard Sandiford

On Tue, Jun 1, 2021 at 7:21 AM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
>
> On 6/1/2021 7:29 AM, H.J. Lu via Gcc-patches wrote:
> > On Tue, Jun 1, 2021 at 6:25 AM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> >> On Tue, Jun 1, 2021 at 3:05 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>> On Mon, May 31, 2021 at 11:54:53PM -0600, Jeff Law wrote:
> >>>>
> >>>> On 5/31/2021 11:50 PM, Richard Sandiford wrote:
> >>>>> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> >>>>>> On Mon, May 31, 2021 at 06:32:04AM -0700, H.J. Lu wrote:
> >>>>>>> On Mon, May 31, 2021 at 6:26 AM Richard Biener
> >>>>>>> <richard.guenther@gmail.com> wrote:
> >>>>>>>> On Mon, May 31, 2021 at 3:12 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>>>>>>>> On Mon, May 31, 2021 at 5:46 AM Richard Biener
> >>>>>>>>> <richard.guenther@gmail.com> wrote:
> >>>>>>>>>> On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>>>>>>>>>> On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
> >>>>>>>>>>>>>>>    -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> >>>>>>>>>>>>>>>             MODE)
> >>>>>>>>>>>>>>>        This function returns the RTL of a register containing
> >>>>>>>>>>>>>>>        'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> >>>>>>>>>>>>>>>        value given in the RTL register DATA.  For example, if MODE is 4
> >>>>>>>>>>>>>>>        bytes wide, return the RTL for 0x01010101*DATA.
> >>>>>>>>>>>>>> For this one I wonder if it should be an optab instead.  Couldn't you
> >>>>>>>>>>>>>> use the existing vec_duplicate for this by using (paradoxical) subregs
> >>>>>>>>>>>>>> like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
> >>>>>>>>>>>>> I tried.   It doesn't even work on x86.  See:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
> >>>>>>>>>>>> Not sure what I should read from there...
> >>>>>>>>>>>>
> >>>>>>>>>>>>> There are special cases to subreg HI, SI and DI modes of TI mode in
> >>>>>>>>>>>>> ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
> >>>>>>>>>>>>> work here.   Each backend may need its own special handling.
> >>>>>>>>>>>> OK, I guess I'm not (RTL) qualified enough to further review these parts,
> >>>>>>>>>>>> sorry.  Since we're doing code generation the canonical way to communicate
> >>>>>>>>>>>> with backends should be optabs, not some set of disconnected target hooks.
> >>>>>>>>>>>> But as said, I probably don't know enough of RTL to see why it's the only way.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Richard.
> >>>>>>>>>>> Here is the patch to add optabs instead.  Does it look OK?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks.
> >>>>>>>>>>>
> >>>>>>>>>>> H.J.
> >>>>>>>>>>> ---
> >>>>>>>>>>> Add 2 optabs:
> >>>>>>>>>>>
> >>>>>>>>>>> 1. integer_extract: Extract lower bit value from the integer value in
> >>>>>>>>>>> TImode, OImode or XImode.
> >>>>>>>>>> That sounds very specific, esp. the restriction to {TI,OI,XI}mode.
> >>>>>>>>>> It also sounds like it matches (subreg:{TI,OI,XI} (...) 0).  There are
> >>>>>>>>>> existing target hooks verifying subreg validity - why's that not a good
> >>>>>>>>>> fit here?  ISTR you say gen_lowpart () doesn't work (or was it
> >>>>>>>>>> simplify_gen_subreg?), why's that so?
> >>>>>>>>> {TI,OI,XI}mode are storage only integer types.   subreg doesn't work
> >>>>>>>>> well on them.  I got
> >>>>>>>>>
> >>>>>>>>> [hjl@gnu-cfl-2 pieces]$ cat s2.i
> >>>>>>>>> extern void *ops;
> >>>>>>>>>
> >>>>>>>>> void
> >>>>>>>>> foo (int c)
> >>>>>>>>> {
> >>>>>>>>>     __builtin_memset (ops, c, 34);
> >>>>>>>>> }
> >>>>>>>>> [hjl@gnu-cfl-2 pieces]$ make s2.s
> >>>>>>>>> /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
> >>>>>>>>> -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
> >>>>>>>>> -O2 -march=haswell -S s2.i
> >>>>>>>>> during RTL pass: reload
> >>>>>>>>> s2.i: In function ‘foo’:
> >>>>>>>>> s2.i:7:1: internal compiler error: maximum number of generated reload
> >>>>>>>>> insns per insn achieved (90)
> >>>>>>>>>       7 | }
> >>>>>>>>>         | ^
> >>>>>>>>> 0x1050734 lra_constraints(bool)
> >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c:5091
> >>>>>>>>> 0x1039536 lra(_IO_FILE*)
> >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336
> >>>>>>>>> 0xfe1140 do_reload
> >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822
> >>>>>>>>> 0xfe162e execute
> >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008
> >>>>>>>>> Please submit a full bug report,
> >>>>>>>>> with preprocessed source if appropriate.
> >>>>>>>>> Please include the complete backtrace with any bug report.
> >>>>>>>>> See <https://gcc.gnu.org/bugs/> for instructions.
> >>>>>>>>> make: *** [Makefile:32: s2.s] Error 1
> >>>>>>>>> [hjl@gnu-cfl-2 pieces]$
> >>>>>>>>>
> >>>>>>>>> due to
> >>>>>>>>>
> >>>>>>>>> (insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84)
> >>>>>>>>>                   (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
> >>>>>>>>> *)ops.0_1]+32 S2 A8])
> >>>>>>>>>           (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1
> >>>>>>>>>        (nil))
> >>>>>>>>>
> >>>>>>>>> The new optab gives us
> >>>>>>>>>
> >>>>>>>>> (insn 12 11 13 2 (set (reg:TI 88)
> >>>>>>>>>           (reg:TI 51 xmm15)) "s2.i":6:3 -1
> >>>>>>>>>        (nil))
> >>>>>>>>> (insn 13 12 14 2 (set (reg:SI 89)
> >>>>>>>>>           (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1
> >>>>>>>>>        (nil))
> >>>>>>>>> (insn 14 13 15 2 (set (reg:HI 87)
> >>>>>>>>>           (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1
> >>>>>>>>>        (nil))
> >>>>>>>> that looks odd to me - what's the final result after LRA?  I think
> >>>>>>> I got:
> >>>>>>>
> >>>>>>> vmovd %edi, %xmm15
> >>>>>>> movq ops(%rip), %rdx
> >>>>>>> vpbroadcastb %xmm15, %ymm15
> >>>>>>> vmovq %xmm15, %rax    <<<< move to GPR
> >>>>>>> vmovdqu %ymm15, (%rdx)
> >>>>>>> movw %ax, 32(%rdx)   <<<< subreg of GPR
> >>>>>>> vzeroupper
> >>>>>>> ret
> >>>>>>>
> >>>>>>>> we should see to make lowpart_subreg work on {XI,OI,TI}mode.
> >>>>>>>> Only two steps should be necessary at most:
> >>>>>>>> xmm -> gpr, grp -> subreg, or gpr -> subreg.  So the expander
> >>>>>>>> code in memset should try to generate the subreg directly
> >>>>>>> subreg didn't fail on x86 when I tried.
> >>>>>>>
> >>>>>>>> and if that fails, try a word_mode subreg followed by the subreg.
> >>>>>>> I will try word_mode subreg.
> >>>>>>>
> >>>>>> Here is the v2 patch to use word_mode subreg.  For
> >>>>>>
> >>>>>> ---
> >>>>>> extern void *ops;
> >>>>>>
> >>>>>> void
> >>>>>> foo (int c)
> >>>>>> {
> >>>>>>     __builtin_memset (ops, 4, 32);
> >>>>>> }
> >>>>>> ---
> >>>>>>
> >>>>>> without vec_const_duplicate, I got
> >>>>>>
> >>>>>>    movl    $4, %eax
> >>>>>>    movq    ops(%rip), %rdx
> >>>>>>    movd    %eax, %xmm0
> >>>>>>    punpcklbw       %xmm0, %xmm0
> >>>>>>    punpcklwd       %xmm0, %xmm0
> >>>>>>    pshufd  $0, %xmm0, %xmm0
> >>>>>>    movups  %xmm0, (%rdx)
> >>>>>>    movups  %xmm0, 16(%rdx)
> >>>>>>    ret
> >>>>>>
> >>>>>> With vec_const_duplicate, I got
> >>>>>>
> >>>>>>    movq    ops(%rip), %rax
> >>>>>>    movdqa  .LC0(%rip), %xmm0
> >>>>>>    movups  %xmm0, (%rax)
> >>>>>>    movups  %xmm0, 16(%rax)
> >>>>>>    ret
> >>>>>>
> >>>>>> If vec_duplicate is allowed to fail, I don't need vec_const_duplicate.
> >>>>> I don't understand why we need an optab for this though.  If the operand
> >>>>> is constant then we should just be doing an ordinary move in which the
> >>>>> source is a CONST_VECTOR.  It's then up to the move patterns to handle
> >>>>> duplicated constants as efficiently as possible.  (Sorry if this was
> >>>>> discussed upthread and I missed it.)
> >>>> That's exactly the point I'm trying to get across as well.
> >>>>
> >>> This is what we do today.  But I'd like to generate
> >>>
> >>>          movl    $4, %eax
> >>>          vpbroadcastb    %eax, %ymm15
> >>>          movq    ops(%rip), %rax
> >>>          vmovdqu %ymm15, (%rax)
> >>>          vzeroupper
> >>>          ret
> >>>
> >>> instead of
> >>>
> >>>          vmovdqa .LC0(%rip), %ymm15
> >>>          movq    ops(%rip), %rax
> >>>          vmovdqu %ymm15, (%rax)
> >>>          vzeroupper
> >>>          ret
> >>>
> >>> Do I need a vec_dup pattern for it?
> >> I think we have special code sequences to materialize some
> >> constant vectors already, we should be able to add to that, no?
> > We can do that for all 0s and all 1s at the final codegen.   For
> > other values, since we need a GPR, we can't do that.
> You can catch them in your movxx expanders, you can create peep2
> patterns that use available GPRs, etc.  I don't see a fundamental need
> to to introduce new target macros or hooks to handle this stuff.  In
> fact I've done both to handle a closely related issue on our port.
>

One problem of expanding TI/OI/XI moves to broadcast is that other
RTL passes may change it.   For example, expander generates:

(insn 7 5 6 (set (reg:QI 85)
        (const_int 12 [0xc]))
"/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
-1
     (nil))

(insn 6 7 8 (set (reg:V16QI 84)
        (vec_duplicate:V16QI (reg:QI 85)))
"/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c
":9:3 5103 {*avx512vl_vec_dup_gprv16qi}
     (nil))

(insn 8 6 9 (set (subreg:V16QI (reg:TI 86) 0)
        (reg:V16QI 84))
"/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
-1
     (nil))

(insn 9 8 10 (set (mem:TI (reg/f:DI 83) [0 MEM <char[1:19]> [(void
*)dst.0_1]+0 S16 A8])
        (reg:TI 86))
"/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
-1
     (nil))

combine turns it into:

insn 9 6 10 2 (set (mem:TI (reg/f:DI 83 [ dst ]) [0 MEM <char[1:19]>
[(void *)dst.0_1]+0 S16 A8])
        (const_wide_int 0xc0c0c0c0c0c0c0c0c0c0c0c0c0c0c0c))
"/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/
i386/pr90773-17.c":9:3 73 {*movti_internal}
     (nil))

LRA tries:

(insn 14 15 16 2 (set (reg:V16QI 20 xmm0 [89])
        (vec_duplicate:V16QI (reg:QI 4 si [90])))
"/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr907
73-17.c":9:3 5103 {*avx512vl_vec_dup_gprv16qi}
     (nil))
(insn 16 14 9 2 (set (reg:V16QI 1 dx)
        (reg:V16QI 20 xmm0 [89]))
"/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
152
7 {movv16qi_internal}
     (nil))
(insn 9 16 10 2 (set (mem:TI (reg/f:DI 0 ax [orig:83 dst ] [83]) [0
MEM <char[1:19]> [(void *)dst.0_1]+0 S16 A8])
        (reg:TI 1 dx [88]))
"/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
73 {*movt
i_internal}

and fails:

/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c:
In function ‘foo’:
/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c:10:1:
error: insn does not satisfy its constraints:
(insn 16 14 9 2 (set (reg:V16QI 1 dx)
        (reg:V16QI 20 xmm0 [89]))
"/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
1527 {movv16qi_internal}
     (nil))

I want to hide

(const_wide_int 0xc0c0c0c0c0c0c0c0c0c0c0c0c0c0c0c)

from RTL passes.


-- 
H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2] Add vec_const_duplicate optab and TARGET_GEN_MEMSET_SCRATCH_RTX
  2021-06-01 23:07                                       ` H.J. Lu
@ 2021-06-02  1:21                                         ` Hongtao Liu
  2021-06-02  1:54                                           ` H.J. Lu
  0 siblings, 1 reply; 52+ messages in thread
From: Hongtao Liu @ 2021-06-02  1:21 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jeff Law, Jeff Law, Bernd Edlinger, H.J. Lu via Gcc-patches,
	Richard Sandiford

On Wed, Jun 2, 2021 at 7:07 AM H.J. Lu via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Tue, Jun 1, 2021 at 7:21 AM Jeff Law <jeffreyalaw@gmail.com> wrote:
> >
> >
> >
> > On 6/1/2021 7:29 AM, H.J. Lu via Gcc-patches wrote:
> > > On Tue, Jun 1, 2021 at 6:25 AM Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > >> On Tue, Jun 1, 2021 at 3:05 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >>> On Mon, May 31, 2021 at 11:54:53PM -0600, Jeff Law wrote:
> > >>>>
> > >>>> On 5/31/2021 11:50 PM, Richard Sandiford wrote:
> > >>>>> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> > >>>>>> On Mon, May 31, 2021 at 06:32:04AM -0700, H.J. Lu wrote:
> > >>>>>>> On Mon, May 31, 2021 at 6:26 AM Richard Biener
> > >>>>>>> <richard.guenther@gmail.com> wrote:
> > >>>>>>>> On Mon, May 31, 2021 at 3:12 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >>>>>>>>> On Mon, May 31, 2021 at 5:46 AM Richard Biener
> > >>>>>>>>> <richard.guenther@gmail.com> wrote:
> > >>>>>>>>>> On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >>>>>>>>>>> On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
> > >>>>>>>>>>>>>>>    -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> > >>>>>>>>>>>>>>>             MODE)
> > >>>>>>>>>>>>>>>        This function returns the RTL of a register containing
> > >>>>>>>>>>>>>>>        'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> > >>>>>>>>>>>>>>>        value given in the RTL register DATA.  For example, if MODE is 4
> > >>>>>>>>>>>>>>>        bytes wide, return the RTL for 0x01010101*DATA.
> > >>>>>>>>>>>>>> For this one I wonder if it should be an optab instead.  Couldn't you
> > >>>>>>>>>>>>>> use the existing vec_duplicate for this by using (paradoxical) subregs
> > >>>>>>>>>>>>>> like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
> > >>>>>>>>>>>>> I tried.   It doesn't even work on x86.  See:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
> > >>>>>>>>>>>> Not sure what I should read from there...
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> There are special cases to subreg HI, SI and DI modes of TI mode in
> > >>>>>>>>>>>>> ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
> > >>>>>>>>>>>>> work here.   Each backend may need its own special handling.
> > >>>>>>>>>>>> OK, I guess I'm not (RTL) qualified enough to further review these parts,
> > >>>>>>>>>>>> sorry.  Since we're doing code generation the canonical way to communicate
> > >>>>>>>>>>>> with backends should be optabs, not some set of disconnected target hooks.
> > >>>>>>>>>>>> But as said, I probably don't know enough of RTL to see why it's the only way.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Richard.
> > >>>>>>>>>>> Here is the patch to add optabs instead.  Does it look OK?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks.
> > >>>>>>>>>>>
> > >>>>>>>>>>> H.J.
> > >>>>>>>>>>> ---
> > >>>>>>>>>>> Add 2 optabs:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1. integer_extract: Extract lower bit value from the integer value in
> > >>>>>>>>>>> TImode, OImode or XImode.
> > >>>>>>>>>> That sounds very specific, esp. the restriction to {TI,OI,XI}mode.
> > >>>>>>>>>> It also sounds like it matches (subreg:{TI,OI,XI} (...) 0).  There are
> > >>>>>>>>>> existing target hooks verifying subreg validity - why's that not a good
> > >>>>>>>>>> fit here?  ISTR you say gen_lowpart () doesn't work (or was it
> > >>>>>>>>>> simplify_gen_subreg?), why's that so?
> > >>>>>>>>> {TI,OI,XI}mode are storage only integer types.   subreg doesn't work
> > >>>>>>>>> well on them.  I got
> > >>>>>>>>>
> > >>>>>>>>> [hjl@gnu-cfl-2 pieces]$ cat s2.i
> > >>>>>>>>> extern void *ops;
> > >>>>>>>>>
> > >>>>>>>>> void
> > >>>>>>>>> foo (int c)
> > >>>>>>>>> {
> > >>>>>>>>>     __builtin_memset (ops, c, 34);
> > >>>>>>>>> }
> > >>>>>>>>> [hjl@gnu-cfl-2 pieces]$ make s2.s
> > >>>>>>>>> /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
> > >>>>>>>>> -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
> > >>>>>>>>> -O2 -march=haswell -S s2.i
> > >>>>>>>>> during RTL pass: reload
> > >>>>>>>>> s2.i: In function ‘foo’:
> > >>>>>>>>> s2.i:7:1: internal compiler error: maximum number of generated reload
> > >>>>>>>>> insns per insn achieved (90)
> > >>>>>>>>>       7 | }
> > >>>>>>>>>         | ^
> > >>>>>>>>> 0x1050734 lra_constraints(bool)
> > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c:5091
> > >>>>>>>>> 0x1039536 lra(_IO_FILE*)
> > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336
> > >>>>>>>>> 0xfe1140 do_reload
> > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822
> > >>>>>>>>> 0xfe162e execute
> > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008
> > >>>>>>>>> Please submit a full bug report,
> > >>>>>>>>> with preprocessed source if appropriate.
> > >>>>>>>>> Please include the complete backtrace with any bug report.
> > >>>>>>>>> See <https://gcc.gnu.org/bugs/> for instructions.
> > >>>>>>>>> make: *** [Makefile:32: s2.s] Error 1
> > >>>>>>>>> [hjl@gnu-cfl-2 pieces]$
> > >>>>>>>>>
> > >>>>>>>>> due to
> > >>>>>>>>>
> > >>>>>>>>> (insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84)
> > >>>>>>>>>                   (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
> > >>>>>>>>> *)ops.0_1]+32 S2 A8])
> > >>>>>>>>>           (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1
> > >>>>>>>>>        (nil))
> > >>>>>>>>>
> > >>>>>>>>> The new optab gives us
> > >>>>>>>>>
> > >>>>>>>>> (insn 12 11 13 2 (set (reg:TI 88)
> > >>>>>>>>>           (reg:TI 51 xmm15)) "s2.i":6:3 -1
> > >>>>>>>>>        (nil))
> > >>>>>>>>> (insn 13 12 14 2 (set (reg:SI 89)
> > >>>>>>>>>           (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1
> > >>>>>>>>>        (nil))
> > >>>>>>>>> (insn 14 13 15 2 (set (reg:HI 87)
> > >>>>>>>>>           (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1
> > >>>>>>>>>        (nil))
> > >>>>>>>> that looks odd to me - what's the final result after LRA?  I think
> > >>>>>>> I got:
> > >>>>>>>
> > >>>>>>> vmovd %edi, %xmm15
> > >>>>>>> movq ops(%rip), %rdx
> > >>>>>>> vpbroadcastb %xmm15, %ymm15
> > >>>>>>> vmovq %xmm15, %rax    <<<< move to GPR
> > >>>>>>> vmovdqu %ymm15, (%rdx)
> > >>>>>>> movw %ax, 32(%rdx)   <<<< subreg of GPR
> > >>>>>>> vzeroupper
> > >>>>>>> ret
> > >>>>>>>
> > >>>>>>>> we should see to make lowpart_subreg work on {XI,OI,TI}mode.
> > >>>>>>>> Only two steps should be necessary at most:
> > >>>>>>>> xmm -> gpr, grp -> subreg, or gpr -> subreg.  So the expander
> > >>>>>>>> code in memset should try to generate the subreg directly
> > >>>>>>> subreg didn't fail on x86 when I tried.
> > >>>>>>>
> > >>>>>>>> and if that fails, try a word_mode subreg followed by the subreg.
> > >>>>>>> I will try word_mode subreg.
> > >>>>>>>
> > >>>>>> Here is the v2 patch to use word_mode subreg.  For
> > >>>>>>
> > >>>>>> ---
> > >>>>>> extern void *ops;
> > >>>>>>
> > >>>>>> void
> > >>>>>> foo (int c)
> > >>>>>> {
> > >>>>>>     __builtin_memset (ops, 4, 32);
> > >>>>>> }
> > >>>>>> ---
> > >>>>>>
> > >>>>>> without vec_const_duplicate, I got
> > >>>>>>
> > >>>>>>    movl    $4, %eax
> > >>>>>>    movq    ops(%rip), %rdx
> > >>>>>>    movd    %eax, %xmm0
> > >>>>>>    punpcklbw       %xmm0, %xmm0
> > >>>>>>    punpcklwd       %xmm0, %xmm0
> > >>>>>>    pshufd  $0, %xmm0, %xmm0
> > >>>>>>    movups  %xmm0, (%rdx)
> > >>>>>>    movups  %xmm0, 16(%rdx)
> > >>>>>>    ret
> > >>>>>>
> > >>>>>> With vec_const_duplicate, I got
> > >>>>>>
> > >>>>>>    movq    ops(%rip), %rax
> > >>>>>>    movdqa  .LC0(%rip), %xmm0
> > >>>>>>    movups  %xmm0, (%rax)
> > >>>>>>    movups  %xmm0, 16(%rax)
> > >>>>>>    ret
> > >>>>>>
> > >>>>>> If vec_duplicate is allowed to fail, I don't need vec_const_duplicate.
> > >>>>> I don't understand why we need an optab for this though.  If the operand
> > >>>>> is constant then we should just be doing an ordinary move in which the
> > >>>>> source is a CONST_VECTOR.  It's then up to the move patterns to handle
> > >>>>> duplicated constants as efficiently as possible.  (Sorry if this was
> > >>>>> discussed upthread and I missed it.)
> > >>>> That's exactly the point I'm trying to get across as well.
> > >>>>
> > >>> This is what we do today.  But I'd like to generate
> > >>>
> > >>>          movl    $4, %eax
> > >>>          vpbroadcastb    %eax, %ymm15
> > >>>          movq    ops(%rip), %rax
> > >>>          vmovdqu %ymm15, (%rax)
> > >>>          vzeroupper
> > >>>          ret
> > >>>
> > >>> instead of
> > >>>
> > >>>          vmovdqa .LC0(%rip), %ymm15
> > >>>          movq    ops(%rip), %rax
> > >>>          vmovdqu %ymm15, (%rax)
> > >>>          vzeroupper
> > >>>          ret
> > >>>
> > >>> Do I need a vec_dup pattern for it?
> > >> I think we have special code sequences to materialize some
> > >> constant vectors already, we should be able to add to that, no?
> > > We can do that for all 0s and all 1s at the final codegen.   For
> > > other values, since we need a GPR, we can't do that.
> > You can catch them in your movxx expanders, you can create peep2
> > patterns that use available GPRs, etc.  I don't see a fundamental need
> > to to introduce new target macros or hooks to handle this stuff.  In
> > fact I've done both to handle a closely related issue on our port.
> >
>
> One problem of expanding TI/OI/XI moves to broadcast is that other
> RTL passes may change it.   For example, expander generates:
It could be handled in pass_data_constant_pool_broadcast which is
designed for avx512 embedding broadcast, but can also do such
transforming.

see
https://godbolt.org/z/8YGzqf938
>
> (insn 7 5 6 (set (reg:QI 85)
>         (const_int 12 [0xc]))
> "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
> -1
>      (nil))
>
> (insn 6 7 8 (set (reg:V16QI 84)
>         (vec_duplicate:V16QI (reg:QI 85)))
> "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c
> ":9:3 5103 {*avx512vl_vec_dup_gprv16qi}
>      (nil))
>
> (insn 8 6 9 (set (subreg:V16QI (reg:TI 86) 0)
>         (reg:V16QI 84))
> "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
> -1
>      (nil))
>
> (insn 9 8 10 (set (mem:TI (reg/f:DI 83) [0 MEM <char[1:19]> [(void
> *)dst.0_1]+0 S16 A8])
>         (reg:TI 86))
> "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
> -1
>      (nil))
>
> combine turns it into:
>
> insn 9 6 10 2 (set (mem:TI (reg/f:DI 83 [ dst ]) [0 MEM <char[1:19]>
> [(void *)dst.0_1]+0 S16 A8])
>         (const_wide_int 0xc0c0c0c0c0c0c0c0c0c0c0c0c0c0c0c))
> "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/
> i386/pr90773-17.c":9:3 73 {*movti_internal}
>      (nil))
>
> LRA tries:
>
> (insn 14 15 16 2 (set (reg:V16QI 20 xmm0 [89])
>         (vec_duplicate:V16QI (reg:QI 4 si [90])))
> "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr907
> 73-17.c":9:3 5103 {*avx512vl_vec_dup_gprv16qi}
>      (nil))
> (insn 16 14 9 2 (set (reg:V16QI 1 dx)
>         (reg:V16QI 20 xmm0 [89]))
> "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
> 152
> 7 {movv16qi_internal}
>      (nil))
> (insn 9 16 10 2 (set (mem:TI (reg/f:DI 0 ax [orig:83 dst ] [83]) [0
> MEM <char[1:19]> [(void *)dst.0_1]+0 S16 A8])
>         (reg:TI 1 dx [88]))
> "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
> 73 {*movt
> i_internal}
>
> and fails:
>
> /export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c:
> In function ‘foo’:
> /export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c:10:1:
> error: insn does not satisfy its constraints:
> (insn 16 14 9 2 (set (reg:V16QI 1 dx)
>         (reg:V16QI 20 xmm0 [89]))
> "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
> 1527 {movv16qi_internal}
>      (nil))
>
> I want to hide
>
> (const_wide_int 0xc0c0c0c0c0c0c0c0c0c0c0c0c0c0c0c)
>
> from RTL passes.
>
>
> --
> H.J.



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2] Add vec_const_duplicate optab and TARGET_GEN_MEMSET_SCRATCH_RTX
  2021-06-02  1:21                                         ` Hongtao Liu
@ 2021-06-02  1:54                                           ` H.J. Lu
  2021-06-02  7:02                                             ` Richard Biener
  0 siblings, 1 reply; 52+ messages in thread
From: H.J. Lu @ 2021-06-02  1:54 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: Jeff Law, Jeff Law, Bernd Edlinger, H.J. Lu via Gcc-patches,
	Richard Sandiford

On Tue, Jun 1, 2021 at 6:17 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Wed, Jun 2, 2021 at 7:07 AM H.J. Lu via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > On Tue, Jun 1, 2021 at 7:21 AM Jeff Law <jeffreyalaw@gmail.com> wrote:
> > >
> > >
> > >
> > > On 6/1/2021 7:29 AM, H.J. Lu via Gcc-patches wrote:
> > > > On Tue, Jun 1, 2021 at 6:25 AM Richard Biener
> > > > <richard.guenther@gmail.com> wrote:
> > > >> On Tue, Jun 1, 2021 at 3:05 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >>> On Mon, May 31, 2021 at 11:54:53PM -0600, Jeff Law wrote:
> > > >>>>
> > > >>>> On 5/31/2021 11:50 PM, Richard Sandiford wrote:
> > > >>>>> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> > > >>>>>> On Mon, May 31, 2021 at 06:32:04AM -0700, H.J. Lu wrote:
> > > >>>>>>> On Mon, May 31, 2021 at 6:26 AM Richard Biener
> > > >>>>>>> <richard.guenther@gmail.com> wrote:
> > > >>>>>>>> On Mon, May 31, 2021 at 3:12 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >>>>>>>>> On Mon, May 31, 2021 at 5:46 AM Richard Biener
> > > >>>>>>>>> <richard.guenther@gmail.com> wrote:
> > > >>>>>>>>>> On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >>>>>>>>>>> On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
> > > >>>>>>>>>>>>>>>    -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> > > >>>>>>>>>>>>>>>             MODE)
> > > >>>>>>>>>>>>>>>        This function returns the RTL of a register containing
> > > >>>>>>>>>>>>>>>        'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> > > >>>>>>>>>>>>>>>        value given in the RTL register DATA.  For example, if MODE is 4
> > > >>>>>>>>>>>>>>>        bytes wide, return the RTL for 0x01010101*DATA.
> > > >>>>>>>>>>>>>> For this one I wonder if it should be an optab instead.  Couldn't you
> > > >>>>>>>>>>>>>> use the existing vec_duplicate for this by using (paradoxical) subregs
> > > >>>>>>>>>>>>>> like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
> > > >>>>>>>>>>>>> I tried.   It doesn't even work on x86.  See:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
> > > >>>>>>>>>>>> Not sure what I should read from there...
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> There are special cases to subreg HI, SI and DI modes of TI mode in
> > > >>>>>>>>>>>>> ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
> > > >>>>>>>>>>>>> work here.   Each backend may need its own special handling.
> > > >>>>>>>>>>>> OK, I guess I'm not (RTL) qualified enough to further review these parts,
> > > >>>>>>>>>>>> sorry.  Since we're doing code generation the canonical way to communicate
> > > >>>>>>>>>>>> with backends should be optabs, not some set of disconnected target hooks.
> > > >>>>>>>>>>>> But as said, I probably don't know enough of RTL to see why it's the only way.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Richard.
> > > >>>>>>>>>>> Here is the patch to add optabs instead.  Does it look OK?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> H.J.
> > > >>>>>>>>>>> ---
> > > >>>>>>>>>>> Add 2 optabs:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 1. integer_extract: Extract lower bit value from the integer value in
> > > >>>>>>>>>>> TImode, OImode or XImode.
> > > >>>>>>>>>> That sounds very specific, esp. the restriction to {TI,OI,XI}mode.
> > > >>>>>>>>>> It also sounds like it matches (subreg:{TI,OI,XI} (...) 0).  There are
> > > >>>>>>>>>> existing target hooks verifying subreg validity - why's that not a good
> > > >>>>>>>>>> fit here?  ISTR you say gen_lowpart () doesn't work (or was it
> > > >>>>>>>>>> simplify_gen_subreg?), why's that so?
> > > >>>>>>>>> {TI,OI,XI}mode are storage only integer types.   subreg doesn't work
> > > >>>>>>>>> well on them.  I got
> > > >>>>>>>>>
> > > >>>>>>>>> [hjl@gnu-cfl-2 pieces]$ cat s2.i
> > > >>>>>>>>> extern void *ops;
> > > >>>>>>>>>
> > > >>>>>>>>> void
> > > >>>>>>>>> foo (int c)
> > > >>>>>>>>> {
> > > >>>>>>>>>     __builtin_memset (ops, c, 34);
> > > >>>>>>>>> }
> > > >>>>>>>>> [hjl@gnu-cfl-2 pieces]$ make s2.s
> > > >>>>>>>>> /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
> > > >>>>>>>>> -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
> > > >>>>>>>>> -O2 -march=haswell -S s2.i
> > > >>>>>>>>> during RTL pass: reload
> > > >>>>>>>>> s2.i: In function ‘foo’:
> > > >>>>>>>>> s2.i:7:1: internal compiler error: maximum number of generated reload
> > > >>>>>>>>> insns per insn achieved (90)
> > > >>>>>>>>>       7 | }
> > > >>>>>>>>>         | ^
> > > >>>>>>>>> 0x1050734 lra_constraints(bool)
> > > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c:5091
> > > >>>>>>>>> 0x1039536 lra(_IO_FILE*)
> > > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336
> > > >>>>>>>>> 0xfe1140 do_reload
> > > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822
> > > >>>>>>>>> 0xfe162e execute
> > > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008
> > > >>>>>>>>> Please submit a full bug report,
> > > >>>>>>>>> with preprocessed source if appropriate.
> > > >>>>>>>>> Please include the complete backtrace with any bug report.
> > > >>>>>>>>> See <https://gcc.gnu.org/bugs/> for instructions.
> > > >>>>>>>>> make: *** [Makefile:32: s2.s] Error 1
> > > >>>>>>>>> [hjl@gnu-cfl-2 pieces]$
> > > >>>>>>>>>
> > > >>>>>>>>> due to
> > > >>>>>>>>>
> > > >>>>>>>>> (insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84)
> > > >>>>>>>>>                   (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
> > > >>>>>>>>> *)ops.0_1]+32 S2 A8])
> > > >>>>>>>>>           (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1
> > > >>>>>>>>>        (nil))
> > > >>>>>>>>>
> > > >>>>>>>>> The new optab gives us
> > > >>>>>>>>>
> > > >>>>>>>>> (insn 12 11 13 2 (set (reg:TI 88)
> > > >>>>>>>>>           (reg:TI 51 xmm15)) "s2.i":6:3 -1
> > > >>>>>>>>>        (nil))
> > > >>>>>>>>> (insn 13 12 14 2 (set (reg:SI 89)
> > > >>>>>>>>>           (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1
> > > >>>>>>>>>        (nil))
> > > >>>>>>>>> (insn 14 13 15 2 (set (reg:HI 87)
> > > >>>>>>>>>           (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1
> > > >>>>>>>>>        (nil))
> > > >>>>>>>> that looks odd to me - what's the final result after LRA?  I think
> > > >>>>>>> I got:
> > > >>>>>>>
> > > >>>>>>> vmovd %edi, %xmm15
> > > >>>>>>> movq ops(%rip), %rdx
> > > >>>>>>> vpbroadcastb %xmm15, %ymm15
> > > >>>>>>> vmovq %xmm15, %rax    <<<< move to GPR
> > > >>>>>>> vmovdqu %ymm15, (%rdx)
> > > >>>>>>> movw %ax, 32(%rdx)   <<<< subreg of GPR
> > > >>>>>>> vzeroupper
> > > >>>>>>> ret
> > > >>>>>>>
> > > >>>>>>>> we should see to make lowpart_subreg work on {XI,OI,TI}mode.
> > > >>>>>>>> Only two steps should be necessary at most:
> > > >>>>>>>> xmm -> gpr, grp -> subreg, or gpr -> subreg.  So the expander
> > > >>>>>>>> code in memset should try to generate the subreg directly
> > > >>>>>>> subreg didn't fail on x86 when I tried.
> > > >>>>>>>
> > > >>>>>>>> and if that fails, try a word_mode subreg followed by the subreg.
> > > >>>>>>> I will try word_mode subreg.
> > > >>>>>>>
> > > >>>>>> Here is the v2 patch to use word_mode subreg.  For
> > > >>>>>>
> > > >>>>>> ---
> > > >>>>>> extern void *ops;
> > > >>>>>>
> > > >>>>>> void
> > > >>>>>> foo (int c)
> > > >>>>>> {
> > > >>>>>>     __builtin_memset (ops, 4, 32);
> > > >>>>>> }
> > > >>>>>> ---
> > > >>>>>>
> > > >>>>>> without vec_const_duplicate, I got
> > > >>>>>>
> > > >>>>>>    movl    $4, %eax
> > > >>>>>>    movq    ops(%rip), %rdx
> > > >>>>>>    movd    %eax, %xmm0
> > > >>>>>>    punpcklbw       %xmm0, %xmm0
> > > >>>>>>    punpcklwd       %xmm0, %xmm0
> > > >>>>>>    pshufd  $0, %xmm0, %xmm0
> > > >>>>>>    movups  %xmm0, (%rdx)
> > > >>>>>>    movups  %xmm0, 16(%rdx)
> > > >>>>>>    ret
> > > >>>>>>
> > > >>>>>> With vec_const_duplicate, I got
> > > >>>>>>
> > > >>>>>>    movq    ops(%rip), %rax
> > > >>>>>>    movdqa  .LC0(%rip), %xmm0
> > > >>>>>>    movups  %xmm0, (%rax)
> > > >>>>>>    movups  %xmm0, 16(%rax)
> > > >>>>>>    ret
> > > >>>>>>
> > > >>>>>> If vec_duplicate is allowed to fail, I don't need vec_const_duplicate.
> > > >>>>> I don't understand why we need an optab for this though.  If the operand
> > > >>>>> is constant then we should just be doing an ordinary move in which the
> > > >>>>> source is a CONST_VECTOR.  It's then up to the move patterns to handle
> > > >>>>> duplicated constants as efficiently as possible.  (Sorry if this was
> > > >>>>> discussed upthread and I missed it.)
> > > >>>> That's exactly the point I'm trying to get across as well.
> > > >>>>
> > > >>> This is what we do today.  But I'd like to generate
> > > >>>
> > > >>>          movl    $4, %eax
> > > >>>          vpbroadcastb    %eax, %ymm15
> > > >>>          movq    ops(%rip), %rax
> > > >>>          vmovdqu %ymm15, (%rax)
> > > >>>          vzeroupper
> > > >>>          ret
> > > >>>
> > > >>> instead of
> > > >>>
> > > >>>          vmovdqa .LC0(%rip), %ymm15
> > > >>>          movq    ops(%rip), %rax
> > > >>>          vmovdqu %ymm15, (%rax)
> > > >>>          vzeroupper
> > > >>>          ret
> > > >>>
> > > >>> Do I need a vec_dup pattern for it?
> > > >> I think we have special code sequences to materialize some
> > > >> constant vectors already, we should be able to add to that, no?
> > > > We can do that for all 0s and all 1s at the final codegen.   For
> > > > other values, since we need a GPR, we can't do that.
> > > You can catch them in your movxx expanders, you can create peep2
> > > patterns that use available GPRs, etc.  I don't see a fundamental need
> > > to to introduce new target macros or hooks to handle this stuff.  In
> > > fact I've done both to handle a closely related issue on our port.
> > >
> >
> > One problem of expanding TI/OI/XI moves to broadcast is that other
> > RTL passes may change it.   For example, expander generates:
> It could be handled in pass_data_constant_pool_broadcast which is
> designed for avx512 embedding broadcast, but can also do such
> transforming.
>
> see
> https://godbolt.org/z/8YGzqf938

It sounds promising, but doesn't work on TImode:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100865

> >
> > (insn 7 5 6 (set (reg:QI 85)
> >         (const_int 12 [0xc]))
> > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
> > -1
> >      (nil))
> >
> > (insn 6 7 8 (set (reg:V16QI 84)
> >         (vec_duplicate:V16QI (reg:QI 85)))
> > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c
> > ":9:3 5103 {*avx512vl_vec_dup_gprv16qi}
> >      (nil))
> >
> > (insn 8 6 9 (set (subreg:V16QI (reg:TI 86) 0)
> >         (reg:V16QI 84))
> > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
> > -1
> >      (nil))
> >
> > (insn 9 8 10 (set (mem:TI (reg/f:DI 83) [0 MEM <char[1:19]> [(void
> > *)dst.0_1]+0 S16 A8])
> >         (reg:TI 86))
> > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
> > -1
> >      (nil))
> >
> > combine turns it into:
> >
> > insn 9 6 10 2 (set (mem:TI (reg/f:DI 83 [ dst ]) [0 MEM <char[1:19]>
> > [(void *)dst.0_1]+0 S16 A8])
> >         (const_wide_int 0xc0c0c0c0c0c0c0c0c0c0c0c0c0c0c0c))
> > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/
> > i386/pr90773-17.c":9:3 73 {*movti_internal}
> >      (nil))
> >
> > LRA tries:
> >
> > (insn 14 15 16 2 (set (reg:V16QI 20 xmm0 [89])
> >         (vec_duplicate:V16QI (reg:QI 4 si [90])))
> > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr907
> > 73-17.c":9:3 5103 {*avx512vl_vec_dup_gprv16qi}
> >      (nil))
> > (insn 16 14 9 2 (set (reg:V16QI 1 dx)
> >         (reg:V16QI 20 xmm0 [89]))
> > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
> > 152
> > 7 {movv16qi_internal}
> >      (nil))
> > (insn 9 16 10 2 (set (mem:TI (reg/f:DI 0 ax [orig:83 dst ] [83]) [0
> > MEM <char[1:19]> [(void *)dst.0_1]+0 S16 A8])
> >         (reg:TI 1 dx [88]))
> > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
> > 73 {*movt
> > i_internal}
> >
> > and fails:
> >
> > /export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c:
> > In function ‘foo’:
> > /export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c:10:1:
> > error: insn does not satisfy its constraints:
> > (insn 16 14 9 2 (set (reg:V16QI 1 dx)
> >         (reg:V16QI 20 xmm0 [89]))
> > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
> > 1527 {movv16qi_internal}
> >      (nil))
> >
> > I want to hide
> >
> > (const_wide_int 0xc0c0c0c0c0c0c0c0c0c0c0c0c0c0c0c)
> >
> > from RTL passes.
> >
> >
> > --
> > H.J.
>
>
>
> --
> BR,
> Hongtao



-- 
H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2] Add vec_const_duplicate optab and TARGET_GEN_MEMSET_SCRATCH_RTX
  2021-06-02  1:54                                           ` H.J. Lu
@ 2021-06-02  7:02                                             ` Richard Biener
  2021-06-02 13:50                                               ` H.J. Lu
  0 siblings, 1 reply; 52+ messages in thread
From: Richard Biener @ 2021-06-02  7:02 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Hongtao Liu, H.J. Lu via Gcc-patches, Bernd Edlinger, Jeff Law,
	Richard Sandiford

On Wed, Jun 2, 2021 at 3:57 AM H.J. Lu via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Tue, Jun 1, 2021 at 6:17 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > On Wed, Jun 2, 2021 at 7:07 AM H.J. Lu via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > On Tue, Jun 1, 2021 at 7:21 AM Jeff Law <jeffreyalaw@gmail.com> wrote:
> > > >
> > > >
> > > >
> > > > On 6/1/2021 7:29 AM, H.J. Lu via Gcc-patches wrote:
> > > > > On Tue, Jun 1, 2021 at 6:25 AM Richard Biener
> > > > > <richard.guenther@gmail.com> wrote:
> > > > >> On Tue, Jun 1, 2021 at 3:05 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > >>> On Mon, May 31, 2021 at 11:54:53PM -0600, Jeff Law wrote:
> > > > >>>>
> > > > >>>> On 5/31/2021 11:50 PM, Richard Sandiford wrote:
> > > > >>>>> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> > > > >>>>>> On Mon, May 31, 2021 at 06:32:04AM -0700, H.J. Lu wrote:
> > > > >>>>>>> On Mon, May 31, 2021 at 6:26 AM Richard Biener
> > > > >>>>>>> <richard.guenther@gmail.com> wrote:
> > > > >>>>>>>> On Mon, May 31, 2021 at 3:12 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > >>>>>>>>> On Mon, May 31, 2021 at 5:46 AM Richard Biener
> > > > >>>>>>>>> <richard.guenther@gmail.com> wrote:
> > > > >>>>>>>>>> On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > >>>>>>>>>>> On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
> > > > >>>>>>>>>>>>>>>    -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> > > > >>>>>>>>>>>>>>>             MODE)
> > > > >>>>>>>>>>>>>>>        This function returns the RTL of a register containing
> > > > >>>>>>>>>>>>>>>        'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> > > > >>>>>>>>>>>>>>>        value given in the RTL register DATA.  For example, if MODE is 4
> > > > >>>>>>>>>>>>>>>        bytes wide, return the RTL for 0x01010101*DATA.
> > > > >>>>>>>>>>>>>> For this one I wonder if it should be an optab instead.  Couldn't you
> > > > >>>>>>>>>>>>>> use the existing vec_duplicate for this by using (paradoxical) subregs
> > > > >>>>>>>>>>>>>> like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
> > > > >>>>>>>>>>>>> I tried.   It doesn't even work on x86.  See:
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
> > > > >>>>>>>>>>>> Not sure what I should read from there...
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> There are special cases to subreg HI, SI and DI modes of TI mode in
> > > > >>>>>>>>>>>>> ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
> > > > >>>>>>>>>>>>> work here.   Each backend may need its own special handling.
> > > > >>>>>>>>>>>> OK, I guess I'm not (RTL) qualified enough to further review these parts,
> > > > >>>>>>>>>>>> sorry.  Since we're doing code generation the canonical way to communicate
> > > > >>>>>>>>>>>> with backends should be optabs, not some set of disconnected target hooks.
> > > > >>>>>>>>>>>> But as said, I probably don't know enough of RTL to see why it's the only way.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Richard.
> > > > >>>>>>>>>>> Here is the patch to add optabs instead.  Does it look OK?
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Thanks.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> H.J.
> > > > >>>>>>>>>>> ---
> > > > >>>>>>>>>>> Add 2 optabs:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> 1. integer_extract: Extract lower bit value from the integer value in
> > > > >>>>>>>>>>> TImode, OImode or XImode.
> > > > >>>>>>>>>> That sounds very specific, esp. the restriction to {TI,OI,XI}mode.
> > > > >>>>>>>>>> It also sounds like it matches (subreg:{TI,OI,XI} (...) 0).  There are
> > > > >>>>>>>>>> existing target hooks verifying subreg validity - why's that not a good
> > > > >>>>>>>>>> fit here?  ISTR you say gen_lowpart () doesn't work (or was it
> > > > >>>>>>>>>> simplify_gen_subreg?), why's that so?
> > > > >>>>>>>>> {TI,OI,XI}mode are storage only integer types.   subreg doesn't work
> > > > >>>>>>>>> well on them.  I got
> > > > >>>>>>>>>
> > > > >>>>>>>>> [hjl@gnu-cfl-2 pieces]$ cat s2.i
> > > > >>>>>>>>> extern void *ops;
> > > > >>>>>>>>>
> > > > >>>>>>>>> void
> > > > >>>>>>>>> foo (int c)
> > > > >>>>>>>>> {
> > > > >>>>>>>>>     __builtin_memset (ops, c, 34);
> > > > >>>>>>>>> }
> > > > >>>>>>>>> [hjl@gnu-cfl-2 pieces]$ make s2.s
> > > > >>>>>>>>> /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
> > > > >>>>>>>>> -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
> > > > >>>>>>>>> -O2 -march=haswell -S s2.i
> > > > >>>>>>>>> during RTL pass: reload
> > > > >>>>>>>>> s2.i: In function ‘foo’:
> > > > >>>>>>>>> s2.i:7:1: internal compiler error: maximum number of generated reload
> > > > >>>>>>>>> insns per insn achieved (90)
> > > > >>>>>>>>>       7 | }
> > > > >>>>>>>>>         | ^
> > > > >>>>>>>>> 0x1050734 lra_constraints(bool)
> > > > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c:5091
> > > > >>>>>>>>> 0x1039536 lra(_IO_FILE*)
> > > > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336
> > > > >>>>>>>>> 0xfe1140 do_reload
> > > > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822
> > > > >>>>>>>>> 0xfe162e execute
> > > > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008
> > > > >>>>>>>>> Please submit a full bug report,
> > > > >>>>>>>>> with preprocessed source if appropriate.
> > > > >>>>>>>>> Please include the complete backtrace with any bug report.
> > > > >>>>>>>>> See <https://gcc.gnu.org/bugs/> for instructions.
> > > > >>>>>>>>> make: *** [Makefile:32: s2.s] Error 1
> > > > >>>>>>>>> [hjl@gnu-cfl-2 pieces]$
> > > > >>>>>>>>>
> > > > >>>>>>>>> due to
> > > > >>>>>>>>>
> > > > >>>>>>>>> (insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84)
> > > > >>>>>>>>>                   (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
> > > > >>>>>>>>> *)ops.0_1]+32 S2 A8])
> > > > >>>>>>>>>           (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1
> > > > >>>>>>>>>        (nil))
> > > > >>>>>>>>>
> > > > >>>>>>>>> The new optab gives us
> > > > >>>>>>>>>
> > > > >>>>>>>>> (insn 12 11 13 2 (set (reg:TI 88)
> > > > >>>>>>>>>           (reg:TI 51 xmm15)) "s2.i":6:3 -1
> > > > >>>>>>>>>        (nil))
> > > > >>>>>>>>> (insn 13 12 14 2 (set (reg:SI 89)
> > > > >>>>>>>>>           (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1
> > > > >>>>>>>>>        (nil))
> > > > >>>>>>>>> (insn 14 13 15 2 (set (reg:HI 87)
> > > > >>>>>>>>>           (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1
> > > > >>>>>>>>>        (nil))
> > > > >>>>>>>> that looks odd to me - what's the final result after LRA?  I think
> > > > >>>>>>> I got:
> > > > >>>>>>>
> > > > >>>>>>> vmovd %edi, %xmm15
> > > > >>>>>>> movq ops(%rip), %rdx
> > > > >>>>>>> vpbroadcastb %xmm15, %ymm15
> > > > >>>>>>> vmovq %xmm15, %rax    <<<< move to GPR
> > > > >>>>>>> vmovdqu %ymm15, (%rdx)
> > > > >>>>>>> movw %ax, 32(%rdx)   <<<< subreg of GPR
> > > > >>>>>>> vzeroupper
> > > > >>>>>>> ret
> > > > >>>>>>>
> > > > >>>>>>>> we should see to make lowpart_subreg work on {XI,OI,TI}mode.
> > > > >>>>>>>> Only two steps should be necessary at most:
> > > > >>>>>>>> xmm -> gpr, grp -> subreg, or gpr -> subreg.  So the expander
> > > > >>>>>>>> code in memset should try to generate the subreg directly
> > > > >>>>>>> subreg didn't fail on x86 when I tried.
> > > > >>>>>>>
> > > > >>>>>>>> and if that fails, try a word_mode subreg followed by the subreg.
> > > > >>>>>>> I will try word_mode subreg.
> > > > >>>>>>>
> > > > >>>>>> Here is the v2 patch to use word_mode subreg.  For
> > > > >>>>>>
> > > > >>>>>> ---
> > > > >>>>>> extern void *ops;
> > > > >>>>>>
> > > > >>>>>> void
> > > > >>>>>> foo (int c)
> > > > >>>>>> {
> > > > >>>>>>     __builtin_memset (ops, 4, 32);
> > > > >>>>>> }
> > > > >>>>>> ---
> > > > >>>>>>
> > > > >>>>>> without vec_const_duplicate, I got
> > > > >>>>>>
> > > > >>>>>>    movl    $4, %eax
> > > > >>>>>>    movq    ops(%rip), %rdx
> > > > >>>>>>    movd    %eax, %xmm0
> > > > >>>>>>    punpcklbw       %xmm0, %xmm0
> > > > >>>>>>    punpcklwd       %xmm0, %xmm0
> > > > >>>>>>    pshufd  $0, %xmm0, %xmm0
> > > > >>>>>>    movups  %xmm0, (%rdx)
> > > > >>>>>>    movups  %xmm0, 16(%rdx)
> > > > >>>>>>    ret
> > > > >>>>>>
> > > > >>>>>> With vec_const_duplicate, I got
> > > > >>>>>>
> > > > >>>>>>    movq    ops(%rip), %rax
> > > > >>>>>>    movdqa  .LC0(%rip), %xmm0
> > > > >>>>>>    movups  %xmm0, (%rax)
> > > > >>>>>>    movups  %xmm0, 16(%rax)
> > > > >>>>>>    ret
> > > > >>>>>>
> > > > >>>>>> If vec_duplicate is allowed to fail, I don't need vec_const_duplicate.
> > > > >>>>> I don't understand why we need an optab for this though.  If the operand
> > > > >>>>> is constant then we should just be doing an ordinary move in which the
> > > > >>>>> source is a CONST_VECTOR.  It's then up to the move patterns to handle
> > > > >>>>> duplicated constants as efficiently as possible.  (Sorry if this was
> > > > >>>>> discussed upthread and I missed it.)
> > > > >>>> That's exactly the point I'm trying to get across as well.
> > > > >>>>
> > > > >>> This is what we do today.  But I'd like to generate
> > > > >>>
> > > > >>>          movl    $4, %eax
> > > > >>>          vpbroadcastb    %eax, %ymm15
> > > > >>>          movq    ops(%rip), %rax
> > > > >>>          vmovdqu %ymm15, (%rax)
> > > > >>>          vzeroupper
> > > > >>>          ret
> > > > >>>
> > > > >>> instead of
> > > > >>>
> > > > >>>          vmovdqa .LC0(%rip), %ymm15
> > > > >>>          movq    ops(%rip), %rax
> > > > >>>          vmovdqu %ymm15, (%rax)
> > > > >>>          vzeroupper
> > > > >>>          ret
> > > > >>>
> > > > >>> Do I need a vec_dup pattern for it?
> > > > >> I think we have special code sequences to materialize some
> > > > >> constant vectors already, we should be able to add to that, no?
> > > > > We can do that for all 0s and all 1s at the final codegen.   For
> > > > > other values, since we need a GPR, we can't do that.
> > > > You can catch them in your movxx expanders, you can create peep2
> > > > patterns that use available GPRs, etc.  I don't see a fundamental need
> > > > to to introduce new target macros or hooks to handle this stuff.  In
> > > > fact I've done both to handle a closely related issue on our port.
> > > >
> > >
> > > One problem of expanding TI/OI/XI moves to broadcast is that other
> > > RTL passes may change it.   For example, expander generates:
> > It could be handled in pass_data_constant_pool_broadcast which is
> > designed for avx512 embedding broadcast, but can also do such
> > transforming.
> >
> > see
> > https://godbolt.org/z/8YGzqf938
>
> It sounds promising, but doesn't work on TImode:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100865

It would seem it could be handled in ix86_expand_vector_move?
That is, you create a pseudo and (set (reg:VnQI ..) (const_vector ...)).

Richard.

> > >
> > > (insn 7 5 6 (set (reg:QI 85)
> > >         (const_int 12 [0xc]))
> > > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
> > > -1
> > >      (nil))
> > >
> > > (insn 6 7 8 (set (reg:V16QI 84)
> > >         (vec_duplicate:V16QI (reg:QI 85)))
> > > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c
> > > ":9:3 5103 {*avx512vl_vec_dup_gprv16qi}
> > >      (nil))
> > >
> > > (insn 8 6 9 (set (subreg:V16QI (reg:TI 86) 0)
> > >         (reg:V16QI 84))
> > > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
> > > -1
> > >      (nil))
> > >
> > > (insn 9 8 10 (set (mem:TI (reg/f:DI 83) [0 MEM <char[1:19]> [(void
> > > *)dst.0_1]+0 S16 A8])
> > >         (reg:TI 86))
> > > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
> > > -1
> > >      (nil))
> > >
> > > combine turns it into:
> > >
> > > insn 9 6 10 2 (set (mem:TI (reg/f:DI 83 [ dst ]) [0 MEM <char[1:19]>
> > > [(void *)dst.0_1]+0 S16 A8])
> > >         (const_wide_int 0xc0c0c0c0c0c0c0c0c0c0c0c0c0c0c0c))
> > > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/
> > > i386/pr90773-17.c":9:3 73 {*movti_internal}
> > >      (nil))
> > >
> > > LRA tries:
> > >
> > > (insn 14 15 16 2 (set (reg:V16QI 20 xmm0 [89])
> > >         (vec_duplicate:V16QI (reg:QI 4 si [90])))
> > > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr907
> > > 73-17.c":9:3 5103 {*avx512vl_vec_dup_gprv16qi}
> > >      (nil))
> > > (insn 16 14 9 2 (set (reg:V16QI 1 dx)
> > >         (reg:V16QI 20 xmm0 [89]))
> > > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
> > > 152
> > > 7 {movv16qi_internal}
> > >      (nil))
> > > (insn 9 16 10 2 (set (mem:TI (reg/f:DI 0 ax [orig:83 dst ] [83]) [0
> > > MEM <char[1:19]> [(void *)dst.0_1]+0 S16 A8])
> > >         (reg:TI 1 dx [88]))
> > > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
> > > 73 {*movt
> > > i_internal}
> > >
> > > and fails:
> > >
> > > /export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c:
> > > In function ‘foo’:
> > > /export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c:10:1:
> > > error: insn does not satisfy its constraints:
> > > (insn 16 14 9 2 (set (reg:V16QI 1 dx)
> > >         (reg:V16QI 20 xmm0 [89]))
> > > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-17.c":9:3
> > > 1527 {movv16qi_internal}
> > >      (nil))
> > >
> > > I want to hide
> > >
> > > (const_wide_int 0xc0c0c0c0c0c0c0c0c0c0c0c0c0c0c0c)
> > >
> > > from RTL passes.
> > >
> > >
> > > --
> > > H.J.
> >
> >
> >
> > --
> > BR,
> > Hongtao
>
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2] Add vec_const_duplicate optab and TARGET_GEN_MEMSET_SCRATCH_RTX
  2021-06-02  7:02                                             ` Richard Biener
@ 2021-06-02 13:50                                               ` H.J. Lu
  0 siblings, 0 replies; 52+ messages in thread
From: H.J. Lu @ 2021-06-02 13:50 UTC (permalink / raw)
  To: Richard Biener
  Cc: Hongtao Liu, H.J. Lu via Gcc-patches, Bernd Edlinger, Jeff Law,
	Richard Sandiford

On Wed, Jun 2, 2021 at 12:02 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Wed, Jun 2, 2021 at 3:57 AM H.J. Lu via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > On Tue, Jun 1, 2021 at 6:17 PM Hongtao Liu <crazylht@gmail.com> wrote:
> > >
> > > On Wed, Jun 2, 2021 at 7:07 AM H.J. Lu via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > > On Tue, Jun 1, 2021 at 7:21 AM Jeff Law <jeffreyalaw@gmail.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > On 6/1/2021 7:29 AM, H.J. Lu via Gcc-patches wrote:
> > > > > > On Tue, Jun 1, 2021 at 6:25 AM Richard Biener
> > > > > > <richard.guenther@gmail.com> wrote:
> > > > > >> On Tue, Jun 1, 2021 at 3:05 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > >>> On Mon, May 31, 2021 at 11:54:53PM -0600, Jeff Law wrote:
> > > > > >>>>
> > > > > >>>> On 5/31/2021 11:50 PM, Richard Sandiford wrote:
> > > > > >>>>> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> > > > > >>>>>> On Mon, May 31, 2021 at 06:32:04AM -0700, H.J. Lu wrote:
> > > > > >>>>>>> On Mon, May 31, 2021 at 6:26 AM Richard Biener
> > > > > >>>>>>> <richard.guenther@gmail.com> wrote:
> > > > > >>>>>>>> On Mon, May 31, 2021 at 3:12 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > >>>>>>>>> On Mon, May 31, 2021 at 5:46 AM Richard Biener
> > > > > >>>>>>>>> <richard.guenther@gmail.com> wrote:
> > > > > >>>>>>>>>> On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > >>>>>>>>>>> On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
> > > > > >>>>>>>>>>>>>>>    -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> > > > > >>>>>>>>>>>>>>>             MODE)
> > > > > >>>>>>>>>>>>>>>        This function returns the RTL of a register containing
> > > > > >>>>>>>>>>>>>>>        'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> > > > > >>>>>>>>>>>>>>>        value given in the RTL register DATA.  For example, if MODE is 4
> > > > > >>>>>>>>>>>>>>>        bytes wide, return the RTL for 0x01010101*DATA.
> > > > > >>>>>>>>>>>>>> For this one I wonder if it should be an optab instead.  Couldn't you
> > > > > >>>>>>>>>>>>>> use the existing vec_duplicate for this by using (paradoxical) subregs
> > > > > >>>>>>>>>>>>>> like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
> > > > > >>>>>>>>>>>>> I tried.   It doesn't even work on x86.  See:
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
> > > > > >>>>>>>>>>>> Not sure what I should read from there...
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> There are special cases to subreg HI, SI and DI modes of TI mode in
> > > > > >>>>>>>>>>>>> ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
> > > > > >>>>>>>>>>>>> work here.   Each backend may need its own special handling.
> > > > > >>>>>>>>>>>> OK, I guess I'm not (RTL) qualified enough to further review these parts,
> > > > > >>>>>>>>>>>> sorry.  Since we're doing code generation the canonical way to communicate
> > > > > >>>>>>>>>>>> with backends should be optabs, not some set of disconnected target hooks.
> > > > > >>>>>>>>>>>> But as said, I probably don't know enough of RTL to see why it's the only way.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Richard.
> > > > > >>>>>>>>>>> Here is the patch to add optabs instead.  Does it look OK?
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Thanks.
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> H.J.
> > > > > >>>>>>>>>>> ---
> > > > > >>>>>>>>>>> Add 2 optabs:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> 1. integer_extract: Extract lower bit value from the integer value in
> > > > > >>>>>>>>>>> TImode, OImode or XImode.
> > > > > >>>>>>>>>> That sounds very specific, esp. the restriction to {TI,OI,XI}mode.
> > > > > >>>>>>>>>> It also sounds like it matches (subreg:{TI,OI,XI} (...) 0).  There are
> > > > > >>>>>>>>>> existing target hooks verifying subreg validity - why's that not a good
> > > > > >>>>>>>>>> fit here?  ISTR you say gen_lowpart () doesn't work (or was it
> > > > > >>>>>>>>>> simplify_gen_subreg?), why's that so?
> > > > > >>>>>>>>> {TI,OI,XI}mode are storage only integer types.   subreg doesn't work
> > > > > >>>>>>>>> well on them.  I got
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> [hjl@gnu-cfl-2 pieces]$ cat s2.i
> > > > > >>>>>>>>> extern void *ops;
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> void
> > > > > >>>>>>>>> foo (int c)
> > > > > >>>>>>>>> {
> > > > > >>>>>>>>>     __builtin_memset (ops, c, 34);
> > > > > >>>>>>>>> }
> > > > > >>>>>>>>> [hjl@gnu-cfl-2 pieces]$ make s2.s
> > > > > >>>>>>>>> /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
> > > > > >>>>>>>>> -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
> > > > > >>>>>>>>> -O2 -march=haswell -S s2.i
> > > > > >>>>>>>>> during RTL pass: reload
> > > > > >>>>>>>>> s2.i: In function ‘foo’:
> > > > > >>>>>>>>> s2.i:7:1: internal compiler error: maximum number of generated reload
> > > > > >>>>>>>>> insns per insn achieved (90)
> > > > > >>>>>>>>>       7 | }
> > > > > >>>>>>>>>         | ^
> > > > > >>>>>>>>> 0x1050734 lra_constraints(bool)
> > > > > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c:5091
> > > > > >>>>>>>>> 0x1039536 lra(_IO_FILE*)
> > > > > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336
> > > > > >>>>>>>>> 0xfe1140 do_reload
> > > > > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822
> > > > > >>>>>>>>> 0xfe162e execute
> > > > > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008
> > > > > >>>>>>>>> Please submit a full bug report,
> > > > > >>>>>>>>> with preprocessed source if appropriate.
> > > > > >>>>>>>>> Please include the complete backtrace with any bug report.
> > > > > >>>>>>>>> See <https://gcc.gnu.org/bugs/> for instructions.
> > > > > >>>>>>>>> make: *** [Makefile:32: s2.s] Error 1
> > > > > >>>>>>>>> [hjl@gnu-cfl-2 pieces]$
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> due to
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> (insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84)
> > > > > >>>>>>>>>                   (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
> > > > > >>>>>>>>> *)ops.0_1]+32 S2 A8])
> > > > > >>>>>>>>>           (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1
> > > > > >>>>>>>>>        (nil))
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> The new optab gives us
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> (insn 12 11 13 2 (set (reg:TI 88)
> > > > > >>>>>>>>>           (reg:TI 51 xmm15)) "s2.i":6:3 -1
> > > > > >>>>>>>>>        (nil))
> > > > > >>>>>>>>> (insn 13 12 14 2 (set (reg:SI 89)
> > > > > >>>>>>>>>           (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1
> > > > > >>>>>>>>>        (nil))
> > > > > >>>>>>>>> (insn 14 13 15 2 (set (reg:HI 87)
> > > > > >>>>>>>>>           (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1
> > > > > >>>>>>>>>        (nil))
> > > > > >>>>>>>> that looks odd to me - what's the final result after LRA?  I think
> > > > > >>>>>>> I got:
> > > > > >>>>>>>
> > > > > >>>>>>> vmovd %edi, %xmm15
> > > > > >>>>>>> movq ops(%rip), %rdx
> > > > > >>>>>>> vpbroadcastb %xmm15, %ymm15
> > > > > >>>>>>> vmovq %xmm15, %rax    <<<< move to GPR
> > > > > >>>>>>> vmovdqu %ymm15, (%rdx)
> > > > > >>>>>>> movw %ax, 32(%rdx)   <<<< subreg of GPR
> > > > > >>>>>>> vzeroupper
> > > > > >>>>>>> ret
> > > > > >>>>>>>
> > > > > >>>>>>>> we should see to make lowpart_subreg work on {XI,OI,TI}mode.
> > > > > >>>>>>>> Only two steps should be necessary at most:
> > > > > >>>>>>>> xmm -> gpr, grp -> subreg, or gpr -> subreg.  So the expander
> > > > > >>>>>>>> code in memset should try to generate the subreg directly
> > > > > >>>>>>> subreg didn't fail on x86 when I tried.
> > > > > >>>>>>>
> > > > > >>>>>>>> and if that fails, try a word_mode subreg followed by the subreg.
> > > > > >>>>>>> I will try word_mode subreg.
> > > > > >>>>>>>
> > > > > >>>>>> Here is the v2 patch to use word_mode subreg.  For
> > > > > >>>>>>
> > > > > >>>>>> ---
> > > > > >>>>>> extern void *ops;
> > > > > >>>>>>
> > > > > >>>>>> void
> > > > > >>>>>> foo (int c)
> > > > > >>>>>> {
> > > > > >>>>>>     __builtin_memset (ops, 4, 32);
> > > > > >>>>>> }
> > > > > >>>>>> ---
> > > > > >>>>>>
> > > > > >>>>>> without vec_const_duplicate, I got
> > > > > >>>>>>
> > > > > >>>>>>    movl    $4, %eax
> > > > > >>>>>>    movq    ops(%rip), %rdx
> > > > > >>>>>>    movd    %eax, %xmm0
> > > > > >>>>>>    punpcklbw       %xmm0, %xmm0
> > > > > >>>>>>    punpcklwd       %xmm0, %xmm0
> > > > > >>>>>>    pshufd  $0, %xmm0, %xmm0
> > > > > >>>>>>    movups  %xmm0, (%rdx)
> > > > > >>>>>>    movups  %xmm0, 16(%rdx)
> > > > > >>>>>>    ret
> > > > > >>>>>>
> > > > > >>>>>> With vec_const_duplicate, I got
> > > > > >>>>>>
> > > > > >>>>>>    movq    ops(%rip), %rax
> > > > > >>>>>>    movdqa  .LC0(%rip), %xmm0
> > > > > >>>>>>    movups  %xmm0, (%rax)
> > > > > >>>>>>    movups  %xmm0, 16(%rax)
> > > > > >>>>>>    ret
> > > > > >>>>>>
> > > > > >>>>>> If vec_duplicate is allowed to fail, I don't need vec_const_duplicate.
> > > > > >>>>> I don't understand why we need an optab for this though.  If the operand
> > > > > >>>>> is constant then we should just be doing an ordinary move in which the
> > > > > >>>>> source is a CONST_VECTOR.  It's then up to the move patterns to handle
> > > > > >>>>> duplicated constants as efficiently as possible.  (Sorry if this was
> > > > > >>>>> discussed upthread and I missed it.)
> > > > > >>>> That's exactly the point I'm trying to get across as well.
> > > > > >>>>
> > > > > >>> This is what we do today.  But I'd like to generate
> > > > > >>>
> > > > > >>>          movl    $4, %eax
> > > > > >>>          vpbroadcastb    %eax, %ymm15
> > > > > >>>          movq    ops(%rip), %rax
> > > > > >>>          vmovdqu %ymm15, (%rax)
> > > > > >>>          vzeroupper
> > > > > >>>          ret
> > > > > >>>
> > > > > >>> instead of
> > > > > >>>
> > > > > >>>          vmovdqa .LC0(%rip), %ymm15
> > > > > >>>          movq    ops(%rip), %rax
> > > > > >>>          vmovdqu %ymm15, (%rax)
> > > > > >>>          vzeroupper
> > > > > >>>          ret
> > > > > >>>
> > > > > >>> Do I need a vec_dup pattern for it?
> > > > > >> I think we have special code sequences to materialize some
> > > > > >> constant vectors already, we should be able to add to that, no?
> > > > > > We can do that for all 0s and all 1s at the final codegen.   For
> > > > > > other values, since we need a GPR, we can't do that.
> > > > > You can catch them in your movxx expanders, you can create peep2
> > > > > patterns that use available GPRs, etc.  I don't see a fundamental need
> > > > > to to introduce new target macros or hooks to handle this stuff.  In
> > > > > fact I've done both to handle a closely related issue on our port.
> > > > >
> > > >
> > > > One problem of expanding TI/OI/XI moves to broadcast is that other
> > > > RTL passes may change it.   For example, expander generates:
> > > It could be handled in pass_data_constant_pool_broadcast which is
> > > designed for avx512 embedding broadcast, but can also do such
> > > transforming.
> > >
> > > see
> > > https://godbolt.org/z/8YGzqf938
> >
> > It sounds promising, but doesn't work on TImode:
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100865
>
> It would seem it could be handled in ix86_expand_vector_move?
> That is, you create a pseudo and (set (reg:VnQI ..) (const_vector ...)).
>

I can do that.  But builtin_memset_read_str will see the previous
value as

(const_wide_int 0x3030303030303030303030303030303)

and we will generate

movl $3, %edx
movq ops(%rip), %rax
movabsq $217020518514230019, %rcx <<<<< Not needed
vmovd %edx, %xmm15
vpbroadcastb %xmm15, %xmm15
movq %rcx, 16(%rax)  <<< Should use vmovq %xmm15
vmovdqu %xmm15, (%rax)
ret

-- 
H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2021-06-02 13:51 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-18 19:16 [PATCH v4 00/12] Allow TImode/OImode/XImode in op_by_pieces operations H.J. Lu
2021-05-18 19:16 ` [PATCH v4 01/12] Add TARGET_READ_MEMSET_VALUE/TARGET_GEN_MEMSET_VALUE H.J. Lu
2021-05-19  9:25   ` Richard Biener
2021-05-19 12:55     ` H.J. Lu
2021-05-20 20:49       ` [PATCH] Add 3 target hooks for memset H.J. Lu
2021-05-21  5:42         ` Bernd Edlinger
2021-05-21 11:53           ` H.J. Lu
2021-05-25 14:34         ` Richard Biener
2021-05-25 15:11           ` H.J. Lu
2021-05-26  8:28             ` Richard Biener
2021-05-31 12:09               ` [PATCH] Add integer_extract and vec_const_duplicate optabs H.J. Lu
2021-05-31 12:46                 ` Richard Biener
2021-05-31 13:12                   ` H.J. Lu
2021-05-31 13:25                     ` Richard Biener
2021-05-31 13:32                       ` H.J. Lu
2021-05-31 13:36                         ` H.J. Lu
2021-05-31 20:22                         ` [PATCH v2] Add vec_const_duplicate optab and TARGET_GEN_MEMSET_SCRATCH_RTX H.J. Lu
2021-06-01  5:50                           ` Richard Sandiford
2021-06-01  5:54                             ` Jeff Law
2021-06-01 13:05                               ` H.J. Lu
2021-06-01 13:25                                 ` Richard Biener
2021-06-01 13:29                                   ` H.J. Lu
2021-06-01 14:21                                     ` Jeff Law
2021-06-01 23:07                                       ` H.J. Lu
2021-06-02  1:21                                         ` Hongtao Liu
2021-06-02  1:54                                           ` H.J. Lu
2021-06-02  7:02                                             ` Richard Biener
2021-06-02 13:50                                               ` H.J. Lu
2021-05-18 19:16 ` [PATCH v4 02/12] x86: Add TARGET_READ_MEMSET_VALUE/TARGET_GEN_MEMSET_VALUE H.J. Lu
2021-05-18 19:16 ` [PATCH v4 03/12] x86: Avoid stack realignment when copying data H.J. Lu
2021-05-18 19:16 ` [PATCH v4 04/12] Remove MAX_BITSIZE_MODE_ANY_INT H.J. Lu
2021-05-25 14:37   ` Richard Biener
2021-05-18 19:16 ` [PATCH v4 05/12] x86: Update piecewise move and store H.J. Lu
2021-05-18 19:16 ` [PATCH v4 06/12] x86: Add AVX2 tests for PR middle-end/90773 H.J. Lu
2021-05-18 19:16 ` [PATCH v4 07/12] x86: Add tests for piecewise move and store H.J. Lu
2021-05-18 19:16 ` [PATCH v4 08/12] x86: Also pass -mno-avx to pr72839.c H.J. Lu
2021-05-18 19:16 ` [PATCH v4 09/12] x86: Also pass -mno-avx to cold-attribute-1.c H.J. Lu
2021-05-18 19:16 ` [PATCH v4 10/12] x86: Also pass -mno-avx to sw-1.c for ia32 H.J. Lu
2021-05-18 19:16 ` [PATCH v4 11/12] x86: Update gcc.target/i386/incoming-11.c H.J. Lu
2021-05-18 19:16 ` [PATCH v4 12/12] constructor: Check if it is faster to load constant from memory H.J. Lu
2021-05-19  9:33   ` Richard Biener
2021-05-19 13:22     ` H.J. Lu
2021-05-19 13:27       ` Bernd Edlinger
2021-05-19 19:04         ` H.J. Lu
2021-05-20  6:57           ` Richard Biener
2021-05-20  7:51       ` Richard Biener
2021-05-20 14:03         ` [PATCH] constructor: Elide expand_constructor when can move by pieces is true H.J. Lu
2021-05-21  5:35           ` Bernd Edlinger
2021-05-21  6:57           ` Richard Biener
2021-05-21  7:30             ` Bernd Edlinger
2021-05-21 13:13               ` H.J. Lu
2021-05-21 13:09             ` [PATCH] Elide expand_constructor if move by pieces is preferred H.J. Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).