* [PATCH] Minimize number of operations between two areas of memory
@ 2019-06-11 15:15 H.J. Lu
2019-06-18 15:59 ` PING^1 " H.J. Lu
0 siblings, 1 reply; 3+ messages in thread
From: H.J. Lu @ 2019-06-11 15:15 UTC (permalink / raw)
To: GCC Patches, Jakub Jelinek, Jeffrey Law, Richard Biener
[-- Attachment #1: Type: text/plain, Size: 1287 bytes --]
For op_by_pieces operations between two areas of memory, this patch adds
-fminimize-op-by-pieces-run to minimize number of operations. When
operating on LENGTH bytes of memory, it starts with the widest usable
integer size, MAX_SIZE, for LENGTH bytes and finishes with the smallest
usable integer size, MIN_SIZE, for the remaining bytes where MAX_SIZE
>= MIN_SIZE. If MIN_SIZE > the remaining bytes, the last operation is
performed on MIN_SIZE bytes of overlapping memory from the previous
operation.
Tested on Linux/x86-64 with both -fminimize-op-by-pieces-run enabled
and disabled by default.
I will submit a separate patch to enable -fminimize-op-by-pieces-run for -Os.
OK for trunk?
gcc/
PR middl-end/90773
* common.opt (-fminimize-op-by-pieces-run): New.
* expr.c (op_by_pieces_d): Add get_usable_mode.
(op_by_pieces_d::get_usable_mode): New.
(op_by_pieces_d::run): Use get_usable_mode to get the largest
usable integer mode for -fminimize-op-by-pieces-run.
* doc/invoke.texi: Document -fminimize-op-by-pieces-run.
gcc/testsuite/
PR middl-end/90773
* gcc.target/i386/pr90773-1.c: New test.
* gcc.target/i386/pr90773-2.c: Likewise.
* gcc.target/i386/pr90773-3.c: Likewise.
* gcc.target/i386/pr90773-4.c: Likewise.
* gcc.target/i386/pr90773-5.c: Likewise.
Thanks.
--
H.J.
[-- Attachment #2: 0001-Minimize-number-of-operations-between-two-areas-of-m.patch --]
[-- Type: text/x-patch, Size: 12865 bytes --]
From cca1f7d9578dd927b97a9210aae404b0f6135cc1 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Mon, 10 Jun 2019 09:57:15 -0700
Subject: [PATCH] Minimize number of operations between two areas of memory
For op_by_pieces operations between two areas of memory, this patch adds
-fminimize-op-by-pieces-run to minimize number of operations. When
operating on LENGTH bytes of memory, it starts with the widest usable
integer size, MAX_SIZE, for LENGTH bytes and finishes with the smallest
usable integer size, MIN_SIZE, for the remaining bytes where MAX_SIZE
>= MIN_SIZE. If MIN_SIZE > the remaining bytes, the last operation is
performed on MIN_SIZE bytes of overlapping memory from the previous
operation.
Tested on Linux/x86-64 with both -fminimize-op-by-pieces-run enabled
and disabled by default.
gcc/
PR middl-end/90773
* common.opt (-fminimize-op-by-pieces-run): New.
* expr.c (op_by_pieces_d): Add get_usable_mode.
(op_by_pieces_d::get_usable_mode): New.
(op_by_pieces_d::run): Use get_usable_mode to get the largest
usable integer mode for -fminimize-op-by-pieces-run.
* doc/invoke.texi: Document -fminimize-op-by-pieces-run.
gcc/testsuite/
PR middl-end/90773
* gcc.target/i386/pr90773-1.c: New test.
* gcc.target/i386/pr90773-2.c: Likewise.
* gcc.target/i386/pr90773-3.c: Likewise.
* gcc.target/i386/pr90773-4.c: Likewise.
* gcc.target/i386/pr90773-5.c: Likewise.
---
gcc/common.opt | 4 +
gcc/doc/invoke.texi | 9 ++-
gcc/expr.c | 97 +++++++++++++++++------
gcc/testsuite/gcc.target/i386/pr90773-1.c | 17 ++++
gcc/testsuite/gcc.target/i386/pr90773-2.c | 20 +++++
gcc/testsuite/gcc.target/i386/pr90773-3.c | 23 ++++++
gcc/testsuite/gcc.target/i386/pr90773-4.c | 13 +++
gcc/testsuite/gcc.target/i386/pr90773-5.c | 13 +++
8 files changed, 171 insertions(+), 25 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-1.c
create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-2.c
create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-3.c
create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-4.c
create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-5.c
diff --git a/gcc/common.opt b/gcc/common.opt
index e1404165feb..3e5f8075bda 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1934,6 +1934,10 @@ fmessage-length=
Common RejectNegative Joined UInteger
-fmessage-length=<number> Limit diagnostics to <number> characters per line. 0 suppresses line-wrapping.
+fminimize-op-by-pieces-run
+Common Report Var(flag_minimize_op_by_pieces_run) Optimization
+Minimize number of operations between two areas of memory.
+
fmodulo-sched
Common Report Var(flag_modulo_sched) Optimization
Perform SMS based modulo scheduling before the first scheduling pass.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index f18d225e5b5..d6d00dfea6a 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -431,8 +431,8 @@ Objective-C and Objective-C++ Dialects}.
-floop-block -floop-interchange -floop-strip-mine @gol
-floop-unroll-and-jam -floop-nest-optimize @gol
-floop-parallelize-all -flra-remat -flto -flto-compression-level @gol
--flto-partition=@var{alg} -fmerge-all-constants @gol
--fmerge-constants -fmodulo-sched -fmodulo-sched-allow-regmoves @gol
+-flto-partition=@var{alg} -fmerge-all-constants -fmerge-constants @gol
+-fminimize-op-by-pieces-run -fmodulo-sched -fmodulo-sched-allow-regmoves @gol
-fmove-loop-invariants -fno-branch-count-reg @gol
-fno-defer-pop -fno-fp-int-builtin-inexact -fno-function-cse @gol
-fno-guess-branch-probability -fno-inline -fno-math-errno -fno-peephole @gol
@@ -8610,6 +8610,11 @@ instances of the same variable in recursive calls, to have distinct locations,
so using this option results in non-conforming
behavior.
+@item -fminimize-op-by-pieces-run
+@opindex fminimize-op-by-pieces-run
+Minimize number of operations between two areas of memory by using
+the largest integer operation with ovelapping memory operations.
+
@item -fmodulo-sched
@opindex fmodulo-sched
Perform swing modulo scheduling immediately before the first scheduling
diff --git a/gcc/expr.c b/gcc/expr.c
index c78bc74c0d9..42a1c4a6e5b 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -1015,6 +1015,9 @@ pieces_addr::maybe_postinc (HOST_WIDE_INT size)
class op_by_pieces_d
{
+ private:
+ scalar_int_mode get_usable_mode (scalar_int_mode mode, unsigned int);
+
protected:
pieces_addr m_to, m_from;
unsigned HOST_WIDE_INT m_len;
@@ -1082,6 +1085,25 @@ op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load,
m_align = align;
}
+/* This function returns the largest usable integer mode for LEN bytes
+ whose size is no bigger than size of MODE. */
+
+scalar_int_mode
+op_by_pieces_d::get_usable_mode (scalar_int_mode mode, unsigned int len)
+{
+ unsigned int size;
+ do
+ {
+ size = GET_MODE_SIZE (mode);
+ if (len >= size && prepare_mode (mode, m_align))
+ break;
+ /* NB: widest_int_mode_for_size checks SIZE > 1. */
+ mode = widest_int_mode_for_size (size);
+ }
+ while (1);
+ return mode;
+}
+
/* This function contains the main loop used for expanding a block
operation. First move what we can in the largest integer mode,
then go to successively smaller modes. For every access, call
@@ -1090,42 +1112,71 @@ op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load,
void
op_by_pieces_d::run ()
{
- while (m_max_size > 1 && m_len > 0)
+ if (m_len == 0)
+ return;
+
+ /* NB: widest_int_mode_for_size checks M_MAX_SIZE > 1. */
+ scalar_int_mode mode = widest_int_mode_for_size (m_max_size);
+ mode = get_usable_mode (mode, m_len);
+
+ do
{
- scalar_int_mode mode = widest_int_mode_for_size (m_max_size);
+ unsigned int size = GET_MODE_SIZE (mode);
+ rtx to1 = NULL_RTX, from1;
- if (prepare_mode (mode, m_align))
+ while (m_len >= size)
{
- unsigned int size = GET_MODE_SIZE (mode);
- rtx to1 = NULL_RTX, from1;
+ if (m_reverse)
+ m_offset -= size;
- while (m_len >= size)
- {
- if (m_reverse)
- m_offset -= size;
+ to1 = m_to.adjust (mode, m_offset);
+ from1 = m_from.adjust (mode, m_offset);
- to1 = m_to.adjust (mode, m_offset);
- from1 = m_from.adjust (mode, m_offset);
+ m_to.maybe_predec (-(HOST_WIDE_INT)size);
+ m_from.maybe_predec (-(HOST_WIDE_INT)size);
- m_to.maybe_predec (-(HOST_WIDE_INT)size);
- m_from.maybe_predec (-(HOST_WIDE_INT)size);
+ generate (to1, from1, mode);
- generate (to1, from1, mode);
+ m_to.maybe_postinc (size);
+ m_from.maybe_postinc (size);
- m_to.maybe_postinc (size);
- m_from.maybe_postinc (size);
+ if (!m_reverse)
+ m_offset += size;
- if (!m_reverse)
- m_offset += size;
+ m_len -= size;
+ }
- m_len -= size;
- }
+ finish_mode (mode);
- finish_mode (mode);
- }
+ if (m_len == 0)
+ return;
- m_max_size = GET_MODE_SIZE (mode);
+ if (flag_minimize_op_by_pieces_run)
+ {
+ /* Get the smallest integer mode for M_LEN bytes. */
+ mode = smallest_int_mode_for_size (m_len * BITS_PER_UNIT);
+ mode = get_usable_mode (mode, GET_MODE_SIZE (mode));
+ int gap = GET_MODE_SIZE (mode) - m_len;
+ if (gap > 0)
+ {
+ /* If size of MODE > M_LEN, generate the last operation
+ in MODE for the remaining bytes with ovelapping memory
+ from the previois operation. */
+ if (m_reverse)
+ m_offset += gap;
+ else
+ m_offset -= gap;
+ m_len += gap;
+ }
+ }
+ else
+ {
+ /* NB: widest_int_mode_for_size checks SIZE > 1. */
+ mode = widest_int_mode_for_size (size);
+ mode = get_usable_mode (mode, m_len);
+ }
}
+ while (1);
/* The code above should have handled everything. */
gcc_assert (!m_len);
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-1.c b/gcc/testsuite/gcc.target/i386/pr90773-1.c
new file mode 100644
index 00000000000..76cd48ff3d5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-1.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic -fminimize-op-by-pieces-run" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+ __builtin_memcpy (dst, src, 15);
+}
+
+/* { dg-final { scan-assembler-times "movq\[\\t \]+\\(%\[\^,\]+\\)," 1 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "movq\[\\t \]+7\\(%\[\^,\]+\\)," 1 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+4\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+8\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+11\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-2.c b/gcc/testsuite/gcc.target/i386/pr90773-2.c
new file mode 100644
index 00000000000..fb7c37dbac5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-2.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic -fminimize-op-by-pieces-run" } */
+/* { dg-additional-options "-msse2" { target { ! ia32 } } } */
+/* { dg-additional-options "-mno-sse" { target ia32 } } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+ __builtin_memcpy (dst, src, 19);
+}
+
+/* { dg-final { scan-assembler-times "movdqu\[\\t \]+\\(%\[\^,\]+\\)," 1 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+15\\(%\[\^,\]+\\)," 1 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+4\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+8\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+12\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+15\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-3.c b/gcc/testsuite/gcc.target/i386/pr90773-3.c
new file mode 100644
index 00000000000..54fa0df85c7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-3.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic -fminimize-op-by-pieces-run" } */
+/* { dg-additional-options "-msse2" { target { ! ia32 } } } */
+/* { dg-additional-options "-mno-sse" { target ia32 } } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+ __builtin_memcpy (dst, src, 31);
+}
+
+/* { dg-final { scan-assembler-times "movdqu\[\\t \]+\\(%\[\^,\]+\\)," 1 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "movdqu\[\\t \]+15\\(%\[\^,\]+\\)," 1 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+4\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+8\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+12\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+16\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+20\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+24\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movl\[\\t \]+27\\(%\[\^,\]+\\)," 1 { target ia32 } } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-4.c b/gcc/testsuite/gcc.target/i386/pr90773-4.c
new file mode 100644
index 00000000000..7f386af9d97
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-4.c
@@ -0,0 +1,13 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -msse2 -mtune=generic -fminimize-op-by-pieces-run" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+ __builtin_memset (dst, 0, 31);
+}
+
+/* { dg-final { scan-assembler-times "movups\[\\t \]+%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]+%xmm\[0-9\]+, 15\\(%\[\^,\]+\\)" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-5.c b/gcc/testsuite/gcc.target/i386/pr90773-5.c
new file mode 100644
index 00000000000..124d8fdaeeb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr90773-5.c
@@ -0,0 +1,13 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -msse2 -mtune=generic -fminimize-op-by-pieces-run" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+ __builtin_memset (dst, 0, 21);
+}
+
+/* { dg-final { scan-assembler-times "movups\[\\t \]+%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "movq\[\\t \]+\\\$0+, 13\\(%\[\^,\]+\\)" 1 } } */
--
2.20.1
^ permalink raw reply [flat|nested] 3+ messages in thread
* PING^1 [PATCH] Minimize number of operations between two areas of memory
2019-06-11 15:15 [PATCH] Minimize number of operations between two areas of memory H.J. Lu
@ 2019-06-18 15:59 ` H.J. Lu
2019-07-02 19:21 ` PING^2 " H.J. Lu
0 siblings, 1 reply; 3+ messages in thread
From: H.J. Lu @ 2019-06-18 15:59 UTC (permalink / raw)
To: GCC Patches, Jakub Jelinek, Jeffrey Law, Richard Biener
On Tue, Jun 11, 2019 at 8:14 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> For op_by_pieces operations between two areas of memory, this patch adds
> -fminimize-op-by-pieces-run to minimize number of operations. When
> operating on LENGTH bytes of memory, it starts with the widest usable
> integer size, MAX_SIZE, for LENGTH bytes and finishes with the smallest
> usable integer size, MIN_SIZE, for the remaining bytes where MAX_SIZE
> >= MIN_SIZE. If MIN_SIZE > the remaining bytes, the last operation is
> performed on MIN_SIZE bytes of overlapping memory from the previous
> operation.
>
> Tested on Linux/x86-64 with both -fminimize-op-by-pieces-run enabled
> and disabled by default.
>
> I will submit a separate patch to enable -fminimize-op-by-pieces-run for -Os.
>
> OK for trunk?
>
> gcc/
>
> PR middl-end/90773
> * common.opt (-fminimize-op-by-pieces-run): New.
> * expr.c (op_by_pieces_d): Add get_usable_mode.
> (op_by_pieces_d::get_usable_mode): New.
> (op_by_pieces_d::run): Use get_usable_mode to get the largest
> usable integer mode for -fminimize-op-by-pieces-run.
> * doc/invoke.texi: Document -fminimize-op-by-pieces-run.
>
> gcc/testsuite/
>
> PR middl-end/90773
> * gcc.target/i386/pr90773-1.c: New test.
> * gcc.target/i386/pr90773-2.c: Likewise.
> * gcc.target/i386/pr90773-3.c: Likewise.
> * gcc.target/i386/pr90773-4.c: Likewise.
> * gcc.target/i386/pr90773-5.c: Likewise.
>
> Thanks.
>
PING:
https://gcc.gnu.org/ml/gcc-patches/2019-06/msg00641.html
--
H.J.
^ permalink raw reply [flat|nested] 3+ messages in thread
* PING^2 [PATCH] Minimize number of operations between two areas of memory
2019-06-18 15:59 ` PING^1 " H.J. Lu
@ 2019-07-02 19:21 ` H.J. Lu
0 siblings, 0 replies; 3+ messages in thread
From: H.J. Lu @ 2019-07-02 19:21 UTC (permalink / raw)
To: GCC Patches, Jakub Jelinek, Jeffrey Law, Richard Biener
On Tue, Jun 18, 2019 at 8:58 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Jun 11, 2019 at 8:14 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > For op_by_pieces operations between two areas of memory, this patch adds
> > -fminimize-op-by-pieces-run to minimize number of operations. When
> > operating on LENGTH bytes of memory, it starts with the widest usable
> > integer size, MAX_SIZE, for LENGTH bytes and finishes with the smallest
> > usable integer size, MIN_SIZE, for the remaining bytes where MAX_SIZE
> > >= MIN_SIZE. If MIN_SIZE > the remaining bytes, the last operation is
> > performed on MIN_SIZE bytes of overlapping memory from the previous
> > operation.
> >
> > Tested on Linux/x86-64 with both -fminimize-op-by-pieces-run enabled
> > and disabled by default.
> >
> > I will submit a separate patch to enable -fminimize-op-by-pieces-run for -Os.
> >
> > OK for trunk?
> >
> > gcc/
> >
> > PR middl-end/90773
> > * common.opt (-fminimize-op-by-pieces-run): New.
> > * expr.c (op_by_pieces_d): Add get_usable_mode.
> > (op_by_pieces_d::get_usable_mode): New.
> > (op_by_pieces_d::run): Use get_usable_mode to get the largest
> > usable integer mode for -fminimize-op-by-pieces-run.
> > * doc/invoke.texi: Document -fminimize-op-by-pieces-run.
> >
> > gcc/testsuite/
> >
> > PR middl-end/90773
> > * gcc.target/i386/pr90773-1.c: New test.
> > * gcc.target/i386/pr90773-2.c: Likewise.
> > * gcc.target/i386/pr90773-3.c: Likewise.
> > * gcc.target/i386/pr90773-4.c: Likewise.
> > * gcc.target/i386/pr90773-5.c: Likewise.
> >
> > Thanks.
> >
>
> PING:
>
> https://gcc.gnu.org/ml/gcc-patches/2019-06/msg00641.html
>
PING.
--
H.J.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2019-07-02 19:21 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-11 15:15 [PATCH] Minimize number of operations between two areas of memory H.J. Lu
2019-06-18 15:59 ` PING^1 " H.J. Lu
2019-07-02 19:21 ` PING^2 " H.J. Lu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).