[PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass
@ 2016-07-15 15:14 Kyrill Tkachov
  2016-07-18 12:23 ` Richard Biener
  0 siblings, 1 reply; 9+ messages in thread
From: Kyrill Tkachov @ 2016-07-15 15:14 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 6202 bytes --]

Hi all,

This is a GIMPLE pass to implement PR middle-end/22141. that is merge narrow stores of constants
into fewer wider stores.  A 2009 patch from Jakub [1] contains many testcases but a simple motivating
case can be:

struct bar {
   int a;
   char b;
   char c;
   char d;
   char e;
}; // packed 64-bit structure

void bar (struct bar *);

void
foo (struct bar *p)
{
   p->b = 0;
   p->a = 0;
   p->c = 0;
   p->d = 1;
   p->e = 0;
}

Currently on aarch64 this will generate:
foo:
         mov     w1, 1
         str     wzr, [x0]
         strb    wzr, [x0, 4]
         strb    wzr, [x0, 5]
         strb    w1, [x0, 6]
         strb    wzr, [x0, 7]
         ret

With this patch this can be improved into a single unaligned store:
foo:
         mov     x1, 0x1000000000000
         str     x1, [x0]
         ret

or, if compiled with -mstrict-align:
foo:
         mov     w1, 0x10000
         stp     wzr, w1, [x0]
         ret

The pass is a tree-ssa pass that runs fairly late in the pipeline, after pass_optimize_widening_mul.
I explain the approach taken in the comments in the new tree-ssa-store-widening.c file but essentially
it has 3 phases applied to each basic block:

1) Scan through the statements recording assignments of constants to destinations like ARRAY_REF,
COMPONENT_REF, MEM_REF which are determined to write to an ultimate common destination. get_inner_reference
is used to decompose these destinations. Continue recording these until we encounter a statement that may
interfere with the stores we've been recording (load or store that may alias, volatile access etc).
These assignments of interest are recorded as store_immediate_info objects in the m_store_info vector.

2) Analyse the stores recorded in phase one (they all write to a destination offset from a common base)
and merge them into wider assignments up to BITS_PER_WORD bits wide. These widened assignments are represented
as merged_store_group objects and they are recorded in the m_merged_store_groups vector. This is the
coalesce_immediate_stores function. It sorts the stores by the bitposition they write to and iterates through
them, merging consecutive stores (it fails the transformation on overlapping stores, I don't think that case
appears often enough to warrant extra logic) up to BITS_PER_WORD-wide accesses.

3) Go through the merged stores recorded in m_merged_store_groups and output each widened store. Widened stores
that are not of a bitsize that is a power of two (for example 48 bits wide) are output as multiple stores of decreasing
power-of-two width. So, for a widened store 48-bits wide this phase would a emit a 32-bit store followed by a
16-bit store. The new sequence is only emitted if it contains fewer statements than the original sequence that it
will replace.  This phase also avoids outputting unaligned stores for STRICT_ALIGNMENT targets or targets where
SLOW_UNALIGNED_ACCESS forbids it. Since some configurations/targets may want to avoid generation of unaligned
stores even when it is legal I've added the new param PARAM_STORE_WIDENING_ALLOW_UNALIGNED that can be used
to disallow unaligned store generation.  Its default setting is to allow them (assuming that STRICT_ALIGNMENT
and SLOW_UNALIGNED_ACCESS allows it).

This is my first GIMPLE-level pass so please do point out places where I'm not using the interfaces correctly.
This patch handles bitfields as well, but only if they are a multiple of BITS_PER_UNIT. It should be easily
extensible to handle other bitfields as well, but I'm not entirely sure of the rules for laying out such bitfields
and in particular the byteswap logic that needs to be applied for big-endian targets. If someone can shed some light
on how they should be handed I'll be happy to try it out, but I believe this patch is an improvement as it is.

This has been bootstrapped and tested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf and x86_64-unknown-linux-gnu.
I've also tested it on the big-endian targets: armeb-none-eabi, aarch64_be-none-elf. Also tested aarch64-none-elf/-mabi=ilp32.

I've benchmarked it on SPEC2006 on AArch64 on a Cortex-A72 and there were no regressions, the overall score improved a bit
(about 0.1%). The interesting improvements were:
458.sjeng     (+0.8%)
483.xalancbmk (+1.1%)
416.gamess    (+1.0%)
454.calculix  (+1.1%)

An interesting effect was in BZ2_decompress from bzip2 where at -Ofast it transformed a long sequence of constant
byte stores into a much shorter sequence of word-size stores (from ~550 instructions to ~190).

On x86_64 SPECINT there was no change in the overall score. The code size at -Ofast is consistently smaller
with this patch but the preformance differences on sub-benchmarks are in the noise.

I've included the testcases from Jakub's patch [1] and added a few of my own.

Is this direction acceptable for the problem this is trying to solve?

Thanks,
Kyrill

N.B. I'm going on vacation until August so I won't be able to respond to any feedback until I get back.

[1] https://gcc.gnu.org/ml/gcc-patches/2009-09/msg01745.html

2016-07-15  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     PR middle-end/22141
     * Makefile.in (OBJS): Add tree-ssa-store-widening.o.
     * common.opt (ftree-store-widening): New Optimization option.
     * opts.c (default_options_table): Add entry for
     OPT_ftree_store_widening.
     * params.def (PARAM_STORE_WIDENING_ALLOW_UNALIGNED): Define.
     * passes.def: Insert pass_tree_store_widening.
     * tree-pass.h (make_pass_tree_store_widening): Declare extern
     prototype.
     * tree-ssa-store-widening.c: New file.
     * doc/invoke.texi (Optimization Options): Document
     -ftree-store-widening.

2016-07-15  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
             Jakub Jelinek  <jakub@redhat.com>

     PR middle-end/22141
     * gcc.c-torture/execute/pr22141-1.c: New test.
     * gcc.c-torture/execute/pr22141-2.c: Likewise.
     * gcc.target/aarch64/ldp_stp_1.c: Adjust for -ftree-store-widening.
     * gcc.dg/store_widening_1.c: New test.
     * gcc.dg/store_widening_2.c: Likewise.
     * gcc.dg/store_widening_3.c: Likewise.
     * gcc.dg/store_widening_4.c: Likewise.
     * gcc.target/i386/pr22141.c: Likewise.

[-- Attachment #2: widening.patch --]
[-- Type: text/x-patch, Size: 44961 bytes --]

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 2dc88bee3e98d1642d7603c620faccbc8d4d9c54..9992a212f976cdfaa00cc21a6c70525fd345976e 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1505,6 +1505,7 @@ OBJS = \
 	tree-ssa-sccvn.o \
 	tree-ssa-scopedtables.o \
 	tree-ssa-sink.o \
+	tree-ssa-store-widening.o \
 	tree-ssa-strlen.o \
 	tree-ssa-structalias.o \
 	tree-ssa-tail-merge.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index 72e657ca000ec3333e40b93e264767a6d840d1c9..13d47e17ff016e1af5971d36a046849f53d5bcc2 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2471,6 +2471,10 @@ ftree-sra
 Common Report Var(flag_tree_sra) Optimization
 Perform scalar replacement of aggregates.
 
+ftree-store-widening
+Common Var(flag_tree_store_widening) Optimization
+Use the tree store widening pass.
+
 ftree-ter
 Common Report Var(flag_tree_ter) Optimization
 Replace temporary expressions in the SSA->normal pass.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 997faa15b6e4e8dc938bad5b8b249f3870598990..b57c24e2207f3f504e186bef51b05dbbdbeacfec 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -411,8 +411,8 @@ Objective-C and Objective-C++ Dialects}.
 -ftree-loop-vectorize @gol
 -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-partial-pre -ftree-pta @gol
 -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
--ftree-switch-conversion -ftree-tail-merge -ftree-ter @gol
--ftree-vectorize -ftree-vrp -funconstrained-commons @gol
+-ftree-store-widening -ftree-switch-conversion -ftree-tail-merge @gol
+-ftree-ter -ftree-vectorize -ftree-vrp -funconstrained-commons @gol
 -funit-at-a-time -funroll-all-loops -funroll-loops @gol
 -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
 -fipa-ra -fvariable-expansion-in-unroller -fvect-cost-model -fvpt @gol
@@ -6333,6 +6333,7 @@ compilation time.
 -ftree-sink @gol
 -ftree-slsr @gol
 -ftree-sra @gol
+-ftree-store-widening @gol
 -ftree-pta @gol
 -ftree-ter @gol
 -funit-at-a-time}
@@ -7651,6 +7652,13 @@ Perform scalar replacement of aggregates.  This pass replaces structure
 references with scalars to prevent committing structures to memory too
 early.  This flag is enabled by default at @option{-O} and higher.
 
+@item -ftree-store-widening
+@opindex ftree-store-widening
+Perform merging of narrow stores to consecutive memory addresses.  This pass
+merges contigous stores of immediate values narrower than a word into fewer
+wider stores to reduce the number of instructions.  This is enabled by default
+at @option{-O} and higher.
+
 @item -ftree-ter
 @opindex ftree-ter
 Perform temporary expression replacement during the SSA->normal phase.  Single
diff --git a/gcc/opts.c b/gcc/opts.c
index 4053fb1db0a9a814a5623e0604619fbe068b37ab..6f7c4a2504b0bf13ab6f4cb6faff35757381fff2 100644
--- a/gcc/opts.c
+++ b/gcc/opts.c
@@ -463,6 +463,7 @@ static const struct default_options default_options_table[] =
     { OPT_LEVELS_1_PLUS, OPT_ftree_dse, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_ftree_ter, NULL, 1 },
     { OPT_LEVELS_1_PLUS_NOT_DEBUG, OPT_ftree_sra, NULL, 1 },
+    { OPT_LEVELS_1_PLUS_NOT_DEBUG, OPT_ftree_store_widening, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_ftree_fre, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_ftree_copy_prop, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_ftree_sink, NULL, 1 },
diff --git a/gcc/params.def b/gcc/params.def
index 62ec600ba3c88dde78150fae63591e0855e93752..521ca636c08b4e0d845db947e12fd58a033f94e6 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1095,6 +1095,12 @@ DEFPARAM (PARAM_MAX_TAIL_MERGE_COMPARISONS,
           "Maximum amount of similar bbs to compare a bb with.",
           10, 0, 0)
 
+DEFPARAM (PARAM_STORE_WIDENING_ALLOW_UNALIGNED,
+	  "store-widening-allow-unaligned",
+	  "Allow the store widening pass to introduce unaligned stores "
+	  "if it is legal to do so",
+          1, 0, 1)
+
 DEFPARAM (PARAM_MAX_TAIL_MERGE_ITERATIONS,
           "max-tail-merge-iterations",
           "Maximum amount of iterations of the pass over a function.",
diff --git a/gcc/passes.def b/gcc/passes.def
index c78dbf2f61beb3940122109c04473e6499846b22..e54c64893c274dcf69fcee9d61be2193f3d8e148 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -324,6 +324,7 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_phiopt);
       NEXT_PASS (pass_fold_builtins);
       NEXT_PASS (pass_optimize_widening_mul);
+      NEXT_PASS (pass_tree_store_widening);
       NEXT_PASS (pass_tail_calls);
       /* If DCE is not run before checking for uninitialized uses,
 	 we may get false warnings (e.g., testsuite/gcc.dg/uninit-5.c).
diff --git a/gcc/testsuite/gcc.c-torture/execute/pr22141-1.c b/gcc/testsuite/gcc.c-torture/execute/pr22141-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..7c888b469cf39f00ced8ddb8cc6dc245fa30b97b
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/execute/pr22141-1.c
@@ -0,0 +1,122 @@
+/* PR middle-end/22141 */
+
+extern void abort (void);
+
+struct S
+{
+  struct T
+    {
+      char a;
+      char b;
+      char c;
+      char d;
+    } t;
+} u;
+
+struct U
+{
+  struct S s[4];
+};
+
+void __attribute__((noinline))
+c1 (struct T *p)
+{
+  if (p->a != 1 || p->b != 2 || p->c != 3 || p->d != 4)
+    abort ();
+  __builtin_memset (p, 0xaa, sizeof (*p));
+}
+
+void __attribute__((noinline))
+c2 (struct S *p)
+{
+  c1 (&p->t);
+}
+
+void __attribute__((noinline))
+c3 (struct U *p)
+{
+  c2 (&p->s[2]);
+}
+
+void __attribute__((noinline))
+f1 (void)
+{
+  u = (struct S) { { 1, 2, 3, 4 } };
+}
+
+void __attribute__((noinline))
+f2 (void)
+{
+  u.t.a = 1;
+  u.t.b = 2;
+  u.t.c = 3;
+  u.t.d = 4;
+}
+
+void __attribute__((noinline))
+f3 (void)
+{
+  u.t.d = 4;
+  u.t.b = 2;
+  u.t.a = 1;
+  u.t.c = 3;
+}
+
+void __attribute__((noinline))
+f4 (void)
+{
+  struct S v;
+  v.t.a = 1;
+  v.t.b = 2;
+  v.t.c = 3;
+  v.t.d = 4;
+  c2 (&v);
+}
+
+void __attribute__((noinline))
+f5 (struct S *p)
+{
+  p->t.a = 1;
+  p->t.c = 3;
+  p->t.d = 4;
+  p->t.b = 2;
+}
+
+void __attribute__((noinline))
+f6 (void)
+{
+  struct U v;
+  v.s[2].t.a = 1;
+  v.s[2].t.b = 2;
+  v.s[2].t.c = 3;
+  v.s[2].t.d = 4;
+  c3 (&v);
+}
+
+void __attribute__((noinline))
+f7 (struct U *p)
+{
+  p->s[2].t.a = 1;
+  p->s[2].t.c = 3;
+  p->s[2].t.d = 4;
+  p->s[2].t.b = 2;
+}
+
+int
+main (void)
+{
+  struct U w;
+  f1 ();
+  c2 (&u);
+  f2 ();
+  c1 (&u.t);
+  f3 ();
+  c2 (&u);
+  f4 ();
+  f5 (&u);
+  c2 (&u);
+  f6 ();
+  f7 (&w);
+  c3 (&w);
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.c-torture/execute/pr22141-2.c b/gcc/testsuite/gcc.c-torture/execute/pr22141-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..cb9cc79026310260ffc3a83bfdf9bfc92f998a86
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/execute/pr22141-2.c
@@ -0,0 +1,122 @@
+/* PR middle-end/22141 */
+
+extern void abort (void);
+
+struct S
+{
+  struct T
+    {
+      char a;
+      char b;
+      char c;
+      char d;
+    } t;
+} u __attribute__((aligned));
+
+struct U
+{
+  struct S s[4];
+};
+
+void __attribute__((noinline))
+c1 (struct T *p)
+{
+  if (p->a != 1 || p->b != 2 || p->c != 3 || p->d != 4)
+    abort ();
+  __builtin_memset (p, 0xaa, sizeof (*p));
+}
+
+void __attribute__((noinline))
+c2 (struct S *p)
+{
+  c1 (&p->t);
+}
+
+void __attribute__((noinline))
+c3 (struct U *p)
+{
+  c2 (&p->s[2]);
+}
+
+void __attribute__((noinline))
+f1 (void)
+{
+  u = (struct S) { { 1, 2, 3, 4 } };
+}
+
+void __attribute__((noinline))
+f2 (void)
+{
+  u.t.a = 1;
+  u.t.b = 2;
+  u.t.c = 3;
+  u.t.d = 4;
+}
+
+void __attribute__((noinline))
+f3 (void)
+{
+  u.t.d = 4;
+  u.t.b = 2;
+  u.t.a = 1;
+  u.t.c = 3;
+}
+
+void __attribute__((noinline))
+f4 (void)
+{
+  struct S v __attribute__((aligned));
+  v.t.a = 1;
+  v.t.b = 2;
+  v.t.c = 3;
+  v.t.d = 4;
+  c2 (&v);
+}
+
+void __attribute__((noinline))
+f5 (struct S *p)
+{
+  p->t.a = 1;
+  p->t.c = 3;
+  p->t.d = 4;
+  p->t.b = 2;
+}
+
+void __attribute__((noinline))
+f6 (void)
+{
+  struct U v __attribute__((aligned));
+  v.s[2].t.a = 1;
+  v.s[2].t.b = 2;
+  v.s[2].t.c = 3;
+  v.s[2].t.d = 4;
+  c3 (&v);
+}
+
+void __attribute__((noinline))
+f7 (struct U *p)
+{
+  p->s[2].t.a = 1;
+  p->s[2].t.c = 3;
+  p->s[2].t.d = 4;
+  p->s[2].t.b = 2;
+}
+
+int
+main (void)
+{
+  struct U w __attribute__((aligned));
+  f1 ();
+  c2 (&u);
+  f2 ();
+  c1 (&u.t);
+  f3 ();
+  c2 (&u);
+  f4 ();
+  f5 (&u);
+  c2 (&u);
+  f6 ();
+  f7 (&w);
+  c3 (&w);
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/store_widening_1.c b/gcc/testsuite/gcc.dg/store_widening_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..6b7d9b3f83a065a5a420281217fa8299d8e1c275
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/store_widening_1.c
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target non_strict_align } */
+/* { dg-options "-O -fdump-tree-store-widening" } */
+
+struct bar {
+  int a;
+  char b;
+  char c;
+  char d;
+  char e;
+  char f;
+  char g;
+};
+
+void
+foo1 (struct bar *p)
+{
+  p->b = 0;
+  p->a = 0;
+  p->c = 0;
+  p->d = 0;
+  p->e = 0;
+}
+
+void
+foo2 (struct bar *p)
+{
+  p->b = 0;
+  p->a = 0;
+  p->c = 1;
+  p->d = 0;
+  p->e = 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Widening successful" 2 "store-widening" } } */
diff --git a/gcc/testsuite/gcc.dg/store_widening_2.c b/gcc/testsuite/gcc.dg/store_widening_2.c
new file mode 100644
index 0000000000000000000000000000000000000000..e5ac1f1fee24e01cf4d3327afbbb68560919b5e6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/store_widening_2.c
@@ -0,0 +1,80 @@
+/* { dg-do run } */
+/* { dg-require-effective-target non_strict_align } */
+/* { dg-options "-O -fdump-tree-store-widening" } */
+
+struct bar
+{
+  int a;
+  unsigned char b;
+  unsigned char c;
+  short d;
+  unsigned char e;
+  unsigned char f;
+  unsigned char g;
+};
+
+__attribute__ ((noinline)) void
+foozero (struct bar *p)
+{
+  p->b = 0;
+  p->a = 0;
+  p->c = 0;
+  p->d = 0;
+  p->e = 0;
+  p->f = 0;
+  p->g = 0;
+}
+
+__attribute__ ((noinline)) void
+foo1 (struct bar *p)
+{
+  p->b = 1;
+  p->a = 2;
+  p->c = 3;
+  p->d = 4;
+  p->e = 5;
+  p->f = 0;
+  p->g = 0xff;
+}
+
+__attribute__ ((noinline)) void
+foo2 (struct bar *p, struct bar *p2)
+{
+  p->b = 0xff;
+  p2->b = 0xa;
+  p->a = 0xfffff;
+  p2->c = 0xc;
+  p->c = 0xff;
+  p2->d = 0xbf;
+  p->d = 0xfff;
+}
+
+int
+main (void)
+{
+  struct bar b1, b2;
+  foozero (&b1);
+  foozero (&b2);
+
+  foo1 (&b1);
+  if (b1.b != 1 || b1.a != 2 || b1.c != 3 || b1.d != 4 || b1.e != 5
+      || b1.f != 0 || b1.g != 0xff)
+    __builtin_abort ();
+
+  foozero (&b1);
+  /* Make sure writes to aliasing struct pointers preserve the
+     correct order.  */
+  foo2 (&b1, &b1);
+  if (b1.b != 0xa || b1.a != 0xfffff || b1.c != 0xff || b1.d != 0xfff)
+    __builtin_abort ();
+
+  foozero (&b1);
+  foo2 (&b1, &b2);
+  if (b1.a != 0xfffff || b1.b != 0xff || b1.c != 0xff || b1.d != 0xfff
+      || b2.b != 0xa || b2.c != 0xc || b2.d != 0xbf)
+    __builtin_abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Widening successful" 2 "store-widening" } } */
diff --git a/gcc/testsuite/gcc.dg/store_widening_3.c b/gcc/testsuite/gcc.dg/store_widening_3.c
new file mode 100644
index 0000000000000000000000000000000000000000..63a76e7ad6ec912749725e92093179b07d02afe3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/store_widening_3.c
@@ -0,0 +1,32 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target non_strict_align } */
+/* { dg-options "-O -fdump-tree-store-widening" } */
+
+/* Make sure stores to volatile addresses don't get combined with
+   other accesses.  */
+
+struct bar
+{
+  int a;
+  char b;
+  char c;
+  volatile short d;
+  char e;
+  char f;
+  char g;
+};
+
+void
+foozero (struct bar *p)
+{
+  p->b = 0xa;
+  p->a = 0xb;
+  p->c = 0xc;
+  p->d = 0;
+  p->e = 0xd;
+  p->f = 0xe;
+  p->g = 0xf;
+}
+
+/* { dg-final { scan-tree-dump "Volatile access terminates chain" "store-widening" } } */
+/* { dg-final { scan-tree-dump-times "=\{v\} 0;" 1 "store-widening" } } */
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.dg/store_widening_4.c b/gcc/testsuite/gcc.dg/store_widening_4.c
new file mode 100644
index 0000000000000000000000000000000000000000..4f90b9cdc4dfe1a80d46b1dbc6357fc2e6750f79
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/store_widening_4.c
@@ -0,0 +1,32 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target non_strict_align } */
+/* { dg-options "-O -fdump-tree-store-widening" } */
+
+/* Check that we can merge interleaving stores that are guaranteed
+   to be non-aliasing.  */
+
+struct bar
+{
+  int a;
+  char b;
+  char c;
+  short d;
+  char e;
+  char f;
+  char g;
+};
+
+void
+foozero (struct bar *restrict p, struct bar *restrict p2)
+{
+  p->b = 0xff;
+  p2->b = 0xa;
+  p->a = 0xfffff;
+  p2->a = 0xab;
+  p2->c = 0xc;
+  p->c = 0xff;
+  p2->d = 0xbf;
+  p->d = 0xfff;
+}
+
+/* { dg-final { scan-tree-dump-times "Widening successful" 2 "store-widening" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_1.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_1.c
index f02e55f1cc2f01063aea35b3b88f793bb2f7c532..9de4e771ab1e73bce960d4038f8ec5b49b5c612c 100644
--- a/gcc/testsuite/gcc.target/aarch64/ldp_stp_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_1.c
@@ -3,22 +3,22 @@
 int arr[4][4];
 
 void
-foo ()
+foo (int x, int y)
 {
-  arr[0][1] = 1;
-  arr[1][0] = -1;
-  arr[2][0] = 1;
-  arr[1][1] = -1;
-  arr[0][2] = 1;
-  arr[0][3] = -1;
-  arr[1][2] = 1;
-  arr[2][1] = -1;
-  arr[3][0] = 1;
-  arr[3][1] = -1;
-  arr[2][2] = 1;
-  arr[1][3] = -1;
-  arr[2][3] = 1;
-  arr[3][2] = -1;
+  arr[0][1] = x;
+  arr[1][0] = y;
+  arr[2][0] = x;
+  arr[1][1] = y;
+  arr[0][2] = x;
+  arr[0][3] = y;
+  arr[1][2] = x;
+  arr[2][1] = y;
+  arr[3][0] = x;
+  arr[3][1] = y;
+  arr[2][2] = x;
+  arr[1][3] = y;
+  arr[2][3] = x;
+  arr[3][2] = y;
 }
 
 /* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 7 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr22141.c b/gcc/testsuite/gcc.target/i386/pr22141.c
new file mode 100644
index 0000000000000000000000000000000000000000..efded365778417e7b4bf468288f42f5136fdd585
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr22141.c
@@ -0,0 +1,126 @@
+/* PR middle-end/22141 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+extern void abort (void);
+
+struct S
+{
+  struct T
+    {
+      char a;
+      char b;
+      char c;
+      char d;
+    } t;
+} u;
+
+struct U
+{
+  struct S s[4];
+};
+
+void __attribute__((noinline))
+c1 (struct T *p)
+{
+  if (p->a != 1 || p->b != 2 || p->c != 3 || p->d != 4)
+    abort ();
+  __builtin_memset (p, 0xaa, sizeof (*p));
+}
+
+void __attribute__((noinline))
+c2 (struct S *p)
+{
+  c1 (&p->t);
+}
+
+void __attribute__((noinline))
+c3 (struct U *p)
+{
+  c2 (&p->s[2]);
+}
+
+void __attribute__((noinline))
+f1 (void)
+{
+  u = (struct S) { { 1, 2, 3, 4 } };
+}
+
+void __attribute__((noinline))
+f2 (void)
+{
+  u.t.a = 1;
+  u.t.b = 2;
+  u.t.c = 3;
+  u.t.d = 4;
+}
+
+void __attribute__((noinline))
+f3 (void)
+{
+  u.t.d = 4;
+  u.t.b = 2;
+  u.t.a = 1;
+  u.t.c = 3;
+}
+
+void __attribute__((noinline))
+f4 (void)
+{
+  struct S v;
+  v.t.a = 1;
+  v.t.b = 2;
+  v.t.c = 3;
+  v.t.d = 4;
+  c2 (&v);
+}
+
+void __attribute__((noinline))
+f5 (struct S *p)
+{
+  p->t.a = 1;
+  p->t.c = 3;
+  p->t.d = 4;
+  p->t.b = 2;
+}
+
+void __attribute__((noinline))
+f6 (void)
+{
+  struct U v;
+  v.s[2].t.a = 1;
+  v.s[2].t.b = 2;
+  v.s[2].t.c = 3;
+  v.s[2].t.d = 4;
+  c3 (&v);
+}
+
+void __attribute__((noinline))
+f7 (struct U *p)
+{
+  p->s[2].t.a = 1;
+  p->s[2].t.c = 3;
+  p->s[2].t.d = 4;
+  p->s[2].t.b = 2;
+}
+
+int
+main (void)
+{
+  struct U w;
+  f1 ();
+  c2 (&u);
+  f2 ();
+  c1 (&u.t);
+  f3 ();
+  c2 (&u);
+  f4 ();
+  f5 (&u);
+  c2 (&u);
+  f6 ();
+  f7 (&w);
+  c3 (&w);
+  return 0;
+}
+
+/* { dg-final { scan-assembler-times "67305985\|4030201" 7 } } */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 6d4d70cc1850464d1b014b91da363ee6164f4432..5a9ffa79dfb2ad1254b709d032455958afe542f1 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -424,6 +424,7 @@ extern gimple_opt_pass *make_pass_late_warn_uninitialized (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_cse_reciprocals (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_cse_sincos (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_optimize_bswap (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_tree_store_widening (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_optimize_widening_mul (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_warn_function_return (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_warn_function_noreturn (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-store-widening.c b/gcc/tree-ssa-store-widening.c
new file mode 100644
index 0000000000000000000000000000000000000000..febd5dff3474d458c580deacf23bbe2d1e3430b0
--- /dev/null
+++ b/gcc/tree-ssa-store-widening.c
@@ -0,0 +1,875 @@
+/* GIMPLE store widening pass.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   Contributed by ARM Ltd.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+/* The purpose of this pass is to combine multiple memory stores of
+   constant values to consecutive memory locations into fewer wider stores.
+   For example, if we have a sequence peforming four byte stores to
+   consecutive memory locations:
+   [p     ] := imm1;
+   [p + 1B] := imm2;
+   [p + 2B] := imm3;
+   [p + 3B] := imm4;
+   we can transform this into a single 4-byte store if the target supports it:
+  [p] := imm1:imm2:imm3:imm4 //concatenated immediates according to endianness.
+
+   The algorithm is applied to each basic block in three phases:
+
+   1) Scan through the basic block recording constant assignments to
+   destinations that can be expressed as a store to memory of a certain size
+   at a certain bit offset.  Record all such assignments as long as they store
+   to a common base object (the address pointed to by 'p' in the above
+   example).
+   These stores can be a result of structure element initializers, array stores
+   etc.  A store_immediate_info object is recorded for every such store.
+   Record as many such assignments to a single base as possible until a
+   statement that interferes with the store sequence is encountered.
+
+   2) Analyze the chain of stores recorded in phase 1) (i.e. the vector of
+   store_immediate_info objects) and coalesce contiguous stores into
+   merged_store_group objects.
+
+   For example, given the stores:
+   [p     ] := 0;
+   [p + 1B] := 1;
+   [p + 3B] := 0;
+   [p + 4B] := 1;
+   [p + 5B] := 0;
+   [p + 6B] := 0;
+   This phase would produce two merged_store_group objects, one recording the
+   two bytes stored in the memory region [p : p + 1] and another
+   recording the four bytes stored in the memory region [p + 3 : p + 6].
+
+   3) The merged_store_group objects produced in phase 2) are processed
+   to generate the sequence of wider stores that set the contiguous memory
+   regions to the sequence of bytes that correspond to it.  This may emit
+   multiple stores per store group to handle contiguous stores that are not
+   of a size that is a power of 2.  For example it can try to emit a 40-bit
+   store as a 32-bit store followed by an 8-bit store.
+   We try to emit as wide stores as we can while respecting STRICT_ALIGNMENT or
+   SLOW_UNALIGNED_ACCESS rules.
+
+   Note on endianness and example:
+   The bitposition that is written in each access is determined by
+   get_inner_reference which calculates the position in the object rather than
+   the memory itself so some care must be taken when coalescing stores and
+   synthesizing the result in steps 2) and 3) respectively to handle
+   byte-endianness correctly.
+   Consider 2 contiguous 16-bit stores followed by 2 contiguous 8-bit stores:
+   [p     ] := 0x1234;
+   [p + 2B] := 0x5678;
+   [p + 4B] := 0xab;
+   [p + 5B] := 0xcd;
+
+   The memory layout for little-endian (LE) and big-endian (BE) must be:
+  p |LE|BE|
+  ---------
+  0 |34|12|
+  1 |12|34|
+  2 |78|56|
+  3 |56|78|
+  4 |ab|ab|
+  5 |cd|cd|
+
+  To merge these into a single 48-bit merged value 'val' in phase 2)
+  on little-endian we insert stores to higher (consecutive) bitpositions
+  into the most significant bits of the merged value.
+  For the example above that would be:
+  val = 0x1234;
+  val = val | (0x5678 << 16);
+  val = val | (0xab << 32);
+  val = val | (0xcd << 40);
+  To produce a wide_int containing: 0xcdab56781234
+
+  For big-endian we insert stores to higher bitpositions into the least
+  significant bits of the merged value.  So for the above example the
+  operations would be:
+  val = 0x1234;
+  val = 0x5678 | (val << 16);
+  val = 0xab   | (val << 8);
+  val = 0xcd   | (val << 8);
+  to produce a wide_int containing: 0x12345678abcd
+
+  Then, in phase 3), we want to emit this 48-bit value as a 32-bit store
+  followed by a 16-bit store.  Again, we must consider endianness when
+  breaking down the 48-bit value 'val' computed above.
+  For little endian we emit:
+  [p]      (32-bit) := 0x56781234; // val & 0x0000ffffffff;
+  [p + 4B] (16-bit) := 0xcdab;    // (val & 0xffff00000000) >> 32;
+
+  Whereas for big-endian we emit:
+  [p]      (32-bit) := 0x12345678; // (val & 0xffffffff0000) >> 16;
+  [p + 4B] (16-bit) := 0xabcd;     //  val & 0x00000000ffff;
+
+  The same effect could be achieved without changing the insertion order
+  during phase 2) and changing the extraction logic in phase 3) by
+  byte-swapping the values when they are being merged in phase 2) and again
+  after extracting them in phase 3).  But that approach would be harder to
+  extend to deal with potential future improvements including bitfields that
+  are not a multiple of BITS_PER_UNIT or non-constant contiguous stores.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "tree.h"
+#include "gimple.h"
+#include "builtins.h"
+#include "tree-pass.h"
+#include "ssa.h"
+#include "gimple-pretty-print.h"
+#include "alias.h"
+#include "fold-const.h"
+#include "params.h"
+#include "gimple-iterator.h"
+#include "gimplify.h"
+
+/* The maximum size (in bits) of the stores this pass should generate.  */
+#define MAX_STORE_BITSIZE (BITS_PER_WORD)
+
+namespace {
+
+/* Struct recording the information about a single store of an immediate
+   to memory.  These are created in the first phase and coalesced into
+   merged_store_group objects in the second phase.  */
+
+struct store_immediate_info
+{
+  HOST_WIDE_INT bitsize;
+  HOST_WIDE_INT bitpos;
+  tree val;
+  tree dest;
+  gimple *stmt;
+  store_immediate_info (HOST_WIDE_INT, HOST_WIDE_INT, tree, tree, gimple *);
+};
+
+store_immediate_info::store_immediate_info (HOST_WIDE_INT bs, HOST_WIDE_INT bp,
+					    tree v, tree d, gimple *st)
+  : bitsize (bs), bitpos (bp), val (v), dest (d), stmt (st)
+{
+}
+
+/* Struct representing a group of stores to contiguous memory locations.
+   These are produced by the second phase (coalescing) and consumed in the
+   third phase that outputs the widened stores.  */
+
+struct merged_store_group
+{
+  HOST_WIDE_INT start;
+  HOST_WIDE_INT width;
+  wide_int val;
+  unsigned int align;
+  auto_vec<gimple *> stmts;
+  merged_store_group (store_immediate_info *);
+  void merge_into (store_immediate_info *);
+};
+
+/* Initialize a merged_store_group object from a store_immediate_info
+   object.  Initialize the wide_int recording the value and the vector
+   of statements that will be replaced by this store.  */
+
+merged_store_group::merged_store_group (store_immediate_info *info)
+{
+  start = info->bitpos;
+  width = info->bitsize;
+
+  val = wide_int::from (info->val, MAX_STORE_BITSIZE, UNSIGNED);
+  align = get_object_alignment (info->dest);
+  stmts.create (1);
+  stmts.safe_push (info->stmt);
+}
+
+/* Merge a store recorded by INFO into this widened store.
+   Insert the value stored into the val field in the appropriate
+   position (byte-swapping as appropriate for endianness).
+   Record the original statement that produced the store.
+   See note on endianness at top of file for an explanation of the
+   BYTES_BIG_ENDIAN logic.  */
+
+void
+merged_store_group::merge_into (store_immediate_info *info)
+{
+  HOST_WIDE_INT wid = info->bitsize;
+  wide_int tmp = wide_int::from (info->val, MAX_STORE_BITSIZE, UNSIGNED);
+
+  /* Make sure we're inserting in the position we think we're inserting.  */
+  gcc_assert (info->bitpos == start + width);
+  if (BYTES_BIG_ENDIAN)
+    val = wi::insert (tmp, val, wid, width);
+  else
+    val = wi::insert (val, tmp, width, wid);
+  width += wid;
+  gimple *stmt = info->stmt;
+  stmts.safe_push (stmt);
+}
+
+const pass_data pass_data_tree_store_widening = {
+  GIMPLE_PASS,      /* type */
+  "store-widening", /* name */
+  OPTGROUP_NONE,    /* optinfo_flags */
+  TV_NONE,	  /* tv_id */
+  PROP_ssa,	 /* properties_required */
+  0,		    /* properties_provided */
+  0,		    /* properties_destroyed */
+  0,		    /* todo_flags_start */
+  TODO_update_ssa,  /* todo_flags_finish */
+};
+
+class pass_tree_store_widening : public gimple_opt_pass
+{
+public:
+  pass_tree_store_widening (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_tree_store_widening, ctxt)
+  {
+  }
+
+  virtual bool
+  gate (function *)
+  {
+    return flag_tree_store_widening && optimize;
+  }
+
+  virtual unsigned int execute (function *);
+
+private:
+  tree m_curr_base_expr;
+  auto_vec<gimple *> m_store_chain;
+  auto_vec<struct store_immediate_info *> m_store_info;
+  auto_vec<gimple *> m_interspersed_mem_accesses;
+  auto_vec<merged_store_group *> m_merged_store_groups;
+
+  bool stmt_terminates_chain_p (gimple *);
+  bool stmt_interferes_with_mem_accesses_p (tree);
+  bool terminate_and_process_chain ();
+  bool coalesce_immediate_stores ();
+  bool output_merged_store (merged_store_group *);
+  bool output_merged_stores ();
+
+}; // class pass_tree_store_widening
+
+/* Sorting function for store_immediate_info objects.
+   Sorts them by bitposition.  */
+
+static int
+sort_by_bitpos (const void *x, const void *y)
+{
+  store_immediate_info *const *tmp = (store_immediate_info * const *) x;
+  store_immediate_info *const *tmp2 = (store_immediate_info * const *) y;
+
+  if ((*tmp)->bitpos <= (*tmp2)->bitpos)
+    return -1;
+  else if ((*tmp)->bitpos > (*tmp2)->bitpos)
+    return 1;
+
+  gcc_unreachable ();
+  return 0;
+}
+
+/* Go through the candidate stores recorded in m_store_info and merge them
+   into merged_store_group objects recorded into m_merged_store_groups
+   representing the widened stores.  Return true if coalescing was successful
+   and the number of widened stores is fewer than the original number
+   of stores.  */
+
+bool
+pass_tree_store_widening::coalesce_immediate_stores ()
+{
+  /* Anything less can't be processed.  */
+  if (m_store_info.length () < 2)
+    return false;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "Attempting to coalesce %u stores in chain.\n",
+	     m_store_info.length ());
+
+  store_immediate_info *info;
+  unsigned int i;
+
+  /* Order the stores by the bitposition they write to.  */
+  m_store_info.qsort (sort_by_bitpos);
+
+  info = m_store_info[0];
+
+  merged_store_group *merged_store = new merged_store_group (info);
+  m_merged_store_groups.safe_push (merged_store);
+
+  FOR_EACH_VEC_ELT (m_store_info, i, info)
+    {
+      if (dump_file)
+	{
+	  fprintf (dump_file, "Store %u:\nbitsize:" HOST_WIDE_INT_PRINT_DEC
+			       " bitpos:" HOST_WIDE_INT_PRINT_DEC
+			       " val:\n", i, info->bitsize, info->bitpos);
+	  print_generic_expr (dump_file, info->val, 0);
+	  fprintf (dump_file, "\n------------\n");
+	}
+
+      /* First element was already handled outside the loop.  */
+      if (i == 0)
+	continue;
+
+      /* |---store 1---|
+	       |---store 2---|
+       Partially overlapping stores are not handled at the moment.  */
+      HOST_WIDE_INT start = info->bitpos;
+      if (IN_RANGE (start, merged_store->start,
+		    merged_store->start + merged_store->width - 1))
+	return false;
+
+      /* |---store 1---| <gap> |---store 2---|.
+	 Gap between stores.  Start a new group.  */
+      if (start != merged_store->start + merged_store->width)
+	{
+	  merged_store = new merged_store_group (info);
+	  m_merged_store_groups.safe_push (merged_store);
+	  continue;
+	}
+
+      /* |---store 1---||---store 2---|
+	 This store is consecutive to the previous one.  */
+
+      unsigned int prev_size = merged_store->width;
+      /* If we will be exceeding the maximum store size start a new group but
+	 record the alignment of the new store appropriately.  */
+      if (prev_size + info->bitsize > MAX_STORE_BITSIZE)
+	{
+	  merged_store_group *new_merged_store = new merged_store_group (info);
+	  new_merged_store->align
+	    = MIN (merged_store->align, merged_store->width);
+	  merged_store = new_merged_store;
+	  m_merged_store_groups.safe_push (merged_store);
+	  continue;
+	}
+      /*  Merge it into the current store group.  */
+      else
+	{
+	  merged_store->merge_into (info);
+	  continue;
+	}
+
+      gcc_unreachable ();
+      return false;
+    }
+
+  gcc_assert (m_merged_store_groups.length () <= m_store_info.length ());
+  bool success = m_merged_store_groups.length () < m_store_info.length ();
+  if (dump_file)
+    {
+      if (success)
+	fprintf (dump_file, "Coalescing successful!\n"
+			     "Merged into %u stores\n",
+		 m_merged_store_groups.length ());
+    }
+
+  return success;
+}
+
+/* Return the type to use for the merged stores described by GROUP.
+   This is needed to get the alias sets right.  */
+
+static tree
+get_type_for_merged_store (merged_store_group *group)
+{
+  gimple *stmt;
+  unsigned int i;
+  tree lhs = gimple_assign_lhs (group->stmts[0]);
+  tree type = reference_alias_ptr_type (lhs);
+  FOR_EACH_VEC_ELT (group->stmts, i, stmt)
+    {
+      if (i == 0)
+	continue;
+
+      lhs = gimple_assign_lhs (stmt);
+      tree type1 = reference_alias_ptr_type (lhs);
+      if (!alias_ptr_types_compatible_p (type, type1))
+	return ptr_type_node;
+    }
+  return type;
+}
+
+/* Given a store group GROUP record in VDEF and VUSE the last
+   vdef in the original statement sequence and also the first vuse.
+   These are to be used in the replacement sequence.  */
+
+static void
+get_vuse_vdef_for_merged_store (merged_store_group *group, tree *vdef,
+				tree *vuse)
+{
+  gimple *stmt;
+  unsigned int i;
+  *vdef = NULL_TREE;
+  *vuse = NULL_TREE;
+  FOR_EACH_VEC_ELT (group->stmts, i, stmt)
+    {
+      if (gimple_vuse (stmt))
+	{
+	  *vuse = gimple_vuse (stmt);
+	  break;
+	}
+    }
+  *vdef = gimple_vdef (group->stmts[group->stmts.length () - 1]);
+  /* What do you mean there's no vdef?  */
+  gcc_assert (*vdef);
+}
+
+/* Return the location_t information we can find among the statements
+   in GROUP.  */
+
+static location_t
+get_merged_store_location (merged_store_group *group)
+{
+  gimple *stmt;
+  unsigned int i;
+
+  FOR_EACH_VEC_ELT (group->stmts, i, stmt)
+    {
+      if (gimple_has_location (stmt))
+	return gimple_location (stmt);
+    }
+  return UNKNOWN_LOCATION;
+}
+
+/* Return true if storing an integer of bitsize SIZE using an unaligned
+   access if prohibitively slow.  */
+
+static bool
+slow_unaligned_size_access (unsigned int size)
+{
+  machine_mode mode = mode_for_size (size, MODE_INT, 0);
+  gcc_assert (mode != BLKmode);
+  return SLOW_UNALIGNED_ACCESS (mode, GET_MODE_ALIGNMENT (mode));
+}
+
+/* Given a merged store group GROUP output the widened version of it.
+   Try store sizes of at most MAX_STORE_BITSIZE bits wide and don't output
+   unaligned stores for STRICT_ALIGNMENT targets or if it's too expensive.
+   Make sure that the number of statements output is less than the number of
+   original statements.  If a better sequence is possible emit it and
+   return true.  See note on endianness at top of file for an explanation of
+   the BYTES_BIG_ENDIAN logic.  */
+
+bool
+pass_tree_store_widening::output_merged_store (merged_store_group *group)
+{
+  HOST_WIDE_INT pos = group->start;
+  HOST_WIDE_INT size = group->width;
+  HOST_WIDE_INT bytepos = pos / BITS_PER_UNIT;
+  unsigned int align = group->align;
+  gcc_assert (IN_RANGE (size, 0, MAX_STORE_BITSIZE));
+
+  unsigned int orig_num_stmts = group->stmts.length ();
+  if (orig_num_stmts < 2)
+    return false;
+
+  /* We don't handle partial bitfields for now.  */
+  if ((size % BITS_PER_UNIT != 0) || (pos % BITS_PER_UNIT != 0))
+    return false;
+
+  bool allow_unaligned
+    = !STRICT_ALIGNMENT && PARAM_VALUE (PARAM_STORE_WIDENING_ALLOW_UNALIGNED);
+  unsigned int try_size = MAX_STORE_BITSIZE;
+  while (try_size > size
+	 || ((!allow_unaligned || slow_unaligned_size_access (try_size))
+	     && try_size > align))
+    {
+      try_size /= 2;
+      if (try_size < BITS_PER_UNIT)
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    {
+	      fprintf (dump_file, "Failed to output merged store.\n"
+				   "Failed to find starting size meeting"
+				   " alignment requirements.\n");
+	    }
+	  return false;
+	}
+    }
+
+  HOST_WIDE_INT try_pos = bytepos;
+  int wi_offset = BYTES_BIG_ENDIAN ? size - try_size : 0;
+  if (dump_file)
+    {
+      fprintf (dump_file,
+	       "Trying to output merged store at pos " HOST_WIDE_INT_PRINT_DEC
+	       ", of size " HOST_WIDE_INT_PRINT_DEC ", "
+	       "starting size %u and alignment %u\n", pos, size, try_size,
+	       align);
+    }
+
+  gimple_seq seq = NULL;
+  unsigned int num_stmts = 0;
+  tree offset_type = get_type_for_merged_store (group);
+  tree last_vdef, new_vuse;
+  get_vuse_vdef_for_merged_store (group, &last_vdef, &new_vuse);
+  location_t loc = get_merged_store_location (group);
+  gimple *stmt = NULL;
+  while (size > 0)
+    {
+      tree int_type = build_nonstandard_integer_type (try_size, UNSIGNED);
+      SET_TYPE_ALIGN (int_type, align);
+      tree addr = build_fold_addr_expr (m_curr_base_expr);
+      tree dest = fold_build2 (MEM_REF, int_type, addr,
+			       build_int_cst (offset_type, try_pos));
+
+      wide_int wi_val = group->val;
+
+      wi_val
+	= wi::bit_and (wi_val, wi::shifted_mask (wi_offset, try_size, false,
+						 MAX_STORE_BITSIZE));
+
+      wi_val
+	= wi::rshift (wi_val, wide_int::from (wi_offset, MAX_STORE_BITSIZE, UNSIGNED),
+		      UNSIGNED);
+
+      tree src = wide_int_to_tree (int_type, wi_val);
+      stmt = gimple_build_assign (dest, src);
+      gimple_set_vuse (stmt, new_vuse);
+      gimple_seq_add_stmt (&seq, stmt);
+      num_stmts++;
+
+      try_pos += try_size / BITS_PER_UNIT;
+
+      if (!BYTES_BIG_ENDIAN)
+	wi_offset += try_size;
+      size -= try_size;
+      align = try_size;
+      while (size < try_size)
+	try_size /= 2;
+
+      if (BYTES_BIG_ENDIAN)
+	wi_offset -= try_size;
+
+      if (num_stmts >= orig_num_stmts)
+	{
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "Exceeded original number of stmts (%u)."
+				   "  Not profitable to emit new sequence.\n",
+			orig_num_stmts);
+	    }
+	  return false;
+	}
+
+      tree new_vdef
+	= size ? make_ssa_name (gimple_vop (cfun), stmt) : last_vdef;
+      gimple_set_vdef (stmt, new_vdef);
+      SSA_NAME_DEF_STMT (new_vdef) = stmt;
+      new_vuse = new_vdef;
+    }
+  gcc_assert (size == 0);
+  gcc_assert (seq);
+  annotate_all_with_location (seq, loc);
+  gimple_stmt_iterator gsi = gsi_for_stmt (group->stmts[0]);
+  if (dump_file)
+    {
+      fprintf (dump_file,
+	       "New sequence of %u stmts to replace old one of %u stmts:\n",
+	       num_stmts, orig_num_stmts);
+      print_gimple_seq (dump_file, seq, 0, TDF_VOPS | TDF_MEMSYMS);
+    }
+  gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+
+  return true;
+}
+
+/* Process the merged_store_group objects created in the coalescing phase.
+   Try to output the widened stores and delete the original statements if
+   successful.  Return true iff any changes were made.  */
+
+bool
+pass_tree_store_widening::output_merged_stores ()
+{
+  unsigned int i;
+  merged_store_group *merged_store;
+  bool ret = false;
+  FOR_EACH_VEC_ELT (m_merged_store_groups, i, merged_store)
+    {
+      bool successful_p = output_merged_store (merged_store);
+      if (successful_p)
+	{
+	  gimple *stmt;
+	  unsigned int j;
+	  FOR_EACH_VEC_ELT (merged_store->stmts, j, stmt)
+	    {
+	      gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+	      gsi_remove (&gsi, true);
+	    }
+	}
+      ret |= successful_p;
+    }
+  if (ret && dump_file)
+    fprintf (dump_file, "Widening successful!\n");
+
+  return ret;
+}
+
+/* Coalesce the store_immediate_info objects recorded in the first phase
+   and output them.  Clear and reinitialize the structures to record the
+   next store chain.  Return true if any changes were made.  */
+
+bool
+pass_tree_store_widening::terminate_and_process_chain (void)
+{
+  /* Process store chain.  */
+  bool ret = false;
+  if (m_store_info.length () > 1)
+    {
+      ret = coalesce_immediate_stores ();
+      if (ret)
+	ret = output_merged_stores ();
+    }
+
+  /* Reinitialize structures.  */
+  store_immediate_info *info;
+  unsigned int i;
+  FOR_EACH_VEC_ELT (m_store_info, i, info)
+    delete info;
+
+  merged_store_group *merged_info;
+  FOR_EACH_VEC_ELT (m_merged_store_groups, i, merged_info)
+    delete merged_info;
+
+  m_merged_store_groups.truncate (0);
+  m_store_info.truncate (0);
+  m_store_chain.truncate (0);
+  m_interspersed_mem_accesses.truncate (0);
+  m_curr_base_expr = NULL_TREE;
+  return ret;
+}
+
+/* Return true iff upon encountering the statement STMT we should terminate.
+   Use aliasing info to determine that.  */
+
+bool
+pass_tree_store_widening::stmt_terminates_chain_p (gimple *stmt)
+{
+  if (!m_curr_base_expr)
+    return false;
+
+  if (!gimple_vuse (stmt) && !gimple_vdef (stmt))
+    return false;
+
+  if (gimple_has_volatile_ops (stmt)
+      || stmt_may_clobber_ref_p (stmt, m_curr_base_expr)
+      || ref_maybe_used_by_stmt_p (stmt, m_curr_base_expr))
+    return true;
+
+  return false;
+}
+
+/* Return true iff storing to LHS may interfere with any store recorded
+   in the m_interspersed_mem_accesses vector.  */
+
+bool
+pass_tree_store_widening::stmt_interferes_with_mem_accesses_p (tree lhs)
+{
+  gimple *tmp_stmt;
+  unsigned int i;
+  FOR_EACH_VEC_ELT (m_interspersed_mem_accesses, i, tmp_stmt)
+    {
+      if (stmt_may_clobber_ref_p (tmp_stmt, lhs)
+	  || ref_maybe_used_by_stmt_p (tmp_stmt, lhs))
+	return true;
+    }
+  return false;
+}
+
+/* Return true iff LHS is a destination potentially interesting for
+   store widening.  */
+
+static bool
+lhs_code_valid_for_store_widening_p (tree lhs)
+{
+  tree_code code = TREE_CODE (lhs);
+  if (code == ARRAY_REF || code == ARRAY_RANGE_REF || code == MEM_REF
+      || code == COMPONENT_REF || code == BIT_FIELD_REF)
+    return true;
+
+  return false;
+}
+
+/* Entry point for the pass.  Go over each basic block recording chains of
+  immediate stores.  Upon encountering a terminating statement (as defined
+  by stmt_terminates_chain_p) process the recorded stores and emit the widened
+  variants.  */
+
+unsigned int
+pass_tree_store_widening::execute (function *fun)
+{
+  basic_block bb;
+  m_curr_base_expr = NULL_TREE;
+  hash_set<gimple *> orig_stmts;
+
+  bool global_changes_made = false;
+  FOR_EACH_BB_FN (bb, fun)
+    {
+      gimple_stmt_iterator gsi;
+      HOST_WIDE_INT num_statements = 0;
+      /* Record the original statements so that we can keep track of
+	 statements emitted in this pass and not re-process new
+	 statements.  */
+      for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  if (!is_gimple_debug (stmt))
+	    orig_stmts.add (stmt);
+	  num_statements++;
+	}
+      if (num_statements < 2)
+	continue;
+      bool changes_made;
+      do
+	{
+	  changes_made = false;
+	  for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	    {
+	      gimple *stmt = gsi_stmt (gsi);
+
+	      if (!orig_stmts.contains (stmt) || is_gimple_debug (stmt))
+		continue;
+	      else if (is_gimple_assign (stmt)
+		       && lhs_code_valid_for_store_widening_p (
+			    gimple_assign_lhs (stmt)))
+		{
+		  tree lhs = gimple_assign_lhs (stmt);
+
+		  HOST_WIDE_INT bitsize, bitpos;
+		  machine_mode mode;
+		  int unsignedp = 0, reversep = 0, volatilep = 0;
+		  tree offset, base_addr;
+		  base_addr
+		    = get_inner_reference (lhs, &bitsize, &bitpos, &offset,
+					   &mode, &unsignedp, &reversep,
+					   &volatilep);
+		  bool same_base_p
+		    = m_curr_base_expr
+		      && operand_equal_p (m_curr_base_expr, base_addr, 0);
+
+		  if (volatilep)
+		    {
+		      if (dump_file)
+			fprintf (dump_file, "Volatile access terminates "
+					    "chain.\n");
+		      changes_made |= terminate_and_process_chain ();
+		    }
+		  else if (gimple_assign_rhs_code (stmt) == INTEGER_CST
+			   && (!m_curr_base_expr || same_base_p) && !offset
+			   && !reversep && bitsize <= MAX_STORE_BITSIZE
+			   /* Don't handle bitfields that are not a multiple
+			      of BITS_PER_UNIT for now.  Can be extended
+			      later.  */
+			   && (bitsize % BITS_PER_UNIT == 0)
+			   && !stmt_interferes_with_mem_accesses_p (lhs))
+		    {
+		      /* Valid store to continue widening or, in the
+			 !m_curr_base_expr case, to start a chain.  */
+		      gcc_assert (m_curr_base_expr
+				  || m_store_chain.is_empty ());
+		      m_curr_base_expr = base_addr;
+		      m_store_chain.safe_push (stmt);
+		      tree int_val = gimple_assign_rhs1 (stmt);
+		      store_immediate_info *info
+			= new store_immediate_info (bitsize, bitpos, int_val,
+						    lhs, stmt);
+
+		      if (dump_file)
+			{
+			  fprintf (dump_file,
+				   "Recording immediate store from stmt:\n");
+			  print_gimple_stmt (dump_file, stmt, 0, 0);
+			}
+		      m_store_info.safe_push (info);
+		    }
+		  else if (same_base_p)
+		    {
+		      if (dump_file)
+			fprintf (dump_file,
+				 "Non-constant store based of current base.  "
+				 "Recording for alias checking purposes.\n");
+
+		      if (!m_store_info.is_empty ())
+			m_interspersed_mem_accesses.safe_push (stmt);
+		    }
+		  /* Store to a different base.  If it conflicts with
+		     m_curr_base_expr stop the chain, otherwise ignore.  */
+		  else if (stmt_terminates_chain_p (stmt))
+		    {
+		      if (dump_file)
+			fprintf (dump_file,
+			   "Store to a different base terminates chain.\n");
+
+		      changes_made |= terminate_and_process_chain ();
+		    }
+		  else
+		    {
+		      /* stmt doesn't affect anything.  */
+		      continue;
+		    }
+		}
+	      else if (stmt_terminates_chain_p (stmt))
+		{
+		  /* Non-assignment that terminates.  */
+		  if (dump_file)
+		    fprintf (dump_file, "Statement terminates chain.\n");
+
+		  changes_made |= terminate_and_process_chain ();
+		}
+	      else if (gimple_vdef (stmt) || gimple_vuse (stmt))
+		{
+		  if (dump_file)
+		    {
+		      fprintf (dump_file,
+			       "Statement reads/writes memory.  "
+			       "Recording for alias checking purposes.\n");
+		    }
+		  if (!m_store_info.is_empty ())
+		    m_interspersed_mem_accesses.safe_push (stmt);
+		}
+	      /* Non-assignment statement that doesn't impact current chain.  */
+	    }
+	  /* End of basic block.  Terminate chain.  */
+	  if (dump_file && !m_store_info.is_empty ())
+	    fprintf (dump_file,
+		     "End of basic block.  Terminating recorded chain.\n");
+	  changes_made |= terminate_and_process_chain ();
+	  global_changes_made |= changes_made;
+	  num_statements--;
+	  /* Make sure the number of times we went around the block is
+	     not more than the number of statements in the block.  Otherwise
+	     something has gone seriously wrong and we're probably in an
+	     infinite loop.  */
+	  gcc_assert (num_statements >= 0);
+	  if (changes_made && dump_file)
+	    fprintf (dump_file, "Changes made this iteration.  Going through "
+				"basic block again.\n");
+	}
+      while (changes_made);
+    }
+  return global_changes_made ? TODO_cleanup_cfg : 0;
+}
+
+} // anon namespace
+
+/* Construct and return a store widening pass object.  */
+
+gimple_opt_pass *
+make_pass_tree_store_widening (gcc::context *ctxt)
+{
+  return new pass_tree_store_widening (ctxt);
+}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass
  2016-07-15 15:14 [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass Kyrill Tkachov
@ 2016-07-18 12:23 ` Richard Biener
  2016-08-01  9:16   ` Kyrill Tkachov
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Richard Biener @ 2016-07-18 12:23 UTC (permalink / raw)
  To: Kyrill Tkachov; +Cc: GCC Patches

On Fri, Jul 15, 2016 at 5:13 PM, Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
> Hi all,
>
> This is a GIMPLE pass to implement PR middle-end/22141. that is merge narrow
> stores of constants
> into fewer wider stores.  A 2009 patch from Jakub [1] contains many
> testcases but a simple motivating
> case can be:
>
> struct bar {
>   int a;
>   char b;
>   char c;
>   char d;
>   char e;
> }; // packed 64-bit structure
>
> void bar (struct bar *);
>
> void
> foo (struct bar *p)
> {
>   p->b = 0;
>   p->a = 0;
>   p->c = 0;
>   p->d = 1;
>   p->e = 0;
> }
>
> Currently on aarch64 this will generate:
> foo:
>         mov     w1, 1
>         str     wzr, [x0]
>         strb    wzr, [x0, 4]
>         strb    wzr, [x0, 5]
>         strb    w1, [x0, 6]
>         strb    wzr, [x0, 7]
>         ret
>
> With this patch this can be improved into a single unaligned store:
> foo:
>         mov     x1, 0x1000000000000
>         str     x1, [x0]
>         ret
>
> or, if compiled with -mstrict-align:
> foo:
>         mov     w1, 0x10000
>         stp     wzr, w1, [x0]
>         ret
>
> The pass is a tree-ssa pass that runs fairly late in the pipeline, after
> pass_optimize_widening_mul.
> I explain the approach taken in the comments in the new
> tree-ssa-store-widening.c file but essentially
> it has 3 phases applied to each basic block:
>
> 1) Scan through the statements recording assignments of constants to
> destinations like ARRAY_REF,
> COMPONENT_REF, MEM_REF which are determined to write to an ultimate common
> destination. get_inner_reference
> is used to decompose these destinations. Continue recording these until we
> encounter a statement that may
> interfere with the stores we've been recording (load or store that may
> alias, volatile access etc).
> These assignments of interest are recorded as store_immediate_info objects
> in the m_store_info vector.
>
> 2) Analyse the stores recorded in phase one (they all write to a destination
> offset from a common base)
> and merge them into wider assignments up to BITS_PER_WORD bits wide. These
> widened assignments are represented
> as merged_store_group objects and they are recorded in the
> m_merged_store_groups vector. This is the
> coalesce_immediate_stores function. It sorts the stores by the bitposition
> they write to and iterates through
> them, merging consecutive stores (it fails the transformation on overlapping
> stores, I don't think that case
> appears often enough to warrant extra logic) up to BITS_PER_WORD-wide
> accesses.
>
> 3) Go through the merged stores recorded in m_merged_store_groups and output
> each widened store. Widened stores
> that are not of a bitsize that is a power of two (for example 48 bits wide)
> are output as multiple stores of decreasing
> power-of-two width. So, for a widened store 48-bits wide this phase would a
> emit a 32-bit store followed by a
> 16-bit store. The new sequence is only emitted if it contains fewer
> statements than the original sequence that it
> will replace.  This phase also avoids outputting unaligned stores for
> STRICT_ALIGNMENT targets or targets where
> SLOW_UNALIGNED_ACCESS forbids it. Since some configurations/targets may want
> to avoid generation of unaligned
> stores even when it is legal I've added the new param
> PARAM_STORE_WIDENING_ALLOW_UNALIGNED that can be used
> to disallow unaligned store generation.  Its default setting is to allow
> them (assuming that STRICT_ALIGNMENT
> and SLOW_UNALIGNED_ACCESS allows it).
>
> This is my first GIMPLE-level pass so please do point out places where I'm
> not using the interfaces correctly.
> This patch handles bitfields as well, but only if they are a multiple of
> BITS_PER_UNIT. It should be easily
> extensible to handle other bitfields as well, but I'm not entirely sure of
> the rules for laying out such bitfields
> and in particular the byteswap logic that needs to be applied for big-endian
> targets. If someone can shed some light
> on how they should be handed I'll be happy to try it out, but I believe this
> patch is an improvement as it is.
>
> This has been bootstrapped and tested on aarch64-none-linux-gnu,
> arm-none-linux-gnueabihf and x86_64-unknown-linux-gnu.
> I've also tested it on the big-endian targets: armeb-none-eabi,
> aarch64_be-none-elf. Also tested aarch64-none-elf/-mabi=ilp32.
>
> I've benchmarked it on SPEC2006 on AArch64 on a Cortex-A72 and there were no
> regressions, the overall score improved a bit
> (about 0.1%). The interesting improvements were:
> 458.sjeng     (+0.8%)
> 483.xalancbmk (+1.1%)
> 416.gamess    (+1.0%)
> 454.calculix  (+1.1%)
>
> An interesting effect was in BZ2_decompress from bzip2 where at -Ofast it
> transformed a long sequence of constant
> byte stores into a much shorter sequence of word-size stores (from ~550
> instructions to ~190).
>
> On x86_64 SPECINT there was no change in the overall score. The code size at
> -Ofast is consistently smaller
> with this patch but the preformance differences on sub-benchmarks are in the
> noise.
>
> I've included the testcases from Jakub's patch [1] and added a few of my
> own.
>
> Is this direction acceptable for the problem this is trying to solve?

+      /* Record the original statements so that we can keep track of
+        statements emitted in this pass and not re-process new
+        statements.  */
+      for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+       {
+         gimple *stmt = gsi_stmt (gsi);
+         if (!is_gimple_debug (stmt))
+           orig_stmts.add (stmt);
+         num_statements++;
+       }

please use gimple_set_visited () instead, that should be cheaper.


+      do
+       {
+         changes_made = false;
+         for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+           {
...
+       }
+      while (changes_made);

looks pretty quadratic to me.  Instead of tracking things with m_curr_base_expr
why not use a hash-map to track stores related to a base?

+                          /* Don't handle bitfields that are not a multiple
+                             of BITS_PER_UNIT for now.  Can be extended
+                             later.  */
+                          && (bitsize % BITS_PER_UNIT == 0)

:(

+                          && !stmt_interferes_with_mem_accesses_p (lhs))

given this function loops over all accesses this is quadratic as well.

I think alias queries could be simplified if you reduce them to alias
checks on the base object (and allow overlapping constant stores
which should be easy to handle during merging).

The VUSE/VDEF handling is somewhat odd.  All stores have both
a VDEF and a VUSE, if you merge a set of them you can simply
re-use the VDEF/VUSE of one of them, effectively replacing the
stmt.  For all other stores you remove you miss a
  unlink_stmt_vdef (stmt);
  release_defs (stmt);
to update virtual SSA form and properly release SSA names.

As you use TBAA in your alias checks you may only _sink_
stores.  Not sure if your merged store placement ensures this.

I think this kind of transforms is useful in early optimizations, not only
very late as you perform it.  Of course it should be really cheap there.

Note that options like -ftree-store-widening are discouraged
("tree" does mean nothing to our users and store widening isn't
what is done - it's store merging).  Simply name it -fstore-merging.
Also the file name should be gimple-ssa-store-merging.c

Looking forward to an algorithmically enhanced version.

Richard.


> Thanks,
> Kyrill
>
> N.B. I'm going on vacation until August so I won't be able to respond to any
> feedback until I get back.
>
> [1] https://gcc.gnu.org/ml/gcc-patches/2009-09/msg01745.html
>
> 2016-07-15  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     PR middle-end/22141
>     * Makefile.in (OBJS): Add tree-ssa-store-widening.o.
>     * common.opt (ftree-store-widening): New Optimization option.
>     * opts.c (default_options_table): Add entry for
>     OPT_ftree_store_widening.
>     * params.def (PARAM_STORE_WIDENING_ALLOW_UNALIGNED): Define.
>     * passes.def: Insert pass_tree_store_widening.
>     * tree-pass.h (make_pass_tree_store_widening): Declare extern
>     prototype.
>     * tree-ssa-store-widening.c: New file.
>     * doc/invoke.texi (Optimization Options): Document
>     -ftree-store-widening.
>
> 2016-07-15  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>             Jakub Jelinek  <jakub@redhat.com>
>
>     PR middle-end/22141
>     * gcc.c-torture/execute/pr22141-1.c: New test.
>     * gcc.c-torture/execute/pr22141-2.c: Likewise.
>     * gcc.target/aarch64/ldp_stp_1.c: Adjust for -ftree-store-widening.
>     * gcc.dg/store_widening_1.c: New test.
>     * gcc.dg/store_widening_2.c: Likewise.
>     * gcc.dg/store_widening_3.c: Likewise.
>     * gcc.dg/store_widening_4.c: Likewise.
>     * gcc.target/i386/pr22141.c: Likewise.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass
  2016-07-18 12:23 ` Richard Biener
@ 2016-08-01  9:16   ` Kyrill Tkachov
  2016-08-01 16:10     ` Jeff Law
  2016-08-03  9:59   ` Kyrill Tkachov
  2016-08-03 15:15   ` Kyrill Tkachov
  2 siblings, 1 reply; 9+ messages in thread
From: Kyrill Tkachov @ 2016-08-01  9:16 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches


On 18/07/16 13:22, Richard Biener wrote:
> On Fri, Jul 15, 2016 at 5:13 PM, Kyrill Tkachov
> <kyrylo.tkachov@foss.arm.com> wrote:
>> Hi all,
>>
>> This is a GIMPLE pass to implement PR middle-end/22141. that is merge narrow
>> stores of constants
>> into fewer wider stores.  A 2009 patch from Jakub [1] contains many
>> testcases but a simple motivating
>> case can be:
>>
>> struct bar {
>>    int a;
>>    char b;
>>    char c;
>>    char d;
>>    char e;
>> }; // packed 64-bit structure
>>
>> void bar (struct bar *);
>>
>> void
>> foo (struct bar *p)
>> {
>>    p->b = 0;
>>    p->a = 0;
>>    p->c = 0;
>>    p->d = 1;
>>    p->e = 0;
>> }
>>
>> Currently on aarch64 this will generate:
>> foo:
>>          mov     w1, 1
>>          str     wzr, [x0]
>>          strb    wzr, [x0, 4]
>>          strb    wzr, [x0, 5]
>>          strb    w1, [x0, 6]
>>          strb    wzr, [x0, 7]
>>          ret
>>
>> With this patch this can be improved into a single unaligned store:
>> foo:
>>          mov     x1, 0x1000000000000
>>          str     x1, [x0]
>>          ret
>>
>> or, if compiled with -mstrict-align:
>> foo:
>>          mov     w1, 0x10000
>>          stp     wzr, w1, [x0]
>>          ret
>>
>> The pass is a tree-ssa pass that runs fairly late in the pipeline, after
>> pass_optimize_widening_mul.
>> I explain the approach taken in the comments in the new
>> tree-ssa-store-widening.c file but essentially
>> it has 3 phases applied to each basic block:
>>
>> 1) Scan through the statements recording assignments of constants to
>> destinations like ARRAY_REF,
>> COMPONENT_REF, MEM_REF which are determined to write to an ultimate common
>> destination. get_inner_reference
>> is used to decompose these destinations. Continue recording these until we
>> encounter a statement that may
>> interfere with the stores we've been recording (load or store that may
>> alias, volatile access etc).
>> These assignments of interest are recorded as store_immediate_info objects
>> in the m_store_info vector.
>>
>> 2) Analyse the stores recorded in phase one (they all write to a destination
>> offset from a common base)
>> and merge them into wider assignments up to BITS_PER_WORD bits wide. These
>> widened assignments are represented
>> as merged_store_group objects and they are recorded in the
>> m_merged_store_groups vector. This is the
>> coalesce_immediate_stores function. It sorts the stores by the bitposition
>> they write to and iterates through
>> them, merging consecutive stores (it fails the transformation on overlapping
>> stores, I don't think that case
>> appears often enough to warrant extra logic) up to BITS_PER_WORD-wide
>> accesses.
>>
>> 3) Go through the merged stores recorded in m_merged_store_groups and output
>> each widened store. Widened stores
>> that are not of a bitsize that is a power of two (for example 48 bits wide)
>> are output as multiple stores of decreasing
>> power-of-two width. So, for a widened store 48-bits wide this phase would a
>> emit a 32-bit store followed by a
>> 16-bit store. The new sequence is only emitted if it contains fewer
>> statements than the original sequence that it
>> will replace.  This phase also avoids outputting unaligned stores for
>> STRICT_ALIGNMENT targets or targets where
>> SLOW_UNALIGNED_ACCESS forbids it. Since some configurations/targets may want
>> to avoid generation of unaligned
>> stores even when it is legal I've added the new param
>> PARAM_STORE_WIDENING_ALLOW_UNALIGNED that can be used
>> to disallow unaligned store generation.  Its default setting is to allow
>> them (assuming that STRICT_ALIGNMENT
>> and SLOW_UNALIGNED_ACCESS allows it).
>>
>> This is my first GIMPLE-level pass so please do point out places where I'm
>> not using the interfaces correctly.
>> This patch handles bitfields as well, but only if they are a multiple of
>> BITS_PER_UNIT. It should be easily
>> extensible to handle other bitfields as well, but I'm not entirely sure of
>> the rules for laying out such bitfields
>> and in particular the byteswap logic that needs to be applied for big-endian
>> targets. If someone can shed some light
>> on how they should be handed I'll be happy to try it out, but I believe this
>> patch is an improvement as it is.
>>
>> This has been bootstrapped and tested on aarch64-none-linux-gnu,
>> arm-none-linux-gnueabihf and x86_64-unknown-linux-gnu.
>> I've also tested it on the big-endian targets: armeb-none-eabi,
>> aarch64_be-none-elf. Also tested aarch64-none-elf/-mabi=ilp32.
>>
>> I've benchmarked it on SPEC2006 on AArch64 on a Cortex-A72 and there were no
>> regressions, the overall score improved a bit
>> (about 0.1%). The interesting improvements were:
>> 458.sjeng     (+0.8%)
>> 483.xalancbmk (+1.1%)
>> 416.gamess    (+1.0%)
>> 454.calculix  (+1.1%)
>>
>> An interesting effect was in BZ2_decompress from bzip2 where at -Ofast it
>> transformed a long sequence of constant
>> byte stores into a much shorter sequence of word-size stores (from ~550
>> instructions to ~190).
>>
>> On x86_64 SPECINT there was no change in the overall score. The code size at
>> -Ofast is consistently smaller
>> with this patch but the preformance differences on sub-benchmarks are in the
>> noise.
>>
>> I've included the testcases from Jakub's patch [1] and added a few of my
>> own.
>>
>> Is this direction acceptable for the problem this is trying to solve?
> +      /* Record the original statements so that we can keep track of
> +        statements emitted in this pass and not re-process new
> +        statements.  */
> +      for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +       {
> +         gimple *stmt = gsi_stmt (gsi);
> +         if (!is_gimple_debug (stmt))
> +           orig_stmts.add (stmt);
> +         num_statements++;
> +       }
>
> please use gimple_set_visited () instead, that should be cheaper.
>
>
> +      do
> +       {
> +         changes_made = false;
> +         for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +           {
> ...
> +       }
> +      while (changes_made);
>
> looks pretty quadratic to me.  Instead of tracking things with m_curr_base_expr
> why not use a hash-map to track stores related to a base?
>
> +                          /* Don't handle bitfields that are not a multiple
> +                             of BITS_PER_UNIT for now.  Can be extended
> +                             later.  */
> +                          && (bitsize % BITS_PER_UNIT == 0)
>
> :(
>
> +                          && !stmt_interferes_with_mem_accesses_p (lhs))
>
> given this function loops over all accesses this is quadratic as well.
>
> I think alias queries could be simplified if you reduce them to alias
> checks on the base object (and allow overlapping constant stores
> which should be easy to handle during merging).
>
> The VUSE/VDEF handling is somewhat odd.  All stores have both
> a VDEF and a VUSE, if you merge a set of them you can simply
> re-use the VDEF/VUSE of one of them, effectively replacing the
> stmt.  For all other stores you remove you miss a
>    unlink_stmt_vdef (stmt);
>    release_defs (stmt);
> to update virtual SSA form and properly release SSA names.
>
> As you use TBAA in your alias checks you may only _sink_
> stores.  Not sure if your merged store placement ensures this.
>
> I think this kind of transforms is useful in early optimizations, not only
> very late as you perform it.  Of course it should be really cheap there.
>
> Note that options like -ftree-store-widening are discouraged
> ("tree" does mean nothing to our users and store widening isn't
> what is done - it's store merging).  Simply name it -fstore-merging.
> Also the file name should be gimple-ssa-store-merging.c
>
> Looking forward to an algorithmically enhanced version.

Thanks, that's all helpful.
I'll rework it with the feedback in mind.
Kyrill

> Richard.
>
>
>> Thanks,
>> Kyrill
>>
>> N.B. I'm going on vacation until August so I won't be able to respond to any
>> feedback until I get back.
>>
>> [1] https://gcc.gnu.org/ml/gcc-patches/2009-09/msg01745.html
>>
>> 2016-07-15  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>      PR middle-end/22141
>>      * Makefile.in (OBJS): Add tree-ssa-store-widening.o.
>>      * common.opt (ftree-store-widening): New Optimization option.
>>      * opts.c (default_options_table): Add entry for
>>      OPT_ftree_store_widening.
>>      * params.def (PARAM_STORE_WIDENING_ALLOW_UNALIGNED): Define.
>>      * passes.def: Insert pass_tree_store_widening.
>>      * tree-pass.h (make_pass_tree_store_widening): Declare extern
>>      prototype.
>>      * tree-ssa-store-widening.c: New file.
>>      * doc/invoke.texi (Optimization Options): Document
>>      -ftree-store-widening.
>>
>> 2016-07-15  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>              Jakub Jelinek  <jakub@redhat.com>
>>
>>      PR middle-end/22141
>>      * gcc.c-torture/execute/pr22141-1.c: New test.
>>      * gcc.c-torture/execute/pr22141-2.c: Likewise.
>>      * gcc.target/aarch64/ldp_stp_1.c: Adjust for -ftree-store-widening.
>>      * gcc.dg/store_widening_1.c: New test.
>>      * gcc.dg/store_widening_2.c: Likewise.
>>      * gcc.dg/store_widening_3.c: Likewise.
>>      * gcc.dg/store_widening_4.c: Likewise.
>>      * gcc.target/i386/pr22141.c: Likewise.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass
  2016-08-01  9:16   ` Kyrill Tkachov
@ 2016-08-01 16:10     ` Jeff Law
  0 siblings, 0 replies; 9+ messages in thread
From: Jeff Law @ 2016-08-01 16:10 UTC (permalink / raw)
  To: Kyrill Tkachov, Richard Biener; +Cc: GCC Patches

On 08/01/2016 03:15 AM, Kyrill Tkachov wrote:
>
> On 18/07/16 13:22, Richard Biener wrote:
>> On Fri, Jul 15, 2016 at 5:13 PM, Kyrill Tkachov
>> <kyrylo.tkachov@foss.arm.com> wrote:
>>> Hi all,
>>>
>>> This is a GIMPLE pass to implement PR middle-end/22141. that is merge
>>> narrow
>>> stores of constants
>>> into fewer wider stores.  A 2009 patch from Jakub [1] contains many
>>> testcases but a simple motivating
[ ... ]
Given your work on 22141, you might want to pick up my work on 33562. 
They're different issues, but they touch on similar concepts.  IIRC the 
major thing left on my plate for 33562 was rewriting existing stores 
into smaller pieces when parts of the original store are found to be dead.

jeff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass
  2016-07-18 12:23 ` Richard Biener
  2016-08-01  9:16   ` Kyrill Tkachov
@ 2016-08-03  9:59   ` Kyrill Tkachov
  2016-08-03 13:51     ` Richard Biener
  2016-08-03 15:15   ` Kyrill Tkachov
  2 siblings, 1 reply; 9+ messages in thread
From: Kyrill Tkachov @ 2016-08-03  9:59 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

Hi Richard,

On 18/07/16 13:22, Richard Biener wrote:

<snip>

> +      /* Record the original statements so that we can keep track of
> +        statements emitted in this pass and not re-process new
> +        statements.  */
> +      for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +       {
> +         gimple *stmt = gsi_stmt (gsi);
> +         if (!is_gimple_debug (stmt))
> +           orig_stmts.add (stmt);
> +         num_statements++;
> +       }
>
> please use gimple_set_visited () instead, that should be cheaper.
>
>
> +      do
> +       {
> +         changes_made = false;
> +         for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +           {
> ...
> +       }
> +      while (changes_made);
>
> looks pretty quadratic to me.  Instead of tracking things with m_curr_base_expr
> why not use a hash-map to track stores related to a base?

I've implemented this scheme but I'm having trouble making it work.
In particular I have a hash_map keyed on a 'tree' that is the base
object (as extracted by get_inner_reference) but I can't get the hash_map
to properly extract the already recorded stores to the same base.
For example for the simple code:
struct bar {
   int a;
   char b;
   char c;
   char d;
   char e;
   char f;
   char g;
};

void
foo1 (struct bar *p)
{
   p->b = 0;
   p->a = 0;
   p->c = 0;
   p->d = 0;
   p->e = 0;
}

As we can see, the stores are all to the same object and should
be recognised as such.

The base of the first store is recorded as:
<mem_ref 0x7f527a482820 ...>
and for the second store as <mem_ref 0x7f527a482848 ...>
where the dumps of the two mem_refs are identical except for that first
hex number (their address in memory?)
In my first version of the patch I compare these with operand_equal_p and that
detects that they are the same, but in the hash_map they are not detected
as equal. Is there some special hashing function I must specify?

Thanks,
Kyrill

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass
  2016-08-03  9:59   ` Kyrill Tkachov
@ 2016-08-03 13:51     ` Richard Biener
  2016-08-03 13:57       ` Kyrill Tkachov
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Biener @ 2016-08-03 13:51 UTC (permalink / raw)
  To: Kyrill Tkachov; +Cc: GCC Patches

On Wed, Aug 3, 2016 at 11:59 AM, Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
> Hi Richard,
>
> On 18/07/16 13:22, Richard Biener wrote:
>
> <snip>
>
>> +      /* Record the original statements so that we can keep track of
>> +        statements emitted in this pass and not re-process new
>> +        statements.  */
>> +      for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next
>> (&gsi))
>> +       {
>> +         gimple *stmt = gsi_stmt (gsi);
>> +         if (!is_gimple_debug (stmt))
>> +           orig_stmts.add (stmt);
>> +         num_statements++;
>> +       }
>>
>> please use gimple_set_visited () instead, that should be cheaper.
>>
>>
>> +      do
>> +       {
>> +         changes_made = false;
>> +         for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next
>> (&gsi))
>> +           {
>> ...
>> +       }
>> +      while (changes_made);
>>
>> looks pretty quadratic to me.  Instead of tracking things with
>> m_curr_base_expr
>> why not use a hash-map to track stores related to a base?
>
>
> I've implemented this scheme but I'm having trouble making it work.
> In particular I have a hash_map keyed on a 'tree' that is the base
> object (as extracted by get_inner_reference) but I can't get the hash_map
> to properly extract the already recorded stores to the same base.
> For example for the simple code:
> struct bar {
>   int a;
>   char b;
>   char c;
>   char d;
>   char e;
>   char f;
>   char g;
> };
>
> void
> foo1 (struct bar *p)
> {
>   p->b = 0;
>   p->a = 0;
>   p->c = 0;
>   p->d = 0;
>   p->e = 0;
> }
>
> As we can see, the stores are all to the same object and should
> be recognised as such.
>
> The base of the first store is recorded as:
> <mem_ref 0x7f527a482820 ...>
> and for the second store as <mem_ref 0x7f527a482848 ...>
> where the dumps of the two mem_refs are identical except for that first
> hex number (their address in memory?)
> In my first version of the patch I compare these with operand_equal_p and
> that
> detects that they are the same, but in the hash_map they are not detected
> as equal. Is there some special hashing function I must specify?

If you just use hash_map <tree, ...> then it will hash on the pointer value.
I think you need to use tree_operand_hash.

Richard.

> Thanks,
> Kyrill

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass
  2016-08-03 13:51     ` Richard Biener
@ 2016-08-03 13:57       ` Kyrill Tkachov
  0 siblings, 0 replies; 9+ messages in thread
From: Kyrill Tkachov @ 2016-08-03 13:57 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches


On 03/08/16 14:50, Richard Biener wrote:
> On Wed, Aug 3, 2016 at 11:59 AM, Kyrill Tkachov
> <kyrylo.tkachov@foss.arm.com> wrote:
>> Hi Richard,
>>
>> On 18/07/16 13:22, Richard Biener wrote:
>>
>> <snip>
>>
>>> +      /* Record the original statements so that we can keep track of
>>> +        statements emitted in this pass and not re-process new
>>> +        statements.  */
>>> +      for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next
>>> (&gsi))
>>> +       {
>>> +         gimple *stmt = gsi_stmt (gsi);
>>> +         if (!is_gimple_debug (stmt))
>>> +           orig_stmts.add (stmt);
>>> +         num_statements++;
>>> +       }
>>>
>>> please use gimple_set_visited () instead, that should be cheaper.
>>>
>>>
>>> +      do
>>> +       {
>>> +         changes_made = false;
>>> +         for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next
>>> (&gsi))
>>> +           {
>>> ...
>>> +       }
>>> +      while (changes_made);
>>>
>>> looks pretty quadratic to me.  Instead of tracking things with
>>> m_curr_base_expr
>>> why not use a hash-map to track stores related to a base?
>>
>> I've implemented this scheme but I'm having trouble making it work.
>> In particular I have a hash_map keyed on a 'tree' that is the base
>> object (as extracted by get_inner_reference) but I can't get the hash_map
>> to properly extract the already recorded stores to the same base.
>> For example for the simple code:
>> struct bar {
>>    int a;
>>    char b;
>>    char c;
>>    char d;
>>    char e;
>>    char f;
>>    char g;
>> };
>>
>> void
>> foo1 (struct bar *p)
>> {
>>    p->b = 0;
>>    p->a = 0;
>>    p->c = 0;
>>    p->d = 0;
>>    p->e = 0;
>> }
>>
>> As we can see, the stores are all to the same object and should
>> be recognised as such.
>>
>> The base of the first store is recorded as:
>> <mem_ref 0x7f527a482820 ...>
>> and for the second store as <mem_ref 0x7f527a482848 ...>
>> where the dumps of the two mem_refs are identical except for that first
>> hex number (their address in memory?)
>> In my first version of the patch I compare these with operand_equal_p and
>> that
>> detects that they are the same, but in the hash_map they are not detected
>> as equal. Is there some special hashing function I must specify?
> If you just use hash_map <tree, ...> then it will hash on the pointer value.
> I think you need to use tree_operand_hash.

Ah, thanks. That did the trick.

Kyrill

> Richard.
>
>> Thanks,
>> Kyrill

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass
  2016-07-18 12:23 ` Richard Biener
  2016-08-01  9:16   ` Kyrill Tkachov
  2016-08-03  9:59   ` Kyrill Tkachov
@ 2016-08-03 15:15   ` Kyrill Tkachov
  2016-08-04  8:17     ` Richard Biener
  2 siblings, 1 reply; 9+ messages in thread
From: Kyrill Tkachov @ 2016-08-03 15:15 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

Hi Richard,

On 18/07/16 13:22, Richard Biener wrote:
> On Fri, Jul 15, 2016 at 5:13 PM, Kyrill Tkachov
> <kyrylo.tkachov@foss.arm.com> wrote:

<snip>

> +      /* Record the original statements so that we can keep track of
> +        statements emitted in this pass and not re-process new
> +        statements.  */
> +      for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +       {
> +         gimple *stmt = gsi_stmt (gsi);
> +         if (!is_gimple_debug (stmt))
> +           orig_stmts.add (stmt);
> +         num_statements++;
> +       }
>
> please use gimple_set_visited () instead, that should be cheaper.
>
>
> +      do
> +       {
> +         changes_made = false;
> +         for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +           {
> ...
> +       }
> +      while (changes_made);
>
> looks pretty quadratic to me.  Instead of tracking things with m_curr_base_expr
> why not use a hash-map to track stores related to a base?
>
> +                          /* Don't handle bitfields that are not a multiple
> +                             of BITS_PER_UNIT for now.  Can be extended
> +                             later.  */
> +                          && (bitsize % BITS_PER_UNIT == 0)
>
> :(
>
> +                          && !stmt_interferes_with_mem_accesses_p (lhs))
>
> given this function loops over all accesses this is quadratic as well.
>
> I think alias queries could be simplified if you reduce them to alias
> checks on the base object (and allow overlapping constant stores
> which should be easy to handle during merging).

I've implemented that and it simplified the detecting code as well as its complexity
but I'm now missing some cases that were being caught before.
An example is:
struct bar {
   int a;
   char b;
   char c;
   char d;
   char e;
   char g;
};

void
foo1 (struct bar *p, char tmp)
{
   p->a = 0;
   p->b = 0;
   p->g = tmp;
   p->c = 0;
   p->d = 0;
   p->e = 0;
}

The store to 'g' doesn't interfere with the contiguous stores to the early fields but because
we perform the aliasing checks on the base object 'p' rather than the full LHS of the assignments
this is deemed to alias the constant stores and only the first two and the last three constant stores
are merged instead of the full 5 stores.

I'll experiment with some solutions for this involving recording the non-constant stores and performing
some trickery during the merging phase.

Thanks,
Kyrill

> The VUSE/VDEF handling is somewhat odd.  All stores have both
> a VDEF and a VUSE, if you merge a set of them you can simply
> re-use the VDEF/VUSE of one of them, effectively replacing the
> stmt.  For all other stores you remove you miss a
>    unlink_stmt_vdef (stmt);
>    release_defs (stmt);
> to update virtual SSA form and properly release SSA names.
>
> As you use TBAA in your alias checks you may only _sink_
> stores.  Not sure if your merged store placement ensures this.
>
> I think this kind of transforms is useful in early optimizations, not only
> very late as you perform it.  Of course it should be really cheap there.
>
> Note that options like -ftree-store-widening are discouraged
> ("tree" does mean nothing to our users and store widening isn't
> what is done - it's store merging).  Simply name it -fstore-merging.
> Also the file name should be gimple-ssa-store-merging.c
>
> Looking forward to an algorithmically enhanced version.
>
> Richard.
>
>
>> Thanks,
>> Kyrill
>>
>> N.B. I'm going on vacation until August so I won't be able to respond to any
>> feedback until I get back.
>>
>> [1] https://gcc.gnu.org/ml/gcc-patches/2009-09/msg01745.html
>>
>> 2016-07-15  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>      PR middle-end/22141
>>      * Makefile.in (OBJS): Add tree-ssa-store-widening.o.
>>      * common.opt (ftree-store-widening): New Optimization option.
>>      * opts.c (default_options_table): Add entry for
>>      OPT_ftree_store_widening.
>>      * params.def (PARAM_STORE_WIDENING_ALLOW_UNALIGNED): Define.
>>      * passes.def: Insert pass_tree_store_widening.
>>      * tree-pass.h (make_pass_tree_store_widening): Declare extern
>>      prototype.
>>      * tree-ssa-store-widening.c: New file.
>>      * doc/invoke.texi (Optimization Options): Document
>>      -ftree-store-widening.
>>
>> 2016-07-15  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>              Jakub Jelinek  <jakub@redhat.com>
>>
>>      PR middle-end/22141
>>      * gcc.c-torture/execute/pr22141-1.c: New test.
>>      * gcc.c-torture/execute/pr22141-2.c: Likewise.
>>      * gcc.target/aarch64/ldp_stp_1.c: Adjust for -ftree-store-widening.
>>      * gcc.dg/store_widening_1.c: New test.
>>      * gcc.dg/store_widening_2.c: Likewise.
>>      * gcc.dg/store_widening_3.c: Likewise.
>>      * gcc.dg/store_widening_4.c: Likewise.
>>      * gcc.target/i386/pr22141.c: Likewise.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass
  2016-08-03 15:15   ` Kyrill Tkachov
@ 2016-08-04  8:17     ` Richard Biener
  0 siblings, 0 replies; 9+ messages in thread
From: Richard Biener @ 2016-08-04  8:17 UTC (permalink / raw)
  To: Kyrill Tkachov; +Cc: GCC Patches

On Wed, Aug 3, 2016 at 5:15 PM, Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
> Hi Richard,
>
> On 18/07/16 13:22, Richard Biener wrote:
>>
>> On Fri, Jul 15, 2016 at 5:13 PM, Kyrill Tkachov
>> <kyrylo.tkachov@foss.arm.com> wrote:
>
>
> <snip>
>
>
>> +      /* Record the original statements so that we can keep track of
>> +        statements emitted in this pass and not re-process new
>> +        statements.  */
>> +      for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next
>> (&gsi))
>> +       {
>> +         gimple *stmt = gsi_stmt (gsi);
>> +         if (!is_gimple_debug (stmt))
>> +           orig_stmts.add (stmt);
>> +         num_statements++;
>> +       }
>>
>> please use gimple_set_visited () instead, that should be cheaper.
>>
>>
>> +      do
>> +       {
>> +         changes_made = false;
>> +         for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next
>> (&gsi))
>> +           {
>> ...
>> +       }
>> +      while (changes_made);
>>
>> looks pretty quadratic to me.  Instead of tracking things with
>> m_curr_base_expr
>> why not use a hash-map to track stores related to a base?
>>
>> +                          /* Don't handle bitfields that are not a
>> multiple
>> +                             of BITS_PER_UNIT for now.  Can be extended
>> +                             later.  */
>> +                          && (bitsize % BITS_PER_UNIT == 0)
>>
>> :(
>>
>> +                          && !stmt_interferes_with_mem_accesses_p (lhs))
>>
>> given this function loops over all accesses this is quadratic as well.
>>
>> I think alias queries could be simplified if you reduce them to alias
>> checks on the base object (and allow overlapping constant stores
>> which should be easy to handle during merging).
>
>
> I've implemented that and it simplified the detecting code as well as its
> complexity
> but I'm now missing some cases that were being caught before.
> An example is:
> struct bar {
>   int a;
>   char b;
>   char c;
>   char d;
>   char e;
>   char g;
> };
>
> void
> foo1 (struct bar *p, char tmp)
> {
>   p->a = 0;
>   p->b = 0;
>   p->g = tmp;
>   p->c = 0;
>   p->d = 0;
>   p->e = 0;
> }
>
> The store to 'g' doesn't interfere with the contiguous stores to the early
> fields but because
> we perform the aliasing checks on the base object 'p' rather than the full
> LHS of the assignments
> this is deemed to alias the constant stores and only the first two and the
> last three constant stores
> are merged instead of the full 5 stores.
>
> I'll experiment with some solutions for this involving recording the
> non-constant stores and performing
> some trickery during the merging phase.

Not sure how/where exactly you perform the alias checks but alias
checks inbetween a group of same bases should use the full
reference to also factor in offsets/sizes.  Only cross-group I'd
resort to base-only alias-checks.

Richard.

> Thanks,
> Kyrill
>
>
>> The VUSE/VDEF handling is somewhat odd.  All stores have both
>> a VDEF and a VUSE, if you merge a set of them you can simply
>> re-use the VDEF/VUSE of one of them, effectively replacing the
>> stmt.  For all other stores you remove you miss a
>>    unlink_stmt_vdef (stmt);
>>    release_defs (stmt);
>> to update virtual SSA form and properly release SSA names.
>>
>> As you use TBAA in your alias checks you may only _sink_
>> stores.  Not sure if your merged store placement ensures this.
>>
>> I think this kind of transforms is useful in early optimizations, not only
>> very late as you perform it.  Of course it should be really cheap there.
>>
>> Note that options like -ftree-store-widening are discouraged
>> ("tree" does mean nothing to our users and store widening isn't
>> what is done - it's store merging).  Simply name it -fstore-merging.
>> Also the file name should be gimple-ssa-store-merging.c
>>
>> Looking forward to an algorithmically enhanced version.
>>
>> Richard.
>>
>>
>>> Thanks,
>>> Kyrill
>>>
>>> N.B. I'm going on vacation until August so I won't be able to respond to
>>> any
>>> feedback until I get back.
>>>
>>> [1] https://gcc.gnu.org/ml/gcc-patches/2009-09/msg01745.html
>>>
>>> 2016-07-15  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>
>>>      PR middle-end/22141
>>>      * Makefile.in (OBJS): Add tree-ssa-store-widening.o.
>>>      * common.opt (ftree-store-widening): New Optimization option.
>>>      * opts.c (default_options_table): Add entry for
>>>      OPT_ftree_store_widening.
>>>      * params.def (PARAM_STORE_WIDENING_ALLOW_UNALIGNED): Define.
>>>      * passes.def: Insert pass_tree_store_widening.
>>>      * tree-pass.h (make_pass_tree_store_widening): Declare extern
>>>      prototype.
>>>      * tree-ssa-store-widening.c: New file.
>>>      * doc/invoke.texi (Optimization Options): Document
>>>      -ftree-store-widening.
>>>
>>> 2016-07-15  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>              Jakub Jelinek  <jakub@redhat.com>
>>>
>>>      PR middle-end/22141
>>>      * gcc.c-torture/execute/pr22141-1.c: New test.
>>>      * gcc.c-torture/execute/pr22141-2.c: Likewise.
>>>      * gcc.target/aarch64/ldp_stp_1.c: Adjust for -ftree-store-widening.
>>>      * gcc.dg/store_widening_1.c: New test.
>>>      * gcc.dg/store_widening_2.c: Likewise.
>>>      * gcc.dg/store_widening_3.c: Likewise.
>>>      * gcc.dg/store_widening_4.c: Likewise.
>>>      * gcc.target/i386/pr22141.c: Likewise.
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-08-04  8:17 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-15 15:14 [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass Kyrill Tkachov
2016-07-18 12:23 ` Richard Biener
2016-08-01  9:16   ` Kyrill Tkachov
2016-08-01 16:10     ` Jeff Law
2016-08-03  9:59   ` Kyrill Tkachov
2016-08-03 13:51     ` Richard Biener
2016-08-03 13:57       ` Kyrill Tkachov
2016-08-03 15:15   ` Kyrill Tkachov
2016-08-04  8:17     ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).